Jump to content

Alpha 21064

From Wikipedia, the free encyclopedia

The 21064 microprocessor
The 21064 microprocessor mounted on a business card

TheAlpha 21064is amicroprocessordeveloped and fabricated byDigital Equipment Corporationthat implemented theAlpha(introduced as the Alpha AXP)instruction set architecture(ISA). It was introduced as theDECchip 21064before it was renamed in 1994. The 21064 is also known by its code name,EV4.It was announced in February 1992 with volume availability in September 1992. The 21064 was the first commercial implementation of the Alpha ISA, and the first microprocessor from Digital to be available commercially. It was succeeded by a derivative, the Alpha 21064A in October 1993. This last version was replaced by theAlpha 21164in 1995.

History

[edit]

The first Alpha processor was a test chip codenamedEV3.This test chip was fabricated using Digital's 1.0-micrometre(μm) CMOS-3 process. The test chip lacked afloating point unitand only had 1KBcaches.The test chip was used to confirm the operation of the aggressivecircuit designtechniques. The test chip, along with simulators and emulators, was also used to bring upfirmwareand the variousoperating systemsthat the company supported.

The production chip, codenamedEV4,was fabricated using Digital's 0.75 μm CMOS-4 process.Dirk Meyerand Edward McLellan were the micro-architects. Ed designed the issue logic while Dirk designed the other major blocks. Jim Montanaro led the circuit implementation. The EV3 was used in the Alpha Demonstration Unit (ADU), amultiprocessorsystem used by Digital to develop software for the Alpha platform before the availability of EV4 parts.[1]

The 21064 was unveiled at the 39thInternational Solid-State Circuits Conference(ISSCC) in mid-February 1992. It was announced on 25 February 1992, with a 150 MHz sample introduced on the same day. It was priced at $3,375 in quantities of 100, $1,650 in quantities between 100 and 1,000, and $1,560 for quantities over 1,000. Volume shipments began in September 1992.

In early February 1993, the price of the 150 MHz version was reduced to $1,096 from $1,559 in quantities greater than 1,000.

On 25 February 1993, a 200 MHz was introduced, with sample kits available, priced at $3,495. In volume, it was priced at $1,231 per unit in quantities greater than 10,000. Volume orders were accepted in June 1993, with shipments in August 1993. The price of the 150 MHz version was reduced in response. The sample kit was reduced to $1,690 from $3,375, effective in April 1993; and in volume, it was reduced to $853 from $1,355 per unit in quantities greater than 10,000, effective in July 1993.

With the introduction of the Alpha 21066 and the Alpha 21068 on 10 September 1993, Digital adjusted the positioning of the existing 21064s and introduced a 166 MHz version priced at $499 per unit in quantities of 5,000. The price of the 150 MHz version was reduced to $455 per unit in quantities of 5,000.

On 6 June 1994, the price of the 200 MHz version was reduced by 31% to $544 to position it against the 60 MHz Pentium; and the 166 MHz version by 19% to $404 per unit in quantities of 5,000, effective on 3 July 1994.

The Alpha 21064 was fabricated at Digital'sHudson, MassachusettsandSouth Queensferry, Scotlandfacilities.

Users

[edit]

The 21064 was mostly used in high-end computers such asworkstationsandservers.Users included:

Performance

[edit]

The 21064 was the highest performing microprocessor from when it was introduced until 1993, afterInternational Business Machines(IBM) introduced the multi-chipPOWER2.It subsequently became the highest performing single-chip microprocessor, a position it held until the 275 MHz 21064A was introduced in October 1993.[2]

Description

[edit]

The Alpha 21064 is asuperpipelineddual-issuesuperscalarmicroprocessor that executes instructionsin-order.It is capable of issuing up to two instructions every clock cycle to four functional units: aninteger unit,afloating-point unit(FPU), an address unit, and a branch unit. The integerpipelineis seven stages long, and the floating-point pipeline ten stages. The first four stages of both pipelines are identical and are implemented by the I-Box.

I-box

[edit]

The I-box is thecontrol unit;it fetches, decodes, and issues instructions and controls the pipeline.[3]During stage one, two instructions are fetched from the I-cache.Branch predictionis performed by logic in the I-box during stage two. Either static prediction or dynamic prediction is used. Static prediction examined thesign bitof the displacement field of abranch instruction,predicted the branch as taken if the sign bit indicated a backwards branch (if sign bit contained 1). Dynamic prediction examined an entry in the 2,048-entry by 1-bit branch history table. If an entry contained 1, the branch was predicted as taken.[4]If dynamic prediction was utilized, the branch prediction is approximately 80% accurate for most programs.[5]Thebranch mispredictionpenalty is four cycles.[6]

These instructions are decoded during stage three. The I-box then checks if the resources required by the two instructions are available during stage four. If so, the instructions are issued, providing they can be paired. Which instructions could be paired was determined by the number of read and write ports in the integer register file.[7]The 21064 could issue: an integer operate with a floating-point operate, any load/store instruction with any operate instruction, an integer operate with an integer branch, or a floating-point operate with a floating-point branch. Two combinations were not permitted: an integer operate and a floating-point store, and a floating-point operate and an integer store. If one of the two instructions cannot be issued together, the first four stages are stalled until the remaining instruction is issued. The first four stages are also stalled in the event that no instruction can be issued due to resource unavailability, dependencies, or similar conditions.

The I-box contains twotranslation lookaside buffers(TLBs) for translatingvirtual addressestophysical addresses.These TLBs are referred to asinstruction translation buffers(ITBs). The ITBs cache recently usedpage table entriesfor the instruction stream. An eight-entry ITB is used for 8 KBpagesand a four-entry ITB for 4 MB pages. Both ITBs arefully associativeand use a not-last used replacement algorithm.[8]

Execution

[edit]

Execution begins during stage five for all instructions. Theregister filesare read during stage four. The pipelines beginning at stage five cannot be stalled.

Integer unit

[edit]

The integer unit is responsible for executing integer instructions. It consists of the integerregister file(IRF) and the E-box. The IRF contains thirty-two 64-bit registers and has four read ports and two write ports that are equally divided between the integer unit and the branch unit.[9]The E-box contains anadder,a logic unit,barrel shifter,andmultiplier.Except for multiply, shift, and byte manipulation instructions, most integer instructions are completed by the end of stage five and thus have a latency of one cycle. The barrel shifter is pipelined, but shift and byte manipulation instructions are not completed by the end of stage six, and thus have a latency of two cycles. The multiplier was not pipelined in order to save die area;[5]thus multiply instructions have a variable latency of 19 to 23 cycles depending on the operands. In stage seven, integer instructions write their results to the IRF.

Address unit

[edit]

The address unit, also known as the "A-box", executed load and store instructions. To enable the address unit and integer unit to operate in parallel, the address unit has its own displacementadder,which it uses to calculatevirtual addresses,instead of using the adder in the integer unit.[10]A 32-entry fully associativetranslation lookaside buffer(TLB) is used to translatevirtual addressesintophysical addresses.[10]This TLB is referred to as thedata translation buffer(DTB). The 21064 implemented a 43-bit virtual address and a 34-bit physical address, and is therefore capable of addressing 8 TB ofvirtual memoryand 16 GB ofphysical memory.

Store instructions result in data buffered in a 4-entry by 32-byte write buffer. The write buffer improved performance by reducing the number of writes on the system bus by merging data from adjacent stores and by temporarily delaying stores, enabling loads to be serviced quicker as the system bus is not utilized as often.[10]

Floating-point unit

[edit]

The floating-point unit consists of the floating-point register file (FRF) and the F-box.[7]The FRF contains thirty-two 64-bit registers and has three read ports and two write ports. The F-box contained a floating-point pipeline and a non-pipelined divide unit which retired one bit per cycle.

The floating-point register file is read and the data formatted into fraction, exponent, and sign in stage four. If executing add instructions, the adder calculates the exponent difference, and a predictive leading one or zero detector using input operands for normalizing the result is initiated. If executing multiply instructions, a 3 Xmultiplicandis generated.

In stages five and six, alignment or a normalization shift and sticky-bit calculations are performed for adds and subtracts. Multiply instructions are multiplied in a pipelined, two-way interleaved array which uses a radix-8Booth algorithm.[5][11]In stage eight, final addition is performed in parallel with rounding. Floating-point instructions write their results to the FRF in stage ten.[11]

Instructions executed in the pipeline have a six-cycle latency.[11]Single-precision (32-bit) and double-precision (64-bit) divides, which are executed in the non-pipelined divide unit, have a latency of 31 and 61 cycles, respectively.[12]

Caches

[edit]

The 21064 has two on-die primarycaches:an 8 KB data cache (known as the D-cache) using a write-through write policy and an 8 KB instruction cache (known as the I-cache). Both caches aredirect-mappedfor single-cycle access and have 32-byte line size. The caches are built with six-transistorstatic random access memory(SRAM) cells that have an area of 98 μm2.The caches are 1,024 cells wide by 66 cells tall, with the top two rows used for redundancy.

An optional external secondary cache, known as the B-cache, with capacities of 128 KB to 16 MB was supported. The cache operated at one-third to one-sixteenth of the internal clock frequency, or 12.5 to 66.67 MHz at 200 MHz.[13]The B-cache is direct-mapped and has a 128-byte line size by default that could be configured to use larger quantities. The B-cache is accessed via the system bus.

External interface

[edit]

The external interface is a 128-bitdata busthat operated at half to one-eighth the internal clock rate, or 25 to 100 MHz at 200 MHz. The width of the bus was configurable, systems using the 21064 could have a 64-bit external interface. The external interface also consisted of a 34-bitaddress bus.

Fabrication

[edit]
DEC Alpha 21064 (EV4S) die shot

The 21064 contained 1.68 million transistors.[14]The original EV4 was fabricated by Digital in its CMOS-4 process, which has a 0.75 μm feature size and three levels ofaluminium interconnect.[14]The EV4 measures 13.9 mm by 16.8 mm, for an area of 233.52 mm2.The later EV4S was fabricated in CMOS-4S, a 10% optical shrink of CMOS-4 with a 0.675 μm feature size. This version measured 12.4 mm by 15.0 mm, for an area 186 mm2.[15]

The 21064 used a 3.3-volt(V) power supply.[14]The EV4 dissipated a maximum of 30 W at 200 MHz. The EV4S dissipates a maximum of 21.0 W at 150 MHz, 22.5 W at 166 MHz, and 27.0 W at 200 MHz.[16]

Package

[edit]
A packaged 21064 microprocessor

The 21064 is packaged in a 431-pin alumina-ceramicpin grid array(PGA) measuring 61.72 mm by 61.72 mm.[17]Of the 431 pins, 291 were for signals and 140 were for power and ground.[14][18]Theheatsinkis directly attached to the package, secured by nuts attached to two studs protruding from the tungstenheat spreader.

Derivatives

[edit]

Alpha 21064A

[edit]
DEC Alpha 21064A (EV45) die shot

TheAlpha 21064A,introduced as theDECchip 21064A,code-namedEV45,is a further development of the Alpha 21064 introduced in October 1993. It operated at clock frequencies of 200, 225, 233, 275 and 300 MHz. The 225 MHz model was replaced by the 233 MHz model on 6 July 1994, which at introduction, was priced at US$788 in quantities of 5,000, 10% less than the 225 MHz model it replaced. On the same day, prices for the 275 MHz was also reduced by 25% to US$1,083 in quantities of 5,000. The 300 MHz model was announced and sampled on 2 October 1995 and was shipped in December 1995. There was also one model, the 21064A-275-PC, that was restricted to running theWindows NToroperating systemsthat use the Windows NT memory management model.

The 21064A succeeded the original 21064 as the high-end Alpha microprocessor. It subsequently saw the most use in high-end systems. Users included:

  • Digital in some models of its DEC 3000 AXP, DEC 4000 AXP and DEC 7000/10000 AXP systems
  • Aspen Systems in its Alpine workstation
  • BTG, who used a 275 MHz model in its Action AXP275 RISC PC
  • Carrera Computers in its Cobra AXP 275 workstation
  • NekoTech, who used a 275 MHz modeloverclockedby 5% to 289 MHz in their Mach 2-289-T workstation
  • Network Appliance(now NetApp), who used a 275 MHz model in itsstorage systems

The 21064A had a number of microarchitectural improvements over the 21064. The primary caches were improved in two ways: the capacity of the I-cache and D-cache was doubled from 8 KB to 16 KB and parity protection was added to the cache tag and cache data arrays. Floating-point divides have a lower latency due to an improved divider that retires two bits per cycle on average. Branch prediction was improved by a larger 4,096-entry by 2-bit BHT.

The 21064A contains 2.8 million transistors and is 14.5 by 10.5 mm large, for an area of 152.25 mm2.It was fabricated by Digital in their fifth-generation CMOS process, CMOS-5, a 0.5 μm process with four levels ofaluminium interconnect.[19]

Alpha 21066

[edit]

TheAlpha 21066,introduced as theDECchip 21066,code-namedLCA4(Low Cost Alpha), is a low-cost variant of Alpha 21064. Samples were introduced on 10 September 1993, with volume shipments in early 1994. At the time of introduction, the 166 MHz Alpha 21066 was priced at US$385 in quantities of 5,000. A 100 MHz model, intended forembedded systems,also existed. Sampling begun in late 1994, with volume shipments in the third quarter of 1995. TheMicroprocessor Reportrecognized the Alpha 21066 as the first microprocessor with an integrated PCI controller.

The Alpha 21066 was intended for use in low-cost applications, specificallypersonal computersrunningWindows NT.Digital used various models of the Alpha 21066 in theirMultiaclients, AXPpci 33original equipment manufacturer(OEM) motherboards and AXPvmesingle-board computers.Outside of Digital, users included Aspen Systems in its Alpine workstation, Carrera Computers in its Pantera I workstation, NekoTech used a 166 MHz model in its Mach 1-166 personal computer, and Parsys in its TransAlpha TA9000 Series supercomputers.

Due to the process shrink, it was able to include features that were desirable in cost-sensitiveembedded systems.These features include an on-die B-cache andmemory controllerwithECCsupport, a functionally limitedgraphics acceleratorsupporting up to 8 MB ofVRAMfor implementing aframebuffer,a PCI controller and aphase-locked loop(PLL) clock generator for multiplying a 33 MHz external clock signal to the desired internal clock frequency.

The memory controller supported 64 KB to 2 MB of B-cache and 2 to 512 MB of memory. The ECC implementation was capable of detecting 1-, 2- and 4-bit errors and correcting 1-bit errors. To reduce cost, the Alpha 21066 has a 64-bit system bus, which reduced the number of pins and thus the size of the package. The reduced width of the system bus also reduced bandwidth and thus performance by 20%, which was deemed acceptable.

The 21066 contained 1.75 million transistors and measured 17.0 by 12.3 mm, for an area of 209.1 mm2.It was fabricated in CMOS-4S, a 0.675 μm process with three levels of interconnect. The 21066 was packaged in a 287-pin CPGA measuring 57.404 by 57.404 mm.

Alpha 21066A

[edit]
DEC Alpha 21066A

TheAlpha 21066A,code-namedLCA45,is a low-cost variant of the Alpha 21064A. It was announced on 14 November 1994, with samples of 100 and 233 MHz models introduced on the same day. Both models were shipped in March 1995. When announced, the 100 and 233 MHz models were priced at $175 and $360, respectively, in quantities of 5,000. A 266 MHz model was later made available.

The 21066A wassecond sourcedbyMitsubishi Electricas theM36066A.It was the first Alpha microprocessor to be fabricated by the company. 100 and 233 MHz parts were announced in November 1994. At the time of the announcement, engineering samples were set for December 1994, commercial samples in July 1995 and volume quantities in September 1995. The 233 MHz part was priced at $490 in quantities of 1,000.[20]

Although it was based on the 21064A, the 21066A did not have the 16 KB instruction and data caches. A feature specific to the 21066A was power management – the microprocessor's internal clock frequency could be adjusted by software.

Digital used various models of 21066A in their products which had previously used the 21066. Outside of Digital,Tadpole Technologyused a 233 MHz model in their ALPHAbook 1notebook.

The 21066A contained 1.8 million transistors on a die measuring 14.8 by 10.9 mm, for an area of 161.32 mm2.It was fabricated in Digital's fifth-generation CMOS process, CMOS-5, a 0.5 μm process with three levels of interconnect. Mitsubishi Electric fabricated the M36066A in its own 0.5 μm three-level-metal process.

Alpha 21068

[edit]

TheAlpha 21068,introduced as theDECchip 21068,is a version of the 21066 positioned for embedded systems. It was identical to the 21066 but had a lower clock rate to reduce power dissipation and cost. Samples were introduced on 10 September 1993 with volume shipments in early 1994. It operated at 66 MHz and had a 9 W maximum power dissipation. At the time of introduction, the 21068 was priced at US$221 each in quantities of 5,000. On 6 June 1994, Digital announced that it was cutting the price by 16% to US$186, effective on 3 July 1994.

The Alpha 21068 was used by Digital in their AXPpci 33 motherboard and the AXPvme 64 and 64LCsingle-board computers.

Alpha 21068A

[edit]

TheAlpha 21068A,introduced as theDECchip 21068A,is a variant of the Alpha 21066A for embedded systems. It operated at a clock frequency of 100 MHz.

Chipsets

[edit]

Initially, there was no standardchipsetfor the 21064 and 21064A. Digital's computers used customapplication-specific integrated circuits(ASICs) to interface the microprocessor to the system. As this raised development cost for third parties who wished to develop Alpha-based products, Digital developed a standard chipset, the DECchip 21070 (Apecs), fororiginal equipment manufacturers(OEMs).

There were two models of the 21070, theDECchip 21071and theDECchip 21072.The 21071 was intended for workstations whereas the 21072 was intended for high-end workstations or low-end uniprocessor servers. The two models differed in memory subsystem features: the 21071 has a 64-bitmemory busand supports 8 MB to 2 GB ofparity-protected memory whereas the 21072 has a 128-bit memory bus and supports 16 MB to 4 GB ofECC-protected memory.

The chipset consisted of three chip designs: the COMANCHE B-cache andmemory controller,the DECADE data slice, and the EPIC PCI controller. The DECADE chips implemented the data paths in 32-bit slices, and therefore the 21071 has two such chips while the 21072 has four. The EPIC chip has a 32-bit path to the DECADE chips.

The 21070 was introduced on 10 January 1994,[21]with samples available. Volume shipments began in mid-1994. In quantities of 5,000, the 21071 was priced at $90 and the 21072 at $120.

21070 users included Carrera Computers for its Pantera workstations and Digital in some models of itsAlphaStationsanduniprocessorAlphaServers.

See also

[edit]
  • AlphaVM:A fullDEC Alphasystem emulator running on Windows or Linux. It contains a high-performance emulator of the Alpha CPU.

Notes

[edit]
  1. ^Charles P. Thacker; David G. Conroy; Lawrence C. Stewart (1992)."The Alpha Demonstration Unit: A High-performance Multiprocessor for Software and Chip Development"(PDF).Digital Technical Journal.4(4).
  2. ^Ryan 1994
  3. ^Digital Equipment Corporation 1996, p. 2-3–2-4
  4. ^Digital Equipment Corporation 1996, p. 2-5
  5. ^abcMcLellan 1993,p. 42
  6. ^Dobberpuhl 1992, p. 37
  7. ^abDobberpuhl 1992, p. 36
  8. ^Digital Equipment Corporation 1996, p. 2-6
  9. ^Dobberpuhl 1992,pp. 35–36
  10. ^abcMcLellan 1993,p. 43
  11. ^abcDobberpuhl 1992, p. 38
  12. ^Gwennap 1994
  13. ^McLellan 1993,p. 44
  14. ^abcdDobberpuhl 1992, p. 35
  15. ^Bhandarkar 1995, pp. 2–4
  16. ^Digital Equipment Corporation 1996, p. 8-3
  17. ^Digital Equipment Corporation 1996, p. 8-2
  18. ^Bhandarkar 1995, p. 2
  19. ^Bhandarkar 1995,p. 3
  20. ^Krause 1994
  21. ^Digital Equipment Corporation 1994

References

[edit]

Further reading

[edit]
  • "DEC Enters Microprocessor Business with Alpha". (4 March 1992).Microprocessor Report,Volume 6, Number 3.
  • "DEC's Alpha Architecture Premiers". (4 March 1992).Microprocessor Report,Volume 6, Number 3.
  • "Digital Plans Broad Alpha Processor Family" (18 November 1992).Microprocessor Report,Volume 6, Number 3.
  • "Digital Reveals PCI Chip Sets For Alpha". (12 July 1993).Microprocessor Report,Volume 7, Number 9.
  • "Alpha Hits Low End with Digital's 21066". (13 September 1993).Microprocessor Report,Volume 7, Number 12.
  • Bhandarkar, Dileep P. (1995).Alpha Architecture and Implementations.Digital Press.
  • Fox, Thomas F. (1994). "The design of high-performance microprocessors at Digital".Proceedings of the 31st Annual ACM-IEEE Design Automation Conference.pp. 586–591.
  • Gronowski, Paul E. et al. (May 1998). "High-performance microprocessor design".IEEE Journal of Solid-State Circuits33(5): pp. 676–686.