technologies for exascale systems

Technologies forexascale systems

P. W. CoteusJ. U. Knickerbocker

C. H. LamY. A. Vlasov

To satisfy the economic drive for ever more powerful computers tohandle scientific and business applications, new technologies areneeded to overcome the limitations of current approaches. Newmemory technologies will address the need for greater amounts ofdata in close proximity to the processors. Three-dimensional siliconintegration will allow more cache and function to be integrated withthe processor while allowing more than 1,000 times higherbandwidth communications at low power per channel using localinterconnects between Si die layers and between die stacks.Integrated silicon nanophotonics will provide low-power andhigh-bandwidth optical interconnections between different parts ofthe system on a chip, board, and rack levels. Highly efficient powerdelivery and advanced liquid cooling will reduce the electricaldemand and facility costs. A combination of these technologies willlikely be required to build exascale systems that meet the combinedchallenges of a practical power constraint on the order of 20 MWwith sufficient reliability and at a reasonable cost.

IntroductionHigh-performance computing (HPC) is currentlyexperiencing very strong growth in all computing sectors,driven by an exponentially improving performance/cost ratiofor HPC machines. As shown in Figure 1, both the averagesystem performance and the performance of the fastestcomputers as measured by the Top500** benchmark havebeen increasing 1.8 times every year. Because systemshave not dramatically increased in cost, the rate ofimprovement in performance/cost has also followed thistrend. It is inevitable, therefore, that physical experimentswill continue to be augmented and, in some cases, replacedby HPC simulations. The current worldwide investment inresearch and development is approximately $1 trillion,with HPC hardware accounting for merely $10 billion.(Software and services account for an additional $10 billionper year.) This leaves tremendous room for growth in HPC,which will likely continue as long as we can build morecost-effective machines.The race to build an exaflop computer (1018 floating-point

operations/second) in the next 8–10 years is representative ofthe great potential that HPC computers hold. Nevertheless,there are enormous challenges in continuing to improve

performance at the near-doubling-every-year rate [1]. Someof these challenges are highlighted in Table 1. This tableshows how the underlying technologies are not improvingat a rate that is even close to what would be needed toachieve 1.8 times per year through evolutionary means. Thus,there must be innovations in architecture while exploitingexisting technologies in new ways, as well as developing andintegrating new technologies.There are primarily three different aspects to the challenges

in achieving exascale computing. These are cost efficiency,energy efficiency, and reliability.To continue improving the performance/cost ratio of

computing, we must continue to higher levels of integration;simply leveraging future silicon by putting more performancein a compute chip is not sufficient. Figure 2 illustrates theexascale technologies to be considered, in addition to rawprocessor chip performance. First, we need to recognizethat we must disproportionately increase the amount ofcomputing with respect to other area constraints. Puttingmuch more computing into a compute chip will presentdifficulties for the input/outputs (I/Os) of the memory andnetwork. The sheer number drives us either to dramaticallychange the balance of memory bandwidth to computationaloperation or to develop new technologies that can achievethe much larger bandwidth. As the growth in HPC will berealized by adding new application areas that will be largely

�Copyright 2011 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done withoutalteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed

royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.

Digital Object Identifier: 10.1147/JRD.2011.2163967

P. W. COTEUS ET AL. 14 : 1IBM J. RES. & DEV. VOL. 55 NO. 5 PAPER 14 SEPTEMBER/OCTOBER 2011

0018-8646/11/$5.00 B 2011 IBM

explored by nonexperts in HPC, we should focus on solvingthese challenges in a manner that results in highly usablebalanced systems.The challenges in processor memory are twofold. We need

to have memory devices that support the bandwidthrequirements that future processors will need. Additionally,we need to continue improving the cost per bit of memory.As this later metric for dynamic random access memory(DRAM) is not anticipated to increase at a rate that iscomparable to the rate of processor improvement, weneed to aggressively research and develop new memorytechnologies. In the near term, this means that we need tochange the bandwidth-per-device rule of thumb that relatesthe bandwidth needed per byte of memory. In the near future,we need much more bandwidth per device of memory. Inthe long term, we need new technologies to providesignificantly better cost per bit over DRAM.

To exploit memory and processors with high bandwidthcapabilities, we need to have packaging technology thatwill allow one to connect these two system components ina cost-effective and power-efficient manner. Technologiessuch as 3-D packaging are crucial in achieving this goal, assuggested by the illustration in Figure 3. Improvements inthe efficiency of power delivery and cooling are alsocritically needed.

Figure 1

Performance of supercomputers: (top curve) total compute power of the highest performing 500 systems, (second curve) performance of the numberone computer in the list, (third curve) number ten computer, and (lowest curve) number 500.

Table 1 Exascale challenges (current annualimprovement rates).

Figure 2

Exascale technologies and their interrelation. Power and cooling areimplicitly assumed.

14 : 2 P. W. COTEUS ET AL. IBM J. RES. & DEV. VOL. 55 NO. 5 PAPER 14 SEPTEMBER/OCTOBER 2011

In large systems, the processor-to-processor interconnectplays a critical role. The cost of these interconnects willbecome prohibitive without developing new technologies.Interconnects of the future will be dominated by optics, asthis offers the potential for a far better cost solution for alldistances. As we look to exascale, even the connectionsbetween processors on a common circuit card needs tobe optical because of the amount of bandwidth needed.Silicon photonics represents a nearly ideal solution to theinterconnect problem. With silicon photonics, the cost isdriven almost entirely by the packaging and optical connectorcost, as the core photonics structures can be made insilicon technology using relatively standard tool flow.

Memory technologiesThe complex memory hierarchy shown in Figure 4 isdesigned to alleviate the gap between processor and memoryperformance. While individual processor operating frequencypeaked at approximately 4 GHz in mid-2000, multicoreprocessors continue to widen the performance gap. Inmassively parallel high-performance computers, concurrenceand locality are taxing the capacity, bandwidth, and latencyof the entire memory. In order to meet the prescribedexascale performance target, the capacity of main memoryneeds to be increased more than 200 times the capacitiesemployed in petascale systems. This increase in memorycapacity with limited cost and energy resources placessignificant burdens on the design of the exascale memorysystem. The cost and energy challenge is pervasive, affectingevery component of the exascale system. Here, we focus ourattention on the memory system. The scaling trend of theincumbent memory technologies is studied. Emerging

memory technologies potentially capable of providingcost- and power-effective solutions for the exascale memorysystem are also examined.Since its invention in 1967 [2], the one-device DRAM has

been the workhorse of the memory system in all computingsystems. Riding the advances in semiconductor technologyprescribed by Moore’s Law, the density of a DRAM chip hasdoubled every 18 months, and more recently, every 3 years,from the first commercially available DRAM chip of 1 Kb(kilobits) in the 1970s to 4 Gb (gigabits) today. However,the effort for further scaling the DRAM critical dimension(CD) has met with some unprecedented challenges. CurrentDRAM technology is approaching the 30-nm size limitenabled by 193-nm wavelength immersion lithography [3].Complexity in DRAM cell layout and tight tolerances

Figure 3

Conceptual Bsystem-on-stack[ exascale computing node.

Figure 4

Typical memory hierarchy of a computer system, where memory latencyis depicted as a function of number of processor cycles. (SSD: solid-statedrive; SRAM: static random access memory; DRAM: dynamic randomaccess memory; RLDRAM: reduced latency DRAM.)


required of the sense circuits forbid the use of doublepatterning techniques to half the CD as employed in NANDFlash technology, as depicted in Figure 5. Next-generationextreme ultraviolet lithography with 13.5-nm wavelength forscaling the CD below 20 nm is still under development.The most challenging process in the integration of theDRAM cell is the storage capacitor. The storage capacitor ina typical modern DRAM is a cone-shaped structure withan aspect ratio of the height-to-base area greater than 50:1.In the cone structure, the storage node is sandwiched betweenthe common plates on both sides to increase storage area.Without a major breakthrough for the dielectric materialused in the storage capacitor, scaling the capacitor beyondthe 20-nm node would require a two-time increase in theaspect ratio of the capacitor structure, whereas the cone-structured capacitor has to be changed to a rod structure. Theconstant storage capacitor refresh time of 64 ms dictates a

stringent off-leakage requirement for the cell access transistorin DRAM to keep the signal charge. This leakagerequirement practically forbids scaling of the channellength of the cell access transistor while keeping the memorycell size to a constant number of feature squares, six instate-of-art DRAM. This stringent leakage requirementnecessitates the use of 3-D access devices, which increase theprocess complexity, thus increasing the production cost ofDRAM. The slowdown in scaling progress and intrinsiclimits to CD scaling mean that DRAM is probably notgoing to provide a cost-effective solution for the exascalememory system.Hard disk drives (HDDs) are mechanical devices that are

power hungry since they involve constantly spinning themagnetic platters at rotary speeds that rival modern jetengines. With mechanical moving parts, data integrity andreliability have always been major issues confronting the

Figure 5

2010 International Technology Roadmap for Semiconductors of memory array half-pitch dimension for DRAMandNANDFlash, alongwith lithographicrequirements.


HDD industry. Extensive and complex error-correctionsystems such as Redundant Array of Independent Disks(RAID) systems have been devised to protect data stored inHDDs. However, one indispensible merit of HDD is its lowcost. State-of-the-art HDDs are capable of storing terabitsper square inch, making it the most inexpensive storagedevice other than tape drives. Today, a gigabyte of storageon an HDD costs less than 10 cents. However, HDDcapacity is rapidly approaching its superparamagnetic limitwhere thermal instability can randomly flip the magnetizationdirection in nanoscale magnets.In recent years, solid-state drive (SSD) employing

NAND Flash memory as the storage media has been insertedin the memory hierarchy to bridge the huge latency gapbetween the main memory and the HDD. NAND Flashmemory with a cell size of four feature squares ð4F2Þ, thetheoretical minimum for a planar memory cell, wasintroduced as high-density nonvolatile memory in the early1990s [4]. With aggressive scaling and multibit storagecapability, NAND Flash has become pervasive in manyelectronic devices. While NAND Flash is still approximately10 times more expensive than HDD, it does providefaster access file storage, as well as lower power consumptionand cost of ownership because of its smaller footprint.However, NAND Flash based on electrons stored in floatinggates is also approaching physical limits in scaling.Because NAND Flash is scaled toward the nanoscale,the number of electrons that can be stored diminishes.For a CD equal to and smaller than 20 nm, there are merely afew hundred electrons stored in the floating gate. Dataintegrity is becoming a great concern, and extensiveerror-correction codes are incorporated in SSD to alleviatethis problem. In addition, since electrons tunnel in andout through the tunnel oxide of the floating gate in highelectric fields, the wear-out mechanism of the tunneloxide limits the number of reliable programming cycles ofthe memory cell. Current NAND Flash multilevel celldevices are limited to less than 10,000 programmingcycles. The wear-out mechanism in NAND Flash alsodeteriorates with scaling.In lieu of DRAM, NAND, and HDD, researchers and

semiconductor developers are actively working on variousemerging memory technologies as replacements orsupplements. Resistive random access memory (RRAM),spin-torque transfer magnetic random access memory(STT MRAM), and phase-change memory are among themost actively pursued emerging memory technologies.Figure 6 shows typical structures and brief operatingprinciples of these emerging memory technologies. Allthree are nonvolatile memory technologies withdemonstrated performance rivaling that of DRAM. Thetarget for any emerging memory technologies to be aviable replacement must satisfy the specifications set forthin Table 2.

The memory element of RRAM is a metal–insulator–metalcapacitor-like structure in which the insulator is typically ametal oxide. The reversible resistive change phenomenonwas first published in 1962 [5]. RRAM has attractedmuch attention since enormous varieties of insulators notrestricted to metal oxides exhibit similar resistive changephenomenon. Although the basic physics underlying RRAMis not completely understood, it is generally agreed that thephenomenon is associated with oxygen vacancies in theoxide film, whereas the physical conduction path isfilamentary [6–8]. Recently, tremendous progress has beenmade in RRAM in emerging memory research fromacademia to memory developers. On the other hand, STTMRAM is built on the theory revealed in the 1990s [9, 10].The memory element of an STT MRAM cell is the magnetictunneling junction (MTJ). Each MTJ consists of twoferromagnetic layers separated by a very thin tunnelingdielectric film. Magnetization in one of the layers is pinned orfixed in one direction by coupling to an antiferromagneticlayer. The other ferromagnetic layer is a free layer and is usedfor information storage. By controlling the direction ofmagnetization of the free layer with respect to the pinnedlayer, the MTJ can be configured to a high-resistanceantiparallel state or a low-resistance parallel state. Thedirection of magnetization of the free layer is controlled bythe spin of electrons determined by the polarity of thecurrent driven through the MTJ during a write operation.The prospects of STT MRAM for replacinghigh-performance memory are promising [11]. Of all theemerging memory technologies, parameter random accessmemory (PRAM) is the most mature with limitedproduction from major memory manufacturers. PRAM isbased on reversible transformation of a phase-changematerial from a polycrystalline to an amorphous state byJoule heating creating a resistance change in the material.The phase-change characteristic in chalcogenide materialswas observed in the 1960s [12, 13]. Current PRAMproducts are designed to replace NOR Flash in mobileapplications. Considerably, more progress is required inphase-change materials, memory cell, and arraydesigns in order for PRAM to replace DRAM orNAND Flash.The three memory technologies discussed in the preceding

paragraph represent a short list of emerging memorytechnologies in the research and development pipeline [14].The tasks for these memory technologies in meeting theenergy challenge for exascale computing are tremendousyet similar. All solid-state memory devices are governedby the operating voltage V and the two basic passivecomponents, i.e., resistance R and capacitance C. Withcomparable lithographic CDs, these parameters are similarfor the four emerging technologies; thus, the bit-cell readand write power differences are very much the same asand not much better than those of incumbent DRAM and


NAND Flash technologies. To meet the energy requirementfor exascale computing, a new memory architecturemust be designed to take full advantage of the nonvolatilityproperty of these emerging technologies. In addition,exascale memory systems must rely heavily on advances in3-D packaging to reduce the input and output distances, sincemuch of the energy spent is in the movement of data. Thecost factor is a different story; RRAM and PRAM wouldhave an advantage over STT MRAM since both RRAM

and PRAM are capable of storing multiple bits in a singlememory cell. This multiple-bit storage capability is the mosteffective way to reduce cost per bit with little cost added insensing circuits. The penalty of multiple-bit storage isdegradation in latency, which can be hidden by an innovativememory system design. Since RRAM, STT MRAM, andPRAM are emerging technologies, there is not much fielddata on the reliability aspect of the technologies. Fieldfailure data generally provides a wealth of knowledge for

Figure 6

Emerging memory technologies: (a) RRAM based on modulation of the thickness of a tunneling oxide; (b) RRAM based on filamentary formation;(c) STT MRAM based on giant magnetoresistive modulation; and (d) PRAM based on resistance variations of amorphous versus polycrystalline phases.


memory manufacturers to develop and designnext-generation products. SST MRAM and PRAM productsin the field today could provide some valuable data, butthere may not be enough data to envision a reliable memorysystem for exascale computing without extensive use ofredundancies and error corrections. Reliability is one ofthe much studied topics in RRAM research. Consideringpower, cost, reliability, and maturity of development,PRAM is the most probable candidate to be integratedinto the exascale computing system.

Packaging, interconnection, and energymanagement technologiesPackaging, interconnection, and energy management presentformidable challenges to achieve exascale computing withinboth system power and cost objectives for the program.Complementary metal-oxide semiconductor (CMOS)feature size will continue to scale, but dissipated power willnot. If power per operation would remain constant, anexaflop machine will have 50 times the power of theanticipated 20-Gflop/s (giga-floating-point operations persecond) IBM Blue Gene*/Q supercomputer now underconstruction [15]. If instead we are to realize a 20-MWmachine, we must find a way to dramatically reduce powerlosses in data movement, power conversion, and coolingwhile also meeting the increased packaging andinterconnection densities to support the performanceobjectives. We discuss this subject in inverse order, whereeach of these technology challenges can be described in moredetail.

CoolingEnergy-efficient cooling may be realized through the useof water cooling. If the water can be maintained at anacceptable inlet temperature by equilibration with the outsideair, a practice known as free cooling [16], then cooling maybecome particularly inexpensive as water chillers are notrequired, and insulation to prevent condensation can beeliminated. Water at 35�C or cooler may be easily obtainedanywhere on the planet through evaporative cooling towersand the like [17]; thus, above-ambient-temperature watercooling is an emerging technology. The challenge is to coolthe processors, memory devices, switches, and powersupplies of this exascale machine with such relatively warmwater while containing the device temperatures at acceptablelevels. This will require continued innovation in thermalinterface materials to meet the needs of mechanicalcompliance. Additionally, thermal interface materials withinsoldered components must withstand the increased Pb-freesolder reflow temperatures. Field replacement of serviceableparts will demand affordable separable interfaces, and wemay move from Bquick connects[ where we break the waterflow to Bthermal connectors[ where we can break a thermalconduit without introducing high cost or high thermalresistance.

Power deliveryEfficient power conversion requires a thoughtful look atconversion from high-voltage transmission lines down to thefinal transistor power rails. Today, there can be as many asfour power conversion steps, from rectification of, for

Table 2 Preliminary memory specifications for an exascale memory system.


example, 240-V alternating current (ac) to direct current (dc),then power factor correcting for the largely capacitive load ofthe switching electronics back onto the grid, then dc–dcdown conversion in two or more steps, to finally the sub-1-Vlevels expected for exascale processor chip cores.Certainly, reducing the number of conversion steps will be animportant first step, and making each conversion step asefficient as possible is a logical next step. We anticipate andexpect continued improvements in the efficiency of bothac–dc and external dc–dc converters. Making the lastconversion step as close to the processor as possible, eitherin the processor chip or immediately adjacent to it, willreduce resistance losses from the necessarily large currents.However, as more and more functions are integrated ontothe processor chip, a proliferation of different precisionvoltages in the form of references, static random accessmemory (SRAM) supply voltages, I/O driver voltages,and receiver threshold voltages requires additional powersupplies of modest current, which compete for theprecious near-die real estate. This points to an opportunityfor on-chip dc–dc voltage conversion, with the mix ofon- and off-chip conversion depending on the required areaand subsequent efficiency. Resistive series regulators arefundamentally limited to low (�50%) conversion efficiencyvalues because of the inherent resistive divider network;these are suitable for very low-current applications or forvoltage trimming. Buck converters are more efficient butrequire on-chip inductors, which are difficult to produce withhigh-quality factors. Switched-capacitor circuits may be aneffective solution for on-chip voltage conversion. Suchcircuits have been built with limited efficiency [18], butrecent designs using trench capacitors can potentially enableon-chip conversion efficiency values of more than 90%[19]. Indeed, the Second International Workshop on PowerSupply On Chip, i.e., PwrSOC10 [20], has garneredconsiderable attention. Workshop papers cite currentdensities of up to 10 A/mm2; however, these come withrather high losses on the order of 15% or more. Therefore, asa hedge against on-chip dc–dc power conversion, we mustalso consider what could be done with near-chip powerconversion. For a few amperes per square millimeter, 95%efficient near-chip regulator operating off of a 3.3 V of supplyor higher may be interesting. IBM is active in suchendeavors.

Packaging and interconnectionPower losses due to data communication between multipledie in the system is another significant challenge. Anapproach to reduce power significantly is to bring thecommunicating components into close proximity whereshort-reach data communications are possible. A balancedarchitecture that combines lowest power short-reach linksand optimized long-reach communications channels canbe achieved with advancements in packaging and

interconnection. For short-distance channels, electricalcommunication links can provide significant power savingsin transmit circuits, short-length wires, and low-powerreceiver circuits. These channels may consist ofultrashort-reach interconnection structures such as withinthinned die stacks or other short-reach channels for datacommunications such as with high-density packagingbetween multiple die or die stacks. Power savings on theorder of between 10 and 1,000 times reduction compared totraditional off-chip power levels are possible within thesetypes of communications channels. In addition to powersavings, the use of these short-reach high-densityinterconnectivity channels can provide the necessarybandwidth, low latency, and cumulative data rates ofcommunications to scale the performance of the HPC systemwhile also meeting cost objectives. The packaging andinterconnection technology requirements drive bothhigher density interconnection and higher rates of datacommunication that support efficient utilization of die areafor I/O, low power per bit of data transferred, and signalintegrity with acceptable noise margins at the targeted datarates needed to meet the performance objectives of thesystem. Specially developed I/O circuits that can transmit andreceive data at low power can provide the power savingswhile also maintaining a small area footprint compatiblewith high interconnection density. For long-reach datacommunications, enhancements in optical communicationscan be leveraged, as was previously discussed.Another power-saving approach is to power down circuits

when not in use. The ability to support these power savingsnecessitates the ability to quickly power down and powerup on-demand. Highly evolved I/O cells that minimize theenergy exchanged per bit of transferred data and thatmaintain a high data rate but also low latency are critical.Optimization of cells that meet these challenges can provideenergy savings and high-bandwidth data communications,as well as meet the objectives for the architecture and designof the system within the cost objectives.The increased interconnection with I/O density and wiring

and resulting power savings can be achieved through theuse of 3-D die stacking, 3-D high-density packaging, andhigh-density interconnection. The exascale systemoptimization can leverage stacked memory, high I/O processordie, and high interconnectivity packaging beyond traditionalsolutions. Multi-high-die stack technology withthrough-silicon vias (TSVs) provides the highest level ofintegration at the lowest power levels for data communication.Use of fine-pitch, off-chip I/O and off-chip-stack I/O combinedwith high-density packaging provides another level ofpower savings while supporting module integration.Optimization of power efficiency, component cost, wiringlength, number of channels, operational data rate, and wirelength for module integration are ongoing studies toward anefficient exascale computing solution. Examples of technical


advancements of 3-D memory die stacking, 3-D high-densityinterconnection, and 3-D packaging using lead-free solderinterconnection that may be applicable to exascale computinghave been reported [21–25].

Optical communication technologyOptical interconnects are already a widely accepted solutionin today’s HPC systems. Starting with the first petaflopsmachines, short-reach active optical cables (AOCs) based onvertical-cavity surface-emitting lasers and multimode(VCSEL/MM) fibers have been deployed in large volumes,with more than 100,000 optical ports per system [26] toprovide high aggregate bandwidth communications betweenracks and the central switch. In order to maintain reasonablecompute-to-bandwidth ratio, parallel optical links havebeen further developed to provide broadband connectivityto multichip modules in the IBM P775 system [27]. Smallform-factor VCSEL/MM-based optical engines have beendeveloped [28] to provide up to 1.4 TB/s (terabytesper second) of aggregate switching capacity on a switchingsocket while significantly decreasing power dissipation.The number of parallel optical links in the whole systemincreased to more than two million, demonstrating animpressive 120% compound annual growth rate [29].Utilization of optical interconnects is necessitated by theever-increasing dramatic mismatch between computationaloperations and memory bandwidth. With this tendency likelyto extend to the next decade, massive numbers of paralleloptical links on the order of 100 million, which will be usedto connect racks, boards, modules, and chips together, areexpected in the HPC exascale systems. In order to bemassively deployed in an exascale system, optical linksshould provide reliable communication while maintainingextremely low power dissipation on the order of just afew picojoules per transmitted bit. It is worth mentioningthat besides accounting for just electrical-to-optical andoptical-to-electrical conversion in transceivers, this powerbudget should also include all electrical circuitry required toprovide high-integrity electrical signaling on a card or on amodule for an I/O link. These numbers are approximately50–100 times lower than what is available with today’stechnology. Analogous aggressive scaling is expected forcost reduction for optical interconnects from today’s tensof dollars for transmitted gigabits per second of data tobelow tens of cents. Although it is expected that furtherdevelopment of the next-generation VCSEL/MM-basedtransceivers might provide some of this scaling, seriousdifficulties are envisioned to meet aggressive power andcost savings required for exascale systems whilesimultaneously maintaining the reliability.The new technology of silicon photonics is emerging and

has the promise of revolutionizing short-reach opticalinterconnects. This promise is connected to the idea ofintegrating optical and electrical circuitry on a single silicon

die utilizing mature CMOS technology. If successful, suchintegration might result in significant cost reduction,highly increased density of optical interconnects based onvery low power, and highly reliable all-silicon opticaltransceivers. Various approaches have been explored frombuilding optical circuits such as modulators and photode-tectors in the front end of CMOS line (FEOL) [30] to addingadditional low-temperature processing steps at the back endof the line (BEOL) [31]. Several products utilizing someof these approaches mostly target the AOC market [32].However, in order to meet very aggressive requirementsfor the exascale era in reliability, cost, and power dissipation,significant development is needed.Over the last several years, IBM has developed a variant

of silicon photonics technology called CMOS-integratedsilicon nanophotonics [33, 34]. This technology allowsmonolithic integration of deeply scaled optical circuits intothe FEOL of a standard CMOS process. The light signalsproduced by an external DC infrared laser propagatedon a silicon-on-insulator die inside single-mode siliconwaveguides with submicrometer dimensions. These opticalcircuits share the same silicon device layer with the bodiesof nearby metal-oxide semiconductor transistors. Severalprocessing modules have been added into a standard CMOSFEOL processing flow. These modules require a minimalnumber of additional unique masks and processing stepswhile sharing most mask levels and processing steps withthe rest of the conventional CMOS flow. For example,passive optical waveguides and thermooptic or electroopticmodulators require addition of only a single additionalmask to the flow [33, 34], since they share the same silicondevice layer with the CMOS p- and n-channel field-effecttransistors. To build a high-performance Ge photodetector, aBGe-first[ integration approach was developed using a rapidmelt growth technique concurrent with the source-drainanneal step [35].Utilization of advanced scaled CMOS technology

allows one to control dimensions of optical nanophotonicwaveguides within just a few nanometers, opening the way tobuild very dense optical circuitry. Indeed, the demonstratedoptical devices such as wavelength division multiplexers(WDMs) [36], high-speed electrooptical modulators [37],germanium photodetectors [35–38], and fiber edgecouplers [39] are working close to the optical diffractionlimit. With such integration density, the area occupied byoptical circuitry is becoming comparable to or sometimeseven smaller than the area occupied by surrounding analogand digital CMOS circuits that provide signal amplificationfor robust I/O links.Demonstrated integration density of just 0.5 mm2 per

channel in IBM CMOS-integrated silicon nanophotonicstechnology allows the design of massively parallelterabit-per-second-class optical transceivers fabricated on asingle CMOS die occupying an area less than 5� 5 mm2


[33–40]. Such integration density achieved in a standardCMOS foundry allows one to expect not only dramatic costreduction and power savings for optical interconnects butalso significant improvement of failure rates potentiallyapproaching the reliability of CMOS circuitry. With suchadvances, it is possible to expect that Si photonicstransceivers will form the interconnect backbone at all levelsfrom AOC connecting racks to standalone transceiver chipsat the edge of the card or the module. To meet theseexpectations, significant additional effort is needed, alongwith advances in compact, low-power, and reliable lasers, forfull realization of all the advantages promised by siliconnanophotonics, as compared to traditional VCSEL/MMparallel links. Both have promise to scale power pertransmitted bit to just a few picojoules per bit and even lower.In VCSEL/MM transceivers, this is expected to come fromfurther increase of the line data rate to 25 Gb/s and beyondand utilization of more power-efficient electrical analogcircuits designed in advanced CMOS or SiGe technologies.In silicon nanophotonics, analogous or even better powerefficiency can be envisioned coming from efficient utilizationof a single external laser whose power is shared betweenmultiple parallel optical channels. Further power savings cancome as a result of intimate integration of electrical andoptical circuitry on the same silicon die, thus minimizingparasitic effects in the more complex hybrid package typicalfor VCSEL/MM transceivers. With the high level ofintegration of almost all transceiver components in a singlesilicon die, the cost structure of transceivers based on Sinanophotonics technology will be dominated mostly by thecost of packaging and testing. Significant breakthroughsare needed in these areas to make this technology moreattractive than more traditional VCSEL/MM optical links.Thus, it is most likely that these two technologies willcoexist with each other in the next decade for buildingthe most cost- and energy-efficient paralleloptical links.There are two inherent advantages of Si nanophotonics

differentiating it from VCSEL/MM that can potentiallymake this technology more attractive for building exascalesystems. One is based on the possibility to utilize WDMconcepts to significantly increase the aggregate bandwidthand to minimize the number of optical fibers in the system.Indeed, as opposed to VCSEL-based links utilizingmultimode fibers, Si nanophotonics is based on single-modeoptics and single-mode fibers. Moreover, utilization ofsmall chirp external modulation, combined with lowdispersion in single-mode fibers, allows the design of opticallinks that are almost independent of distance. Once in theoptical domain, the data can be transmitted as closeas the nearest analogous compute socket a few centimetersaway or as far as a distant rack located at the furthestside of the building or even severalkilometers away.

The second and probably the most important advantageis the possibility to integrate Si nanophotonics transceiversinto a more complex 3-D package in a way very similar towhat is envisioned for integration of memory stacks andcompute nodes. While utilization of Si photonics transceiversin HPC systems and high-end data centers will most likelystart to happen even earlier than the time horizon for theexascale era, the full promise of the technology will be fullyappreciated when 3-D integration will mature enough tostack several silicon dies in a single compute socket, asshown in Figure 3. Since from a packaging standpoint, aCMOS nanophotonics transceiver die is not much differentfrom a typical CMOS die, it is indeed possible to envisionthat such dies can be stacked together using 3-D integrationtechnology with dense through-silicon vias. The solefunction of a photonic layer in this vision is to providevery high-bandwidth (up to several tens of terabytes persecond of data) optical communication off the socket.Bringing optical communication to the chip level wouldallow new opportunities for various new architecturalsolutions. Apparent independence of the communicationsignal integrity on the distance allows the connection of the3-D integrated compute socket to other sockets, switchingnodes, racks, or memory banks located veryfar away.

ConclusionThe HPC market is expected to continue its strong growth, ifwe can continue to find solutions that allow year-over-yeargrowth in computing performance at constant cost and power.Indeed, if the current rate of growth continues, we can expectto reach an exaflop supercomputer, achieving a theoreticalpeak speed of 1018 floating-point operations per second, withinthe decade. This drive to the exascale will demand advances in3-D packaging, high-speed electrical and optical signalingand efficient power conversion, and cooling and memorytechnology. These are all areas of active research both withinand outside of IBM. The next decade will be extremelyexciting as advances in these areas come to fruition.

AcknowledgmentsThe authors would like to acknowledge the leadership, deeptechnical insight, and deeply rewarding conversations withAlan Gara. The manuscript benefitted from insightfulsuggestions of Dale Becker. Many colleagues on IBM’sexascale program, too numerous to name, all contributedthrough their imagination, insight, and ongoing research.

*Trademark, service mark, or registered trademark of InternationalBusiness Machines Corporation in the United States, other countries, orboth.

**Trademark, service mark, or registered trademark of Top500.org inthe United States, other countries, or both.


References1. P. Kogge. (2011, Feb.). Next generation supercomputers. IEEE

Spectr. Available: http://spectrum.ieee.org/computing/hardware/nextgeneration-supercomputers

2. R. H. Dennard, BField-effect transistor memory,[ U.S. Patent3 387 286, Jun. 4, 1968.

3. B. Streefkerk, J. Baselmans, W. Gehoel-van Ansem, J. Mulkens,C. Hoogendam, M. Hoogendorp, D. G. Flagello, H. Sewell, andP. Graupner, BExtending optical lithography with immersion,[Proc. SPIE, vol. 5377, pp. 285–305, 2004.

4. T. Endoh, R. Shirota, M. Momodomi, T. Tanaka, F. Masuoka, andS. Watanabe, BElectrically erasable programmable read-onlymemory with NAND cell structure,[ U.S. Patent 4 959 812,Sep. 25, 1990.

5. T. W. Hickmott, BLow-frequency negative resistance in thinanodic oxide films,[ J. Appl. Phys., vol. 33, no. 9, pp. 2669–2682,Sep. 1962.

6. A. Beck, J. G. Bednorz, C. Gerber, C. Rossel, and D. Widmer,BReproducible switching effect in thin oxide films for memoryapplications,[ Appl. Phys. Lett., vol. 77, no. 1, pp. 139–141,Jul. 2000.

7. H. S. Yoon, I.-G. Nack, J. Zhao, H. Sim, M. Y. Park, H. Lee,G.-H. Oh, J. C. Shin, I.-S. Yeo, and U.-I. Chung, BVerticalcross-point resistance change memory for ultra-high densitynon-volatile memory applications or tunneling distancemodulation,[ in VLSI Symp. Tech. Dig., 2009, pp. 26–28,Paper 2B2.

8. C. J. Chevallier, C. H. Siau, S. F. Lim, S. R. Namala,M. Matsuoka, B. L. Bateman, and D. Rinerson, BA 0.13 �m 64 Mbmulti-layered conductive metal-oxide memory,[ in Proc. ISSCC,2010, pp. 260–261.

9. J. C. Slonczewski, BCurrent-driven excitation of magneticmultilayers,[ J. Magn. Magn. Mater., vol. 159, no. 1/2, pp. L1–L7,Jun. 1996.

10. L. Berger, BEmission of spin waves by a magnetic multilayertraversed by a current,[ Phys. Rev. B, Condens. Matter, vol. 54,no. 13, pp. 9353–9358, Oct. 1996.

11. Y. Huai. (2008, Dec.). Spin-transfer Torque MRAM(STT-MRAM): Challenges and prospects. AAPPS Bull. [Online].18(6), pp. 33–40. Available: http://www.cospa.ntu.edu.tw/aappsbulletin/data/18-6/33spin.pdf

12. A. D. Pearson, W. R. Northover, J. F. Dewald, and W. F. Peck, Jr.,BChemical, physical, and electrical properties of some unusualinorganic glasses,[ in Advances in Glass Technology.New York: Plenum, 1962, pp. 357–365.

13. W. R. Northover and A. D. Pearson, BGlass composition,[U.S. Patent 3 117 013, Jan.7, 1964.

14. G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam,K. Gopalakrishnan, and R. S. Shenoy, BOverview of candidatedevice technologies for storage-class memory,[ IBM J. Res. &Dev., vol. 52, no. 4/5, pp. 449–464, Jul. 2008.

15. R. A. Haring, BThe IBM Blue Gene/Q compute chip þ SIMDfloating-point unit,[ presented at the 23rd Symp. HighPerformance Chips (Hot chips), Palo Alto, CA, Aug.17–19,2011. [Online]. Available: http://www.hotchips.org/program/program-23

16. S. Hanson and J. Harshaw, BFree cooling using water economizers,[TRANE Eng. Newslett., vol. 37, no. 3, pp. 1–8, 2008.

17. S. C. Sherwood and M. Huber. (2010, May). An adaptabilitylimit to climate change due to heat stress. Proc. Nat. Acad. Sci.[Online]. 107(21), pp. 9552–9555. Available: http://www.pnas.org/content/early/2010/04/26/0913352107.full.pdf

18. G. Patounakis, Y. W. Li, and K. L. Shepard, BA fully integratedon-chip DC–DC conversion and power management system,[IEEE J. Solid-State Circuits, vol. 39, no. 3, pp. 443–451,Mar. 2004.

19. L. Chang, D. J. Frank, R. K. Montoye, S. J. Koester, B. L. Ji,P. W. Coteus, R. H. Dennard, and W. Haensch, BPracticalstrategies for power-efficient computing technologies,[ Proc.IEEEVSpecial Issue on BCircuit Technology for Ultra-LowPower[, vol. 98, no. 2, pp. 215–236, Feb. 2010.

20. Proceedings of the International Workshop on Power Supply onChip, Cork, Ireland, Oct. 13–15, 2010. [Online]. Available:http://www.powersoc.org/schedule.php

21. C. Mellor, BSamsung readies less sockety servers for 3D stackedmemory,[ in Channel Register, Dec. 7, 2010. [Online]. Available:www.channelregister.co.uk/2010/12/07/samsng_3d_tsv

22. M. LaPedus, BMicron COO talks 450 mm, 3D, EUV,[ in EETimes, Jan. 11, 2011. [Online]. Available: http://www.eetimes.com/electronics-news/4212072/Micron-COO-talks-450-mm–3-D–EUV

23. J. U. Knickerbocker, P. S. Andry, L. P. Buchwalter, A. Deutsch,R. R. Horton, K. A. Jenkins, Y. H. Kwark, G. McVicker,C. S. Patel, R. J. Polastre, C. D. Schuster, A. Sharma,S. M. Sri-Jayantha, C. W. Surovic, C. K. Tsang, B. C. Webb,S. L. Wright, S. R. McKnight, E. J. Sprogis, and B. Dang,BDevelopment of next-generation system-on-package (SOP)technology based on silicon carriers with fine-pitch chipinterconnection,[ IBM J. Res. & Dev., vol. 49, no. 4/5,pp. 725–753, Jul. 2005.

24. J. Maria, B. Dang, S. L. Wright, C. K. Tsang, P. Andry,R. Polastre, Y. Liu, and J. U. Knickerbocker, B3D chip stackingwith 50 �m pitch lead-free micro-C4 interconnections,[ in Proc.ECTC, Orlando, FL, 2011, pp. 268–273.

25. T. O. Dickson, Y. Liu, S. V. Rylov, B. Dang, C. K. Tsang,P. S. Andry, J. F. Bulzacchelli, H. A. Ainspan, X. Gu, L. Turlapati,M. P. Beakes, B. D. Parker, J. U. Knickerbocker, andD. J. Friedman, BAn 8 � 10-Gb/s source-synchronous I/Osystem based on high-density silicon carrier interconnects,[presented at the Symp. VLSI Circuits, Kyoto, Japan, 2011,Paper 8.1.

26. IBM Corporation, Fact Sheet & Background: RoadrunnerSmashes the Petaflop Barrier, IBM press release Jun. 9, 2008.[Online]. Available: http://www-03.ibm.com/press/us/en/pressrelease/24405.wss

27. IBM Corporation, IBM Power 775 Supercomputer. [Online].Available: http://www-03.ibm.com/systems/power/hardware/775/

28. M. H. Fields, J. Foley, R. Kaneshiro, L. McColloch,D. Meadowcroft, F. W. Miller, S. Nassar, M. Robinson, and H. Xu,BTransceivers and optical engines for computer and datacenterinterconnects,[ presented at the Optical Fiber CommunicationConf. Expo. and Nat. Fiber Optic Engineers Conf., San Diego,CA, 2010, Paper OTuP1.

29. A. Benner, D. M. Kuchta, P. K. Pepeljugoski, R. A. Budd,G. Hougham, B. V. Fasano, K. Marston, H. Bagheri,E. J. Seminaro, H. Xu, D. Meadowcroft, M. H. Fields,L. McColloch, M. Robinson, F. W. Miller, R. Kaneshiro,R. Granger, D. Childers, and E. Childers, BOptics forhigh-performance servers and supercomputers,[ presented at theOptical Fiber Communication Conf. Expo. and Nat. Fiber OpticEngineers Conf., San Diego, CA, 2010, Paper OTuH1.

30. C. Gunn, BCMOS photonics for high-speed interconnects,[ IEEEMicro, vol. 26, no. 2, pp. 58–66, Mar./Apr. 2006.

31. I. Young, B. Block, M. Reshotko, and P. Chang, BIntegration ofnano-photonic devices for CMOS chip-to-chip optical I/O,[presented at the Conf. Lasers Electro-Optics, San Diego, CA,2010, Paper CWP1.

32. G. Masini, G. Capellini, J. Witzens, and C. Gunn, BA four-channel,10 Gbps monolithic optical receiver in 130 nm CMOS withintegrated Ge waveguide photodetectors,[ presented at the OpticalFiber Communication Conf. Expo. and Nat. Fiber Optic EngineersConf., Anaheim, CA, 2007, Paper PDP31.

33. W. M. J. Green, S. Assefa, A. Rylyakov, C. Schow, F. Horst, andY. A. Vlasov, BCMOS integrated silicon nanophotonics:Enabling technology for exascale computational systems,[ in Proc.SEMICON, Chiba, Japan, Dec. 1–3, 2010. [Online]. Available:http://www.research.ibm.com/photonics.

34. S. Assefa, W. M. J. Green, A. Rylyakov, C. Schow, F. Horst, andY. A. Vlasov, BCMOS integrated silicon nanophotonics:Enabling technology for exascale computational systems,’’presented at the Optical Fiber Communication Conf. Expo.and Nat. Fiber Optic Engineers Conf., Los Angeles, CA, 2011,Paper OMM6.


35. S. Assefa, F. Xia, S. W. Bedell, Y. Zhang, T. Topuria, P. M. Rice,and Y. A. Vlasov, BCMOS-integrated high-speed MSMgermanium waveguide photodetector,[ Opt. Express, vol. 18,no. 5, pp. 4986–4999, Mar. 2010.

36. F. Horst, W. M. J. Green, B. J. Offrein, and Y. A. Vlasov,BSilicon-on-insulator Echelle grating WDM demultiplexers withtwo stigmatic points,[ IEEE Photon. Technol. Lett., vol. 21,no. 23, pp. 1743–1745, Dec. 2009.

37. J. C. Rosenberg, W. M. J. Green, S. Assefa, T. Barwicz,M. Yang, S. M. Shank, and Y. A. Vlasov, BLow-power 30 Gbpssilicon microring modulator,[ presented at the Conf. LasersElectro-Optics, Baltimore, MD, 2011, Paper PDPB9.

38. S. Assefa, F. Xia, and Y. A. Vlasov, BReinventing germaniumavalanche photodetector for nanophotonic on-chip opticalinterconnects,[ Nature, vol. 464, no. 7285, pp. 80–84, Mar. 2010.

39. F. E. Doany, B. G. Lee, S. Assefa, W. M. J. Green, M. Yang,C. L. Schow, C. V. Jahnes, S. Zhang, J. Singer, V. I. Kopp,J. A. Kash, and Y. A. Vlasov, BMultichannel high-bandwidthcoupling of ultradense silicon photonic waveguide array tostandard-pitch fiber array,[ J. Lightw. Technol., vol. 29, no. 4,pp. 475–482, Feb. 2011.

40. IBM Corporation, Made in IBM Labs: Breakthrough ChipTechnology Lights the Path to Exascale Computing, IBM pressrelease Dec. 1, 2010. [Online]. Available: http://www-03.ibm.com/press/us/en/pressrelease/33115.wss

Received May 5, 2011; accepted for publication July 4, 2011

Paul W. Coteus IBM Research Division, Thomas J. WatsonResearch Center, Yorktown Heights, NY 10598 USA([email protected]). Dr. Coteus is a Manager and ResearchStaff Member in the Systems Research department at the IBM T. J.Watson Research Center. He received his Ph.D. degree in physicsfrom Columbia University in 1981. He continued to design anelectron-proton collider and, from 1982 to 1988, was AssistantProfessor of Physics at the University of Colorado, Boulder, studyingneutron production of charmed baryons. In 1988 he joined the IBMT. J. Watson Research Center as Research Staff Member. He hasmanaged the Systems Packaging Group since 1994, where he directsand designs advanced packaging for high-speed electronics, includingI/O circuits, memory system design, and standardization of high-speedDRAM, and high-performance system packaging. He was Chairmanin 2001 and Vice-chairman from 1998 to 2000 of the JEDEC (JointElectron Devices Engineering Council) Future DRAM Task Group.He is the Chief Engineer of Blue Gene systems and leads the systemdesign and packaging of this family of supercomputers, recentlyhonored with the National Medal of Technology and Innovation. Heis a senior member of the IEEE, a member of the IBM Academy ofTechnology, and an IBM Master Inventor. He has authored more than90 papers in the field of electronic packaging and holds 105 U.S.patents.

John U. Knickerbocker IBM Research Division, Thomas J.Watson Research Center, Yorktown Heights, NY 10598 USA([email protected]). Dr. Knickerbocker is an IBM DistinguishedEngineer, Master Inventor, and Manager at the IBM T. J. WatsonResearch Center. He is the manager of 3-D Silicon Integrationincluding through-silicon-vias (TSVs), wafer finishing, 3-D integratedcircuit die and wafer assembly, and fine-pitch test, modeling,characterization, and reliability evaluation. He received his Ph.D.degree in 1982 from the University of Illinois studying materialsscience and engineering. He joined IBM in 1983 at IBMMicroelectronics in East Fishkill, NY, where he held a series ofengineering and management positions during his first 21 years withIBM including director of IBM worldwide packaging development.In 2003, he joined the IBM Research Division, where he has led thedevelopment of next-generation 3-D silicon integration. His 3-Dresearch advancements include design, wafer fabrication with TSVs,wafer bonding and thinning, wafer finishing, thin die stacking andassembly with fine-pitch interconnections, wafer stacking, fine-pitchwafer test, mechanical, electrical, and thermal modeling, and

characterization and reliability characterization. Dr. Knickerbocker hasreceived an IBM Corporate Award and four Division Awards, and47 Invention Plateau awards. He has authored or coauthored188 U.S. patents or patent applications and more than 60 technicalpapers and publications. He serves as a member of the SEMATECH(Semiconductor Manufacturing Technology) 3-D Working Group.He has been a member of IEEE, International Microelectronics andPackaging Society (IMAPS), and a Fellow of the American CeramicSociety. He has more than 28 years of experience at IBM and is amember of IBM’s Academy of Technology. He continues to activelyserve as a member of the Advanced Packaging Committee for theElectronic Components and Technology Conference as an activemember of the IEEE Technical Society.

Chung H. Lam IBM Research Division, Thomas J. WatsonResearch Center, Yorktown Heights, NY 10598 USA([email protected]). Dr. Lam is an IBM Distinguished Engineer,Master Inventor, and Manager at the IBM T. J. Watson ResearchCenter. He received his B.Sc. degree in electrical engineering atPolytechnic University of New York in 1978. He joined IBM GeneralTechnology Division in Burlington in 1978 as a memory circuitdesigner. He worked on 144-Kb CPROS (cost performance read-onlystorage), 36-Kb EEPROM and led the development of a 64-Kbnonvolatile RAM. He was awarded the IBM Resident Study Fellowshipin 1984 with which he received his M.Sc. and Ph.D. in electricalengineering, at Rensselaer Polytechnic Institute in 1987 and 1988,respectively. Upon returning from resident study, he had takenresponsibilities in various disciplines of semiconductor researchand development including circuit and device designs, as well asprocess integrations for memory and logic applications in IBMMicroelectronics Division. In 2003, He transferred to IBM ResearchDivision at T. J. Watson Research Center to lead the research of newnonvolatile memory technologies. Currently, he manages the PhaseChange Memory Research Joint Project with partner companies atT. J. Watson Research Center. He is author or coauthor of 120 patentsand 70 technical papers. He is a member of the Technical Committee ofthe IEEE International Memory Workshop (IMW), InternationalElectron Device Meeting (IEDM) 2011, International Symposium onVLSI Technology, Systems and Applications (VLSI-TSA), ChinaSemiconductor Technology International Conference (CSTIC), and theInternational Technology Road Map for Semiconductors Committee(ITRS). He is also a member of the IBM Academy of Technology.

Yurii A. Vlasov IBM Research Division, Thomas J. WatsonResearch Center, Yorktown Heights, NY 10598 USA([email protected]). Dr. Vlasov is a Research Staff Member andManager at the IBM T. J. Watson Research Center. He has led theIBM team on silicon integrated nanophotonics for on-chip opticalinterconnects since joining IBM in 2001. Prior to IBM, Dr. Vlasov hasdeveloped semiconductor photonic crystals at the NEC ResearchInstitute in Princeton, and at the Strasbourg IPCMS Institute, France.He also was, for over a decade, a Research Scientist with the IoffeInstitute of Physics and Technology in St. Petersburg, Russia, workingon optics of semiconductors and photonic crystals. He received hisM.S. degree from the University of St. Petersburg (1988) and the Ph.D.degree from the Ioffe Institute (1994), both in physics. Dr. Vlasovhas published more than 100 journal papers, filed more than 40 U.S.patents, and delivered more than 150 invited and plenary talks in thearea of nanophotonic structures. He served on numerous organizingcommittees of conferences on nanophotonics. Dr. Vlasov is a Fellowof both the Optical Society of America and the American PhysicalSociety, as well as a Senior Member of the IEEE. He was awardedseveral Outstanding Technical Achievement Awards from IBM andwas named a scientist of the year by the Scientific American journal.Dr. Vlasov has also served as an adjunct professor at ColumbiaUniversity’s Department of Electrical Engineering.


technologies for exascale systems

Documents