a landscape of the new dark silicon design regime

14
A Landscape of the New Dark Silicon Design Regime Michael B. Taylor University of California, San Diego Abstract Due to the breakdown of Dennard scaling, the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. This utilization wall forces designers to ensure that, at any point in time, large fractions of their chips are effectively dark silicon, i.e., significantly underclocked or idle for large periods of time. As exponentially larger fractions of a chip’s transistors become dark, silicon area becomes an exponentially cheaper resource relative to power and energy consumption. This shift is driving a new class of architectural techniques that “spend” area to “buy” energy efficiency. All of these techniques seek to introduce new forms of heterogeneity into the compu- tational stack. This paper examines four key approaches—the four horsemen—that have emerged as top contenders for thriving in the dark silicon age. Each class carries with its virtues deep-seated restrictions that require a careful understanding of the underlying tradeoffs and benefits. Further, we propose a set of dark silicon design principles, and examine how one of the darkest computing architectures of all, the human brain, trades off energy and area in ways that provide potential insights into future directions for computer architecture. Keywords Dark Silicon, Multicore, Utilization Wall, Four Horsemen 1 Introduction Recent VLSI technology trends have led to a new disruptive regime for digital chip designers, where Moore’s Law continues but CMOS scaling provides in- creasingly diminished fruits. As in prior years, the computational capabilities of chips are still increasing by 2.8× per process generation; but a utilization wall[6] limits us to only 1.4× of this benefit—causing large underclocked swaths of silicon area—hence the term dark silicon[12, 7]. Fortunately, simple scaling theory makes these numbers easy to derive, al- lowing us to think intuitively about the problem. Transistor density continues to improve by 2× every two years, and native transistor speeds improve by 1.4×. But transistor energy-efficiency improves by only 1.4×, which, under constant Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication.

Upload: michael-b

Post on 11-Dec-2016

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A Landscape of the New Dark Silicon Design Regime

A Landscape of the

New Dark Silicon Design Regime

Michael B. TaylorUniversity of California, San Diego

Abstract

Due to the breakdown of Dennard scaling, the percentage of a siliconchip that can switch at full frequency is dropping exponentially with eachprocess generation. This utilization wall forces designers to ensure that,at any point in time, large fractions of their chips are effectively darksilicon, i.e., significantly underclocked or idle for large periods of time.

As exponentially larger fractions of a chip’s transistors become dark,silicon area becomes an exponentially cheaper resource relative to powerand energy consumption. This shift is driving a new class of architecturaltechniques that “spend” area to “buy” energy efficiency. All of thesetechniques seek to introduce new forms of heterogeneity into the compu-tational stack.

This paper examines four key approaches—the four horsemen—thathave emerged as top contenders for thriving in the dark silicon age. Eachclass carries with its virtues deep-seated restrictions that require a carefulunderstanding of the underlying tradeoffs and benefits. Further, wepropose a set of dark silicon design principles, and examine how one of thedarkest computing architectures of all, the human brain, trades off energyand area in ways that provide potential insights into future directions forcomputer architecture.

Keywords Dark Silicon, Multicore, Utilization Wall, Four Horsemen

1 Introduction

Recent VLSI technology trends have led to a new disruptive regime for digitalchip designers, where Moore’s Law continues but CMOS scaling provides in-creasingly diminished fruits. As in prior years, the computational capabilitiesof chips are still increasing by 2.8× per process generation; but a utilizationwall[6] limits us to only 1.4× of this benefit—causing large underclocked swathsof silicon area—hence the term dark silicon[12, 7].

Fortunately, simple scaling theory makes these numbers easy to derive, al-lowing us to think intuitively about the problem. Transistor density continuesto improve by 2× every two years, and native transistor speeds improve by 1.4×.But transistor energy-efficiency improves by only 1.4×, which, under constant

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 2: A Landscape of the New Dark Silicon Design Regime

power-budgets, causes a 2× shortfall in energy budget to power a chip at itsnative frequency. Therefore, our utilization of a chip’s potential is dropping ex-ponentially by a jaw-dropping 2× per generation. Thus, if we are just bumpingup against the dark silicon problem in last generation’s designs, then in eightyears, designs will be 93.75% dark!

A recent paper[18] refers to this widespread disruptive factor informallyas the dark silicon apocalypse, because it officially marks the end of one re-ality (“Dennard Scaling”)[14]—where progress could be measured by improve-ments in transistor speed and count—and the beginning of a new reality (“post-Dennard Scaling”)—where progress is measured by improvements in transistorenergy efficiency. Previously, we tweaked our circuits to reduce transistor de-lays and turbo-charged them with dual-rail domino to reduce FO4 delays. Now,we will tweak our circuits to minimize capacitance switched per function; wewill strip our circuits down and starve them of voltage to squeeze out everyfemtojoule. Where once we would spend exponentially increasing quantities oftransistors to buy performance, now, we will spend these transistors tobuy energy efficiency.

The CMOS scaling breakdown was the direct cause of industry’s transitionto multicore in 2005. Because filling chips with cores does not fundamentallycircumvent utilization wall limits, multicore is not the final solution to darksilicon[7]—it is merely industry’s initial, liminal response to the shocking onsetof the dark silicon age. Increasingly over time, the semiconductor industry isadapting to this new design regime.

Due to the breakdown of Dennard Scaling, multicore chips will not scale astransistors shrink; the fraction of a chip that can be filled with cores runningat full frequency is dropping exponentially with each process generation[7, 6].This reality forces designers to ensure that, at any point in time, large fractionsof their chips are effectively dark—either idle for long periods of time or sig-nificantly underclocked. As exponentially larger fractions of a chip’s transistorsbecome darker, silicon area becomes an exponentially cheaper resource relativeto power and energy consumption. This shift calls for new architectural tech-niques that “spend” area to “buy” energy efficiency. Optionally, we can thenapply the saved energy to increase performance, or to have longer battery lifeor lower operating temperatures.

In this paper, we examine the landscape that is coming to light in the darksilicon regime. We start by recapping the utilization wall that is the cause ofdark silicon , and continue by examining why the multicore response to the uti-lization wall is inherently limited[7, 13, 9]. We examine four recently proposedapproaches, which we refer to as the four horsemen[18], that emerge as solu-tions as we transition beyond the initial multicore stop-gap solution. Lookingbeyond the four horsemen, we propose a set of design principles for dark silicon,and show how the brain—an inherently “dark” computing device—offers newdirections beyond these principles.

1

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 3: A Landscape of the New Dark Silicon Design Regime

4 cores @ 1.8 GHz

4 cores @ 2x1.8 GHz (12 cores dark)

2x4 cores @ 1.8 GHz (8 cores dark, 8 dim)

(Industry’s Choice)

65 nm 32 nm

Spectrum of tradeoffs between # of cores and frequency

Example: 65 nm 32 nm (S = 2)

Figure 1: Multicore scaling leads to large amounts of Dark Silicon. (From[7].)

2 The Utilization Wall that causes Dark Silicon

Table 1 shows the derivation of the utilization wall[6] that causes dark silicon[12,7]. We employ a scaling factor, S, which is the ratio between the feature sizesof two processes (e.g., S = 32/22 = 1.4x between 32 and 22 nm.) In bothDennard and Post-Dennard scaling, transistor count scales by S2, and transistorswitching frequency scales by S. Thus our net increase in compute performanceis S3, or 2.8x.

However, to maintain a constant power envelope, these gains must be off-set by a corresponding reduction in transistor switching energy. In both cases,scaling reduces transistor capacitance by S, improving energy efficiency by S.In Dennardian Scaling, we can scale the threshold voltage and thus the operat-ing voltage, which yields another S2 energy-efficiency improvement. However,in today’s Post-Dennardian, leakage-limited regime, we cannot scale thresholdvoltage without exponentially increasing leakage, and as a result, we must holdoperating voltage roughly constant. The end result is a shortfall of S2, or 2× perprocess generation. This is an exponentially worsening problem that multiplieswith each process generation.

This shortfall prevents multicore from being the solution to scaling[7, 6]. Al-though we have enough transistors to increase core count by 2×, and frequencycould be 1.4× faster, the energy budget permits only a 1.4× improvement. PerFigure 1, across two process generations (S = 2), we can either increase corecount by 2×, or frequency by 2×, or some middle ground between the two. Theremaining 4× potential goes unused. This 4× reflects itself in darker silicon:either fewer cores or lower frequency cores, depending on our preference.

More positively stated, the true new potential of Moore’s Law is a 1.4×

2

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 4: A Landscape of the New Dark Silicon Design Regime

energy efficiency improvement per generation, which can be used to increaseperformance by 1.4× Additionally, if we can somehow make use of 2× moredark silicon, we can do even better.

2.1 Dark Silicon is Real: Calibrating the Utilization Wall

A quick survey of recent designs such as Tilera TileGx, Intel and AMD indicatesthat industry has pursued core count and frequency combinations consistentwith the utilization wall. The latest 2012 ITRS roadmap also shows that scalinghas proceeded consistently with post-Dennard predictions. For instance, Intel’s90-nm single-core Prescott chip ran at 3.8 GHz in 2004. Dennard scaling wouldsuggest that a 22-nm multicore version should run at 15.5 GHz, and contain17 superscalar cores, for a total improvement of 69× in instruction throughput.Instead, the upcoming 2013 22-nm Intel Core i7 4960X runs at 3.6 GHz andhas 6 superscalar cores, a 5.7× peak serial instruction throughput improvement.The darkness ratio is thus 91.74% versus the 93.75% predicted by the utilizationwall.

Although the utilization wall is based on a first-order model that simplifiesmany factors, it has proved to be an effective tool for designers to gain intuitionabout the future. Follow-on work[9, 13, 20] has looked at extending this earlywork on dark silicon and multicore scaling[7, 6] with more sophisticated modelsthat incorporate factors such as application space and cache size.

Transistor Dennard Post-Property DennardΔ Quantity S2 S2

Δ Frequency S SΔ Capacitance 1/S 1/SΔ V 2

dd 1/S2 1=⇒ Δ Power = Δ QFCV 2 1 S2

=⇒ Δ Utilization = 1/Power 1 1/S2

Table 1: Dennard vs. Post-Dennard (leakage-limited) scaling In contrastto Dennard scaling that held until 2005[14], under the post-Dennard regime, thetotal chip utilization for a fixed power budget drops by S2 with each processgeneration. The result is an exponential increase in dark silicon for a fixed-sizedchip under a fixed area budget. From [6].

3 Misconceptions of Dark Silicon

Let us clear up a few misconceptions before proceeding. First, dark silicon doesnot mean blank, useless or unused silicon—it is just silicon that is not used allthe time, or at its full frequency. Even during the best days of CMOS scaling,microprocessor and other circuits were chock full of “dark logic” that is usedonly for some applications and/or only infrequently—for instance, caches are

3

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 5: A Landscape of the New Dark Silicon Design Regime

inherently dark because the average cache transistor is switched far less thanone percent of cycles; and FPUs remain dark in integer codes.

Soon, the exponential growth of dark silicon area will push us beyond logictargeted for direct performance benefits towards swaths of low-duty cycle logicthat exists not for direct performance benefit, but for improving energy-efficiency.This improved energy efficiency can then allow an indirect performance improve-ment because it frees up more of the fixed power budget to be used for evenmore computation.

4 The Four Horsemen

We now examine the four horsemen—four promising approaches to dealing withDark Silicon. These responses originally appeared to be unlikely candidates,carrying unwelcome burdens in design, manufacturing, or programming. Noneis ideal from an aesthetic engineering point of view. But the success of complexmulti-regime devices like MOSFETs has taught us that engineers can toleratecomplexity if the end result is better. We suspect that future chips will employnot just one horseman, but all of them, in interesting and unique combinations.

4.1 The Shrinking Horseman

When confronted with the possibility of dark silicon, many chip designers insist“Area is expensive. Chip designers will just build smaller chips instead of havingdark silicon in their designs!” Among the four horsemen, shrinking chips are themost pessimistic outcome. Although all chips may eventually shrink somewhat,the ones that shrink the most will be those for which dark silicon cannot beapplied fruitfully to improve the product. These chips will will rapidly turninto low-margin businesses for which further generations of Moore’s Law providesmall benefit. We continue by examining a spectrum of second-order effectsassociated with shrinking chips.Cost Side of Shrinking Silicon. Understanding shrinking chips calls for usto consider semiconductor economics. The “build smaller chips” argument hasa ring of truth—after all, designers spend much of their time trying to meetarea budgets for existing chip designs. But exponentially smaller chips are notexponentially cheaper—even if silicon begins as 50% of system cost, after a fewprocess generations, it will be a tiny fraction. Mask costs, design costs, andI/O pad area will fail to be amortized, leading to rising costs per mm2 of siliconthat ultimately will eliminate incentives to move the design to the next processgeneration. These designs will be “left behind” on older generations.Revenue Side of Shrinking Silicon. Shrinking silicon can also shrink thechip selling price. In a competitive market, if there is a way to use the nextprocess generation’s bounty of dark silicon to attain a benefit to the end product,then competition will force companies to do it. Otherwise, they will generally beforced into low-end, low-margin, high-competition markets and their competitorwill take the high end and enjoy high margins. Thus, in scenarios where dark

4

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 6: A Landscape of the New Dark Silicon Design Regime

silicon can be used profitably, decreasing area in lieu of exploiting it wouldcertainly decrease system costs, but would catastrophically decrease sale price.Thus, the shrinking chips scenario is likely to happen only if we can find nopractical use for dark silicon.Power and Packaging Issues with Shrinking Chips. A major consequenceof exponentially shrinking chips is a corresponding exponential rise in powerdensity. Recent analysis of manycore thermal characteristics [20] has shown thatpeak hotspot temperature rise can be modeled as Tmax = TDP×(Rconv+k/A),where Tmax is the rise in temperature, TDP is the target chip thermal designpower, Rconv is the heatsink thermal convection resistance (lower is a betterheatsink), k incorporates manycore design properties, and A is chip area. If areadrops exponentially, then the second term dominates and chip temperatures riseexponentially. This in turn will force a lower TDP so that temperature limits aremet and reduce scaling below even the nominal 1.4× expected energy-efficiencygain. Thus, if thermals drive your shrinking chip strategy, it is much betterto hold your frequency constant and increase cores by 1.4× with a net areadecrease of 1.4× than it is to increase your frequency by 1.4× and shrink yourchip by 2×.

4.2 The Dim Horseman

As exponentially larger fractions of a chip’s transistors become dark transis-tors, silicon area becomes an exponentially cheaper resource relative to powerand energy consumption. This shift calls for new architectural techniques that“spend” area to “buy” energy efficiency. If we move past unhappy thoughtsof shrinking silicon and consider populating dark silicon area with logic thatwe only use part of the time, then we have two choices: do we try to makethe new logic general-purpose, or special purpose? In this section, we examinelow-duty cycle alternatives that try to retain general applicability across manyapplications; in the next section we consider specialized logic.

We employ the term dim silicon[11, 20] to refer to general-purpose logicthat is heavily underclocked or used infrequently to meet the power budget– i.e. the architecture is strategically managing the chip-wide transistor dutycycle to enforce the overall power constraint. Whereas early 90-nm designs likeCell and Prescott were dimmed because actual power exceeded design-estimatedpower, we are converging on increasingly more elegant methods that make bettertradeoffs.

Dim silicon techniques include enabling the number of cores being used andfrequency to be dynamically traded off, scaling up the amount of cache logic,employing near-threshold voltage (NTV) processor designs and redesigning thearchitecture to accomodate bursts that temporarily allow the power budget tobe exceeded, such as Turbo Boost and Computational Sprinting.Turbo Boost 1.0 Although first-generation multicores had a ship-time deter-mined top frequency that was invariant of the number of currently active cores,Intel’s Turbo Boost 1.0 enabled the next generation to make real-time tradeoffsbetween active core count and the frequency the cores ran at—the fewer the

5

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 7: A Landscape of the New Dark Silicon Design Regime

cores, the higher the frequency.Near-Threshold Voltage Processors. One recently emerging approach isnear-threshold voltage (NTV) logic[15], which operates transistors in the near-threshold regime, providing more palatable tradeoffs between energy and delaythan prior subthreshold circuits.

Recently, researchers have explored wide-SIMD NTV processors[1] whichseek to exploit data-parallelism, and also a NTV many-core processors[4] andan NTV x86 processor[17].

Although NTV per-processor performance drops faster than the correspond-ing savings in energy-per-instruction (5× energy improvement for a 8× perfor-mance cost), the performance loss can be offset by using 8× more processors inparallel if the workload allows it. Then, an additional 5× processors could turnthe energy efficiency gains into additional performance. So with ideal paral-lelization, NTV could offer 5× the throughput improvement by absorbing 40×the area. But this would also require 40× more free parallelism in the workloadrelative to the parallelism “consumed” by an equivalent energy-limited super-threshold manycore processor.

In practice, for many applications, 40× additional parallelism can be elusive.For chips with large power-budgets that can already sustain hundreds of cores,applications that have this much spare parallelism are relatively rare. Interest-ingly, because of this effect, NTV’s applicability across applications increasesin low-energy environments because the energy-limited baseline super-thresholddesign has consumed less of the available parallelism. And clearly, NTV becomesmore applicable for workloads with extremely large amounts of parallelism.

NTV presents a variety of circuit-related challenges that have seen activeinvestigation, especially because technology scaling will exacerbate rather thanameliorate these factors. A significant NTV challenge has been susceptibil-ity to process variability. As operating voltages drop, variation in transistorthreshold due to random dopant fluctation is proportionally higher, and leakageand operating frequency can vary greatly. Since NTV designs expand the areaconsumption by ∼8× or more, variation issues are exacerbated. Other chal-lenges include the penalties involved in designing low-operating voltage SRAMsand the increased interconnection energy consumption due to greater spreadingacross cores.Bigger Caches. An often proposed dim-silicon alternative is to simply allocateotherwise-dark silicon area for caches. Because only a subset of cache transis-tors (e.g. a wordline) are accessed each cycle, cache memories have low dutycycles and thus are inherently dark. Compared to general purpose logic, an L1cache clocked at its maximum frequency can be about 10× darker per squaremm, and larger caches can be even darker. Thus, adding cache is one way tosimultaneously increase performance and lower power density per square mm.We can imagine, for instance, expanding per-core cache at a rate that soaksup the remaining dark silicon area: 1.4–2× more cache per core per generation.However, many applications do not benefit from additional cache, and upcomingTSV-integrated DRAM will reduce cache benefit.Computational Sprinting and Turbo Boost. Other techniques employ

6

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 8: A Landscape of the New Dark Silicon Design Regime

CPU

L1 L1

L1 L1

CPU

CPU

CPU

CPU

L1 L1

L1 L1

CPU

CPU

CPU

CPU

L1 L1

L1 L1

CPU

CPU

CPU

CPU

L1 L1

L1 L1

CPU

CPU

CPU

1 mm

1 mm

OCN

D $

CP

U

I $

C C C

C

C

C

C

C

C C

D-CacheI-Cache

CPU

FPU

Tile

Inte

rnal

Sta

te

Inte

rface

C-core

C-core

C-core

C-core

OCN

(a) (b) (c)

Figure 2: The GreenDroid architecture, an example of a Coprocessor-Dominated Architecture (CoDA). The GreenDroid Mobile ApplicationProcessor (a) is made up of 16 non-identical tiles. Each tile (b) holds com-ponents common to every tile—the CPU, on-chip network (OCN), and sharedL1 data cache—and provides space for multiple c-cores of various sizes. (c)shows connections among these components and the c-cores.

“temporal dimness” as opposed to “spatial dimness”, temporarily exceeding thenominal thermal budget but relying on thermal capacitance to buffer againsttemperature increases, and then ramping back to a comparatively dark state.Intel’s Turbo Boost 2.0 uses this approach to boost performance up until theprocessor reaches nominal temperature, relying upon the heatsink’s innate ther-mal capacitance. ARM’s big.LITTLE employs four A15 cores until the thermalenvelope is exceeded (anecdotally about 10 seconds), and then switches over tofour lower-energy, lower-performance A9 cores. Computational Sprinting[3]carries this a step further, employing phase-change materials that allow chipsto exceed their sustainable thermal budget by an order of magnitude for sev-eral seconds, providing a short but substantial computational boost. Thesemodes are especially useful for “race to finish” computations, such as web pagerendering, for which response latency is important, or for which speeding upthe transition of both the processor and its support logic to a low-power statereduces energy consumption.

4.3 The Specialized Horseman

The specialized horseman uses dark silicon to implement a host of specialized co-processors, each either much faster or much more energy-efficient (100–1000×)than a general purpose processor[6]. Execution hops among coprocessors andgeneral purpose cores, executing where it is most efficient. The unused coresare power- and clock- gated to keep them from consuming precious energy.

The promise for a future of widespread specialization is already being re-alized: we are seeing a proliferation of specialized accelerators that span di-

7

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 9: A Landscape of the New Dark Silicon Design Regime

verse areas such as base-band processing, graphics, computer vision, and me-dia coding. These accelerators enable orders-of-magnitude improvements inenergy-efficiency and performance, especially for computations that are highlyparallel. Recent proposals[6, 13] have extrapolated this trend and anticipatethat the near future will see systems comprised of more coprocessors thangeneral-purpose processors. In this paper, we term these systems CoprocessorDominated Architectures, or CoDAs.

As specialization usage grows to combat the dark silicon problem, we arefaced a modern-day specialization “tower-of-babel” crisis that fragments ournotion of general purpose computation and eliminates the traditional clear linesof communication that we have between programmers and software and theunderlying hardware. Already, we see the deployment of specialized languageslike CUDA that are not usable between similar architectures (e.g. AMD andNvidia). We see over-specialization problems between accelerators that causethem to become inapplicable to closely related classes of computations (e.g.double-precision scientific codes running incorrectly on GPU graphics floating-point hardware.) Adoption problems also occur due to the excessive costs ofprogramming heterogeneous hardware (e.g., the slow uptake of Sony PS 3 versusX-Box.) Specialized hardware also risks obsolescence as standards are revised(e.g., JPEG standard revision.)Insulating Humans from Complexity. These factors speak to potentialexponential increases in design, verification and programming effort for theseCoDAs. Combating the tower-of-babel problem requires that we define a newparadigm for how specialization is expressed and exploited in future processingsystems. We need new scalable architectural schemas that employ pervasivelyspecialized hardware to minimize energy and maximize performance while atthe same time insulating the hardware designer and the programmer from theunderlying complexity of such systems.Overcoming Amdahl-Imposed Limits on Specialization. Amdahl’sLaw provides an additional roadblock for specialization. To save energy acrossthe majority of the computation, we need to find broad-based specializationapproaches that apply to both regular, parallel code, and also irregular code.And, we need to ensure that communicating specialized processors do not frittertheir energy savings away on costly cross-chip communication or shared mem-ory accesses. One such CoDA-based system that targets both irregular andregular code is the UCSD GreenDroid processor[7, 8], a mobile application pro-cessor that implements Android mobile environment hotspots using hundredsof specialized cores called conservation cores, or c-cores[6, 11]. C-cores are au-tomatically generated from C/C++ source code, and support a patching mech-anism that allows them to track software changes. They attain an estimatedsim8–10× energy-efficiency improvement, at no loss in serial performance, evenon non-parallel code, and without any user or programmer intervention.

Unlike NTV Processors, C-cores need not find additional parallelism in theworkload to cover a serial performance loss. Thus, c-cores are likely to workacross a wider workloads range including collections of serial programs. How-ever, for highly-parallel workloads where execution time is loosely concentrated,

8

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 10: A Landscape of the New Dark Silicon Design Regime

NTV Processors may hold an area advantage due to their reconfigurability.Other specialized processors such as Wisconsin’s DySER[19] and Michigan’sBERET[16] propose alternative architectures to C-cores that exploit specializa-tion, trading energy savings for improved reconfigurability. Recent efforts havealso examine the use of approximate computing[10] as an elegant way to packageprogrammability, reconfigurability and specialization.

4.4 The Deus Ex Machina Horseman

Of the four horsemen, this is by far the most unpredictable. Deus Ex Machinarefers to a plot device in literature or theatre in which the protagonists seem ut-terly doomed, and then something completely unexpected and unforeshadowedcomes out of nowhere to save the day. For dark silicon, one Deus Ex Machinawould be a breakthrough in semiconductor devices. However as we shall see, thebreakthroughs that would be required would have to be quite fundamental—in fact most likely would require us to build transistors out of devices otherthan MOSFETs. Why? MOSFET leakage is set by fundamental principlesof device physics, and is limited to a sub-threshold slope of 60 mV/decade atroom temperature; corresponding to a reduction of 10× for every 60 mV thatthe threshold voltage is above the Vss, which is determined by properties ofthermionic emission of carriers across a potential well. Thus, although innova-tions like Intel’s FinFET/TriGate transistor, high-K dielectrics, etc, representsignificant achievements maintaining sub-threshold slope close to their historicalvalues, they still remain within the scope of the MOSFET-imposed limits andare one-time improvements rather than scalable changes.

Two VLSI candidates that bypass these limits because they are not basedon thermal injection, are Tunnel Field Effect Transistors (TFETS)[2], which arebased on tunneling effects, and Nano-Electro-Mechanical switches[5]),

which are based on physical switches. TFETs are reputed to have sub-threshold slopes on the order of 30 mV/decade—twice as good as the idealMOSFET—but have lower on-currents than MOSFETs, limiting their use inhigh-performance circuits. NEMS devices have essentially a near-zero sub-threshold slope but slow switching times.

Both of them hint at orders-of-magnitude improvements in leakage, but re-main to be tamed from the wild and integrated into real chips.

Realizing the importance of pushing on this front, a recent $194 millionDARPA STARnet program has been developed that is funding three key direc-tions for “beyond-CMOS” approaches, which include 1) development electronspin-based memory computation devices (C-SPIN), 2) formulating new infor-mation processing models that can leverage statistical (i.e. non-deterministic)beyond-CMOS devices (SONIC), and 3) creating new devices that extend priorwork on TFETs to operate at even lower voltages (LEAST).

9

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 11: A Landscape of the New Dark Silicon Design Regime

5 Design Principles for Dark Silicon

In this section, we seek to distill a set of design principles for dark silicon – rulesof thumb that guide our existing designs along an evolutionary path to becomeincreasingly “dark silicon friendly”.

1. When scaling existing designs, first decide how to spend the 1.4×energy-efficiency increase. As a baseline, chip capabilities will scalewith energy, whether it is allocated to frequency or more cores. You mayincrease or decrease frequency and/or transistor counts, but transistorsswitched per unit time can only increase by 1.4×.

2. Determine, for your domain, how to trade mostly-dark area forenergy. If die area is fixed, any scaling is going to have a surplus oftransistors. Which combination of the four horsemen is most effective inyour domain? Should you go dim (more caches? underclocked arrays ofcores? NTV on top of that?) Add accelerators or conservation cores? Usenew kinds of devices? Shrink your chip?

3. The future will be increasingly less pipelined. Pipelining increasesduty cycle and introduces additional capacitance in circuits (registers, pre-diction circuits, bypassing and clock tree fanout) neither of which is darksilicon friendly. For parallel domains, we will instead see more replicatedcores running at lower frequencies because it reduces this capacitive over-head. Furthermore, pipelining exacerbates the gap between processingand memory.

4. The future will have less architectural multiplexing, or logicsharing. Sharing introduces additional energy consumption because itrequires sharers to have longer wires to the shared logic, and introducesadditional performance and energy overheads from the control logic thatmanages the sharing. For examples, architectures that have repositoriesof non-shared state that share physical pipelines (e.g. such as large-scalemultithreading) pay large wire capacitances inside these memories to sharethat state. As area gets cheaper, it will make less sense to pay these over-heads, and the degree of sharing will decrease so that the energy cost ofpulling state out of these state repositories will be reduced.

5. Allocating shared data to a shared RAM is a still good idea. Ifthreads truly sharing data, then a shared RAM is often still more efficientthan coherence protocols or other schemes.

6. Avoid investing in architectural techniques for saving transistors,unless it also improves energy efficiency. Transistors are gettingexponentially cheaper, and we can’t use them all at once. Why are wetrying to save transistors? Locally, transistor-saving optimizations makesense, but there is an exponential wind blowing against these optimizationsin the long run.

10

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 12: A Landscape of the New Dark Silicon Design Regime

7. Power rails are the new clocks. Ten years ago, it was a big stepto move beyond a few clock domains. Now, chips can have hundreds ofclock domains, all with their own clock gates. With dark silicon, we willsee the same effect with power rails—we will have hundreds and maybethousands of power rails in the future, with their own gates, to managethe leakage for the many heterogeneous system components.

8. Heterogeneity results from the shift from a 1-D objective func-tion (performance) to a 2-D objective function (performance andenergy). The past lacked in heterogeneity, because designs were largelymeasured according to a single axis—performance. To first order, therewas a single winner. Now that performance and energy are both impor-tant, a pareto curve trades off performance and energy, and there is noone optimal device across that curve—there are many winners. Optimaldesigns will incorporate several points across these curves.

6 Insights from the Brain: A Dark Technology

Perhaps one promising indicator that low duty cycle, “dark technology” can bemastered, unlocking new application domains, is the efficiency and density ofbrains. The brain, even today, can perform many tasks that computers can not,especially vision-related tasks. With 80 billion neurons and 100 trillion synapsesoperating at < 100 mv, it embodies an existence proof of highly parallel, reliable,and highly dark operation, and embody three of the horseman—dim, specializedand deus ex machina. Neurons operate with extremely low duty cycles comparedto processors–at best a kilohertz. Although computing using silicon-simulatedneurons introduces excessive “interpetive” overheads—neurons and transistorshave fundamentally different properties—the brain can offer us insight and long-term ideas about how we can redesign systems for the extremely low duty cyclesand low voltages being called for by dark silicon.

1. Specialization. As with the specialized horseman, different groupsof neurons serve different functions in cognitive processing, connect todifferent sensory organs, and boast reconfiguration, evolving over timeneuronal connections customized to the computation.

2. Very dark operation. Neurons fire at a maximum rate of approximately1,000 switches per second. Compare this to ALU transistors that toggleat three billion times per second. The most active neuron’s activity is amillionth that of processing transistors in today’s processors.

3. Low Voltage Operation. Brain cells operate at approximately 100 mV,yielding CV 2 energy savings of 100× versus 1 V operation, in a clearparallel to the dim horseman’s NTV circuits. Communication is low-swing, and low-voltage saving large amounts of energy.

11

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 13: A Landscape of the New Dark Silicon Design Regime

4. Limited sharing and memory multiplexing. Since any given neu-ron can only switch 1,000 times per second, by definition, it must haveextremely limited sharing, since a point of multiplexing would be a bot-tleneck in parallel processing. The human visual system starts with 6Mcones in the retina, similar to a two megapixel display, processes it withlocal neurons and then sends it on the 1M-neuron optic nerve to the visualcortex. There is no central memory store— each pixel has a set of its ownALUs, so to speak, so energy waste due to multiplexing is minimal.

5. Data decimation. The human brain reduces the data size at each stepand operates on concise but approximate representations. If two megapix-els suffices to handle color-related vision tasks, why use more than that?Larger sensors would just require more neurons to store and compute onthe data. We should ensure that we are processing no more data than nec-essary to achieve the final outcome. Increasingly, researchers have begunto formalize approximate computation as a way of handling dark silicon.

6. Analog Operation. The neuron performs a more complex basic opera-tion than the typical digital transistor. On the input side, neurons combineinformation from many other neurons; and on the output, despite produc-ing rail-to-rail digital pulses, they encode multiple bits of information viaspikes timings. However, neuron firing is a rail-to-rail digital pulse. Couldthere be more efficient ways to map operations onto silicon-based tech-nologies? In RF wireless front-end communications, analog processingenables computations that would be impossible to do at speed digitally.But analog techniques may not scale well to deep nanometer technology.

7. Fast, Static, Scatter / Gather. Neurons have fanout/fanin of ∼7,000to other neurons that are located significant distances away. Effectively,they are able to perform efficient scatter-gather operations across a largenumber of static “memory locations”, at low cost. Do more efficient waysexist for implementing these operations in silicon? It would presumablybe useful for computations that operate on finite-sized, static graphs.

Conclusion

Although silicon is getting darker, for researchers the future is bright and ex-citing; dark silicon will cause a transformation of the computational stack, andwill provide many opportunities for investigation.

References

[1] E. Krimer et al. “Synctium: a near-threshold stream processor for energy-constrained parallel applications.” IEEE Computer Architecture Letters, Jan.2010.

12

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

Page 14: A Landscape of the New Dark Silicon Design Regime

[2] A. Ionescu et al. “Tunnel field-effect transistors as energy-efficient electronicswitches.” In Nature, November 2011.

[3] A. Raghavan et al. “Computational sprinting.” In HPCA, Feb. 2012.

[4] D. Fick et al. “Centip3de: A 3930 dmips/w configurable near-threshold 3d stackedsystem with 64 arm cortex-m3 cores.” In ISSCC , Feb. 2012.

[5] F. Chen et al. “Demonstration of integrated micro-electro-mechanical switchcircuits for vlsi applications.” In ISSCC , Feb. 2010.

[6] G. Venkatesh et al. “Conservation cores: Reducing the energy of mature compu-tations.” In ASPLOS , 2010.

[7] Goulding et al. “GreenDroid: A mobile application processor for a future of darksilicon.” In HOTCHIPS , 2010.

[8] Goulding-Hotta, et al. “The GreenDroid mobile application processor: An archi-tecture for silicon’s dark future.” Micro, IEEE , March 2011.

[9] H. Esmaeilzadeh et al. “Dark silicon and the end of multicore scaling.” SIGARCHComput. Archit. News, June 2011.

[10] H. Esmaeilzadeh et al. “Neural acceleration for general-purpose approximateprograms.” MICRO, 2012.

[11] J. Sampson et al. “Efficient complex operators for irregular codes.” In HPCA,2011.

[12] R. Merrit. “ARM CTO: power surge could create ’dark silicon’.” EE Times,October 2009.

[13] N. Hardavellas et al. “Toward dark silicon in servers.” IEEE Micro, 2011.

[14] R. Dennard. “Design of Ion-Implanted MOSFET’s with Very Small PhysicalDimensions.” In JSSC , October 1974.

[15] R. Dreslinski et al. “Near-threshold computing: Reclaiming moore’s law throughenergy efficient integrated circuits.” Proceedings of the IEEE , Feb. 2010.

[16] S. Gupta et al. “Bundled execution of recurring traces for energy-efficient generalpurpose processing.” MICRO, 2011.

[17] S. Jain et al. “A 280mv-to-1.2v wide-operating-range ia-32 processor in 32nmcmos.” In ISSCC , Feb. 2012.

[18] M. Taylor. “Is dark silicon useful? harnessing the four horesemen of the comingdark silicon apocalypse.” In Design Automation Conference, 2012.

[19] V. Govindaraju et al. “Dynamically specialized datapaths for energy efficientcomputing.” In HPCA, 2011.

[20] W. Huang et al. “Scaling with design constraints: Predicting the future of bigchips.” IEEE Micro, july-aug. 2011.

Bio: Michael Taylor is an Associate Professor at the University of Califor-nia, San Diego. He can found at http://darksilicon.org.

13

Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.