A Landscape of the New Dark Silicon Design Regime

Download A Landscape of the New Dark Silicon Design Regime

Post on 11-Dec-2016




2 download


  • A Landscape of the

    New Dark Silicon Design Regime

    Michael B. TaylorUniversity of California, San Diego


    Due to the breakdown of Dennard scaling, the percentage of a siliconchip that can switch at full frequency is dropping exponentially with eachprocess generation. This utilization wall forces designers to ensure that,at any point in time, large fractions of their chips are eectively darksilicon, i.e., signicantly underclocked or idle for large periods of time.

    As exponentially larger fractions of a chips transistors become dark,silicon area becomes an exponentially cheaper resource relative to powerand energy consumption. This shift is driving a new class of architecturaltechniques that spend area to buy energy eciency. All of thesetechniques seek to introduce new forms of heterogeneity into the compu-tational stack.

    This paper examines four key approachesthe four horsementhathave emerged as top contenders for thriving in the dark silicon age. Eachclass carries with its virtues deep-seated restrictions that require a carefulunderstanding of the underlying tradeos and benets. Further, wepropose a set of dark silicon design principles, and examine how one of thedarkest computing architectures of all, the human brain, trades o energyand area in ways that provide potential insights into future directions forcomputer architecture.

    Keywords Dark Silicon, Multicore, Utilization Wall, Four Horsemen

    1 Introduction

    Recent VLSI technology trends have led to a new disruptive regime for digitalchip designers, where Moores Law continues but CMOS scaling provides in-creasingly diminished fruits. As in prior years, the computational capabilitiesof chips are still increasing by 2.8 per process generation; but a utilizationwall[6] limits us to only 1.4 of this benetcausing large underclocked swathsof silicon areahence the term dark silicon[12, 7].

    Fortunately, simple scaling theory makes these numbers easy to derive, al-lowing us to think intuitively about the problem. Transistor density continuesto improve by 2 every two years, and native transistor speeds improve by 1.4.But transistor energy-eciency improves by only 1.4, which, under constant

    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • power-budgets, causes a 2 shortfall in energy budget to power a chip at itsnative frequency. Therefore, our utilization of a chips potential is dropping ex-ponentially by a jaw-dropping 2 per generation. Thus, if we are just bumpingup against the dark silicon problem in last generations designs, then in eightyears, designs will be 93.75% dark!

    A recent paper[18] refers to this widespread disruptive factor informallyas the dark silicon apocalypse, because it ocially marks the end of one re-ality (Dennard Scaling)[14]where progress could be measured by improve-ments in transistor speed and countand the beginning of a new reality (post-Dennard Scaling)where progress is measured by improvements in transistorenergy eciency. Previously, we tweaked our circuits to reduce transistor de-lays and turbo-charged them with dual-rail domino to reduce FO4 delays. Now,we will tweak our circuits to minimize capacitance switched per function; wewill strip our circuits down and starve them of voltage to squeeze out everyfemtojoule. Where once we would spend exponentially increasing quantities oftransistors to buy performance, now, we will spend these transistors tobuy energy eciency.

    The CMOS scaling breakdown was the direct cause of industrys transitionto multicore in 2005. Because lling chips with cores does not fundamentallycircumvent utilization wall limits, multicore is not the nal solution to darksilicon[7]it is merely industrys initial, liminal response to the shocking onsetof the dark silicon age. Increasingly over time, the semiconductor industry isadapting to this new design regime.

    Due to the breakdown of Dennard Scaling, multicore chips will not scale astransistors shrink; the fraction of a chip that can be lled with cores runningat full frequency is dropping exponentially with each process generation[7, 6].This reality forces designers to ensure that, at any point in time, large fractionsof their chips are eectively darkeither idle for long periods of time or sig-nicantly underclocked. As exponentially larger fractions of a chips transistorsbecome darker, silicon area becomes an exponentially cheaper resource relativeto power and energy consumption. This shift calls for new architectural tech-niques that spend area to buy energy eciency. Optionally, we can thenapply the saved energy to increase performance, or to have longer battery lifeor lower operating temperatures.

    In this paper, we examine the landscape that is coming to light in the darksilicon regime. We start by recapping the utilization wall that is the cause ofdark silicon , and continue by examining why the multicore response to the uti-lization wall is inherently limited[7, 13, 9]. We examine four recently proposedapproaches, which we refer to as the four horsemen[18], that emerge as solu-tions as we transition beyond the initial multicore stop-gap solution. Lookingbeyond the four horsemen, we propose a set of design principles for dark silicon,and show how the brainan inherently dark computing deviceoers newdirections beyond these principles.


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • 4 cores @ 1.8 GHz

    4 cores @ 2x1.8 GHz (12 cores dark)

    2x4 cores @ 1.8 GHz (8 cores dark, 8 dim)

    (Industrys Choice)

    65 nm 32 nm

    Spectrum of tradeoffs between # of cores and frequency

    Example: 65 nm 32 nm (S = 2)

    Figure 1: Multicore scaling leads to large amounts of Dark Silicon. (From[7].)

    2 The Utilization Wall that causes Dark Silicon

    Table 1 shows the derivation of the utilization wall[6] that causes dark silicon[12,7]. We employ a scaling factor, S, which is the ratio between the feature sizesof two processes (e.g., S = 32/22 = 1.4x between 32 and 22 nm.) In bothDennard and Post-Dennard scaling, transistor count scales by S2, and transistorswitching frequency scales by S. Thus our net increase in compute performanceis S3, or 2.8x.

    However, to maintain a constant power envelope, these gains must be o-set by a corresponding reduction in transistor switching energy. In both cases,scaling reduces transistor capacitance by S, improving energy eciency by S.In Dennardian Scaling, we can scale the threshold voltage and thus the operat-ing voltage, which yields another S2 energy-eciency improvement. However,in todays Post-Dennardian, leakage-limited regime, we cannot scale thresholdvoltage without exponentially increasing leakage, and as a result, we must holdoperating voltage roughly constant. The end result is a shortfall of S2, or 2 perprocess generation. This is an exponentially worsening problem that multiplieswith each process generation.

    This shortfall prevents multicore from being the solution to scaling[7, 6]. Al-though we have enough transistors to increase core count by 2, and frequencycould be 1.4 faster, the energy budget permits only a 1.4 improvement. PerFigure 1, across two process generations (S = 2), we can either increase corecount by 2, or frequency by 2, or some middle ground between the two. Theremaining 4 potential goes unused. This 4 reects itself in darker silicon:either fewer cores or lower frequency cores, depending on our preference.

    More positively stated, the true new potential of Moores Law is a 1.4


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • energy eciency improvement per generation, which can be used to increaseperformance by 1.4 Additionally, if we can somehow make use of 2 moredark silicon, we can do even better.

    2.1 Dark Silicon is Real: Calibrating the Utilization Wall

    A quick survey of recent designs such as Tilera TileGx, Intel and AMD indicatesthat industry has pursued core count and frequency combinations consistentwith the utilization wall. The latest 2012 ITRS roadmap also shows that scalinghas proceeded consistently with post-Dennard predictions. For instance, Intels90-nm single-core Prescott chip ran at 3.8 GHz in 2004. Dennard scaling wouldsuggest that a 22-nm multicore version should run at 15.5 GHz, and contain17 superscalar cores, for a total improvement of 69 in instruction throughput.Instead, the upcoming 2013 22-nm Intel Core i7 4960X runs at 3.6 GHz andhas 6 superscalar cores, a 5.7 peak serial instruction throughput improvement.The darkness ratio is thus 91.74% versus the 93.75% predicted by the utilizationwall.

    Although the utilization wall is based on a rst-order model that simpliesmany factors, it has proved to be an eective tool for designers to gain intuitionabout the future. Follow-on work[9, 13, 20] has looked at extending this earlywork on dark silicon and multicore scaling[7, 6] with more sophisticated modelsthat incorporate factors such as application space and cache size.

    Transistor Dennard Post-Property Dennard Quantity S2 S2

    Frequency S S Capacitance 1/S 1/S V 2dd 1/S

    2 1= Power = QFCV 2 1 S2= Utilization = 1/Power 1 1/S2

    Table 1: Dennard vs. Post-Dennard (leakage-limited) scaling In contrastto Dennard scaling that held until 2005[14], under the post-Dennard regime, thetotal chip utilization for a xed power budget drops by S2 with each processgeneration. The result is an exponential increase in dark silicon for a xed-sizedchip under a xed area budget. From [6].

    3 Misconceptions of Dark Silicon

    Let us clear up a few misconceptions before proceeding. First, dark silicon doesnot mean blank, useless or unused siliconit is just silicon that is not used allthe time, or at its full frequency. Even during the best days of CMOS scaling,microprocessor and other circuits were chock full of dark logic that is usedonly for some applications and/or only infrequentlyfor instance, caches are


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • inherently dark because the average cache transistor is switched far less thanone percent of cycles; and FPUs remain dark in integer codes.

    Soon, the exponential growth of dark silicon area will push us beyond logictargeted for direct performance benets towards swaths of low-duty cycle logicthat exists not for direct performance benet, but for improving energy-eciency.This improved energy eciency can then allow an indirect performance improve-ment because it frees up more of the xed power budget to be used for evenmore computation.

    4 The Four Horsemen

    We now examine the four horsemenfour promising approaches to dealing withDark Silicon. These responses originally appeared to be unlikely candidates,carrying unwelcome burdens in design, manufacturing, or programming. Noneis ideal from an aesthetic engineering point of view. But the success of complexmulti-regime devices like MOSFETs has taught us that engineers can toleratecomplexity if the end result is better. We suspect that future chips will employnot just one horseman, but all of them, in interesting and unique combinations.

    4.1 The Shrinking Horseman

    When confronted with the possibility of dark silicon, many chip designers insistArea is expensive. Chip designers will just build smaller chips instead of havingdark silicon in their designs! Among the four horsemen, shrinking chips are themost pessimistic outcome. Although all chips may eventually shrink somewhat,the ones that shrink the most will be those for which dark silicon cannot beapplied fruitfully to improve the product. These chips will will rapidly turninto low-margin businesses for which further generations of Moores Law providesmall benet. We continue by examining a spectrum of second-order eectsassociated with shrinking chips.Cost Side of Shrinking Silicon. Understanding shrinking chips calls for usto consider semiconductor economics. The build smaller chips argument hasa ring of truthafter all, designers spend much of their time trying to meetarea budgets for existing chip designs. But exponentially smaller chips are notexponentially cheapereven if silicon begins as 50% of system cost, after a fewprocess generations, it will be a tiny fraction. Mask costs, design costs, andI/O pad area will fail to be amortized, leading to rising costs per mm2 of siliconthat ultimately will eliminate incentives to move the design to the next processgeneration. These designs will be left behind on older generations.Revenue Side of Shrinking Silicon. Shrinking silicon can also shrink thechip selling price. In a competitive market, if there is a way to use the nextprocess generations bounty of dark silicon to attain a benet to the end product,then competition will force companies to do it. Otherwise, they will generally beforced into low-end, low-margin, high-competition markets and their competitorwill take the high end and enjoy high margins. Thus, in scenarios where dark


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • silicon can be used protably, decreasing area in lieu of exploiting it wouldcertainly decrease system costs, but would catastrophically decrease sale price.Thus, the shrinking chips scenario is likely to happen only if we can nd nopractical use for dark silicon.Power and Packaging Issues with Shrinking Chips. A major consequenceof exponentially shrinking chips is a corresponding exponential rise in powerdensity. Recent analysis of manycore thermal characteristics [20] has shown thatpeak hotspot temperature rise can be modeled as Tmax = TDP(Rconv+k/A),where Tmax is the rise in temperature, TDP is the target chip thermal designpower, Rconv is the heatsink thermal convection resistance (lower is a betterheatsink), k incorporates manycore design properties, and A is chip area. If areadrops exponentially, then the second term dominates and chip temperatures riseexponentially. This in turn will force a lower TDP so that temperature limits aremet and reduce scaling below even the nominal 1.4 expected energy-eciencygain. Thus, if thermals drive your shrinking chip strategy, it is much betterto hold your frequency constant and increase cores by 1.4 with a net areadecrease of 1.4 than it is to increase your frequency by 1.4 and shrink yourchip by 2.

    4.2 The Dim Horseman

    As exponentially larger fractions of a chips transistors become dark transis-tors, silicon area becomes an exponentially cheaper resource relative to powerand energy consumption. This shift calls for new architectural techniques thatspend area to buy energy eciency. If we move past unhappy thoughtsof shrinking silicon and consider populating dark silicon area with logic thatwe only use part of the time, then we have two choices: do we try to makethe new logic general-purpose, or special purpose? In this section, we examinelow-duty cycle alternatives that try to retain general applicability across manyapplications; in the next section we consider specialized logic.

    We employ the term dim silicon[11, 20] to refer to general-purpose logicthat is heavily underclocked or used infrequently to meet the power budget i.e. the architecture is strategically managing the chip-wide transistor dutycycle to enforce the overall power constraint. Whereas early 90-nm designs likeCell and Prescott were dimmed because actual power exceeded design-estimatedpower, we are converging on increasingly more elegant methods that make bettertradeos.

    Dim silicon techniques include enabling the number of cores being used andfrequency to be dynamically traded o, scaling up the amount of cache logic,employing near-threshold voltage (NTV) processor designs and redesigning thearchitecture to accomodate bursts that temporarily allow the power budget tobe exceeded, such as Turbo Boost and Computational Sprinting.Turbo Boost 1.0 Although rst-generation multicores had a ship-time deter-mined top frequency that was invariant of the number of currently active cores,Intels Turbo Boost 1.0 enabled the next generation to make real-time tradeosbetween active core count and the frequency the cores ran atthe fewer the


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • cores, the higher the frequency.Near-Threshold Voltage Processors. One recently emerging approach isnear-threshold voltage (NTV) logic[15], which operates transistors in the near-threshold regime, providing more palatable tradeos between energy and delaythan prior subthreshold circuits.

    Recently, researchers have explored wide-SIMD NTV processors[1] whichseek to exploit data-parallelism, and also a NTV many-core processors[4] andan NTV x86 processor[17].

    Although NTV per-processor performance drops faster than the correspond-ing savings in energy-per-instruction (5 energy improvement for a 8 perfor-mance cost), the performance loss can be oset by using 8 more processors inparallel if the workload allows it. Then, an additional 5 processors could turnthe energy eciency gains into additional performance. So with ideal paral-lelization, NTV could oer 5 the throughput improvement by absorbing 40the area. But this would also require 40 more free parallelism in the workloadrelative to the parallelism consumed by an equivalent energy-limited super-threshold manycore processor.

    In practice, for many applications, 40 additional parallelism can be elusive.For chips with large power-budgets that can already sustain hundreds of cores,applications that have this much spare parallelism are relatively rare. Interest-ingly, because of this eect, NTVs applicability across applications increasesin low-energy environments because the energy-limited baseline super-thresholddesign has consumed less of the available parallelism. And clearly, NTV becomesmore applicable for workloads with extremely large amounts of parallelism.

    NTV presents a variety of circuit-related challenges that have seen activeinvestigation, especially because technology scaling will exacerbate rather thanameliorate these factors. A signicant NTV challenge has been susceptibil-ity to process variability. As operating voltages drop, variation in transistorthreshold due to random dopant uctation is proportionally higher, and leakageand operating frequency can vary greatly. Since NTV designs expand the areaconsumption by 8 or more, variation issues are exacerbated. Other chal-lenges include the penalties involved in designing low-operating voltage SRAMsand the increased interconnection energy consumption due to greater spreadingacross cores.Bigger Caches. An often proposed dim-silicon alternative is to simply allocateotherwise-dark silicon area for caches. Because only a subset of cache transis-tors (e.g. a wordline) are accessed each cycle, cache memories have low dutycycles and thus are inherently dark. Compared to general purpose logic, an L1cache clocked at its maximum frequency can be about 10 darker per squaremm, and larger caches can be even darker. Thus, adding cache is one way tosimultaneously increase performance and lower power density per square mm.We can imagine, for instance, expanding per-core cache at a rate that soaksup the remaining dark silicon area: 1.42 more cache per core per generation.However, many applications do not benet from additional cache, and upcomingTSV-integrated DRAM will reduce cache benet.Computational Sprinting and Turbo Boost. Other techniques employ


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • CPU

    L1 L1

    L1 L1





    L1 L1

    L1 L1





    L1 L1

    L1 L1





    L1 L1

    L1 L1




    1 mm

    1 mm


    D $



    I $

    C C C






    C C








    e In







    (a) (b) (c)

    Figure 2: The GreenDroid architecture, an example of a Coprocessor-Dominated Architecture (CoDA). The GreenDroid Mobile ApplicationProcessor (a) is made up of 16 non-identical tiles. Each tile (b) holds com-ponents common to every tilethe CPU, on-chip network (OCN), and sharedL1 data cacheand provides space for multiple c-cores of various sizes. (c)shows connections among these components and the c-cores.

    temporal dimness as opposed to spatial dimness, temporarily exceeding thenominal thermal budget but relying on thermal capacitance to buer againsttemperature increases, and then ramping back to a comparatively dark state.Intels Turbo Boost 2.0 uses this approach to boost performance up until theprocessor reaches nominal temperature, relying upon the heatsinks innate ther-mal capacitance. ARMs big.LITTLE employs four A15 cores until the thermalenvelope is exceeded (anecdotally about 10 seconds), and then switches over tofour lower-energy, lower-performance A9 cores. Computational Sprinting[3]carries this a step further, employing phase-change materials that allow chipsto exceed their sustainable thermal budget by an order of magnitude for sev-eral seconds, providing a short but substantial computational boost. Thesemodes are especially useful for race to nish computations, such as web pagerendering, for which response latency is important, or for which speeding upthe transition of both the processor and its support logic to a low-power statereduces energy consumption.

    4.3 The Specialized Horseman

    The specialized horseman uses dark silicon to implement a host of specialized co-processors, each either much faster or much more energy-ecient (1001000)than a general purpose processor[6]. Execution hops among coprocessors andgeneral purpose cores, executing where it is most ecient. The unused coresare power- and clock- gated to keep them from consuming precious energy.

    The promise for a future of widespread specialization is already being re-alized: we are seeing a proliferation of specialized accelerators that span di-


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • verse areas such as base-band processing, graphics, computer vision, and me-dia coding. These accelerators enable orders-of-magnitude improvements inenergy-eciency and performance, especially for computations that are highlyparallel. Recent proposals[6, 13] have extrapolated this trend and anticipatethat the near future will see systems comprised of more coprocessors thangeneral-purpose processors. In this paper, we term these systems CoprocessorDominated Architectures, or CoDAs.

    As specialization usage grows to combat the dark silicon problem, we arefaced a modern-day specialization tower-of-babel crisis that fragments ournotion of general purpose computation and eliminates the traditional clear linesof communication that we have between programmers and software and theunderlying hardware. Already, we see the deployment of specialized languageslike CUDA that are not usable between similar architectures (e.g. AMD andNvidia). We see over-specialization problems between accelerators that causethem to become inapplicable to closely related classes of computations (e.g.double-precision scientic codes running incorrectly on GPU graphics oating-point hardware.) Adoption problems also occur due to the excessive costs ofprogramming heterogeneous hardware (e.g., the slow uptake of Sony PS 3 versusX-Box.) Specialized hardware also risks obsolescence as standards are revised(e.g., JPEG standard revision.)Insulating Humans from Complexity. These factors speak to potentialexponential increases in design, verication and programming eort for theseCoDAs. Combating the tower-of-babel problem requires that we dene a newparadigm for how specialization is expressed and exploited in future processingsystems. We need new scalable architectural schemas that employ pervasivelyspecialized hardware to minimize energy and maximize performance while atthe same time insulating the hardware designer and the programmer from theunderlying complexity of such systems.Overcoming Amdahl-Imposed Limits on Specialization. AmdahlsLaw provides an additional roadblock for specialization. To save energy acrossthe majority of the computation, we need to nd broad-based specializationapproaches that apply to both regular, parallel code, and also irregular code.And, we need to ensure that communicating specialized processors do not frittertheir energy savings away on costly cross-chip communication or shared mem-ory accesses. One such CoDA-based system that targets both irregular andregular code is the UCSD GreenDroid processor[7, 8], a mobile application pro-cessor that implements Android mobile environment hotspots using hundredsof specialized cores called conservation cores, or c-cores[6, 11]. C-cores are au-tomatically generated from C/C++ source code, and support a patching mech-anism that allows them to track software changes. They attain an estimatedsim810 energy-eciency improvement, at no loss in serial performance, evenon non-parallel code, and without any user or programmer intervention.

    Unlike NTV Processors, C-cores need not nd additional parallelism in theworkload to cover a serial performance loss. Thus, c-cores are likely to workacross a wider workloads range including collections of serial programs. How-ever, for highly-parallel workloads where execution time is loosely concentrated,


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • NTV Processors may hold an area advantage due to their recongurability.Other specialized processors such as Wisconsins DySER[19] and MichigansBERET[16] propose alternative architectures to C-cores that exploit specializa-tion, trading energy savings for improved recongurability. Recent eorts havealso examine the use of approximate computing[10] as an elegant way to packageprogrammability, recongurability and specialization.

    4.4 The Deus Ex Machina Horseman

    Of the four horsemen, this is by far the most unpredictable. Deus Ex Machinarefers to a plot device in literature or theatre in which the protagonists seem ut-terly doomed, and then something completely unexpected and unforeshadowedcomes out of nowhere to save the day. For dark silicon, one Deus Ex Machinawould be a breakthrough in semiconductor devices. However as we shall see, thebreakthroughs that would be required would have to be quite fundamentalin fact most likely would require us to build transistors out of devices otherthan MOSFETs. Why? MOSFET leakage is set by fundamental principlesof device physics, and is limited to a sub-threshold slope of 60 mV/decade atroom temperature; corresponding to a reduction of 10 for every 60 mV thatthe threshold voltage is above the Vss, which is determined by properties ofthermionic emission of carriers across a potential well. Thus, although innova-tions like Intels FinFET/TriGate transistor, high-K dielectrics, etc, representsignicant achievements maintaining sub-threshold slope close to their historicalvalues, they still remain within the scope of the MOSFET-imposed limits andare one-time improvements rather than scalable changes.

    Two VLSI candidates that bypass these limits because they are not basedon thermal injection, are Tunnel Field Eect Transistors (TFETS)[2], which arebased on tunneling eects, and Nano-Electro-Mechanical switches[5]),

    which are based on physical switches. TFETs are reputed to have sub-threshold slopes on the order of 30 mV/decadetwice as good as the idealMOSFETbut have lower on-currents than MOSFETs, limiting their use inhigh-performance circuits. NEMS devices have essentially a near-zero sub-threshold slope but slow switching times.

    Both of them hint at orders-of-magnitude improvements in leakage, but re-main to be tamed from the wild and integrated into real chips.

    Realizing the importance of pushing on this front, a recent $194 millionDARPA STARnet program has been developed that is funding three key direc-tions for beyond-CMOS approaches, which include 1) development electronspin-based memory computation devices (C-SPIN), 2) formulating new infor-mation processing models that can leverage statistical (i.e. non-deterministic)beyond-CMOS devices (SONIC), and 3) creating new devices that extend priorwork on TFETs to operate at even lower voltages (LEAST).


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • 5 Design Principles for Dark Silicon

    In this section, we seek to distill a set of design principles for dark silicon rulesof thumb that guide our existing designs along an evolutionary path to becomeincreasingly dark silicon friendly.

    1. When scaling existing designs, rst decide how to spend the 1.4energy-eciency increase. As a baseline, chip capabilities will scalewith energy, whether it is allocated to frequency or more cores. You mayincrease or decrease frequency and/or transistor counts, but transistorsswitched per unit time can only increase by 1.4.

    2. Determine, for your domain, how to trade mostly-dark area forenergy. If die area is xed, any scaling is going to have a surplus oftransistors. Which combination of the four horsemen is most eective inyour domain? Should you go dim (more caches? underclocked arrays ofcores? NTV on top of that?) Add accelerators or conservation cores? Usenew kinds of devices? Shrink your chip?

    3. The future will be increasingly less pipelined. Pipelining increasesduty cycle and introduces additional capacitance in circuits (registers, pre-diction circuits, bypassing and clock tree fanout) neither of which is darksilicon friendly. For parallel domains, we will instead see more replicatedcores running at lower frequencies because it reduces this capacitive over-head. Furthermore, pipelining exacerbates the gap between processingand memory.

    4. The future will have less architectural multiplexing, or logicsharing. Sharing introduces additional energy consumption because itrequires sharers to have longer wires to the shared logic, and introducesadditional performance and energy overheads from the control logic thatmanages the sharing. For examples, architectures that have repositoriesof non-shared state that share physical pipelines (e.g. such as large-scalemultithreading) pay large wire capacitances inside these memories to sharethat state. As area gets cheaper, it will make less sense to pay these over-heads, and the degree of sharing will decrease so that the energy cost ofpulling state out of these state repositories will be reduced.

    5. Allocating shared data to a shared RAM is a still good idea. Ifthreads truly sharing data, then a shared RAM is often still more ecientthan coherence protocols or other schemes.

    6. Avoid investing in architectural techniques for saving transistors,unless it also improves energy eciency. Transistors are gettingexponentially cheaper, and we cant use them all at once. Why are wetrying to save transistors? Locally, transistor-saving optimizations makesense, but there is an exponential wind blowing against these optimizationsin the long run.


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • 7. Power rails are the new clocks. Ten years ago, it was a big stepto move beyond a few clock domains. Now, chips can have hundreds ofclock domains, all with their own clock gates. With dark silicon, we willsee the same eect with power railswe will have hundreds and maybethousands of power rails in the future, with their own gates, to managethe leakage for the many heterogeneous system components.

    8. Heterogeneity results from the shift from a 1-D objective func-tion (performance) to a 2-D objective function (performance andenergy). The past lacked in heterogeneity, because designs were largelymeasured according to a single axisperformance. To rst order, therewas a single winner. Now that performance and energy are both impor-tant, a pareto curve trades o performance and energy, and there is noone optimal device across that curvethere are many winners. Optimaldesigns will incorporate several points across these curves.

    6 Insights from the Brain: A Dark Technology

    Perhaps one promising indicator that low duty cycle, dark technology can bemastered, unlocking new application domains, is the eciency and density ofbrains. The brain, even today, can perform many tasks that computers can not,especially vision-related tasks. With 80 billion neurons and 100 trillion synapsesoperating at < 100 mv, it embodies an existence proof of highly parallel, reliable,and highly dark operation, and embody three of the horsemandim, specializedand deus ex machina. Neurons operate with extremely low duty cycles comparedto processorsat best a kilohertz. Although computing using silicon-simulatedneurons introduces excessive interpetive overheadsneurons and transistorshave fundamentally dierent propertiesthe brain can oer us insight and long-term ideas about how we can redesign systems for the extremely low duty cyclesand low voltages being called for by dark silicon.

    1. Specialization. As with the specialized horseman, dierent groupsof neurons serve dierent functions in cognitive processing, connect todierent sensory organs, and boast reconguration, evolving over timeneuronal connections customized to the computation.

    2. Very dark operation. Neurons re at a maximum rate of approximately1,000 switches per second. Compare this to ALU transistors that toggleat three billion times per second. The most active neurons activity is amillionth that of processing transistors in todays processors.

    3. Low Voltage Operation. Brain cells operate at approximately 100 mV,yielding CV 2 energy savings of 100 versus 1 V operation, in a clearparallel to the dim horsemans NTV circuits. Communication is low-swing, and low-voltage saving large amounts of energy.


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • 4. Limited sharing and memory multiplexing. Since any given neu-ron can only switch 1,000 times per second, by denition, it must haveextremely limited sharing, since a point of multiplexing would be a bot-tleneck in parallel processing. The human visual system starts with 6Mcones in the retina, similar to a two megapixel display, processes it withlocal neurons and then sends it on the 1M-neuron optic nerve to the visualcortex. There is no central memory store each pixel has a set of its ownALUs, so to speak, so energy waste due to multiplexing is minimal.

    5. Data decimation. The human brain reduces the data size at each stepand operates on concise but approximate representations. If two megapix-els suces to handle color-related vision tasks, why use more than that?Larger sensors would just require more neurons to store and compute onthe data. We should ensure that we are processing no more data than nec-essary to achieve the nal outcome. Increasingly, researchers have begunto formalize approximate computation as a way of handling dark silicon.

    6. Analog Operation. The neuron performs a more complex basic opera-tion than the typical digital transistor. On the input side, neurons combineinformation from many other neurons; and on the output, despite produc-ing rail-to-rail digital pulses, they encode multiple bits of information viaspikes timings. However, neuron ring is a rail-to-rail digital pulse. Couldthere be more ecient ways to map operations onto silicon-based tech-nologies? In RF wireless front-end communications, analog processingenables computations that would be impossible to do at speed digitally.But analog techniques may not scale well to deep nanometer technology.

    7. Fast, Static, Scatter / Gather. Neurons have fanout/fanin of 7,000to other neurons that are located signicant distances away. Eectively,they are able to perform ecient scatter-gather operations across a largenumber of static memory locations, at low cost. Do more ecient waysexist for implementing these operations in silicon? It would presumablybe useful for computations that operate on nite-sized, static graphs.


    Although silicon is getting darker, for researchers the future is bright and ex-citing; dark silicon will cause a transformation of the computational stack, andwill provide many opportunities for investigation.


    [1] E. Krimer et al. Synctium: a near-threshold stream processor for energy-constrained parallel applications. IEEE Computer Architecture Letters, Jan.2010.


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

  • [2] A. Ionescu et al. Tunnel eld-eect transistors as energy-ecient electronicswitches. In Nature, November 2011.

    [3] A. Raghavan et al. Computational sprinting. In HPCA, Feb. 2012.

    [4] D. Fick et al. Centip3de: A 3930 dmips/w congurable near-threshold 3d stackedsystem with 64 arm cortex-m3 cores. In ISSCC , Feb. 2012.

    [5] F. Chen et al. Demonstration of integrated micro-electro-mechanical switchcircuits for vlsi applications. In ISSCC , Feb. 2010.

    [6] G. Venkatesh et al. Conservation cores: Reducing the energy of mature compu-tations. In ASPLOS , 2010.

    [7] Goulding et al. GreenDroid: A mobile application processor for a future of darksilicon. In HOTCHIPS , 2010.

    [8] Goulding-Hotta, et al. The GreenDroid mobile application processor: An archi-tecture for silicons dark future. Micro, IEEE , March 2011.

    [9] H. Esmaeilzadeh et al. Dark silicon and the end of multicore scaling. SIGARCHComput. Archit. News, June 2011.

    [10] H. Esmaeilzadeh et al. Neural acceleration for general-purpose approximateprograms. MICRO, 2012.

    [11] J. Sampson et al. Ecient complex operators for irregular codes. In HPCA,2011.

    [12] R. Merrit. ARM CTO: power surge could create dark silicon. EE Times,October 2009.

    [13] N. Hardavellas et al. Toward dark silicon in servers. IEEE Micro, 2011.

    [14] R. Dennard. Design of Ion-Implanted MOSFETs with Very Small PhysicalDimensions. In JSSC , October 1974.

    [15] R. Dreslinski et al. Near-threshold computing: Reclaiming moores law throughenergy ecient integrated circuits. Proceedings of the IEEE , Feb. 2010.

    [16] S. Gupta et al. Bundled execution of recurring traces for energy-ecient generalpurpose processing. MICRO, 2011.

    [17] S. Jain et al. A 280mv-to-1.2v wide-operating-range ia-32 processor in 32nmcmos. In ISSCC , Feb. 2012.

    [18] M. Taylor. Is dark silicon useful? harnessing the four horesemen of the comingdark silicon apocalypse. In Design Automation Conference, 2012.

    [19] V. Govindaraju et al. Dynamically specialized datapaths for energy ecientcomputing. In HPCA, 2011.

    [20] W. Huang et al. Scaling with design constraints: Predicting the future of bigchips. IEEE Micro, july-aug. 2011.

    Bio: Michael Taylor is an Associate Professor at the University of Califor-nia, San Diego. He can found at http://darksilicon.org.


    Digital Object Indentifier 10.1109/MM.2013.90 0272-1732/$26.00 2013 IEEE

    This article has been accepted for publication in IEEE Micro but has not yet been fully edited.Some content may change prior to final publication.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice


View more >