accelerator-based architectures for wireless sensor network

173
Accelerator-Based Architectures for Wireless Sensor Network Applications A dissertation presented by Mark David Hempstead to School of Engineering and Applied Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Engineering Sciences Harvard University Cambridge, Massachusetts May 2009

Upload: donguyet

Post on 10-Feb-2017

220 views

Category:

Documents


3 download

TRANSCRIPT

  • Accelerator-Based Architectures for WirelessSensor Network Applications

    A dissertation presented

    by

    Mark David Hempstead

    to

    School of Engineering and Applied Science

    in partial fulfillment of the requirements

    for the degree of

    Doctor of Philosophy

    in the subject of

    Engineering Sciences

    Harvard University

    Cambridge, Massachusetts

    May 2009

  • c2009 - Mark David Hempstead

    All rights reserved.

  • Thesis advisor Author

    David Brooks and Gu-Yeon Wei Mark David Hempstead

    Accelerator-Based Architectures for Wireless Sensor Network

    Applications

    Abstract

    Growing power consumption threatens the explosive growth that the semiconductor

    industry has sustained over the last several decades. While the number of transistors

    continues to double every process technology generation, the slowing of constant field

    scaling has caused power density to increase limiting clock frequency. To combat these

    trends, designers must get more performance from each transistor switch. Technology

    companies are applying microprocessors to a growing diversity of applications that

    are increasingly mobile and untethered from the power grid. One such domain is

    the emerging area of wireless sensor networks (WSNs) where, because nodes are

    often deeply embedded in an environment, power consumption is the primary design

    constraint.

    This dissertation explores the challenges of designing in a power-constrained era

    through the development of a model we call Navigo and the design and implemen-

    tation of an accelerator-based architecture for WSNs. We designed Navigo to aid in

    early architecture exploration as an alternative to the spreadsheets and back-of-the-

    envelope calculations that planners use to guide future designs. The results show

    that, even under ideal conditions, multicore processors will not achieve the perfor-

    mance gains necessary to maintain growth. This dissertation shows that if an increas-

    iii

  • Abstract iv

    ing amount of area per technology node is allocated to specialized accelerators, then

    microprocessor performance growth will be maintained.

    As a case study of accelerator-based architectures, we developed a processor for

    WSNs. Our architecture includes accelerators for regular tasks and event handling is

    offloaded to the event processor, removing the software overhead of a general purpose

    design. Because the architecture is modular, VDD-gating can be employed to address

    leakage current at the architecture level. We built a prototype in 130nm CMOS. We

    compare our system to other systems in the literature and a general purpose-based

    design. Our system has the lowest energy per equivalent instruction and results of

    our workload analysis shows the system is suited both for low-intensity and high-

    performance WSN applications.

  • Contents

    Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiCitations to Previously Published Work . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

    1 Introduction and Summary 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Market Requirements . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Accelerator-based Architectures . . . . . . . . . . . . . . . . . . . . . 71.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Navigo: A Model to Study Power-Constrained Architectures andSpecialization 132.1 Navigo: A Model for Performance Trends in Future Technologies . . . 15

    2.1.1 Modeling Methodology and Sample Libraries . . . . . . . . . . 172.2 Power-constrained Performance for Multi-core . . . . . . . . . . . . . 23

    2.2.1 Results without Power Constraints . . . . . . . . . . . . . . . 242.2.2 Results with Power Constraints . . . . . . . . . . . . . . . . . 26

    2.3 Validating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Modeling Specialization . . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.4.1 Variant of Amdahls Law for Specialization . . . . . . . . . . . 352.4.2 Examples of Specialized Cores . . . . . . . . . . . . . . . . . . 38

    2.5 Model Limitations and Future Directions . . . . . . . . . . . . . . . . 42

    v

  • Contents vi

    3 An Ultra Low Power Event Driven Architecture for WSNs 463.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 48

    3.1.1 Overview of WSN Applications . . . . . . . . . . . . . . . . . 483.1.2 PowerTOSSIM Modeling Commercially Available Systems for

    WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.1.3 Low-Power Circuit Design Techniques . . . . . . . . . . . . . . 573.1.4 Energy Scavenging . . . . . . . . . . . . . . . . . . . . . . . . 61

    3.2 Goals of the Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . 64

    3.3.1 System Bus Description . . . . . . . . . . . . . . . . . . . . . 663.3.2 Event Processor Specification . . . . . . . . . . . . . . . . . . 683.3.3 Description of Accelerators and Other Blocks . . . . . . . . . 70

    3.4 Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 743.4.1 Performance Modeling - SystemC Simulator . . . . . . . . . . 753.4.2 Test Application . . . . . . . . . . . . . . . . . . . . . . . . . 753.4.3 Cycle Performance Estimates . . . . . . . . . . . . . . . . . . 78

    3.5 Selection of Process Technology . . . . . . . . . . . . . . . . . . . . . 793.5.1 Background on Technology Scaling . . . . . . . . . . . . . . . 803.5.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 813.5.3 Modeling Architecture Across Process Technologies . . . . . . 853.5.4 Results of System Analysis . . . . . . . . . . . . . . . . . . . . 89

    4 Silicon Implementation and Evaluation of Accelerator Based Sys-tems 994.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    4.1.1 Design Flow and Tools Used . . . . . . . . . . . . . . . . . . . 1014.1.2 VDD-gate circuit . . . . . . . . . . . . . . . . . . . . . . . . . 1024.1.3 Die-Photo and Test Chip Specifications . . . . . . . . . . . . . 103

    4.2 Measurements of Prototype . . . . . . . . . . . . . . . . . . . . . . . 1044.2.1 Test Methodology and Setup . . . . . . . . . . . . . . . . . . . 1054.2.2 Functional Verification . . . . . . . . . . . . . . . . . . . . . . 1064.2.3 Block Level Power Measurements . . . . . . . . . . . . . . . . 1084.2.4 Energy per Task and Energy per Instruction . . . . . . . . . . 110

    4.3 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . 1124.3.1 Categorization and Description of Similar Systems . . . . . . . 1124.3.2 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 113

    4.4 Comparison to General Purpose Microcontroller . . . . . . . . . . . . 1154.4.1 Performance and Energy Benefits of Specialization . . . . . . . 1164.4.2 Workload Analysis and DVFS . . . . . . . . . . . . . . . . . . 119

    4.5 Using Navigo to Guide Future Revisions . . . . . . . . . . . . . . . . 122

  • Contents vii

    5 Conclusion and Future Directions 1265.1 Summary of Themes and Results . . . . . . . . . . . . . . . . . . . . 1275.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    5.2.1 Improved Modeling Frameworks . . . . . . . . . . . . . . . . . 1295.2.2 Memory Systems for Accelerator-Based platforms . . . . . . . 1305.2.3 Applying Accelerator-Based Architectures to Desktop/Mobile

    platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    Bibliography 134

    A Related Work: Description of Similar Systems 141A.1 General Purpose Commodity Based Systems . . . . . . . . . . . . . . 141A.2 Smart Dust - Early Event Driven . . . . . . . . . . . . . . . . . . . . 142A.3 Subthreshold Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.4 Asynchronous - SNAP . . . . . . . . . . . . . . . . . . . . . . . . . . 146A.5 Charm - Network Stack Acceleration . . . . . . . . . . . . . . . . . . 148

    B Detailed Design Documents 150B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.3 Interrupt Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153B.4 Power Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

  • List of Figures

    1.1 Growth in Microprocessor Performance. Historically the indus-try has observed a total 1.58x performance gain per year. Power con-sumption constraints inhibit performance growth causing a gap betweenexpected and delivered performance. Data from Hennessy and Patter-son [25] and spec.org [54]. . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Research Approach. We take a holistic approach to research understanding and addressing power consumption at all layers of thedesign space. Architecture innovations are informed by modeling andprototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1 Graphical depiction of Navigo. The model accepts library files forprocess technology, circuits, architecture, and market segments, andcomputes total and constrained power for a set of user-defined inputssuch as supply voltage, frequency, etc. . . . . . . . . . . . . . . . . . . 16

    2.2 Results without power constraints across process technolo-gies. Results assume nominal voltage for specified technology andMPU-HP market segment with a die size of 310 mm2. . . . . . . . . . 25

    2.3 Results with power constraints across process technologies -Server. Results assume nominal voltage for specified technology andMPU-HP market segment with a die size of 310 mm2 and max powerof 198 W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.4 Results with power constraints across process technologies -Mobile. Results assume nominal voltage for specified technology andMobile market segment with a die size of 100 mm2 and max power of35 W. Vdd is limited to VddMIN. . . . . . . . . . . . . . . . . . . . . 28

    2.5 Results with power constraints across process technologieswithout VddMIN constraints - Mobile. Results assume nominalvoltage for specified technology and Mobile market segment with a diesize of 100 mm2 and max power of 35 W. Vdd can be reduced withouta lower limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    viii

  • List of Figures ix

    2.6 Validation of Navigo using Microprocessors from 1996 to 2007.Predicted results use the most recent ITRS technology models. The ini-tial core model is an Alpha 21164 0.5 GHz in 250nm technology in-troduced in 1996. The data points representing commercially availablesystems are also presented in Figure 2.5 . . . . . . . . . . . . . . . . 33

    2.7 Speeding up an application with specialized cores. A workloadis split to an additional set of resourcesthe specialized core. Thefraction of the application that can be executed on the specialized coreis f , with a speedup of S. . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.8 Understanding the impact of specialization on throughput.Calculations of throughput with specialization for different speedups (S)and fractions of workload (f). Assumes the general purpose core isfully utilized and resources for an additional specialized core has beenprovisioned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.9 Specialization across process technologies with real SP cores.Total throughput for different values of f assuming the area and speedupof one example SP core per GP core. Mobile 35W market segment. . 40

    2.10 Configurations that can achieve 1.58x/year throughput. Modeltwo different accelerator structures the programmable CELL SPE andan H.264 accelerator. Core2Duo-based GP cores and the Mobile 35Wmarket assumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.1 Measured and simulated current consumption for the Beaconapplication. The simulated version includes a breakdown according toradio, LEDs, and CPU current. A lower resolution digital multi-meterwas used for the above measurement, which did not capture the veryshort duration peak power spikes during the wakeups. . . . . . . . . . 55

    3.2 Surge Application Power Consumption Breakdown. 60 sec ofthe surge TinyOS application run on the Mica2 mote. . . . . . . . . 56

    3.3 System Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . 653.4 Event Processor State Machine . . . . . . . . . . . . . . . . . . . 683.5 Diagram and Code of the Monitoring Application. The code

    displayed are ISR routines written for the event processor. Actual ad-dress values have been omitted to make the code easy to read. . . . . . 76

    3.6 Test Circuit Used for Simulations. The circuit consists of an 11stage ring oscillator made up of an assortment of logic gates. Inter-connect was modeled between devices. . . . . . . . . . . . . . . . . . . 81

    3.7 Leakage Power, EDP, and Frequency Across all TechnologiesEach line indicates a technology node from 180nm to 70nm. Supplyvoltage is on the X-axis which was swept from 0.1V to the max VDDspecific to the process. Temperature is 20C and all transistors areminimum size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

  • List of Figures x

    3.8 Results for Baseline Architecture. Performance target of N=100sense and transmit tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 90

    3.9 Effect of Energy Reduction Techniques on Total Energy Con-sumption of the Architecture Across Process Technologies.Power Supply voltage is limited to V tP + V tN and the number of tasksper second is 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    3.10 Summary of Energy Reduction Techniques Across ProcessTechnologies Each bar represents the minimum energy calculated fora particular architecture configuration and process technology. Both thetotal energy consumption and a percentage breakdown of the source ofenergy consumption are included. . . . . . . . . . . . . . . . . . . . . 95

    4.1 Custom VDD-Gating Circuit. The schematic shows four differentparallel legs which are used to control VDD-gating strength. Layout ofthe filter component shows where the VDD-gating circuit is attached.In this example, the VDD-gating circuit requires an additional area of3.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    4.2 Die Photograph of 130nm Prototype. System includes an eventprocessor and several accelerators for regular operation. The systemhas been realized in 130nm CMOS on a 2mm x 2mm die. The systemcontains 444,982 transistors including 4KB of foundry supplied SRAM. 104

    4.3 Frequency verses Voltage Shmoo. Shaded region of plot indicateswhere the test failed the unshaded region indicates successful opera-tion. Results from a full run of the sense and transmit application wereused to generate a shmoo. Due to limitations of the test board the chipwas measured up to 12.5 MHz. The shmoo generated using post layoutsimulations indicate the chip will work up to 100 MHz . . . . . . . . . 107

    4.4 Measured power consumption of the prototype under differ-ent supply voltages and clock frequencies. Plots a-c show thepower consumption for the Event Processor, Accelerator, and SRAMpower domains while sweeping voltage from 450 MV to 800 MV andfrequency from 25 kHz to 12.5 MHz. Idle power is measured with theexternal clock off (0MHz @550mV). The VDD-gating transistor is off(not-conducting) during the measurement of gated power. . . . . . . 109

    4.5 Energy per Task of Sense and Transmit Task. Application in-cludes all accelerator blocks and power contributions from the SRAMand Event Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    4.6 Comparison to Other Systems Designed for WSN. . . . . . . . 1144.7 Performance and Power Benefits of Specialization. Test rou-

    tines were executed both on the hardware accelerators and the micro-controller. Cycle count and energy savings are presented. . . . . . . 118

  • List of Figures xi

    4.8 Evaluation of Accelerator-based Architecture vs. GeneralPurpose System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    4.9 WSN architecture projected to advanced process technologiesand power budgets. Die size, f , S are fixed to based on measure-ments of the original system. Area is swept and the configuration withthe maximum throughput is reported for three different power budgets. 125

    A.1 Smart Dust Microarchitecture[59]. . . . . . . . . . . . . . . . . . . . . 143A.2 Block Diagram of the Subliminal Processor (University of Michigan)[51].145A.3 Simplified block diagram of the SNAP processor for WSN. System

    includes separate instruction and data memories, a timer coprocessor,and a message processor which provides a FIFO interface to the off-chipradio and sensors[9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    A.4 The Charm protocol processor microarchitecture[52]. . . . . . . . . . 148

  • List of Tables

    2.1 Predicted Process Technology Characteristics. High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50]. . . . . . . . . . . 18

    2.2 Technology Scaling Factors. High-Performance MicroprocessorLogic. Indicates a departure from historical scaling trends resultingin an increase in power density. [50] . . . . . . . . . . . . . . . . . . . 19

    2.3 Example Cores used in analysis. Data collected from conferenceand journal publications and datasheets. SPEC2006 results used todetermine IPC are from spec.org. . . . . . . . . . . . . . . . . . . . . 20

    2.4 Market Segment Constraints. Die size and Max Power Consump-tion for a set of market segments. Values for the first three marketscame from ITRS [50]. The final four market segments are based on diesize and thermal design point of commercially available Intel Processors. 21

    2.5 Select Microprocessors from 1996 to 2007. Performance data isfrom the analysis in Figure 1.1. Power consumption and die size datawas acquired from datasheets and published microprocessor reports. . . 32

    2.6 Specialized Cores. Example SP cores used in the model. All mea-surements were scaled to 65nm technology and speedup was calculatedby comparing published performance results to the performance on ageneral purpose CPU. The Core2 is included to show the relative areaand performance cost of including another GP core instead of an SPcore. Power and speedup for CELL SPE running Linpack. . . . . . . 39

    3.1 Sensor Sampling Rates of Different Phenomena . . . . . . . . 493.2 Example WSN application domains. . . . . . . . . . . . . . . . . 503.3 Power model for the Mica2. The mote was measured with the

    micasb sensor board and a 3V power supply. . . . . . . . . . . . . . . 543.4 Event Processor Instruction Set . . . . . . . . . . . . . . . . . . 693.5 Comparison of cycle count for the test application written on

    our architecture and on TinyOS for the Mica Platform. . . . 783.6 Scaling Factors From theory and simulation data . . . . . . . . . . 87

    xii

  • List of Tables xiii

    3.7 Activity Ratios for Our Test Application . . . . . . . . . . . . 88

    B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . . . . . . . 151B.2 System Memory Map All addresses are in hex . . . . . . . . . . . 152B.3 System Interrupt Map Lists all of the interrupts in the prototype

    and the source of the interrupt. . . . . . . . . . . . . . . . . . . . . . 153B.4 Power Domains in the PrototypeLists all of the power domains

    in the prototype including virtual power domains and power domainsfor testing only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

  • Citations to Previously Published Work

    The architecture presented in Chapter 3 first appeared in the following paper:

    An ultra low power system architecture for sensor network applications,Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu-Yeon Wei, andDavid Brooks, In The 32nd Annual International Symposium on Com-puter Architecture (ISCA), June 2005.

    The PowerTOSSIM simulator, presented in Section 3.1.2 including figure 3.1, ap-peared in:

    Simulating the Power Consumption of Large Scale Sensor Network Ap-plications, Victor Shnayder, Mark Hempstead, Bor-Rong Chen, GeoffWerner Allen, and Matt Welsh, In Proceedings of the Second ACM Con-ference on Embedded Networked Sensor Systems (SenSys), Baltimore,MD, Nov 2004.

    The evaluation of process technology selection, presented in Section 3.5, appeared in:

    Architecture and Circuit Techniques for Low-Throughput, Energy-ConstrainedSystems Across Technology Generations, Mark Hempstead, Gu-YeonWei and David Brooks, In Proceedings of the International Conference OnCompilers, Architecture, And Synthesis For Embedded Systems(CASES).Seoul South Korea. October 2006.

    The related work, presented in Section 4.3 and Appendix A was first surveyed in thefollowing invited paper:

    Survey of hardware systems for wireless sensor networks, Mark Hemp-stead, Michael J. Lyons, David Brooks and Gu-Yeon Wei. ASP Journalof Low Power Electronics, Vol. 4., No. 1, April 2008.

    The Navigo model presented in Chapter 2 is currently under submission in the fol-lowing paper:

    Navigo: A Model to Study Power-Constrained Architectures and Spe-cialization, Mark Hempstead, Gu-Yeon Wei, and David Brooks [UnderSubmission]

    The measurement results of our prototype, presented in Chapter 4, are currentlyunder submission:

    An accelerator-based wireless sensor network processor in 130nm CMOS,Mark Hempstead, David Brooks, and Gu-Yeon Wei, [In preparation]

    xiv

  • Acknowledgments

    The path to this PhD has been an adventure, and I would like to take this op-

    portunity to thank all of those who have helped and supported me along the way.

    Throughout my journey the path was often hard to find and, without the guidance and

    encouragement from these individuals, I would have never overcome the academic,

    technical, and emotional challenges that blocked my way.

    First, I would like to thank my advisers Gu-Yeon Wei and David Brooks for taking

    a chance on me to start a fruitful collaboration across the disciplines of circuit design

    and architecture. Throughout the last few years they have supported and guided my

    transformation as a researcher. I appreciate the endless hours they spent providing

    feedback on talks, papers, and chips, pushing me to think more deeply. Early in

    my research career I received valuable feedback from my qualification committee,

    Woodward Yang and Paul Horowitz. I am grateful to Margo Seltzer for her instruction

    in paper writing and presentations in CS261 and, more recently, for agreeing to serve

    on my dissertation committee.

    Throughout the duration of my research project, several individuals helped me

    with architecture exploration and early Verilog coding, including: Nikhil Tripathi,

    Patrick Mauro, and Xiaoyao Liang. Michael Lyons and I have enjoyed a strong col-

    laboration brainstorming the design of SMASH, next generation architecture. I wish

    to thank the other members of the Mixed-signal VLSI and Architecture groups: Am-

    ber Tan, Ruwan Ratnayake, Andrew Liu, Hayun Chung, Ankur Agrawal, Wonyoung

    Kim, Durlov Khan, Meta Gupta, Benjamin Lee, VJ Reddi, and Kevin Brownell.

    They provided invaluable instruction and support when I was met with problems

    using CAD tools, test equipment, and architecture simulators. Moreover, they were

    xv

  • Acknowledgments xvi

    the source of supportive conversations at lunch, over dinner and during late night

    tape-outs.

    Halfway through my grad student career, our group received the gift of Glenn

    Holloway, whose management of our machines and debugging support at all hours

    saved me weeks of frustration. Jim MacArthur in the Cruft circuits lab was an

    invaluable resource when I needed help designing PCBs, soldering, or finding random

    parts. Because my research crossed into the systems realm, early collaborations with

    the wireless sensor network (WSN) groupincluding Matt Welsh, Geoffrey Werner

    Challen, Victor Shnayder, and Bor-Rong Chenhelped me understand the needs

    of the WSN community. Im thankful to UMC and the SRC for supporting the

    fabrication of my two test chips. I would like to thank Joel Emer, Mark Charney, and

    Geoffrey Loweny for hosting me at Intel in Hudson, MA for a summer and exposing

    me to research in higher performance systems.

    For me grad school was more than just researchI had the opportunity to en-

    gage in a diverse set of opportunities from teaching to graduate student organization

    and the Harvard house system. Harry Lewis introduced me to his unique course,

    QR48:BITS, and he was a wonderful teaching mentor who gave me the chance to try

    my hand at lecturing. Likewise, Woodward Yang showed me how to coach students

    in engineering design in ES96. Im thankful to Hwa Chang and Jeffery Hopwood at

    Tufts for mentoring me after I took over the digital logic class this semester. I would

    like to encourage the students who have taken over the graduate student life commit-

    tee to continue the good work of building a community within SEAS and motivating

    graduate students to leave their labs occasionally. For the past three years, my fellow

  • Acknowledgments xvii

    tutors, masters, and students have made Lowell House into a vibrant and supportive

    home.

    Throughout my graduate school experience, it was the support of my caring friends

    and family that kept me going. Specifically, I would like to thank my parents, David

    and Rolande, who brought me up with such caring and supported me with a smile

    when I turned down a job in the real world for graduate school. My father, who

    taught me to think like an engineer at a young age through his probing questions at

    the dinner table, continues to challenge me today. My mother, who rightly believes I

    need emotional support just as much as technical support, continues to pick me back

    up after each paper rejection. My sister Amy was my lifeline here in Boston over the

    past few years. Though she has suppressed her engineering genes, she continues to

    surprise me with a display of her scientific mind over a bottle of wine. My brother

    Chris, the more practical engineer, taught me how to put a square peg in a round

    hole with a big hammer. His thoughtfulness and ingenuity just might convince

    me to start a company with him ... someday. Finally, I cannot give enough thanks

    to Megan, whose caring, kindness, and support over the last few years made this

    dissertation possible and easier to read. I look forward to many more adventures

    together and one more dissertation between us.

  • Dedicated to those who have paved the way for me

    my parents David and Rolande,

    and my grandparents David and Margaret Hempstead, and Rudy and

    Lillian Perreault.

    xviii

  • Chapter 1

    Introduction and Summary

    Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . . . . 3

    1.1.2 Market Requirements . . . . . . . . . . . . . . . . . . . . . 4

    1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . . . . 6

    1.3 Accelerator-based Architectures . . . . . . . . . . . . . . 7

    1.4 Summary of Contributions . . . . . . . . . . . . . . . . . . 8

    Advances in computational capabilities have driven the information technology

    revolution, which in turn has driven advances in nearly all fields of science, medicine,

    and business. Although incredibly powerful computing devices are available today,

    this single-minded pursuit of performance has made power consumption one of the

    main bottlenecks for nearly all types of computing systems, from high-end servers

    to wireless sensor devices. Due to limitations in device cooling at the high-end and

    battery technology at the low-end, processor designs are increasingly stratified into

    power-constrained market segments in which the challenge is to increase processor

    1

  • Chapter 1: Introduction and Summary 2

    performance for a fixed power budget. While advanced fabrication technology are

    projected to continue to provide computer designers a doubling of transistors per

    generation, slowing constant-field scaling and worsening wire parasitics will see the

    energy per switching event scale at a rate in which chip power will essentially re-

    main constant with fixed clock frequency and core activity. Current trends towards

    large multi-core systems utilize the additional transistor bounty for additional power-

    efficient cores but, with single-thread performance saturated, most benefits will come

    through thread-level parallelism. Assuming an optimistic scenario for the continued

    extraction of thread-level parallelism from workloads, chip performance gains will

    track growth in transistor counts. The International Technology Roadmap for Semi-

    conductors (ITRS) projects a doubling in the number of transistors every three years

    (e.g., 1.25x per year) leading to an increasing gap between projected performance

    growth and historical performance growth rates. Bridging this performance gap will

    require an architectural paradigm shift to augment the multi-core trend, in which

    an increasing fraction of chip real estate must be devoted to specialized logic that

    provides significant benefits in performance per switching event for a growing portion

    of workloads.

    This dissertation argues that maintaining growth in system performance requires

    using transistors more efficiently to achieve higher performance per watt. The power

    consumption of a computing device depends on all layers of the design space, from

    the application software, to circuits and process technology and system architecture.

    This work takes a holistic approach by developing models and designs incorporating

    all layers of the design space. In this chapter, we describe the technology and mar-

  • Chapter 1: Introduction and Summary 3

    1985 1990 1995 2000 2005 2010 2015 2020

    102

    104

    106

    Year

    CP

    U P

    erfo

    rman

    ce

    Histor

    ical T

    rend:

    1.58x

    Power Constrained Era

    Multi-core

    Single-thread

    Performance Predictions

    Figure 1.1: Growth in Microprocessor Performance. Historically the industryhas observed a total 1.58x performance gain per year. Power consumption constraintsinhibit performance growth causing a gap between expected and delivered performance.Data from Hennessy and Patterson [25] and spec.org [54].

    ket conditions that motivate this work and our holistic approach. We also describe

    accelerator-based architectures in general and allude to a prototype that we designed

    and taped-out for this work. Finally, we summarize the main contributions of this

    work.

    1.1 Motivation

    1.1.1 Technology and Trends

    Over the past few decades the performance of microprocessors has grown steadily.

    However, over the past several years designers have been forced to slow the growth

  • Chapter 1: Introduction and Summary 4

    of single thread performance because of increasing power consumption. To explore

    these trends, Figure 1.1 plots both historical performance growth and projected multi-

    core and single-threaded performance growth until 2020. All data in the plot is

    relative to the VAX 11/780 as measured by SPECint benchmarks data in the plot

    previous to 2005 was obtained from Hennessy and Patterson, and data for recent years

    was obtained using the highest single-die performance SPECint2006 (single-thread)

    and SPECint2006rate (multi-core) from the SPEC website [25, 54]. Performance

    growth began to deviate from the historical 1.58x per year trend in 2001, primarily

    due to the difficulty of obtaining additional clock frequency and instruction-level

    parallelism improvements in the face of power constraints. The computing industry

    has reacted to this trend by concentrating on multi-core designs that capture thread-

    level parallelism. Unfortunately, as detailed in this work, power issues will limit

    multi-core performance growth from meeting the historical trend, and closing this

    gap will require more efficient use of transistors.

    1.1.2 Market Requirements

    The growth of the semiconductor industry has not only been driven by perfor-

    mance gains but also by a growing diversity of applications for microprocessors. Mi-

    croprocessors have moved out of government and corporate computing centers into

    homes, schools, coffee shops, and, now, pockets and pocket books. As microproces-

    sors have found additional uses beyond high performance and desktop computing,

    new design constraints are being applied to microprocessors among them power,

    size, and cost.

  • Chapter 1: Introduction and Summary 5

    Power consumption is increasingly the primary design constraint for mobile and

    embedded devices, as designers try to maximize battery life and reduce cooling cost.

    The performance and power consumption requirements across market segments vary

    by several orders of magnitude high-performance servers have a power limit of

    200W while some processors for laptops and netbooks are designed to consume a

    maximum of 1-10 W (Chapter 2 includes a more detailed list of market segments and

    power constraints). The power constraints imposed by the market are contradictory

    to the increase in power density caused by technology scaling. Because, mobile and

    embedded devices are untethered from the power grid, power consumption has been

    a concern within these communities for some time.

    The emerging market segment of wireless sensor networks (WSNs) places even

    more stringent power constraints on processor design and therefore is an indicator

    of what is to come for the other market segments in the future. Wireless sensor

    networks have applications in medicine, science, industrial automation and security.

    WSN nodes are often deeply embedded in an environment and decoupled from the

    wired power grid. Consequently, designers would like used scavenged energy to power

    WSN devices indefinitely. Currently available energy scavenging methods place a

    power consumption constraint of roughly 100W on microprocessors designed for

    environmentally powered WSNs (a more detailed background of WSNs and energy

    scavenging is presented in Section 3.1). These strict limits on power consumption

    provide increased design pressure to maximize performance-per-watt. As technology

    scales and power density increases, other market segments will face similar design

    challenges.

  • Chapter 1: Introduction and Summary 6

    83

    Research Strategy

    Application

    Holistic Approachaddresses power

    consumption at all layers

    Architecture informed by modeling and prototyping

    Architecture

    Circuits

    Process Tech

    Network

    Circuit Simulations

    Prototyping

    Design (Architecture/Circuits)

    Modeling (Power + Performance)

    (a) Holistic Approach 83

    Research Strategy

    Application

    Holistic Approachaddresses power

    consumption at all layers

    Architecture informed by modeling and prototyping

    Architecture

    Circuits

    Process Tech

    Network

    Circuit Simulations

    Prototyping

    Design (Architecture/Circuits)

    Modeling (Power + Performance)

    (b) Research Cycle

    Figure 1.2: Research Approach. We take a holistic approach to research un-derstanding and addressing power consumption at all layers of the design space. Ar-chitecture innovations are informed by modeling and prototyping.

    This work investigates the impact of technology scaling on power consumption. As

    this section has described, the pressures of a power-constrained era require designers

    to think about improving performance per watt by using transistors more efficiently.

    This work takes a holistic approach looking at all areas of the design space, using the

    emerging domain of WSNs as a case study in ultra-low power design.

    1.2 Holistic Approach

    During the course of our research, we have taken the view that all layers of the

    design space influence power consumption, from the application and network to the

    architecture and circuits. Figure 1.2 provides a graphical description of the research

    approach we employed. Our research efforts follow an iterative approach through

  • Chapter 1: Introduction and Summary 7

    modeling, design and prototyping and our models incorporate inputs from a variety

    of design layers. For example, the PowerTOSSIM model (Section 3.1.2) accepts inputs

    from the network and application layers and physical power measurements of nodes

    while the Navigo model (Chapter 2) takes data from circuit simulations, process

    technology data and performance benchmarks of different architectures.

    We use modeling to guide design decisions which are verified by circuit simulations

    and prototyping. Chapter 3 describes a design motivated by the modeling of appli-

    cation behavior and addresses leakage current, which is increasing due to technology

    scaling. Because our power consumption targets are so low, we developed a prototype

    in 130nm CMOS to verify that our design achieves ultra low power operation. Both

    the power and performance measurements of the prototype, presented in Chapter 4,

    prompt more analysis and modeling of generalized accelerator-based architectures.

    Consequently, results from our prototype and modeling efforts will drive our future

    research efforts.

    1.3 Accelerator-based Architectures

    Both the trends in technology and market pressures to increase power efficiency

    reveal the need to extract more computation for each transistor switch. Many de-

    signers intuitively believe that application specific integrated circuits (ASICs) pro-

    vide higher performance and increased energy efficiency over general purpose based

    designs. However, ASICs are tuned for a particular set of computations and hence do

    not posses the flexibility and programmability of a general purpose processor. One

    approach, used by the system-on-chip community, places ASIC accelerators on a chip

  • Chapter 1: Introduction and Summary 8

    with a general purpose microcontroller. As we show in this work, an accelerator-based

    approach has the potential to compensate for the loss of performance due to power

    constraints. We show that maximizing total system performance requires that the

    accelerators provide application speedup (S) for a large fraction of the workload (f).

    The regular nature of computation and the ultra-low power requirements of the

    WSN application domain make it well-suited to benefit from an accelerator-based

    architecture. As a case study of accelerator-based architectures, we designed and

    implemented a processor for WSN applications. Our implementation utilizes the

    modular nature of the architecture to turn off unused accelerators and address leakage

    current with architecture. We also do away with the notion that the system needs

    to be controlled by a high powered general purpose core and, instead, we replace it

    with an event-driven state machine. Traditionally, the energy efficiency of a system

    has been evaluated through the metric of energy-per-instruction. The concept of

    instruction is lost on accelerator-based architectures and, therefore, we propose several

    new methods to analyze the efficacy of our prototype.

    1.4 Summary of Contributions

    This work presents the combined contributions of four different modeling and

    analysis frameworks and a ground-up silicon implementation of a processor for wire-

    less sensor networks. Following the research approach presented in Section 1.2, the

    modeling frameworks are informed by several layers of the design space applica-

    tions, architecture, circuits, and process technology. The Navigo model, presented

    in Chapter 2, accepts libraries that describe architecture features, process technology

  • Chapter 1: Introduction and Summary 9

    characteristics, voltage and frequency relationships from circuit simulations. Through

    the analysis of the inputs, Navigo reports an estimate of performance and power con-

    sumption for future generations of microprocessors. The results revel that power con-

    sumption increasingly limits performance. Subsequent analysis with Navigo shows

    that specialization can provide the necessary performance-per-watt. However, the

    high level analysis from Navigo needed to be grounded in a real implementation to

    understand the benefits and costs of accelerator-based architectures. The design of

    our prototype was informed by our modeling efforts of wireless sensor network applica-

    tions with PowerTOSSIM, presented in Section 3.1.2, and a understanding of process

    technology trends. Likewise, the architecture of the prototype drives the analysis in

    Chapter 4 and the process technology study in Section 3.5. Through the models and

    prototype, this work presents the following insights and major contributions.

    Navigo: A Model to Study Power-Constrained Architectures and Specialization (Chap-

    ter 2)

    Modeling Framework for Early Exploration Currently designers use intuition

    and spreadsheet-based models to explore design decisions and estimate power

    consumption and performance of architectures five to fifteen years away from

    tape-out. Navigo provides features not available in spreadsheet-based models

    including voltage-frequency scaling to meet power constraints and input from

    circuit simulations. By incorporating different architecture models, Navigo can

    be used to model massive multi-core designs.

    Amdahls Law for Specialization We enhanced Amdahls law to model het-

    erogeneous accelerators that can provide a speedup (S) for a fraction of appli-

  • Chapter 1: Introduction and Summary 10

    cations (f). Including the enhanced Amdahls law and architecture models of

    specialized accelerators, Navigo can be used to compare homogeneous multi-core

    designs with designs that include specialized accelerators.

    Results show Increasing Effect of Power Constraints Results using Navigo

    reveal that performance of multi-core systems will be significantly reduced due

    to power constraints. While some designers intuitively understand this result,

    our work it is one of the first quantitative presentations of this issue. This result

    should serve as a call to action to develop systems with a higher performance-

    per-watt.

    Analysis for Amount of Specialization By including specialized accelerators

    in the model, we use Navigo to select the amount of specialization (both S and

    f) required to maintain the performance growth shown in the semiconductor

    industry. This analysis gives designers the target amount of area to allocate to

    specialization in designs over the next decade.

    Accelerator-Based Architecture for Wireless Sensor Networks (Chapter 3)

    Holistic Design Informed Through Application and Circuits We built the

    PowerTOSSIM to study the power consumption of WSN applications. We used

    insights gained from PowerTOSSIM to guide our design of the system architec-

    ture.

    Accelerator Based Event-Driven Architecture The custom architecture for

    WSN includes hardware accelerators for regular tasks, we offloaded event pro-

  • Chapter 1: Introduction and Summary 11

    cessing to a custom hardware component (Event Processor), and we address

    leakage power with architecture support for VDD-gating.

    Performance Improvements over Mica2 A SystemC model of the architecture

    shows a 10x performance improvement over the Mica2 architecture for typical

    WSN tasks.

    Framework for Process Technology Selection We built a framework to eval-

    uate the selection of process technology. We based the framework on a Verilog

    model of the architecture and circuit simulations of different process technology

    generations. The results show that because of increasing leakage current, the

    most advanced process technology node is not the best choice to minimize total

    system power consumption.

    Silicon Implementation and Evaluation of Accelerator Based Systems (Chapter 4)

    Prototype Chip in 130nm CMOS We built a prototype as a case study of

    accelerator based architectures. It incorporates synthesized accelerator blocks,

    custom VDD-gating circuit, and 2 KB of SRAM for a total of 444,982 transis-

    tors.

    Functional Verification and Per Block Power Measurements We verified the

    prototype for functionality and it functions correctly up to 12.5 MHz at 550

    mV. Post layout simulations estimate that the system could run up to 100 MHz

    at 1.2V. Measurements of per block power show that VDD-gating saves up to

    100x of idle leakage power.

  • Chapter 1: Introduction and Summary 12

    New Metric of Energy per Task and Comparison to Related Work The

    traditional metric of energy-per-instruction does not accurately measure an

    accelerator-based architecture. Therefore we introduce two new metrics of En-

    ergy per Task and Energy per Equivalent Instruction to compare the prototype

    to related work. With a measured energy per task of 678.9 pJ and energy per

    equivalent instruction of 0.44 pJ this system is the lowest energy processor cur-

    rently available for WSNs.

    Analysis of Accelerator Speedup and Energy Savings We isolate the benefits

    of accelerator based computing by comparing hardware and software implemen-

    tations of the routines expressed by the accelerators. The results show a 15x to

    635x performance speedup and a 10x to 600x energy savings, depending on the

    routine.

    Comparison to General Purpose designs through Workload Analysis with Volt-

    age and Frequency Scaling (VFS) We compare our system against a general

    purpose design while sweeping workload intensity. Voltage and frequency scal-

    ing and VDD-gating are included in the analysis. The results show that the

    architecture is well-suited for low duty cycle applications and at the same time

    can provide more performance for high intensity workloads than general purpose

    designs.

    This work provides both a high-level justification for accelerator-based architec-

    tures and a case study built from the ground up. The work concludes with a discus-

    sion of some of the open research questions in this area and a description of current

    research efforts.

  • Chapter 2

    Navigo: A Model to Study

    Power-Constrained Architectures

    and Specialization

    Contents2.1 Navigo: A Model for Performance Trends in Future

    Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.1.1 Modeling Methodology and Sample Libraries . . . . . . . . 17

    2.2 Power-constrained Performance for Multi-core . . . . . . 23

    2.2.1 Results without Power Constraints . . . . . . . . . . . . . . 24

    2.2.2 Results with Power Constraints . . . . . . . . . . . . . . . . 26

    2.3 Validating the Model . . . . . . . . . . . . . . . . . . . . . 31

    2.4 Modeling Specialization . . . . . . . . . . . . . . . . . . . . 34

    2.4.1 Variant of Amdahls Law for Specialization . . . . . . . . . 35

    2.4.2 Examples of Specialized Cores . . . . . . . . . . . . . . . . 38

    2.5 Model Limitations and Future Directions . . . . . . . . . 42

    13

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 14

    Given the technology scaling trends and market requirements presented in Sec-

    tion 1.1, it is important for chip architects to understand the limitations of homoge-

    neous parallelism and to consider more radical architectural approaches. This chapter

    presents Navigo, a model that incorporates technology scaling effects to predict future

    power-constrained performance trends. Navigo can be used to predict, for a variety

    of processor cores, circuit parameters, and market segments, performance trends and

    shortfalls from the historical growth rate. Future designs that seek to bridge this gap

    must more effectively utilize switching events through specialized hardware. Special-

    ization hardware can take many forms [11, 29, 36, 38] including programmable SIMD

    units, hardcoded ASIC cores, or reconfigurable logic, and Navigo includes a general

    analytical model that can capture the impact of parallel specialization on power-

    constrained performance gains. This model projects the amount of specialization,

    quantified in terms of several parameters, that will be required in future technology

    generations to meet the historical performance scaling trends.

    In addressing the problem of power-constrained performance scalability, the chap-

    ter makes the following contributions:

    We describe Navigo (Section 2.1), a model incorporating technology scaling,

    circuit design parameters, and architectural design decisions into a high-level

    model to facilitate understanding the impact of power-constrained performance.

    We use Navigo to understand a large design space of input parameters (Sec-

    tion 2.2).

    We extend Navigo to model parallelizable specialization hardware (Section 2.4),

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 15

    introducing additional parameters to quantify specialization benefits and power/area

    costs. This model demonstrates that in order to maintain historical performance

    growth, we must increase the amount of specialization for each technology gen-

    eration.

    2.1 Navigo: A Model for Performance Trends in

    Future Technologies

    Trends in process technology scaling, predicted by the International Technology

    Roadmap for Semiconductors (ITRS), consider a variety of factors that affect the

    performance scalability of future computing systems. Designers can no longer rely on

    the next technology node to increase circuit performance and reduce energy consump-

    tion. Constant-field scaling (or Dennard scaling [64]) has run out with limits imposed

    on how aggressively one can reduce transistor threshold voltages (Vth) and supply

    voltage (Vdd). The dramatic increase in leakage current has effectively flattened out

    Vth scaling for planar CMOS technologies such that supply voltage scaling has also

    slowed down. While technology continues to reduce transistor size, wire parasitics

    are getting worse after a short respite gained by moving to copper. Lastly, the power

    ceiling imposed by cooling costs and battery life further limit performance gains tra-

    ditionally offered by technology scaling. In short, the landscape of processor design

    has changed dramatically since the end of the twentieth century. It is imperative to

    arm future designers with tools that can navigate through the complex interactions of

    future process technology scaling trends on architectural and circuit design choices,

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 16

    Process Technology(ITRS)

    Circuits(HSPICE)

    Architecture(General purpose and

    specialized cores)

    User-defined Inputs:Technology nodeVdd (nominal,min)Frequency# of cores and typeMarket selection

    NavigoMarket constraints(Server, Desktop,

    Mobile, WSN, etc.)

    Outputs:ThroughputPower

    Figure 2.1: Graphical depiction of Navigo. The model accepts library files for pro-cess technology, circuits, architecture, and market segments, and computes total andconstrained power for a set of user-defined inputs such as supply voltage, frequency,etc.

    coupled with power budget limitations imposed across different markets segments. To

    this end, we present Navigo, a detailed model that incorporates the effects of process

    technology, circuits, architecture, and market to predict future processor performance

    trends.

    This section begins with a high-level overview of Navigo, which outlines the basic

    goals and assumptions made. Then, it describes the inner details of the model,

    revealing how it can be used by designers in early stages of design to help guide

    high-level system and architectural design decisions.

    Navigo provides designers with a powerful and flexible tool to navigate the in-

    tricate tradeoffs between process technology, circuits, and architecture, in order to

    predict their implications on performance in future processor designs. Figure 2.1

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 17

    presents a high-level graphical representation of Navigo. The model takes in a vari-

    ety of input libraries, which quantify detailed parameters corresponding to process

    technology, circuit performance, architecture, and market segment constraints. While

    each of these libraries can be modified by the user, Navigo includes built-in libraries

    based on ITRS technology scaling predictions out to 11nm (available in 2020), pre-

    dictive technology models (PTM) [47, 67], IPCs of currently available processor cores

    (based on SPECint2006 scores), and high-level power and area constraints for differ-

    ent market segments. With the libraries in place, the designer can sweep a variety

    of input parameters such as technology node, voltage, frequency, target market, etc.

    Navigo then outputs the total system throughput and power. The user can then

    refine her design by iterating through different input parameters to meet a specific

    throughput and/or power target.

    2.1.1 Modeling Methodology and Sample Libraries

    An engine that takes the various libraries and input sweep parameters to calcu-

    late throughput and power consumption is at the core of Navigo. This engine must

    consider a variety of factors such as the number and characteristics of computational

    blocks (i.e. cores), voltage and frequency scaling, wire loading, leakage power, and

    process technology, all constrained by power budget limitations. All of these factors

    are quantified by the different library parameters.

    The process technology library quantifies several parameters and characteristics

    utilized by Navigo, which are listed in Table 2.1. These parameters set the basic

    device and wire characteristics that Navigo uses to determine circuit speed, power,

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 18

    Year of Production 2007 2010 2013 2016 2019 2022Planar Bulk Double Gate

    Approximate node (nm) 65 45 32 22 16 11Supply Voltage (V) 1.1 1.0 0.9 0.8 0.7 0.65Physical Gate Length (nm) 25 18 13 9 6.3 4.5Id sat (uA/um) 1211 1807 2204 2627 2768 2786Intrinsic delay (ps) 0.64 0.46 0.26 0.15 0.1 0.08Intrinsic switching energy (fJ) 0.064 0.045 0.020 0.0085 0.0037 0.0020RC delay of 1mm wire (ps) 890 2100 4555 10652 23515 58525Die Size-Server (mm2) 310 310 310 310 310 310Number of Transistors (M) 1106 2212 4424 8848 17696 35391

    Table 2.1: Predicted Process Technology Characteristics. High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50].

    and the number of cores that will be available in future technology nodes. The built-

    in process technology library uses published data from ITRS 2007 [47, 67] out to the

    11nm technology node anticipated in year 2022. ITRS predicts double gate technology

    will supplant planar bulk devices at the 32nm node in year 2013. Because ITRS is a

    predictive roadmap based on current projections of technology, it is well-known that

    the semiconductor industry has a history of either under- or out-performing ITRS.

    For example, Intels technology roadmap is more aggressive with processors at the

    45nm node already shipping and plans to introduce processors on the 32nm node in

    late 2009. Hence, this library can be readily modified by the user to better reflect

    updated ITRS projections or propriety information if available. Table 2.2 compares

    technology trends up to 1999 described by Borkar [2] to ITRS 2007 predictions, which

    reveals a divergence in power density. This departure from traditional constant-field

    scaling affects frequency and voltage scaling in future designs, which we thoroughly

    explore in Section 2.2. Throughout the rest of this chapter, we rely on technology

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 19

    SourceTransistor Energy per Active Area Power

    Delay Switch Power DensityBorkar99 [2] 0.70 0.34 0.49 0.49 1.00

    ITRS07 (average) [50] 0.67 0.51 0.76 0.50 1.53

    Table 2.2: Technology Scaling Factors. High-Performance Microprocessor Logic.Indicates a departure from historical scaling trends resulting in an increase in powerdensity. [50]

    predictions made by ITRS 2007.

    The circuits library utilizes predictive technology models (PTM) [47, 67], available

    from the 45nm node down to 16nm, to model how power and frequency scale with

    supply voltage and different amounts of wire parasitics. In the absence of detailed cir-

    cuit blocks that can be simulated, we rely on HSPICE simulations of fanout-of-4 ring

    oscillators across the technologies to determine basic frequency, power, and voltage

    trends. We combine ITRS predictions with PTM-based simulations to extrapolate

    trends at the 11nm node. These trends allow Navigo to scale voltage and frequency to

    meet different power budgets. It is also important to consider the effects of imposing

    minimum voltage (VddMIN) constraints since allowing arbitrary reductions in supply

    voltage can lead to a variety of issues related to six transistor SRAM cell instability

    issues [65] and exacerbation of on-chip voltage noise. Again, the circuits library can

    be modified by the user to model specific blocks if available.

    The architecture library contains a collection of processor cores that the user can

    choose to tile together in future multi-core systems. The built-in architecture li-

    brary consists of three cores currently in production, listed in Table 2.3. These cores,

    Intel Xeon (Netburst), Intel Core2Duo (Core), and Intel Atom, represent high-end

    server, desktop, and mobile CPUs. We plan to include analysis for processors such

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 20

    ProcessorTech Die Cores Vdd Freq Power IPC(nm) Size (V) (GHz) (W) (SPEC06

    (mm2) /GHz)

    Intel Xeon 65 435 2 1.25 3.4 110 3.72(Tulsa) [18]

    Intel Core2Duo 45 107 2 1.36 3 65 6.82(Wolfdale)

    Intel Atom [12] 45 25 1 1.0 2.0 2.0 2.35

    Table 2.3: Example Cores used in analysis. Data collected from conference andjournal publications and datasheets. SPEC2006 results used to determine IPC arefrom spec.org.

    as Intels Core i7, as detailed information becomes available. Parameters for the

    processors were obtained from publications and SPEC scores in spec.org for Xeon

    and Core2Duo. Since official SPEC results are not available for Atom, we extrap-

    olate based on benchmark comparisons between Atom and an Athlon with known

    SPEC scores [57]. While different processors have been implemented with different

    technologies, the power, performance, and area of each core is appropriately scaled by

    Navigo utilizing the process technology and circuits trends prescribed by their respec-

    tive libraries. The user is not constrained by these cores, but can also include other

    user-defined cores into the architecture library. For example, Section 2.4 explores the

    impact of specialized cores.

    The market segment library identifies different market segment targets that con-

    strain total area and maximum power. Table 2.4 lists examples of different market

    segments. Throughout the rest of the chapter, we focus on two particular mar-

    ket segmentsserver and mobile. The server market allows for a maximum area of

    300mm2 and maximum power of 198W as defined by ITRS. In contrast, the mobile

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 21

    Market Max Power (W) Die Area (mm2)MPU-CP Cost and Performance 151 140

    MPU-HP High Performance 198 310MPU-PCC Power Cost and Connectivity 3 70

    Desktop-95 95 100Desktop-65 65 100

    Mobile Standard Voltage 35 100Mobile Ultra-low Voltage 10 100

    Table 2.4: Market Segment Constraints. Die size and Max Power Consumptionfor a set of market segments. Values for the first three markets came from ITRS [50].The final four market segments are based on die size and thermal design point ofcommercially available Intel Processors.

    market allows for a maximum area of 100mm2 and maximum power of 35W. Again,

    different markets segments and/or constraints can be easily defined by the user via

    changes to the library.

    Finally, Navigos engine computes total throughput as follows:

    Throughput = Ncores freq(V dd, tech) IPCcore (2.1)

    where the number of cores, Ncores, is defined by the total die size (for a target market

    segment) divided by the core chosen and scaled by technology node. The IPC of each

    core can be derived from published (or simulated for new cores) SPEC benchmark

    results and clock frequency of the core. Operating frequency depends both on process

    technology and voltage, and is calculated based on the original frequency published

    for the core. First, Navigo calculates the maximum frequency of the core for nominal

    voltage in the new technology. We incorporate both the intrinsic switching delay of the

    transistor and effects due to wire delay scaling. We scale logic and wires independently

    because the projected trends follow competing directions and are modeled separately

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 22

    in ITRS.

    freqV ddNom=freqcorebasetech

    fraclogicfreqswitchtech

    freqswitchbasetech+fracwire

    freqwiretechfreqwirebasetech

    (2.2)

    where basetech is the original technology in which the core was fabricated. The nom-

    inal frequency is then multiplied by PTM-based scaling factors to calculate voltage-

    specific frequencies.

    Power depends on voltage, operating frequency, and the transistor switching rate

    of the architecture. We model average power with the following expression:

    Pavg = Pactive + Pleak freq (Eswitch Nswitching + Ewire) + Pleak (2.3)

    Traditionally power consumption is modeled as a sum of active power and leak-

    age power. Navigo computes active power as a sum of the number of transistor

    switches per second multiplied by the energy per switch. We calculate switching rate

    (Nswitching) from published frequency and power numbers. Since energy per switch

    (Eswitch) is technology dependent, it scales based on voltage-dependent scaling factors

    derived from HSPICE simulations for each technology node. Wires scale differently

    from transistors and, hence, are separately accounted for. We assume leakage power

    remains a fixed percentage of the total power consumption at maximum frequency

    and nominal voltage, which then scales with respect to different operating voltage

    levels. In order to accommodate different power budgets prescribed by different mar-

    ket segments, Navigo iterates through voltage and frequency settings until a specific

    power target is met. When the model encounters a VddMIN constraint, it scales

    frequency only to reduce power at the expense of inefficient energy usage.

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 23

    While Navigo seeks to combine a variety of factors to accurately predict future

    performance, it makes several optimistic assumptions. First, it may not be feasi-

    ble to fit an integer number of cores into a predefined area. Hence, we allow for

    half-size cores with IPC and power that scale linearly by one half. Although this sce-

    nario is infeasible, for near-term technologies (e.g. 45nm), large area cores introduce

    quantization effects that make it difficult to observe consistent trends. This effect

    becomes significantly less important as we scale to more advanced technologies. Sec-

    ond, future multi- and many-core systems will face a variety of challenges to enable

    core-to-core communications. Navigo optimistically assumes a perfect on-chip inter-

    connection network. Lastly, and perhaps most important, we assume workloads can

    be fully parallelized to keep all cores running continuously. Hence, the model is or-

    thogonal to Hills investigation that compares single-threaded versus multi-threaded

    parallelism [27]. One of the main objectives of developing Navigo was to provide a

    detailed and yet flexible model to help designers predict performance trends and guide

    future designs. Moreover, we use Navigo to show that despite optimistic assumptions

    of perfect thread parallelism that are run on highly-parallel many-core designs, power

    constraints will hamper performance growth and motivate designers to seek out new

    solutions beyond simply increasing the number of cores on a die.

    2.2 Power-constrained Performance for Multi-core

    Navigo can be used to understand power-constrained performance scalability across

    technology generations. In this section, we demonstrate the utility of Navigo by ex-

    ploring the scalability of three classes of CPU architectures when considering power-

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 24

    constrained market segments (Table 3) and the impact of the minimum supply voltage

    constraint.

    For each of these explorations, we make several assumptions. First, we assume

    that area and power will be fixed by the market segment. More advanced technology

    nodes provide an increase in the number of available transistors leading to a doubling

    of available cores per technology generation; however, frequency benefits will be con-

    strained by power limits. If the power budget is exceeded for a given number of cores

    and clock frequency, we scale voltage and frequency down to meet the power bud-

    gets, subject to circuit constraints on the supply voltage, after which linear frequency

    scaling is utilized.

    2.2.1 Results without Power Constraints

    To understand the impact of power constraints on scaling, we first consider the

    scenario where power is not a design constraint. Figure 2.2 illustrates this figure with

    four sub-figures illustrating various outputs of the model when scaled across technol-

    ogy nodes for a fixed area budget of 310 mm2. The four sub-figures quantify, across

    the three core types, the number of cores, clock frequency, total power, and total

    chip throughput. Without power constraints, all metrics scale up with technology.

    Figure 2.2(a) shows that the number of Core2Duo cores starts at around 6 in the

    45nm node (recall that core count is scaled to meet the 310 mm2 budget), scaling

    to 93 cores by 11nm. Without power limitations, frequency scaling continues un-

    abated surpassing 19.12 GHz for the Xeon core in 11nm, but this comes at the price

    of increased power dissipation, exceeding a kilowatt in the worst case. Figure 2.2(d)

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 25

    45 nm 32 nm 22 nm 16 nm 11 nm0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    Technology

    Num

    ber

    of C

    ores

    AtomXeonCore 2

    (a) Number of cores

    45 nm 32 nm 22 nm 16 nm 11 nm0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    TechnologyF

    req

    (GH

    z)

    AtomXeonCore 2

    (b) Frequency

    45 nm 32 nm 22 nm 16 nm 11 nm0

    500

    1000

    1500

    Technology

    Tot

    al P

    ower

    (W

    )

    AtomXeonCore 2

    (c) Power

    45 nm 32 nm 22 nm 16 nm 11 nm10

    1

    102

    103

    104

    105

    Technology

    Thr

    ough

    put

    AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year (Core2)

    (d) Throughput

    Figure 2.2: Results without power constraints across process technologies.Results assume nominal voltage for specified technology and MPU-HP market segmentwith a die size of 310 mm2.

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 26

    plots total chip throughput relative to the Core2Duo from the 45nm technology node,

    as calculated by increasing the core count along with frequency improvement. The

    throughput improvement increases at a slightly lower rate than the historical growth

    rate of 1.58x. This shows that if power is not a constraint, performance growth could

    be achieved through a combination of traditional frequency scaling and multi-core

    design.

    2.2.2 Results with Power Constraints

    Incorporating power constraints into our analysis gives a true picture of expected

    trends in future technologies. We show that for market segments that tolerate higher

    power density systems, scaling trends are better compared to more constrained market

    segments. In this section, we compare the server market segment, which uses the same

    310 mm2 die with a power limit of 198W, and the mobile market segment, which uses

    a 100 mm2 die with a power limit of 35W. Figure 2.3 and Figure 2.4 plot the server

    and mobile market segment scalability analysis across the three core types. Each

    plot shows the required supply voltage, clock frequency, total power, and total chip

    throughput.

    Focusing on the results for the server market segment, we observe several impor-

    tant trends. For the Intel Xeon design, power is already constrained at the 45nm

    technology node, and the design must reduce supply voltage from nominal in order

    to meet the power goal. When moving to the 32nm node, the Xeon is able to achieve

    a small frequency increase by operating at the minimum supply voltage. Beyond

    32nm, the Xeon frequency reduces slightly and then flattens out as the power budget

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 27

    45 nm 32 nm 22 nm 16 nm 11 nm0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    Technology

    VD

    D (

    V)

    AtomXeonCore 2

    (a) Vdd

    45 nm 32 nm 22 nm 16 nm 11 nm0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    TechnologyF

    req

    (GH

    z)

    AtomXeonCore 2

    (b) Frequency

    45 nm 32 nm 22 nm 16 nm 11 nm20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    220

    Technology

    Tot

    al P

    ower

    (W

    )

    Atom

    Xeon

    Core 2

    (c) Power

    45 nm 32 nm 22 nm 16 nm 11 nm10

    1

    102

    103

    104

    105

    Technology

    Thr

    ough

    put

    AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year (Core2)

    (d) Throughput

    Figure 2.3: Results with power constraints across process technologies -Server. Results assume nominal voltage for specified technology and MPU-HP marketsegment with a die size of 310 mm2 and max power of 198 W.

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 28

    45 nm 32 nm 22 nm 16 nm 11 nm0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    Technology

    VD

    D (

    V)

    AtomXeonCore 2

    (a) Vdd

    45 nm 32 nm 22 nm 16 nm 11 nm1

    2

    3

    4

    5

    6

    7

    TechnologyF

    req

    (GH

    z)

    AtomXeonCore 2

    (b) Frequency

    45 nm 32 nm 22 nm 16 nm 11 nm10

    15

    20

    25

    30

    35

    40

    Technology

    Tot

    al P

    ower

    (W

    )

    AtomXeonCore 2

    (c) Power

    45 nm 32 nm 22 nm 16 nm 11 nm10

    1

    102

    103

    104

    Technology

    Thr

    ough

    put

    AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year (Core2)

    (d) Throughput

    Figure 2.4: Results with power constraints across process technologies -Mobile. Results assume nominal voltage for specified technology and Mobile marketsegment with a die size of 100 mm2 and max power of 35 W. Vdd is limited toVddMIN.

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 29

    is soaked up by additional cores. In contrast, the Intel Core2Duo design allows full

    frequency scaling until the 22nm technology node, after which scaling is curtailed;

    in 11nm, frequency must be throttled when adding more cores. The Intel Atom

    core is much more power-efficient and can continue to scale frequency until 11nm,

    with additional power headroom. However, Atom starts with a significant perfor-

    mance disadvantage compared to Core2Duo, and hence by 11nm, the Core2Duo and

    Atom roughly converge on total throughput. In 11nm, the best designs (Atom and

    Core2Duo) are increasing at a rate of 1.35x per year, which by 11nm is nearly 6.6x

    below the 1.58x per year curve.

    The mobile market segment, seen in Figure 2.4 exhibits similar trends, but the

    tighter power constraints result in more severe reductions in clock frequency, and

    slowing in overall per-year throughput growth. For example, the Core2Duo hits a

    frequency cap around 32nm, and frequency flatlines until 16nm when it slightly dips.

    Even the Atom processor power caps at 16nm, after which frequency also dips to

    maintain the power budget.

    An important issue that we see repeatedly throughout the above scenarios is the

    minimum Vdd constraint is met as we seek to fit designs with many cores into fixed

    power budgets by reducing voltage and clock frequency. When a design reaches

    this constraint, additional power reduction can only be achieved through inefficient

    frequency-scaling essentially linear reduction in clock frequency offsets additional

    cores. Practically speaking, designers may prefer to simply stop scaling the num-

    ber of cores in a system at this point. In order to understand this effect, we have

    run additional simulations with the constraint removed; the results are shown in

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 30

    45 nm 32 nm 22 nm 16 nm 11 nm0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    Technology

    VD

    D (

    V)

    AtomXeonCore 2

    (a) Vdd

    45 nm 32 nm 22 nm 16 nm 11 nm1

    2

    3

    4

    5

    6

    7

    Technology

    Fre

    q (G

    Hz)

    AtomXeonCore 2

    (b) Frequency

    45 nm 32 nm 22 nm 16 nm 11 nm10

    15

    20

    25

    30

    35

    40

    Technology

    Tot

    al P

    ower

    (W

    )

    AtomXeonCore 2

    (c) Power

    45 nm 32 nm 22 nm 16 nm 11 nm10

    1

    102

    103

    104

    Technology

    Thr

    ough

    put

    AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year (Core2)

    (d) Throughput

    Figure 2.5: Results with power constraints across process technologies with-out VddMIN constraints - Mobile. Results assume nominal voltage for specifiedtechnology and Mobile market segment with a die size of 100 mm2 and max power of35 W. Vdd can be reduced without a lower limit.

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 31

    Figure 2.5. We significantly reduce VDD to meet the power constraints set by the

    market, as low as 0.6V for in advanced technologies and Xeon and Core2Duo mi-

    croarchitectures. There is a clear loss in throughput for systems under minimum

    VDD constraints, Figure 2.4 (d), compared to systems without minimum VDD con-

    straints, Figure 2.5 (d). For the Atom processor, minimum Vdd is not a severe issue.

    For the mobile market segment in the 11nm node, scaling VDD reduces throughput

    by 13.4%. However, the minimum voltage constraint reduces the throughput of the

    Xeon core by 57.6% for the same target. Even without this constraint, the Xeon still

    performs poorly compared to the more power-efficient cores, because running at very

    low voltage does not provide ideal performance.

    2.3 Validating the Model

    This section presents a back-validation of Navigo for microprocessors built from

    1996 to 2007. Because of the predictive nature of the model, it is difficult to validate

    Navigos predictions of the power and performance of microprocessors built using

    future process technologies. Therefore, we validate the Navigo based on an initial

    data-point from 1996 against Microprocessors manufactured over the last 10 years.

    For validation, we seeded the microarchitecture library with the DEC Alpha 21164

    microprocessor, introduced in 1996 and manufactured in 350nm technology. We de-

    veloped the technology and circuits library based on ITRS data from 1997 to 2007

    and circuit simulation results, using SPICE models from industry and PTM. In 1997,

    the ITRS committee did not anticipate the growth in power density that started with

    the 180nm technology node. Therefore, for each node, we chose the technology model

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 32

    CPU Year Node(nm)

    DieSize(mm)

    Throughput Freq(GHz)

    Power(W)

    Alpha 21164 1996 350 210 481 0.5 31Alpha 21164 1997 350 141 649 0.6 40Alpha 21264 1998 350 314 993 0.6 73Alpha 21264A 1999 250 210 1267 0.7 85Pentium III 2000 180 106 1779 1.0 29Athlon 2001 180 130 2584 1.6 68Pentium 4 2002 130 146 4195 3.0 81.8Opteron 2003 130 193 5364 2.2 89Xeon 2004 130 237 5764 3.6 9264-bit Xeon 2005 90 81 6505 3.6 110Core 2 Extreme 2006 65 143 17909 2.93 75Xeon 3085 2007 65 143 23207 3 65POWER6 2007 65 341 35071 4.7 180

    Table 2.5: Select Microprocessors from 1996 to 2007. Performance data is fromthe analysis in Figure 1.1. Power consumption and die size data was acquired fromdatasheets and published microprocessor reports.

    from the ITRS year closest to date of introduction. This technique isolates the error

    in ITRS predictions from the modeling framework.

    We compare predictions from Navigo with microprocessors manufactured between

    1996 and 2007, shown in Table 2.5. We calculate throughput from the same Hen-

    nesey and Patterson and SPECint2006 benchmark data used to develop Figure 1.1,

    described in Section 1.1.1. We gathered power consumption data from datasheets

    and online microprocessor reports. The die size of the microprocessors vary widely;

    therefore, we compare throughput per unit area and power per unit area.

    Figure 2.6 (a) presents a comparison of throughput per unit area predicted with

    Navigo and the throughput of commercially available microprocessors. The x-axis

    represents both technology node and year of introduction. The throughput of the

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 33

    350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006 10

    0

    101

    102

    103

    Technology

    Thr

    ough

    put/A

    rea

    NavigoCommercial Microprocessors

    (a) Throughput

    350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006 0

    100

    200

    300

    400

    500

    600

    700

    800

    Technology

    Pow

    er/A

    rea

    (mW

    /mm

    2 )

    NavigoCommercial Microprocessors

    (b) Power

    Figure 2.6: Validation of Navigo using Microprocessors from 1996 to 2007.Predicted results use the most recent ITRS technology models. The initial core modelis an Alpha 21164 0.5 GHz in 250nm technology introduced in 1996. The data pointsrepresenting commercially available systems are also presented in Figure 2.5

    initial core, Alpha 21164 0.5 GHz, matches the predictions from Navigo which reveals

    the absence of static offset errors in the model. The throughput predicted by Navigo

    aligns well with the results from the benchmarked microprocessors. Generally, Navigo

    estimates the upper bound of throughput per unit area. To combat increasing power

    consumption, designers of microprocessors in the 65nm node slowed the scaling of

    clock frequency and choose to design multi-core processors made of simpler cores.

    Navigo overestimates the throughput of multi-core designs because it assumes that

    the costs of communication and thread synchronization are zero.

    While Navigo predicts a general trend of increased power density, shown in Fig-

    ure 2.6 (b), it does not predict the drastic jump in power consumption caused by

    changes in microarchitecture, as it assumes a fixed core design. During the period be-

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 34

    tween 1997 and 2005, microarchitects aggressively pursued single-thread performance

    resulting in several high-throughput and high-power consumption designs. The deeply

    pipelined Netburst microarchitecture, manufactured in 130nm (Pentium 4 and Xeon),

    had notoriously high power consumption. Subsequently, the industry changed course

    and introduced more power efficient multi-core designs. The power consumption pre-

    dicted by Navigo matches the initial core Alpha 21164 in 350nm. Navigo also aligns

    well the multi-core designs in the 65nm node, which utilize cores that have microar-

    chitectures similar to the Alpha. The model correctly shows the transition between an

    earlier erawhen constant field scaling was still possible and power density remained

    constant (350nm, 250nm, 180nm)and the current era of increasing power density.

    Our back-validation shows that Navigo predicts throughput well and points out

    general trends in power consumption. Navigo incorporates a static model of mi-

    croarchitecture, and thus for a more accurate prediction of power consumption, users

    should include cores in their libraries which best represent their target core design.

    2.4 Modeling Specialization

    Consistent progress towards smaller, faster, and more numerous transistors with

    each generation of process technology no longer yields the steady growth in comput-

    ing performance enjoyed throughout the 20th century. The power ceiling forced a

    right-hand turn in single-thread performance and CPU designers have been rac-

    ing to implement multi-core systems ever since. Unfortunately, Navigo predicts that

    even for the server market segment, multi-core scaling will only yield a 1.35x/year

    performance growth trend. In order to get back onto the 1.58x growth trend, design-

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 35

    ers must maximize the efficiency of transistor (and wire) switching. In other words,

    designers must minimize the overheads associated with a general-purpose (GP) CPU.

    One obvious direction is to replace general-purpose computing with dedicated, spe-

    cialized hardware that offers higher computation per unit area and power, for an

    increasing fraction of the machines workload. IBMs CELL processor is one such

    example. It includes 8 SPEs, which are specialized cores used to speed up SIMD

    workloads [11]. Similarly graphics processing units (GPUs) have been used exten-

    sively by programmers to speedup tasks related to video processing and other SIMD

    operations. Another example may be to introduce dedicated hardware specialized to

    H.264 decoding. In order to understand the potential benefits of specialization, this

    section introduces a parallel-variant of Amdahls Law for specialization. Then, by

    augmenting Navigo with specialization, we project the amount of specialization that

    will be required in future computing systems to increase system throughput by 1.58x

    per year.

    2.4.1 Variant of Amdahls Law for Specialization

    Amdahls Law is commonly used to describe the theoretical limitations of appli-

    cation speedup given constraints on the fraction of the workload that can be sped

    up.

    Speedupenhanced(f, S) =1

    (1 f) + fS

    (2.4)

    where f is the fraction of the workload that can be enhanced and S is amount of

    speedup possible through enhancements. Amdahls Law has been adapted to model

    symmetric and asymmetric multi-core systems [27], where parallel cores can execute

  • Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 36

    (a) Calculation framework

    101

    100

    100

    102

    100

    101

    102

    103

    fraction of workload (f)Speedup (S)

    Thr

    ough

    put (

    norm

    aliz

    ed)

    (b) Throughput vs f and S

    Figure 2.7: Speeding up an application with specialized cores. A workloadis split to an additional set of resourcesthe specialized core. The fraction of theapplication that can be executed on the specialized core is f , with a speedup of S.

    all workloads. With specialized cores, we must make a few assumptions in order to

    model speedup using Amdahls Law. First, we assume special-purpose (SP) cores

    can only run specific parts of an application (f) while general-purpose cores can run

    the entire workload, albeit with lower efficiency. Second, we optimistically assume

    that workloads are arbitrarily parallelizable (also previously assumed in Navigo). T