generating the next wave of custom siliconkrste/papers/... · generating the next wave of custom...

Generating the Next Wave of Custom Silicon Borivoje Nikolić, Elad Alon, Krste Asanović

Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA E-mail: {bora, elad, krste}@eecs.berkeley.edu

Abstract—Tidal waves in computing and communications have traditionally fueled the growth of the semiconductor industry. Mainframes have been replaced by personal computers, followed by the proliferation of mobile telephones, each resulting in dramatic increases in volumes of units shipped. The upcoming generation of computing does not have one clear product to drive the industry; Rather a diversity of emerging applications are based on the interaction between edge devices and the cloud. Supporting differentiation amongst diverse products requires specialization of integrated circuits, which in turn requires a paradigm shift in the design of custom silicon. This paper outlines a vision to dramatically increase design reuse by focusing on developing digital and analog generators rather than specific instances of functional modules. The use of the open and extensible RISC-V instruction-set architecture enables customization with reduced software cost. Open-source chip generators amortize the design and verification costs across many instances. Emulation of multi-processor systems running realistic workloads on public clouds validates design decisions at a dramatically reduced cost. The methodology is illustrated by the design of a complex system-on-a-chip.

Keywords—Energy efficiency, CMOS, microprocessor, digital signal processor, design-space exploration.

I. INTRODUCTION

Tidal waves in computing have driven the semiconductor industry over the past five decades (Fig. 1). While the development of mainframe computers fueled the start of the application-specific integrated circuit (ASIC) industry, the ecosystem around personal computers (PCs) enabled industry growth in the 1980s and 1990s. This PC wave was replaced by the smartphone wave, whose volumes presently drive the industry. While there are some early indications of a slowdown in smartphone shipments, there is no single dominant new device or platform emerging as a replacement. Instead, we are witnessing a diversification of computing needs and platforms. Each of the rapidly growing domains of entertainment, media, automotive, healthcare, or gaming is expected to be larger in volume than the entire computing domain from a decade ago, each driven by its own diverse needs. PCs are based on x86 architectures and smartphones are based on ARM, and there is no indication that this will change in the foreseeable future. However, to satisfy the variable needs of new domains, flexible, shared, open platforms are needed both to enable diversification and to reduce development cost by amortizing effort across domains.

Figure 1. Semiconductor industry waves. Simultaneously, ASICs need to overcome the major challenge of exploding chip design costs. With scaling of technology into single-digit nanometer nodes, we have witnessed a doubling of design costs for leading chips in each technology node, and the widely quoted development costs in 14/16nm nodes range from tens to hundreds of millions of dollars. The majority of these costs are attributed to software development, verification, validation, and engineering effort associated with design and respins. This needs to change to sustain the growth of the industry. This paper outlines a vision and initial results for designing custom chips that will enable the next wave of computing platforms. The use of open, free instruction-set architectures facilitates diverse applications, while maintaining core software compatibility. Designing generators, rather than instances, fosters reuse and technology portability. Open-source generators reduce cost to implement common features while allowing for product differentiation.

II. A PERSPECTIVE ON TECHNOLOGY SCALING

CMOS technology has enjoyed tremendous scaling over the past 50 years [1], resulting in repeated doubling of transistor counts in leading microprocessors, Fig. 2a. However, device-count increases have slowed recently, and performance has been added by designing larger chips, Fig 2b. As a result, transistor density has been very slowly increasing in recent years, Fig 2c. Both physical limitations and cost are constraining further logic density scaling. Even when logic stops scaling it is expected that memories will continue scaling, as 3D cross-point arrays [2] are lithography friendly. Increasing die sizes coupled with higher defect densities in scaled technologies result in yield losses, and motivate the design of multi-die modules on an interposer [3], [4].

978-1-5386-5404-0/18/$31.00 ©2018 IEEE 6

a)

b)

c)

Figure 2. a) Transistor counts in microprocessors, b) die sizes, c) transistor densities over years.

Every modern deployed design is energy or power limited, whether it is a small mobile device or a massive compute server in the cloud. In every application domain, the complete design stack, including the underlying technologies, is tuned to maximize performance under energy limits [5]. Another clear trend is that non-recurring engineering costs for custom silicon doubles for every new technology generation [6]. The cost is dominated by the effort of developing and porting software, verification and validation, acquisition and qualification of IP blocks, and physical design, and has been widely quoted to range between tens and hundreds of millions of USD in sub-20nm nodes.

III. SPECIALIZATION FOR ENERGY EFFICIENCY

Energy efficiency is achieved through hardware specialization, such that every function is implemented in an optimal way [7]. Dedicated hardware blocks optimized to perform a particular function have energy efficiency 2-3 orders of magnitude higher than general-purpose processors executing compiled high-level language code. Hardware optimized for a particular computing domain (such as a GPU) is more efficient within that particular domain, but requires programs written in a restricted environment. Investment in further specialization is currently only justified for applications that are valuable (e.g., TPU2 for machine learning) or widespread (e.g., MPEG encoding). An exploration has been performed whether a collection of specialized processors can be made general enough to efficiently run any application. Thirteen computational motifs have been identified as the key patterns that are represented in modern applications [8]. They include dense and sparse linear algebra as well as graph algorithms and graphical models. Some have well-defined datapath accelerators (such as the vector processor for dense and sparse linear algebra), while others rely on enhancing data movement for acceleration of the pattern. There are persistent trends in the research community to hyper-specialize certain computational patterns, where a certain computational structure is highly optimized for efficiency of the computational pattern of interest. It should be kept in mind however that Amdahl’s Law predicts that all parts of a system must be optimized to avoid efficiency bottlenecks, requiring pervasive specialization [9].

IV. HARDWARE GENERATORS

Optimal energy efficiency is achieved through specialization, but specialized hardware is costly to design, and therefore reuse is paramount. To enable design reuse, the emerging approach is to design generators, rather than instances, of either general-purpose or specialized blocks [10]–[12]. The idea of using higher-level programming constructs for description of VLSI functions has been explored for a long time [13], [14]. Generator design in traditional hardware-description languages is challenging and hasn’t been able to take root, so new hardware construction languages (HCLs) have been developed to support generator development, including Spiral [15], Genesis2 [16], Chisel [17] and Magma [18]. Our focus has been on Chisel, which stands for Constructing Hardware in Scala Embedded Language [17]. Chisel is a domain-specific extension to the Scala programming language. Chisel gives a hardware designer abstractions of primitive hardware components, such as registers, muxes, and wires. A designer in Chisel constructs a set of libraries whose classes represent those hardware primitives. Execution of a Chisel program generates a graph of such hardware components to provide a blueprint to fabricate hardware designs; the back-end of the Chisel compiler translates this graph of hardware components into a

7

synthesizable Verilog representation that maps to standard FPGA or ASIC flows, as well as verification and integration collateral such as IP-Xact. Chisel by itself does not raise the level of design abstraction above RTL, but the software abstractions enabled by the expressiveness of the Scala host language allows for the design to be modular and highly parameterized, thus enhancing reusability. Generators written in Chisel also preserve designer’s intent (i.e., control over the generated output), in contrast to high-level synthesis tools which usually provide limited control over output quality. In the spirit of modern software frameworks that compile into an intermediate representation (IR), Chisel compiles into its own IR, FIRRTL [19]. The software frameworks transform the code written in a particular programming language into a compiler-specific IR, such as LLVM, where IR-to-IR transformations (optimizations) such as constant propagation or dead-code elimination modify the program’s structure. Finally, a backend of a compiler converts the IR into code for the target ISA, e.g. ARM or x86. This structure of translating an input language into an IR enables reuse of transformations among multiple designs and languages. The Chisel hardware compiler framework (HCF) is similarly structured: Chisel and Verilog frontends translate designs into FIRRTL, transformation passes provide simplification, optimization, and instrumentation, and the resulting FIRRTL can either be simulated directly or passed to one of many Verilog backends tailored for simulators, emulators, FPGAs, or ASICs, Fig. 3 [19]. Many FIRRTL transformations have been enabled so far, ranging from simplifications and instrumentation to performance optimizations. Target-specific optimizations that enable targeting either block RAM (BRAM) in FPGAs or compiled SRAM macros in ASICs enable seamless Chisel code emulation before committing to chip fabrication. Another powerful extension of Chisel is in its support for unified development of digital signal processing (DSP) functions [20]. These include the operator and data type polymorphism, unified systems modeling, and powerful static and trace-based bitwidth optimizations. As a result, a single generator can support fixed-point, floating-point, or complex data types, thus enabling different target instances. Concurrently, design, validation and verification are done from a single generator.

Figure 3. A parallel between software and hardware compilers.

A well-written digital generator enables a hardware template that can be both statically parameterized and on-the-fly reconfigured, such as in the case of a radix-2n3m5k Fast Fourier Transform (FFT) engines [21]. Development of analog and mixed-signal modules is typically a time-consuming process of refining design specifications through layout iterations and post-layout simulations. Analog design automation in the form of circuit synthesis and automatic layout generation has been a subject of exploration for many years [22], but has had limited traction. The Berkeley Analog Generator (BAG2) is a Python-based framework for process-portable development of analog generators [23], [24]. The specification-to-verification framework encapsulates a schematic-generation application-programing interface (API), a sizing routine (design script) and two layout generation engines, Laygo and XBase. The BAG2 framework has been used to generate a wide range of interface blocks, including analog-to-digital and digital-to-analog converters as well as high-speed serial interfaces. The BAG framework is coarsely illustrated in Fig. 4, through an interconnection between the design script, verification/simulation framework and layout and schematic generators. Fig. 5 shows a schematic of a generated 15Gb/s SerDes front-end, with parameterized number of decision-feedback equalizer (DFE) taps, voltage swings and resistances. Fig. 6 demonstrates portability of BAG2 generators through the design of time-interleaved (TI) successive-approximation (SAR) analog-to-digital converters (ADCs) in three processes from three different foundries: STMicroelectronics 28FDSOI, Globalfoundries 22FDX and TSMC 16FFC.

Figure 4. Illustration of the BAG framework.

Figure 5: Parametrization of a BAG-generated SerDes front-end [25].

8

a)

b)

c) Figure 6. BAG2-generatedTI SAR ADC layouts in a) 28nm FDSOI b) 22nm FDX and c) 16nm finfet technologies.

V. RISC-V INSTRUCTION-SET ARCHITECTURE

Any emerging platform will need both general-purpose and specialized processor cores to provide both flexible and efficient programmability. RISC-V is a open instruction-set architecture that can be freely used and, more importantly, extended. RISC-V supports 32-bit, 64-bit and 128-bit address spaces, rich operating systems and hypervisors, as well as deeply embedded applications. The base instruction-set architecture (ISA) is truly reduced, with less than 50 base instructions, but supports several standard extensions, such as integer division and multiplication (M), atomic (A), single- (F), double- (D) and quad-precision (Q) floating-point operations as well as compressed (C) 16-bit instruction formats. A processor core may implement any subset of standard extensions, or can add its own custom extensions. A set of vector (V) extensions is presently undergoing development and standardization.

VI. ROCKET RISC-V CHIP GENERATORS

The free and open RISC-V ISA specification enables, but does not mandate, open-source implementations, and already a large variety of both proprietary and open-source RISC-V cores are available. In all of our recent chips, we have used different generations of the open-source Rocket Chip generator [26], written in Chisel, which includes a standard 5-stage single-issue in-order processor pipeline, but other cores can be built using Rocket Chip as a library. Rocket Chip presents a template for designing systems-on-a-chip (SoCs), as it supports coherent multi-level caches and standard interconnects (TileLink2). The open-source SiFive Freedom platform adds basic interfaces (SPI, UART) and test and debug features (JTAG) to Rocket Chip [27]. The generator has many of its features parametrized, including the size and organization of caches. An example output is shown in Fig. 7.

Figure 7. Example of Rocket Chip generator output. TileLink2 is an open, chip-level interconnect protocol, that supports multiple masters and multiple slave devices with coherent memory access [28], [29]. TileLink2 is designed to connect general-purpose multiprocessors, co-processors, accelerators, DMA engines, and simple or complex devices in an SoC, using a fast scalable interconnect providing both low-latency and high-throughput transfers. TileLink2 can be serialized over ChipLink to provide cache-coherent inter-chip communications.

VII. EXTENSIONS TO THE CHIP GENERATOR

Rocket Chip generator offers multiple types of customization. Besides changing the generator parameters, it is possible to design a new processor core and drop it into the tile, as has been done with an implementation of the out-of-order implementation of the RISC-V [30]. The standard Rocket custom coprocessor (RoCC) interface provides a way to design decoupled coprocessors integrated in the same tile as the core, with two decoupled interfaces connecting coprocessors to the core and the L1 cache or outer memory system. In several of our prototype chips, we have added vector coprocessors via RoCC [31]–[33]. Peripheral devices can be connected to the TileLink2. These include standard interfaces to external DRAM and Flash, and high-speed serial links, as well as custom compute accelerators and DSP functions. Because of its extensibility, Rocket chip generator is an ideal platform for exploring heterogeneous architectures with a variety of tightly or loosely coupled accelerators. Finally, complete processing subsystems can be attached as peripheral devices or as RoCC-connected coprocessors. An FFT processor is an example of an accelerator that represents a computational motif. More complex examples may include wireless baseband processors, imaging pipelines or accelerators for machine learning algorithms. To enable adding DSP accelerators to the Rocket Chip with minimal effort, interfaces are standardized. As illustrated in Fig. 8, every block (or a composition of blocks) is wrapped into a standard set of interfaces, that unpack and pack streaming data

9

for TileLink2 as a standard interface (or e.g. AXI-Stream, via an adapter) [34]. Diplomacy is a framework for negotiating parameters for hardware interfaces, divided between the master and slave devices [29]. The slave parameters specify which subset of the optional ports are supported, how many distinct addressable endpoints are downstream of the node, and if the slave is always ready for new transactions. The master parameters specify the width of the data and the type of ready/valid signals used for communication. Master parameters also specify the number of distinct addressable masters that are upstream of the node. Edge parameters join the master and slave parameters and computes the actual widths of the interface ports. Adapters for adding new fields and assigning the default values from the standard are straightforward to implement.

VIII. SYSTEM EMULATION

FPGA-accelerated simulation of both edge and cloud workloads is essential for both architecture definition and application validation. We have released FireSim [35], an open-source RISC-V simulation platform that enables cycle-exact microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with a scalable, distributed network simulation. FireSim runs on Amazon EC2 F1, a public cloud FPGA platform, which greatly improves usability, provides elasticity, and lowers the cost of large-scale FPGA-based experiments. FireSim has been able to provide sufficient performance to run modern applications at scale, to enable true hardware-software co-design. For example, an automatically generated and deployed target cluster of 1,024 3.2 GHz quad-core server nodes, each with 16 GB of DRAM, interconnected by a 200 Gbit/s network with 2 s latency, simulates at a 3.4 MHz processor clock rate (less than 1,000x slowdown over real-time). FireSim is a key tool for exploring emerging workloads in warehouse-scale machine design, with a goal of designing an optimal architecture in silicon.

IX. GENERATED SILICON

The principles outlined in this paper have been used to design a series of RISC-V-based custom SoCs in advanced technology nodes, Fig. 9. Each of the designed chips contains an instance of a RISC-V core, with certain specializations added, which include vector co-processors, DMA engines, DSP subsystems and power management units. Switched-capacitor DC-DC converters analog to digital converters and high-speed serial interfaces are generated as well. An example multi-processor system based on a RocketChip generator with added DSP subsystem and serial links is shown in Fig. 10. This has been used as a template for the design of a sparse analysis SoC [36].

Figure 8. Wrapper for DSP functions.

Figure 9. Generated chips designed at UC Berkeley.

Figure 10. An example signal processing system based on RISC-V and generators.

X. CONCLUSION

In order to reduce the NRE cost associated with specialized silicon, a generator-based flow facilitates dramatically improved design reuse and supports an agile approach to hardware design. Generator technology – and in particular,

10

Chisel (for RTL generation), FIRRTL (for separation of concerns and process-specific optimizations) and BAG2 (for mixed-signal subsystem design and verification) are the basis for developing efficient specialized chips necessary to support diverging applications. A small group of students has utilized this generator-based flow to design multiple SoCs that contain both general-purpose compute capabilities as well custom DSP and analog/mixed-signal blocks on a range of advanced processes from different foundries.

ACKNOWLEDGMENTS

This work was in part supported by DARPA PERFECT (HR0011-12-2-0016), DARPA CRAFT (HR0011-16-C-0052), Intel iSTC, ADEPT, ASPIRE and BWRC member companies. The authors acknowledge the work of Jonathan Bachrach, Brian Richards, Brian Zimmer, Yunsup Lee, Ben Keller, Paul Rigge, Angie Wang, Stevo Bailey, Eric Chang, Pi-Feng Chiu, Colin Schmidt, John Wright, Albert Ou, Howard Mao, Woorham Bae, Jeduk Han, Andrew Waterman, Henry Cook, Christopher Celio, Adam Izraelevitz, Zhongkai Wang, Sean Huang, Zhaokai Liu, Chick Markley, Sagar Karandikar, Alon Amid, Nathan Narevsky, Jaehwa Kwak, James Dunn.

REFERENCES

[1] G. E. Moore, “No exponential is forever: but‘ Forever’ can be delayed!,” Dig. Tech. Pap. IEEE Int. Solid State Circuits Conf., 2003.

[2] K. Bourzac, “Has Intel created a universal memory technology? [News],” IEEE Spectrum, vol. 54, no. 5, pp. 9–10, May 2017.

[3] D. Greenhill et al., “3.3 A 14nm 1GHz FPGA with 2.5 D transceiver integration,” in Solid-State Circuits Conference (ISSCC), 2017 IEEE International, 2017, pp. 54–55.

[4] N. Beck et al., “‘Zeppelin’: An SoC for multichip architectures,” in Solid-State Circuits Conference-(ISSCC), 2018 IEEE International, 2018, pp. 40–42.

[5] D. Markovic et al., “Methods for true energy-performance optimization,” IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1282–1293, Aug. 2004.

[6] “How Much Will That Chip Cost?” [Online]. Available: http://semiengineering.com/how-much-will-that-chip-cost/.

[7] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 26, no. 2, pp. 203–215, Feb. 2007.

[8] K. Asanovic et al., “The landscape of parallel computing research: A view from berkeley,” Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[9] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, Atlantic City, New Jersey, 1967, pp. 483–485.

[10] O. Shacham et al., “Rethinking digital design: Why design must change,” IEEE Micro, vol. 30, no. 6, pp. 9–24, Nov. 2010.

[11] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2014, pp. 10–14.

[12] B. Nikolić, “Simpler, more efficient design,” in ESSCIRC Conference 2015 - 41st European Solid-State Circuits Conference (ESSCIRC), 2015, pp. 20–25.

[13] H. T. Kung, “Let’s design algorithms for VLSI systems,” in Proceedings of the Caltech Conference On Very Large Scale Integration, C. L. Seitz, Ed. Pasadena, CA: California Institute of Technology, 1979, pp. 65–90.

[14] M. Sheeran, “Designing regular array architectures using higher order functions,” in Functional Programming Languages and Computer Architecture, 1985, pp. 220–237.

[15] M. Puschel et al., “SPIRAL: Code generation for DSP transforms,” Proc. IEEE, vol. 93, no. 2, pp. 232–275, Feb. 2005.

[16] O. Shacham et al., “Avoiding game over: Bringing design to the next level,” in DAC Design Automation Conference 2012, 2012, pp. 623–629.

[17] J. Bachrach et al., “Chisel: Constructing hardware in a Scala embedded language,” in Proceedings of the 49th Annual Design Automation Conference, 2012, pp. 1216–1225.

[18] “Magma,” https://github.com/phanrahan/magma. . [19] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware

construction languages, compiler frameworks, and transformations,” in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2017, pp. 209–216.

[20] A. Wang et al., “ACED: A hardware library for generating DSP systems,” in Proc. 55th Design Automation Conference, DAC’2018, San Francisco, CA, 2018.

[21] A. Wang et al., “A generator of memory-based, runtime-reconfigurable 2N 3M 5K FFT engines,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 2016, pp. 1016–1020.

[22] R. A. Rutenbar, “Analog circuit and layout synthesis revisited,” in Proceedings of the 2015 Symposium on International Symposium on Physical Design, Monterey, California, USA, 2015, pp. 83–83.

[23] J. Crossley et al., “BAG: A designer-oriented integrated framework for the development of AMS circuit generators,” Proceedings of the, 2013.

[24] E. Chang et al., “BAG2: A process-portable framework for generator-based AMS circuit design,” in 2018 IEEE Custom Integrated Circuits Conference (CICC), 2018, pp. 1–8.

[25] E. Chang et al., “An automated SerDes frontend generator verified with a 16nm instance achieving 15 Gb/s at 1.96 pJ/bit,” in 2018 Symposium on VLSI Circuits, Digest of Technical Papers, Honolulu, HI, 2018.

[26] K. Asanović et al., “The Rocket Chip Generator,” EECS Department, University of California, Berkeley, EECS-2016-17, Apr. 2016.

[27] “SiFive Freedom Platform.” [Online]. Available: https://www.sifive.com/products/freedom/.

[28] W. W. Terpstra, “TileLink: A free and open-source, high-performance scalable cache-coherent fabric designed for RISC-V,” in Proc. 7th RISC-V Workshop, Milpitas, CA, Nov. 2017.

[29] H. Cook et al., “Diplomatic design patterns: A TileLink case study,” in First Workshop on Computer Architecture Research with RISC-V (CARRV 2017), Boston, MA.

[30] P.-F. Chiu et al., “An out-of-order RISC-V processor with resilient low-voltage operation in 28nm CMOS,” in Symposium on VLSI Circuits, Digest of Technical Papers, Honolulu, HI, 2018.

[31] Y. Lee et al., “A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators,” in ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC), Venice Lido, Italy, 2014, pp. 199–202.

[32] B. Zimmer et al., “A RISC-V vector processor with simultaneous-switching switched-capacitor DC-DC converters in 28 nm FDSOI,” IEEE J. Solid-State Circuits, vol. 51, no. 4, pp. 930–942, Apr. 2016.

[33] B. Keller et al., “A RISC-V processor SoC with integrated power management at submicrosecond timescales in 28 nm FD-SOI,” IEEE J. Solid-State Circuits, vol. 52, no. 7, pp. 1863–1875, Jul. 2017.

[34] P. Rigge and B. Nikolic, “Designing digital signal processors with RocketChip,” in Second Workshop on Computer Architecture Research with RISC-V (CARRV 2018), Los Angeles, CA.

[35] S. Karandikar et al., “FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud,” in Proc. 45th International Symposium on Computer Architecture - ISCA 2018, Los Angeles, CA.

[36] A. Wang et al., “A real-time, analog/digital co-designed 1.89-GHz bandwidth, 175-kHz resolution sparse spectral analysis RISC-V SoC in 16-nm FinFET,” presented at the 44th European Solid-State Circuits Conference, Dresden, Germany, 2018.

11

generating the next wave of custom siliconkrste/papers/... · generating the next wave of custom...

Documents