hardware–software co-design: not just a clichéasampson/media/cliche-snapl2015-slides.pdf ·...
Post on 30-Mar-2020
5 Views
Preview:
TRANSCRIPT
sa paUniversity of Washington SNAPL 2015
Adrian SampsonJames Bornholt Luis Ceze
Hardware–Software Co-Design:Not Just a Cliché
Copyright © National Academy of Sciences. All rights reserved.
The Future of Computing Performance: Game Over or Next Level?
SUMMARY
9
10
100
1,000
10,000
100,000
1,000,000
1985 1990 1995 2000 2005 2010 2015 2020
Year of Introduction
Cloc
k Freq
uenc
y (MH
z)
FIGURE S.1 Processor performance from 1986 to 2008 as measured by the bench-
mark suite SPECint2000 and consensus targets from the International Technology
Roadmap for Semiconductors for 2009 to 2020. The vertical scale is logarithmic. A
break in the growth rate at around 2004 can be seen. Before 2004, processor per-
formance was growing by a factor of about 100 per decade; since 2004, processor
performance has been growing and is forecasted to grow by a factor of only about
2 per decade. An expectation gap is apparent. In 2010, this expectation gap for
single-processor performance is about a factor of 10; by 2020, it will have grown to
a factor of 1,000. Most sectors of the economy and society implicitly or explicitly
expect computing to deliver steady, exponentially increasing performance, but as
these graphs illustrate, traditional single-processor computing systems will not
match expectations. Note that the SPEC benchmarks are a set of artificial work-
loads intended to measure a computer system’s speed. A machine that achieves
a SPEC benchmark score that is 30 percent faster than that of another machine
should feel about 30 percent faster than the other machine on real workloads.
Application
Language
Architecture
Circuits
hardware–software abstraction boundary
parallelism data movement
guard bands
energy costs
Application
Language
Architecture
Circuits
hardware–software abstraction boundaryparallelism data movement
guard bands
energy costs
Application
Language
Architecture
Circuits
new abstractions for incorrectness
type systems debuggersprobabilistic guarantees auto-tuning
flaky functional units
lossy cache compression
neural acceleration
drowsy SRAMs
Hardware design costs sanity & well-being
Thierry Moreau, FPGA design champion
[Moreau et al.; HPCA 2015]
Trust your compiler
st r1 x
st.a r2 y
ld x r3
ld.a y r4
[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]
approximate cache
Trust your compiler
st r1 x
st.a r2 y
ld x r3
ld.a y r4
[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]
approximate cache01011
line state bits?
Trust your compiler
st r1 x
st.a r2 y
ld x r3
ld.a y r4
[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]
approximate cache
line state bits?
More hardware flexibility that humans can actually program
explicit data movement
explicit memory blocks
explicit physical routing
explicit clock frequency
explicit ILP
explicit numeric bit width
FPGA
More hardware flexibility that humans can actually program
A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services
Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1
Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy FowersGopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil
Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric PetersonSimon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger
Microsoft
AbstractDatacenter workloads demand high computational capabili-
ties, flexibility, power efficiency, and low cost. It is challengingto improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designscan provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale softwareservices. Each instantiation of the fabric consists of a 6x8 2-Dtorus of high-end Stratix V FPGAs embedded into a half-rackof 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs withpairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment ofthis fabric on a bed of 1,632 servers, and measure its efficacyin accelerating the Bing web search engine. We describethe requirements and architecture of the system, detail thecritical engineering challenges and solutions needed to makethe system robust in the presence of failures, and measurethe performance, power, and resilience of the system whenranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput ofeach server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the taillatency by 29%.
1. IntroductionThe rate at which server performance improves has slowedconsiderably. This slowdown, due largely to power limitations,has severe implications for datacenter operators, who havetraditionally relied on consistent performance and efficiencyimprovements in servers to make improved services economi-cally viable. While specialization of servers for specific scaleworkloads can provide efficiency gains, it is problematic fortwo reasons. First, homogeneity in the datacenter is highly
1Microsoft and University of Texas at Austin2Amazon Web Services3Columbia University4Georgia Institute of Technology5Microsoft and University of Washington6Google, Inc.7École Polytechnique Fédérale de Lausanne (EPFL)All authors contributed to this work while employed by Microsoft.
desirable to reduce management issues and to provide a consis-tent platform that applications can rely on. Second, datacenterservices evolve extremely rapidly, making non-programmablehardware features impractical. Thus, datacenter providersare faced with a conundrum: they need continued improve-ments in performance and efficiency, but cannot obtain thoseimprovements from general-purpose systems.
Reconfigurable chips, such as Field Programmable GateArrays (FPGAs), offer the potential for flexible accelerationof many workloads. However, as of this writing, FPGAs havenot been widely deployed as compute accelerators in eitherdatacenter infrastructure or in client devices. One challengetraditionally associated with FPGAs is the need to fit the ac-celerated function into the available reconfigurable area. Onecould virtualize the FPGA by reconfiguring it at run-time tosupport more functions than could fit into a single device.However, current reconfiguration times for standard FPGAsare too slow to make this approach practical. Multiple FPGAsprovide scalable area, but cost more, consume more power,and are wasteful when unneeded. On the other hand, using asingle small FPGA per server restricts the workloads that maybe accelerated, and may make the associated gains too smallto justify the cost.
This paper describes a reconfigurable fabric (that we callCatapult for brevity) designed to balance these competingconcerns. The Catapult fabric is embedded into each half-rackof 48 servers in the form of a small board with a medium-sizedFPGA and local DRAM attached to each server. FPGAs aredirectly wired to each other in a 6x8 two-dimensional torus,allowing services to allocate groups of FPGAs to provide thenecessary area to implement the desired functionality.
We evaluate the Catapult fabric by offloading a significantfraction of Microsoft Bing’s ranking stack onto groups of eightFPGAs to support each instance of this service. When a serverwishes to score (rank) a document, it performs the softwareportion of the scoring, converts the document into a formatsuitable for FPGA evaluation, and then injects the documentto its local FPGA. The document is routed on the inter-FPGAnetwork to the FPGA at the head of the ranking pipeline.After running the document through the eight-FPGA pipeline,the computed score is routed back to the requesting server.Although we designed the fabric for general-purpose service
978-1-4799-4394-4/14/$31.00 c⃝ 2014 IEEE
More hardware flexibility that humans can actually program
A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services
Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1
Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy FowersGopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil
Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric PetersonSimon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger
Microsoft
AbstractDatacenter workloads demand high computational capabili-
ties, flexibility, power efficiency, and low cost. It is challengingto improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designscan provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale softwareservices. Each instantiation of the fabric consists of a 6x8 2-Dtorus of high-end Stratix V FPGAs embedded into a half-rackof 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs withpairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment ofthis fabric on a bed of 1,632 servers, and measure its efficacyin accelerating the Bing web search engine. We describethe requirements and architecture of the system, detail thecritical engineering challenges and solutions needed to makethe system robust in the presence of failures, and measurethe performance, power, and resilience of the system whenranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput ofeach server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the taillatency by 29%.
1. IntroductionThe rate at which server performance improves has slowedconsiderably. This slowdown, due largely to power limitations,has severe implications for datacenter operators, who havetraditionally relied on consistent performance and efficiencyimprovements in servers to make improved services economi-cally viable. While specialization of servers for specific scaleworkloads can provide efficiency gains, it is problematic fortwo reasons. First, homogeneity in the datacenter is highly
1Microsoft and University of Texas at Austin2Amazon Web Services3Columbia University4Georgia Institute of Technology5Microsoft and University of Washington6Google, Inc.7École Polytechnique Fédérale de Lausanne (EPFL)All authors contributed to this work while employed by Microsoft.
desirable to reduce management issues and to provide a consis-tent platform that applications can rely on. Second, datacenterservices evolve extremely rapidly, making non-programmablehardware features impractical. Thus, datacenter providersare faced with a conundrum: they need continued improve-ments in performance and efficiency, but cannot obtain thoseimprovements from general-purpose systems.
Reconfigurable chips, such as Field Programmable GateArrays (FPGAs), offer the potential for flexible accelerationof many workloads. However, as of this writing, FPGAs havenot been widely deployed as compute accelerators in eitherdatacenter infrastructure or in client devices. One challengetraditionally associated with FPGAs is the need to fit the ac-celerated function into the available reconfigurable area. Onecould virtualize the FPGA by reconfiguring it at run-time tosupport more functions than could fit into a single device.However, current reconfiguration times for standard FPGAsare too slow to make this approach practical. Multiple FPGAsprovide scalable area, but cost more, consume more power,and are wasteful when unneeded. On the other hand, using asingle small FPGA per server restricts the workloads that maybe accelerated, and may make the associated gains too smallto justify the cost.
This paper describes a reconfigurable fabric (that we callCatapult for brevity) designed to balance these competingconcerns. The Catapult fabric is embedded into each half-rackof 48 servers in the form of a small board with a medium-sizedFPGA and local DRAM attached to each server. FPGAs aredirectly wired to each other in a 6x8 two-dimensional torus,allowing services to allocate groups of FPGAs to provide thenecessary area to implement the desired functionality.
We evaluate the Catapult fabric by offloading a significantfraction of Microsoft Bing’s ranking stack onto groups of eightFPGAs to support each instance of this service. When a serverwishes to score (rank) a document, it performs the softwareportion of the scoring, converts the document into a formatsuitable for FPGA evaluation, and then injects the documentto its local FPGA. The document is routed on the inter-FPGAnetwork to the FPGA at the head of the ranking pipeline.After running the document through the eight-FPGA pipeline,the computed score is routed back to the requesting server.Although we designed the fabric for general-purpose service
978-1-4799-4394-4/14/$31.00 c⃝ 2014 IEEE
23 authors!
Trust, but formally verify
Application
Language
Architecture
Circuits
verified properties
e.g., [Hunt and Larus; OSR April 2007]
Hardware beyond core computation
power supply & battery
mobile display & backlight
new memory technologies
software-defined networkingCPU
GPU
FPGA
accelerators
top related