a lgorithms and s pecializers for p rovably optimal i mplementations with r esiliency and e...
Post on 25-Feb-2016
45 Views
Preview:
DESCRIPTION
TRANSCRIPT
Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency
Elad Alon, Krste Asanovic (Director),Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek
krste@eecs.berkeley.eduhttp://aspire.eecs.berkeley.edu
UC Berkeley Future Application Drivers
2
Pervasive Speech
BIG DATA
Augmented Reality
Personalized Medicine
Environment
Social Networks
Robotics
UC Berkeley
3
Compute Energy “Iron Law”
When power is constrained, need better energy efficiency for more performance
Where performance is constrained (real-time), want better energy efficiency to lower power
Improving energy Efficiency is critical goal for all future systems and workloads
Performance = Power * Energy Efficiency(Tasks/Second) (Joules/Second) (Tasks/Joule)
UC Berkeley Good News: Moore’s Law Continues
4“Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965
More Transis
tors/Chip
Cheaper!
UC Berkeley
Bad News:Dennard (Voltage) Scaling Over
5
Dennard Scaling
Post-Dennard Scaling
Moore, ISSCC Keynote, 2003
UC Berkeley
1st Impact of End of Scaling:End of Sequential Processor Era
6
UC Berkeley
7
Parallelism: A one-time gain
Use more, slower cores for better energy efficiency.Either simpler cores, or run cores at lower Vdd/frequency
Even simpler general-purpose microarchitectures?- Limited by smallest sensible core
Even Lower Vdd/Frequency?- Limited by Vdd/Vt scaling, errors
Now what?
UC Berkeley
8
[Muller, ARM CTO, 2009]
2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency!
No savior device technology on horizon.Future energy-efficiency innovations must be above transistor level.
UC Berkeley
9
The End of General-Purpose Processors?
Most computing happens in specialized, heterogeneous processors- Can be 100-1000X more
efficient than general-purpose processor
Challenges:- Hardware design costs- Software development costs
NVIDIA Tegra2
UC Berkeley
The Real Scaling Challenge:Communication
As transistors become smaller and cheaper, communication dominates performance and energy
10
All scales: Across chip Up and down
memory hierarchy Chip-to-chip Board-to-board Rack-to-rack
UC Berkeley
11
ASPIRE: From Better to Best
What is the best we can do?- For a fixed target technology (e.g., 7nm)
Can we prove a bound? Can we design implementation approaching bound?
Provably Optimal Implementations
Specialize and optimize communication and computation across whole stack from
applications to hardware
UC Berkeley
Communication-Avoiding Algorithms: Algorithm Cost Measures
1.Arithmetic (FLOPS)2.Communication: moving data between
- levels of a memory hierarchy (sequential case) - processors over a network (parallel case).
12
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
UC Berkeley Modeling Runtime & Energy
13
UC Berkeley
14
A few examples of speedups
Matrix multiplication- Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication
QR decomposition (used in least squares, data mining, …)- Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10- Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200- Up to 13x on Tesla C2050 / Fermi, for 110k x 100- Up to 4x on Grid of 4 cities (Dongarra, Langou et al)- “infinite speedup” for out-of-core on PowerPC laptop - LAPACK thrashed virtual memory, didn’t finish
Eigenvalues of band symmetric matrices- Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential)
Iterative sparse linear equations solvers (GMRES)- Up to 4.3x on Intel Clovertown, 8 core
N-body (direct particle interactions with cutoff distance)- Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.
UC Berkeley Modeling Energy: Dynamic
15
UC Berkeley Modeling Energy: Memory Retention
16
UC Berkeley Modeling Energy: Background Power
17
UC Berkeley Energy Lower Bounds
18
UC Berkeley
Early Result:Perfect Strong Scaling in Time and Energy
Every time you add processor, use its memory M too Start with minimal number of procs: PM = 3n2
Increase P by factor c total memory increases by factor c Notation for timing model:
- γt , βt , αt = secs per flop, per word_moved, per message of size mT(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c
Notation for energy model:- γe , βe , αe = Joules for same operations- δe = Joules per word of memory used per sec- εe = Joules per sec for leakage, etc.
E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P)
Perfect scaling extends to n-body, Strassen, …[IPDPS, 2013]
19
UC Berkeley C-A Algorithms Not Just for HPC
In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, …
Initial results on lower bounds of database join algorithms
20
UC Berkeley
From C-A Algorithms to Provably Optimal Systems?
1) Prove lower bounds on communication for a computation
2) Develop algorithm that achieves lower bound on a system
3) Find that communication time/energy cost is >90% of resulting implementation
4) We know we’re within 10% of optimal!
Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs
21
UC Berkeley
ESP: An Applications Processor Architecture for ASPIRE
Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore
Well-known how to customize hardware engines for specific task
ESP challenge is using specialized engines for general-purpose code
22
Intel Ivy Bridge (22nm)
Qualcomm Snapdragon MSM8960 (28nm)
ESP
ESP
UC Berkeley ESP: Ensembles of Specialized Processors
General-purpose hardware, flexible but inefficient Fixed-function hardware, efficient but inflexible Par Lab Insight: Patterns capture common operations
across many applications, each with unique communication & computation structure
Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs
Bet: Will give us efficiency plus flexibility- Any given core can have a different mix of these depending
on workload
23
UC Berkeley Par Lab: Motifs common across apps
24
Dense GraphSparse …
ApplicationsAudio Recognition
Object Recognition
Scene Analysis
Berkeley View “Dwarfs” or Motifs
UC Berkeley
25
Par Lab Apps
Motif (nee “Dwarf”) Popularity (Red Hot / Blue Cool)
Computing Domains
UC Berkeley
• Pipe-and-Filter• Agent-and-Repository• Event-based• Bulk Synchronous• Map-Reduce• Layered Systems• Model-view controller• Arbitrary Task Graphs• Puppeteer• Model-View-Controller
Application
• Graph Algorithms• Dynamic programming• Dense/Spare Linear Algebra • Un/Structured Grids• Graphical Models• Finite State Machines• Backtrack Branch-and-Bound• N-Body Methods• Circuits• Spectral Methods• Monte-Carlo
Architecting Parallel Software
Identify the Software Structure
Identify the Key Computations
UC Berkeley Mapping Software to ESP: Specializers
Capture desired functionality at high-level using patterns in a productive high-level language
Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code
ASP specializer infrastructure, open-source download 27
ILP Engine
Dense Engine
Sparse Engine
Graph Engine
ESP Core
Glue Code
Dense Code
SparseCode
Graph Code
ESP Code
Dense GraphSparse …
ApplicationsAudio Recognition
Object Recognition
Scene Analysis
Berkeley View “Dwarfs” or Motifs
Specializers with SEJITS Implementations and Autotuning
UC Berkeley
Replacing Fixed Accelerators with Programmable Fabric
Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore
Fabric challenge is retaining extreme energy efficiency while retaining programmability
28
Intel Ivy Bridge (22nm)
Qualcomm Snapdragon MSM8960 (28nm)
Fabric
Fabr
ic
Fabr
ic
Fabr
ic
UC Berkeley Strawman Fabric Architecture
29
M
ARM
ARM
ARM
AR
M
ARM
ARM
ARM
AR
M
ARM
ARM
ARM
AR
M
ARM
ARM
ARM
AR
Will never have a C compiler Only programmed using pattern-based
DSLs More dynamic, less static than earlier
approaches Dynamic dataflow-driven execution Dynamic routing Large memory support
UC Berkeley “Agile Hardware” Development
Current hardware design slow and arduous But now have huge design space to explore How to examine many design points efficiently?
Build parameterized generators, not point designs! Adopt and adapt best practices from Agile Software
- Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”)
- Incremental feature addition- Test & Verification first step
30
UC Berkeley
31
Chisel: Constructing Hardware In a Scala Embedded Language
Embed a hardware-description language in Scala, using Scala’s extension facilities
A hardware module is just a data structure in Scala Different output routines can generate different types of
output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation
Full power of Scala for writing hardware generators- Object-Oriented: Factory objects, traits, overloading etc- Functional: Higher-order funcs, anonymous funcs, currying- Compiles to JVM: Good performance, Java interoperability
UC Berkeley
32
Chisel Design Flow
Chisel Program
C++ code
FPGA Verilog
ASIC Verilog
Software Simulator
C++ Compiler
Scala/JVM
FPGA Emulation
FPGA Tools
GDS Layout
ASIC Tools
UC Berkeley Chisel is much more than an HDL The base Chisel system allows you to use the full power
of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL
But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric
Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits)
Only ~6,000 lines of code in current version including libraries!
BSD-licensed open source at:chisel.eecs.berkeley.edu
33
UC Berkeley
34
Many processor tapeouts in few years with small group (45nm, 28nm)
Clock test site
SRAM test site
DCDC test site
Processor Site
CO
RE
0VC
0
CO
RE
1VC
1
CO
RE
2VC
2
CO
RE
3VC
3
512K
BL2
VFIX
ED
Test
Site
s
UC Berkeley Resilient Circuits & Modeling
Future scaled technologies have high variability but want to run with lowest-possible margins to save energy
Significant increase in soft errors, need resilient systems Technology modeling to determine tradeoff between MTBF
and energy per task for logic, SRAM, & interconnect.
35
Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors
UC Berkeley
36
Hard
war
eSo
ftwar
e
Computational and Structural Patterns
Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency
Dense GraphSparse …
ESP (Ensembles of Specialized Processors)
Architecture
C++ Simulation
FPGA Emulation
Validation/Verification
ApplicationsAudio Recognition
Object Recognition
Scene Analysis
Hardware Cache Coherence
ASICSoC
FPGA ComputerImplementation Technologies
Communication-Avoiding AlgorithmsC-A GEMM C-A BFSC-A SpMV
Deep HW/SW Design-Space Exploration
Pipe&Filter Map-Reduce…
Hardware Generators using Chisel HDL
ILP Engine
Dense Engine
Sparse Engine
Graph Engine
ESP Core
Local Stores + DMA
Glue Code
Dense Code
SparseCode
Graph Code
ESP Code
Specializers with SEJITS Implementations and Autotuning
UC Berkeley ASPIRE Project
Initial $15.6M/5.5 year funding from DARPA PERFECT program- Started 9/28/2012- Located in Par Lab space + BWRC
Looking for industrial affiliates (see Krste!) Open House today, 5th floor Soda Hall
37
Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.
top related