anand raghunathan kaushik roy - uni-paderborn.de€¦ · approximate computing @ purdue •scalable...
TRANSCRIPT
APPROXIMATE COMPUTING ACROSS THE
STACK:
ARCHITECTURE AND SYSTEMS
Anand Raghunathan
Kaushik Roy
School of Electrical & Computer Engineering
Purdue University
Acknowledgment: Students, alumni and collaborators from
Integrated Systems Lab and Nanoelectronics Research Lab @ Purdue
OUTLINE
Where did we come from? Tracing the roots of
approximate computing
Where do we go from here? Taking approximate
computing to the mainstream
• Design automation
• Cross-layer approximate computing
• Moving approximate computing to programmable platforms
Killer apps for approximate computing?
AXC: WHERE DID WE COME FROM?
Tradeoffs between Quality of Results and Efficiency are
not new
• Intellectual roots of approximate computing can be traced
back to many fields
AXC: WHERE DID WE COME FROM?
Approximation, Heuristic, and Probabilistic algorithms
• Tradeoff amount of work for sub-optimal or occasionally
incorrect results
AXC: WHERE DID WE COME FROM?
Networking
• Best-effort packet delivery (IP)
• Reliability layered on top only when needed (TCP)
• Many apps do not need or use reliable packet delivery!
Video, audio streaming
Packets may be dropped,
corrupted, or delivered
out-of-order!
AXC: WHERE DID WE COME FROM?
Large-scale unstructured data storage
Eventual (!) consistency
AXC: WHERE DID WE COME FROM?
Digital Signal Processing
• Filter design (optimize taps, coefficients, and precision based on
specifications)
WHAT’S “NEW”? APPROXIMATE COMPUTING THROUGHOUT THE STACK
No golden answer Perfect/correct answers
not always possibleToo expensive to produce
perfect/correct answers
Programming Languages, Compilers,
Runtimes
Architecture
Logic
Circuits Stri
ct N
um
eri
cal or
Boole
an E
quiv
alence
AXC: SOME EARLY EFFORTS*
Approximate signal processing (Chandrakasan et. al, 1997)
Voltage overscaling (Shanbhag et. al, ISLPED 1999)
Probabilistic CMOS (Palem et. al, 2003)
Manufacturing yield enhancement (Breuer et. al, 2004-)
Energy-efficient, variation-tolerant approximate hardware (Roy et. al, 2006-)
Probabilistic Arithmetic / Biased voltage overscaling (Palem et. al, CASES 2006-)
Parallel runtime framework with computation skipping, dependency relaxation (Raghunathan et. al, IPDPS 2009; IPDPS 2010)
Error-resilient / stochastic processors (Mitra et. al, 2010; Kumar et. al, 2010)
Cross-layer, scalable-effort approximate HW design (Chippa et. al, 2010)
Programming support for approximate computing (Chilimbi el. al, 2010; Misailovic et. al, 2010; Sampson et. al, 2011)
…
…
* Not an exhaustive list!
AXC: WHERE DO WE GO FROM HERE?
Collection of (mostly manual)
design techniques
Component-centric focus
Disconnected efforts at different
layers of the stack
Design automation & programmable platforms
Holistic (system-level) impact
Cross-layer framework for approximate computing
OUTLINE
Where did we come from? Tracing the roots of
approximate computing..
Where do we go from here? Taking approximate
computing to the mainstream
• Design automation
• Cross-layer approximate computing
• Moving approximate computing to programmable platforms
Killer apps for approximate computing?
APPROXIMATE COMPUTING @ PURDUE
• Scalable Effort Hardware (DAC 2010, DAC 2011, CICC 2013)
• Significance Driven Computation: MPEG, H.264 (DAC2009, ISLPED 2009)
• QUORA: Quality Programmable vector processor (MICRO 2013)
Approximate Architecture & System Design
• Voltage Scalable meta-functions (DATE 2011)
• Energy-quality tradeoff in DCT (DATE 2006)
• Approximate memory design (DAC 2009)
• IMPACT: Imprecise Adders for low power approximate computing (ISLPED 2011)
Approximate Circuit Design
• SALSA: Systematic Logic Synthesis for Approximate Circuits (DAC 2012)
• Substitute-and-Simplify: Design of quality configurable circuits (DATE 2013)
• MACACO: Modeling and Verification of Circuits for Approximate Computing (ICCAD 2011)
Design Automation
for Approximate Computing
Approximate Computing in
Software
• Best-effort parallel computing (DAC 2010)
• Dependency relaxation (IPDPS 2010)
• Partitioned Iterative Convergence (Cluster 2012)
• Analysis and characterization of inherent application resilience (DAC 2013)
APEAPE
APEAPE
APE
APEAPE APE
APE
APE
APE
APE
APE
APE
ACC
ALU
MU
X
MUX
Reg
Reg
MAPE
MAPE
MU
X
Data. O
UT
APEAPE APE
APEAPE APE
APE
APE
APE
APE
1-t
o-m
any-
DEM
UX
MAPE
MAPE
SM
SM
SM
SM
MA
PE
MA
PE
MA
PE
MA
PE
SM SM SM SMSM
MUX
Data. OUT
Data. IN
1-to-many-DEMUX
SM
INST.MEMORY
ScalarReg. File ALU
Prog. Counter
INST. DECODE & CONTROL UNIT
CAPE
Halt
Dat
a. IN
DATA MEMORY
ALU
ACC
Scratch Registers
MAPE
APE
Scra
tch
Reg
iste
rs
ALU
AC
C
MAPE
InstructionInst. Add
Inst. Read
APE ARRAY
CLK
RESET
SM_row_sel
MAPE_row_sel
SM_col_sel
MAPE_col_sel
Data. IN
Data. OUT
Data. Read
Data. Write
Data. Add
INTE
RFA
CE
Application Resilience Characterization (ARC) Framework
(implemented in valgrind)
Resilience
Identification
Quality
Function
Resilient
Parts
Sensitive
Parts
Resilience
Characterization
Approximation
Model 1Approximation
Model 2Approximation
Model n
Dataset
Application
Quality
ProfileQuality
ProfileQuality
Profile
Quality Constraints
SALSA/SASIMI
Original Circuit Approximate Circuit
Quality Configurable Circuit
CAD for Approximate Computing
• Approximate accelerators
• Quality-programmable processors
A NEW CLASS OF WORKLOADS!
Recognition, Mining and Synthesis (RMS)1
P. Dubey, A Platform 2015
Workload Model: Recognition,
Mining and Synthesis Moves
Computers to the Era of Tera,
Feb. 2005.
SVMCNN
SOM
Bayesian
GMM
PCA
LDA
EM
K-meansARM
Decision Tree
Regression Semantic Indexing
ALGORITHMS
A-priori k-NN
Machine
Learning Data
Mining
Computer
Vision
INTRINSIC RESILIENCE IN RMS APPLICATIONS
V. K. Chippa, S. T. Chakradhar, K. Roy and A. Raghunathan, “Analysis and characterization of inherent application
resilience for approximate computing,” DAC 2013.
Recognition, Mining, Synthesis Application Suite
Search imageResults
0: Burger
1: Bread
2: Food
.
.
25: McDonals
Principle Component
Analysis
SVM Classifier
83% of runtime
spent in
computations
that can be
approximated0 20 40 60 80 100 120
Online data clustering
Character recognition
Health information analysis
Census data classification
Census data modeling
Image segmentation
Eye model generation
Eye detection
Digit model generation
Digit recognition
Image search
Document search
Total Resilience
Applications have
a mix of resilient
and sensitive
computations
% Runtime in
resilient
computations
PURDUE APPROXIMATE RMS ACCELERATOR
(PARMA)
Programmable RMS Accelerator
System-on-chip Accelerator card
Mobile Platform Server Platform
Embedded IC
Deployment
Scenarios
Embedded Platform
Banked Memory
Host
Inte
rface
On –
Chip
Con
trol
Inte
rcon
nect
N
etw
ork
MemoryAdministrator
Programmable
ControlProcessor
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
CTRL CTRL CTRL CTRL
CTRL
CTRL
CTRL
CTRL
CTRL CTRL CTRL CTRL
CTRL CTRL CTRL CTRLCTRL CTRL CTRL CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
CTRL
Banked Memory
PURDUE APPROXIMATE RMS ACCELERATOR: OVERVIEW
PARMA
FIFO
FIFO
FIFO
FIFO
PE
Level 2
MemoryFIFO FIFO
PE
PE
PE
PE
PE
PE
PE
PE
FIFO
Control
PE
Control
Main
Controller
Level 1
Memory
Host
Processor
Host
Interface
On C
hip
Bus
Algorithm• Skip less significant
computations
Architecture•Modulate precision of
variables & operations
Circuit•Voltage over scaling
Scaling Mechanisms
Applications
Support Vector Machines: Hand written digit recognition,
Checkerboard, 3D object detection
K-means Clustering: Image segmentation
GLVQ Training: Eye detection
RM
Processor
Process
Technology
TSMC 65nm
Supply
Voltage
0.6-1.0 V core
3.3V I/O
Frequency 250MHz
Power 0.96mW to
7.9mW
Area 1.3 mm2
V. K. Chippa, et. al, CICC 2013
PURDUE APPROXIMATE RMS ACCELERATOR: RESULTS
1.3X-2X energy savings with no
impact on output quality
2X-20X energy savings with 5% loss
in quality
Energy Savings
Benefits of cross-layer design
(SVM Classification)
Single level scaling does not exploit the
inherent application resilience maximally
Cross-layer optimization enables superior
Energy-Quality tradeoff
0
0.2
0.4
0.6
0.8
1
Digit
Recognition
Checkerboard
Classification
Image
Segmentation
Eye Detection
No
rmal
ized
Ener
gy C
onsu
mp
tio
n
No Scaling
0%
5%
10%
50 55 60 65 70 75 80 85 90 951
2
3
4
5
6
7
8
9
10x 10
-9
Classification Accuracy -->E
nerg
y C
onsum
ption P
er
Test
Vecto
r (i
n J
oule
s)
-->
Circuit
Architecture
Algorithm
Cross-Layer
V. K. Chippa, et. al, CICC 2013
A CASE FOR CROSS-LAYER APPROXIMATE
COMPUTING
Iso-accuracy contours:
Weak dependence, i.e., for a given accuracy, maximal scaling at one level still
leaves room for scaling at other levels
Impact on decision boundary
Ideal No Scaling Algorithm Architecture Circuit
Scaling at each level effects the decision boundary in a different manner
SVM Classification on PARMA
APPROXIMATE COMPUTING @ PURDUE
• Scalable Effort Hardware (DAC 2010, DAC 2011, CICC 2013)
• Significance Driven Computation: MPEG, H.264 (DAC2009, ISLPED 2009)
• QUORA: Quality Programmable vector processor (MICRO 2013)
Approximate Architecture & System Design
• Voltage Scalable meta-functions (DATE 2011)
• Energy-quality tradeoff in DCT (DATE 2006)
• Approximate memory design (DAC 2009)
• IMPACT: Imprecise Adders for low power approximate computing (ISLPED 2011)
Approximate Circuit Design
• SALSA: Systematic Logic Synthesis for Approximate Circuits (DAC 2012)
• Substitute-and-Simplify: Design of quality configurable circuits (DATE 2013)
• MACACO: Modeling and Verification of Circuits for Approximate Computing (ICCAD 2011)
Design Automation
for Approximate Computing
Approximate Computing in
Software
• Best-effort parallel computing (DAC 2010)
• Dependency relaxation (IPDPS 2010)
• Partitioned Iterative Convergence (Cluster 2012)
• Analysis and characterization of inherent application resilience (DAC 2013)
APEAPE
APEAPE
APE
APEAPE APE
APE
APE
APE
APE
APE
APE
ACC
ALU
MU
X
MUX
Reg
Reg
MAPE
MAPE
MU
X
Data. O
UT
APEAPE APE
APEAPE APE
APE
APE
APE
APE
1-t
o-m
any-
DEM
UX
MAPE
MAPE
SM
SM
SM
SM
MA
PE
MA
PE
MA
PE
MA
PE
SM SM SM SMSM
MUX
Data. OUT
Data. IN
1-to-many-DEMUX
SM
INST.MEMORY
ScalarReg. File ALU
Prog. Counter
INST. DECODE & CONTROL UNIT
CAPE
Halt
Dat
a. IN
DATA MEMORY
ALU
ACC
Scratch Registers
MAPE
APE
Scra
tch
Reg
iste
rs
ALU
AC
C
MAPE
InstructionInst. Add
Inst. Read
APE ARRAY
CLK
RESET
SM_row_sel
MAPE_row_sel
SM_col_sel
MAPE_col_sel
Data. IN
Data. OUT
Data. Read
Data. Write
Data. Add
INTE
RFA
CE
Application Resilience Characterization (ARC) Framework
(implemented in valgrind)
Resilience
Identification
Quality
Function
Resilient
Parts
Sensitive
Parts
Resilience
Characterization
Approximation
Model 1Approximation
Model 2Approximation
Model n
Dataset
Application
Quality
ProfileQuality
ProfileQuality
Profile
Quality Constraints
SALSA/SASIMI
Original Circuit Approximate Circuit
Quality Configurable Circuit
CAD for Approximate Computing
• Skip computations that have lower impact on result
• Relax global synchronization and communication
APPROXIMATE COMPUTING IN SOFTWARE
Templates allow programmers to easily specify mechanisms for
computation skipping and dependency relaxation
• Auto-tuning and runtime frameworks explore quality-speed tradeoff
Example: Iterative-convergence pattern
iterate {
for (i = 1 to M) {
…
}
} until converged();
Iterative-convergence model(IPDPS 2009)
Apps
Prog models
Best-effort
iterate {
for ( j = 1 to M) {
..
}
g
OS
HW
Best-effort
When to stop
iterations
}
} until converged( );
it tFilter computations
that can be dropped
Best effortiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask) {
..
}
} until converged( );
iiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask, batch P) { Ignore some
dependencies
Best-effort
17
..
}
} until converged( );
dependencies
Approximate
Iterative-convergence model(IPDPS 2009)
Apps
Prog models
Best-effort
iterate {
for ( j = 1 to M) {
..
}
g
OS
HW
Best-effort
When to stop
iterations
}
} until converged( );
it tFilter computations
that can be dropped
Best effortiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask) {
..
}
} until converged( );
iiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask, batch P) { Ignore some
dependencies
Best-effort
17
..
}
} until converged( );
dependencies
Approximate
Iterative-convergence model(IPDPS 2009)
Apps
Prog models
Best-effort
iterate {
for ( j = 1 to M) {
..
}
g
OS
HW
Best-effort
When to stop
iterations
}
} until converged( );
it tFilter computations
that can be dropped
Best effortiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask) {
..
}
} until converged( );
iiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask, batch P) { Ignore some
dependencies
Best-effort
17
..
}
} until converged( );
dependencies
Approximate
float *[M] points;
float * [k] means;
int [M] memberships;
float *[M] distances;
K‐means() {
means = random_array(k);
iterate {
for (i= 1 to M) {
distances[i] = calc_distances(means,
points[i]);
memberships[i]=compute_cluster(distance
s);
}
means = calc_means(points, memberships);
} until (no_change(means));
}
APPROXIMATE COMPUTING IN SOFTWARE
Image segmentation (K-means)
Face detection (GLVQ)
Semantic Document Search
Eye-detection Used in NEC’s Face Recognition ProductsA1
A2 A
Update
reference
vectors
Classes with
Reference vectors
Input images
B1
B2
C1
C2
…
2
C
Distance
computation
p g
(training vectors)C2
2
Conventional algorithms
process one training
vector at a time due to
dependenciesdependencies
But, to scale, we need
ll li
iterate {
for ( j = 1 to M, batch P) {parallelism across many
training vectors !
Best-effort
( j ) {
..
}
} until converged( );
New algorithm: Relax
dependencies and
process multiple training
vectors simultaneously!5X o er parallel vectors simultaneously!~5X over parallel
implementation!
5X speedup
99% accuracy
3X speedup
Iso-accuracy
4.9X speedup
0.1% error
Dell 2950 (8-core Xeon,
32GB RAM), Intel TBB
AxC @ Purdue
RM
Processor
PARMA (Purdue Approximate RMS Accelerator)
TSMC 65nm, IBM 45nm
RMS Accelerators with TOPS/W
processing efficiency
Quality-programmable processors
Parallel software with
improved scalability
Highly efficient BigData
processing on clusters
Iterative-convergence model(IPDPS 2009)
Apps
Prog models
Best-effort
iterate {
for ( j = 1 to M) {
..
}
g
OS
HW
Best-effort
When to stop
iterations
}
} until converged( );
it tFilter computations
that can be dropped
Best effortiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask) {
..
}
} until converged( );
iiterate {
mask [0 to M] = filter ( )
for ( j = 1 to M, mask, batch P) { Ignore some
dependencies
Best-effort
17
..
}
} until converged( );
dependencies
3-5X speedup on
multi-cores and
many-cores
Cross-layer AxC leads to 20X
improvement in energy efficiencyAxC enables programmable vector
processors with 100s of GOPS/W
AxC leads to 5X speedup over MapReduce
implementations 100-1000 node clusters
PIC library (extension of Hadoop)
PageRank
OUTLINE
Where did we come from? Tracing the roots of
approximate computing..
Taking approximate computing to the mainstream
• Design automation
• Cross-layer approximate computing
• Moving approximate computing to programmable platforms
Killer apps for approximate computing?
Credit: Yoshua Bengio, 2014 MLSS
Today
20 W20 W
~200000 W
THE COMPUTATIONAL EFFICIENCY GAP
IBM Watson playing Jeopardy, 2011
CHALLENGE 1: EMBEDDED LEARNING
Need to embed intelligence in mobiles, wearables, IoT devices
• Often not feasible to upload data to the cloud (connectivity, latency,
energy, privacy)
(2013)
2.4
3
3.3
0
3.5
0 7.8
8
20
.42
0
5
10
15
20
25
AlexNet OverFeat GoogLenet VGG-A VGG-E
No. of Scalar Ops (in billions)
CHALLENGE 1: EMBEDDED LEARNING
Case Study: Image recognition using Deep Learning Networks
• 2-20 Giga-operations to classify a ~220*220 image!
• Significant growth in computational complexity over years
Battery Life
Energy/op (mobile
GPU)
5 x 10-2 nJ/op
Energy/frame 0.16 J/frame
Time-to-die
(2.1WH)
25 min (ideal)
Performance*
OMAP 4430 1.5 fps (ideal)
*https://wiki.ubuntu.com/Specs/M/ARMSoCOMAP?action=AttachFile&do=get&target=OMAP_Overview_UDS.pdf
Overfeat Deep Convolutional Net
(2012) (2014) (2014)(2014)
CHALLENGE 2: MODEL BUILDING AT SCALE
Turn-around time for learning is the key challenge
Performance
FLOPs/epoch (training) 61.5 Peta-OPs
Training time (Xeon Phi
7120p, peak 2.4TFLOPS)
~27 days (ideal)
Example: VGG-E network
Energy
Training energy
(1.5 KW/node)
~972 KWh
Need to parallelize on large-scale clusters (100s of nodes)
• Communication and global synchronization limit the parallel scalability
of back-propagation / SGD
1E+02
1E+03
1E+04
1E+05
1E+06
Scalar Ops Memory Accesses
Peta
Op
s/A
cces
ses
Overfeat
VGG-E
Projected (Imagenet 22K)
Model Building for Image Recognition
AXNN: APPROXIMATE NEURAL NETS
0
0.2
0.4
0.6
0.8
1
MNIST Facedet SvnH Cifar Cifar-mlp Adult GeoMean
No
rmal
ized
En
ergy
Original < 0.5% ~2.5% ~7.5%
Applications Layers Neurons Parameters
House Number Recognition 8 47818 847434
Object Classification 6 38282 846890
Digit Recognition 6 8010 51046
Face Detection 4 13362 25634
Object Recognition MLP 2 1034 3157002
Census Data Analysis 2 12 172
0.00013 28 56 84 112 165.39
Layer 1 Layer 3 Layer 5
Layer 6
Input
Resilient
Neurons
Sensitive
Neurons
Energy savings
LeNet-5 resilience map
S. Venkataramani et. al, AxNN: energy-efficient
neuromorphic systems using approximate computing,
ISLPED 2014.
AGES OF APPROXIMATE COMPUTING?
Computing systems were
approximate, but outside
the control of the “layers
of the stack”
Computing system
designers realize the
opportunity, but heroic
effort (a.k.a. grad
students) necessary
Approximate computing
is democratized;
accessible to the
“common” designer /
end user
“Seventh
heaven” of
applications
“Hell” of
nanoscale
physics
Computing
System Stack
PARMA: Purdue
Approximate RM
Accelerator
(65nm CMOS)
The Age of Gods
The Age of Heroes
The Age of Men
https://engineering.purdue.edu/ISL/AxC