anand raghunathan kaushik roy - uni-paderborn.de€¦ · approximate computing @ purdue •scalable...

APPROXIMATE COMPUTING ACROSS THE

STACK:

ARCHITECTURE AND SYSTEMS

Anand Raghunathan

Kaushik Roy

School of Electrical & Computer Engineering

Purdue University

Acknowledgment: Students, alumni and collaborators from

Integrated Systems Lab and Nanoelectronics Research Lab @ Purdue

OUTLINE

Where did we come from? Tracing the roots of

approximate computing

Where do we go from here? Taking approximate

computing to the mainstream

• Design automation

• Cross-layer approximate computing

• Moving approximate computing to programmable platforms

Killer apps for approximate computing?

AXC: WHERE DID WE COME FROM?

Tradeoffs between Quality of Results and Efficiency are

not new

• Intellectual roots of approximate computing can be traced

back to many fields


Approximation, Heuristic, and Probabilistic algorithms

• Tradeoff amount of work for sub-optimal or occasionally

incorrect results


Networking

• Best-effort packet delivery (IP)

• Reliability layered on top only when needed (TCP)

• Many apps do not need or use reliable packet delivery!

Video, audio streaming

Packets may be dropped,

corrupted, or delivered

out-of-order!


Large-scale unstructured data storage

Eventual (!) consistency


Digital Signal Processing

• Filter design (optimize taps, coefficients, and precision based on

specifications)

WHAT’S “NEW”? APPROXIMATE COMPUTING THROUGHOUT THE STACK

No golden answer Perfect/correct answers

not always possibleToo expensive to produce

perfect/correct answers

Programming Languages, Compilers,

Runtimes

Architecture

Logic

Circuits Stri

ct N

um

eri

cal or

Boole

an E

quiv

alence

AXC: SOME EARLY EFFORTS*

Approximate signal processing (Chandrakasan et. al, 1997)

Voltage overscaling (Shanbhag et. al, ISLPED 1999)

Probabilistic CMOS (Palem et. al, 2003)

Manufacturing yield enhancement (Breuer et. al, 2004-)

Energy-efficient, variation-tolerant approximate hardware (Roy et. al, 2006-)

Probabilistic Arithmetic / Biased voltage overscaling (Palem et. al, CASES 2006-)

Parallel runtime framework with computation skipping, dependency relaxation (Raghunathan et. al, IPDPS 2009; IPDPS 2010)

Error-resilient / stochastic processors (Mitra et. al, 2010; Kumar et. al, 2010)

Cross-layer, scalable-effort approximate HW design (Chippa et. al, 2010)

Programming support for approximate computing (Chilimbi el. al, 2010; Misailovic et. al, 2010; Sampson et. al, 2011)

…

…

* Not an exhaustive list!

AXC: WHERE DO WE GO FROM HERE?

Collection of (mostly manual)

design techniques

Component-centric focus

Disconnected efforts at different

layers of the stack

Design automation & programmable platforms

Holistic (system-level) impact

Cross-layer framework for approximate computing

OUTLINE


approximate computing..

Where do we go from here? Taking approximate

computing to the mainstream





APPROXIMATE COMPUTING @ PURDUE

• Scalable Effort Hardware (DAC 2010, DAC 2011, CICC 2013)

• Significance Driven Computation: MPEG, H.264 (DAC2009, ISLPED 2009)

• QUORA: Quality Programmable vector processor (MICRO 2013)

Approximate Architecture & System Design

• Voltage Scalable meta-functions (DATE 2011)

• Energy-quality tradeoff in DCT (DATE 2006)

• Approximate memory design (DAC 2009)

• IMPACT: Imprecise Adders for low power approximate computing (ISLPED 2011)

Approximate Circuit Design

• SALSA: Systematic Logic Synthesis for Approximate Circuits (DAC 2012)

• Substitute-and-Simplify: Design of quality configurable circuits (DATE 2013)

• MACACO: Modeling and Verification of Circuits for Approximate Computing (ICCAD 2011)

Design Automation

for Approximate Computing

Approximate Computing in

Software

• Best-effort parallel computing (DAC 2010)

• Dependency relaxation (IPDPS 2010)

• Partitioned Iterative Convergence (Cluster 2012)

• Analysis and characterization of inherent application resilience (DAC 2013)

APEAPE

APEAPE

APE

APEAPE APE

APE

APE

APE

APE

APE

APE

ACC

ALU

MU

X

MUX

Reg

Reg

MAPE

MAPE

MU

X

Data. O

UT

APEAPE APE

APEAPE APE

APE

APE

APE

APE

1-t

o-m

any-

DEM

UX

MAPE

MAPE

SM

SM

SM

SM

MA

PE

MA

PE

MA

PE

MA

PE

SM SM SM SMSM

MUX

Data. OUT

Data. IN

1-to-many-DEMUX

SM

INST.MEMORY

ScalarReg. File ALU

Prog. Counter

INST. DECODE & CONTROL UNIT

CAPE

Halt

Dat

a. IN

DATA MEMORY

ALU

ACC

Scratch Registers

MAPE

APE

Scra

tch

Reg

iste

rs

ALU

AC

C

MAPE

InstructionInst. Add

Inst. Read

APE ARRAY

CLK

RESET

SM_row_sel

MAPE_row_sel

SM_col_sel

MAPE_col_sel

Data. IN

Data. OUT

Data. Read

Data. Write

Data. Add

INTE

RFA

CE

Application Resilience Characterization (ARC) Framework

(implemented in valgrind)

Resilience

Identification

Quality

Function

Resilient

Parts

Sensitive

Parts

Resilience

Characterization

Approximation

Model 1Approximation


Model n

Dataset

Application

Quality

ProfileQuality

ProfileQuality

Profile

Quality Constraints

SALSA/SASIMI

Original Circuit Approximate Circuit

Quality Configurable Circuit

CAD for Approximate Computing

• Approximate accelerators

• Quality-programmable processors

A NEW CLASS OF WORKLOADS!

Recognition, Mining and Synthesis (RMS)1

P. Dubey, A Platform 2015

Workload Model: Recognition,

Mining and Synthesis Moves

Computers to the Era of Tera,

Feb. 2005.

SVMCNN

SOM

Bayesian

GMM

PCA

LDA

EM

K-meansARM

Decision Tree

Regression Semantic Indexing

ALGORITHMS

A-priori k-NN

Machine

Learning Data

Mining

Computer

Vision

INTRINSIC RESILIENCE IN RMS APPLICATIONS

V. K. Chippa, S. T. Chakradhar, K. Roy and A. Raghunathan, “Analysis and characterization of inherent application

resilience for approximate computing,” DAC 2013.

Recognition, Mining, Synthesis Application Suite

Search imageResults

0: Burger

1: Bread

2: Food

.

.

25: McDonals

Principle Component

Analysis

SVM Classifier

83% of runtime

spent in

computations

that can be

approximated0 20 40 60 80 100 120

Online data clustering

Character recognition

Health information analysis

Census data classification

Census data modeling

Image segmentation

Eye model generation

Eye detection

Digit model generation

Digit recognition

Image search

Document search

Total Resilience

Applications have

a mix of resilient

and sensitive

computations

% Runtime in

resilient

computations

PURDUE APPROXIMATE RMS ACCELERATOR

(PARMA)

Programmable RMS Accelerator

System-on-chip Accelerator card

Mobile Platform Server Platform

Embedded IC

Deployment

Scenarios

Embedded Platform

Banked Memory

Host

Inte

rface

On –

Chip

Con

trol

Inte

rcon

nect

N

etw

ork

MemoryAdministrator

Programmable

ControlProcessor

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

CTRL CTRL CTRL CTRL

CTRL

CTRL

CTRL

CTRL

CTRL CTRL CTRL CTRL

CTRL CTRL CTRL CTRLCTRL CTRL CTRL CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

CTRL

Banked Memory

PURDUE APPROXIMATE RMS ACCELERATOR: OVERVIEW

PARMA

FIFO

FIFO

FIFO

FIFO

PE

Level 2

MemoryFIFO FIFO

PE

PE

PE

PE

PE

PE

PE

PE

FIFO

Control

PE

Control

Main

Controller

Level 1

Memory

Host

Processor

Host

Interface

On C

hip

Bus

Algorithm• Skip less significant

computations

Architecture•Modulate precision of

variables & operations

Circuit•Voltage over scaling

Scaling Mechanisms

Applications

Support Vector Machines: Hand written digit recognition,

Checkerboard, 3D object detection

K-means Clustering: Image segmentation

GLVQ Training: Eye detection

RM

Processor

Process

Technology

TSMC 65nm

Supply

Voltage

0.6-1.0 V core

3.3V I/O

Frequency 250MHz

Power 0.96mW to

7.9mW

Area 1.3 mm2

V. K. Chippa, et. al, CICC 2013

PURDUE APPROXIMATE RMS ACCELERATOR: RESULTS

1.3X-2X energy savings with no

impact on output quality

2X-20X energy savings with 5% loss

in quality

Energy Savings

Benefits of cross-layer design

(SVM Classification)

Single level scaling does not exploit the

inherent application resilience maximally

Cross-layer optimization enables superior

Energy-Quality tradeoff

0

0.2

0.4

0.6

0.8

1

Digit

Recognition

Checkerboard

Classification

Image

Segmentation

Eye Detection

No

rmal

ized

Ener

gy C

onsu

mp

tio

n

No Scaling

0%

5%

10%

50 55 60 65 70 75 80 85 90 951

2

3

4

5

6

7

8

9

10x 10

-9

Classification Accuracy -->E

nerg

y C

onsum

ption P

er

Test

Vecto

r (i

n J

oule

s)

-->

Circuit

Architecture

Algorithm

Cross-Layer

V. K. Chippa, et. al, CICC 2013

A CASE FOR CROSS-LAYER APPROXIMATE

COMPUTING

Iso-accuracy contours:

Weak dependence, i.e., for a given accuracy, maximal scaling at one level still

leaves room for scaling at other levels

Impact on decision boundary

Ideal No Scaling Algorithm Architecture Circuit

Scaling at each level effects the decision boundary in a different manner

SVM Classification on PARMA

APPROXIMATE COMPUTING @ PURDUE

• Scalable Effort Hardware (DAC 2010, DAC 2011, CICC 2013)

• Significance Driven Computation: MPEG, H.264 (DAC2009, ISLPED 2009)

• QUORA: Quality Programmable vector processor (MICRO 2013)

Approximate Architecture & System Design

• Voltage Scalable meta-functions (DATE 2011)

• Energy-quality tradeoff in DCT (DATE 2006)

• Approximate memory design (DAC 2009)

• IMPACT: Imprecise Adders for low power approximate computing (ISLPED 2011)

Approximate Circuit Design

• SALSA: Systematic Logic Synthesis for Approximate Circuits (DAC 2012)

• Substitute-and-Simplify: Design of quality configurable circuits (DATE 2013)

• MACACO: Modeling and Verification of Circuits for Approximate Computing (ICCAD 2011)

Design Automation

for Approximate Computing

Approximate Computing in

Software

• Best-effort parallel computing (DAC 2010)

• Dependency relaxation (IPDPS 2010)

• Partitioned Iterative Convergence (Cluster 2012)

• Analysis and characterization of inherent application resilience (DAC 2013)

APEAPE

APEAPE

APE

APEAPE APE

APE

APE

APE

APE

APE

APE

ACC

ALU

MU

X

MUX

Reg

Reg

MAPE

MAPE

MU

X

Data. O

UT

APEAPE APE

APEAPE APE

APE

APE

APE

APE

1-t

o-m

any-

DEM

UX

MAPE

MAPE

SM

SM

SM

SM

MA

PE

MA

PE

MA

PE

MA

PE

SM SM SM SMSM

MUX

Data. OUT

Data. IN

1-to-many-DEMUX

SM

INST.MEMORY

ScalarReg. File ALU

Prog. Counter

INST. DECODE & CONTROL UNIT

CAPE

Halt

Dat

a. IN

DATA MEMORY

ALU

ACC

Scratch Registers

MAPE

APE

Scra

tch

Reg

iste

rs

ALU

AC

C

MAPE

InstructionInst. Add

Inst. Read

APE ARRAY

CLK

RESET

SM_row_sel

MAPE_row_sel

SM_col_sel

MAPE_col_sel

Data. IN

Data. OUT

Data. Read

Data. Write

Data. Add

INTE

RFA

CE

Application Resilience Characterization (ARC) Framework

(implemented in valgrind)

Resilience

Identification

Quality

Function

Resilient

Parts

Sensitive

Parts

Resilience

Characterization

Approximation



Model n

Dataset

Application

Quality

ProfileQuality

ProfileQuality

Profile

Quality Constraints

SALSA/SASIMI

Original Circuit Approximate Circuit

Quality Configurable Circuit

CAD for Approximate Computing

• Skip computations that have lower impact on result

• Relax global synchronization and communication

APPROXIMATE COMPUTING IN SOFTWARE

Templates allow programmers to easily specify mechanisms for

computation skipping and dependency relaxation

• Auto-tuning and runtime frameworks explore quality-speed tradeoff

Example: Iterative-convergence pattern

iterate {

for (i = 1 to M) {

…

}

} until converged();

Iterative-convergence model(IPDPS 2009)

Apps

Prog models

Best-effort

iterate {

for ( j = 1 to M) {

..

}

g

OS

HW

Best-effort

When to stop

iterations

}

} until converged( );

it tFilter computations

that can be dropped

Best effortiterate {

mask [0 to M] = filter ( )

for ( j = 1 to M, mask) {

..

}


iiterate {


for ( j = 1 to M, mask, batch P) { Ignore some

dependencies

Best-effort

17

..

}


dependencies

Approximate


Apps

Prog models

Best-effort

iterate {

for ( j = 1 to M) {

..

}

g

OS

HW

Best-effort

When to stop

iterations

}



that can be dropped




..

}


iiterate {



dependencies

Best-effort

17

..

}


dependencies

Approximate


Apps

Prog models

Best-effort

iterate {

for ( j = 1 to M) {

..

}

g

OS

HW

Best-effort

When to stop

iterations

}



that can be dropped




..

}


iiterate {



dependencies

Best-effort

17

..

}


dependencies

Approximate

float *[M] points;

float * [k] means;

int [M] memberships;

float *[M] distances;

K‐means() {

means = random_array(k);

iterate {

for (i= 1 to M) {

distances[i] = calc_distances(means,

points[i]);

memberships[i]=compute_cluster(distance

s);

}

means = calc_means(points, memberships);

} until (no_change(means));

}

APPROXIMATE COMPUTING IN SOFTWARE

Image segmentation (K-means)

Face detection (GLVQ)

Semantic Document Search

Eye-detection Used in NEC’s Face Recognition ProductsA1

A2 A

Update

reference

vectors

Classes with

Reference vectors

Input images

B1

B2

C1

C2

…

2

C

Distance

computation

p g

(training vectors)C2

2

Conventional algorithms

process one training

vector at a time due to

dependenciesdependencies

But, to scale, we need

ll li

iterate {

for ( j = 1 to M, batch P) {parallelism across many

training vectors !

Best-effort

( j ) {

..

}


New algorithm: Relax

dependencies and

process multiple training

vectors simultaneously!5X o er parallel vectors simultaneously!~5X over parallel

implementation!

5X speedup

99% accuracy

3X speedup

Iso-accuracy

4.9X speedup

0.1% error

Dell 2950 (8-core Xeon,

32GB RAM), Intel TBB

AxC @ Purdue

RM

Processor

PARMA (Purdue Approximate RMS Accelerator)

TSMC 65nm, IBM 45nm

RMS Accelerators with TOPS/W

processing efficiency

Quality-programmable processors

Parallel software with

improved scalability

Highly efficient BigData

processing on clusters


Apps

Prog models

Best-effort

iterate {

for ( j = 1 to M) {

..

}

g

OS

HW

Best-effort

When to stop

iterations

}



that can be dropped




..

}


iiterate {



dependencies

Best-effort

17

..

}


dependencies

3-5X speedup on

multi-cores and

many-cores

Cross-layer AxC leads to 20X

improvement in energy efficiencyAxC enables programmable vector

processors with 100s of GOPS/W

AxC leads to 5X speedup over MapReduce

implementations 100-1000 node clusters

PIC library (extension of Hadoop)

PageRank

OUTLINE


approximate computing..

Taking approximate computing to the mainstream





Credit: Yoshua Bengio, 2014 MLSS

20 W20 W

~200000 W

THE COMPUTATIONAL EFFICIENCY GAP

IBM Watson playing Jeopardy, 2011

CHALLENGE 1: EMBEDDED LEARNING

Need to embed intelligence in mobiles, wearables, IoT devices

• Often not feasible to upload data to the cloud (connectivity, latency,

energy, privacy)

(2013)

2.4

3

3.3

0

3.5

0 7.8

8

20

.42

0

5

10

15

20

25

AlexNet OverFeat GoogLenet VGG-A VGG-E

No. of Scalar Ops (in billions)

CHALLENGE 1: EMBEDDED LEARNING

Case Study: Image recognition using Deep Learning Networks

• 2-20 Giga-operations to classify a ~220*220 image!

• Significant growth in computational complexity over years

Battery Life

Energy/op (mobile

GPU)

5 x 10-2 nJ/op

Energy/frame 0.16 J/frame

Time-to-die

(2.1WH)

25 min (ideal)

Performance*

OMAP 4430 1.5 fps (ideal)

*https://wiki.ubuntu.com/Specs/M/ARMSoCOMAP?action=AttachFile&do=get&target=OMAP_Overview_UDS.pdf

Overfeat Deep Convolutional Net

(2012) (2014) (2014)(2014)

CHALLENGE 2: MODEL BUILDING AT SCALE

Turn-around time for learning is the key challenge

Performance

FLOPs/epoch (training) 61.5 Peta-OPs

Training time (Xeon Phi

7120p, peak 2.4TFLOPS)

~27 days (ideal)

Example: VGG-E network

Energy

Training energy

(1.5 KW/node)

~972 KWh

Need to parallelize on large-scale clusters (100s of nodes)

• Communication and global synchronization limit the parallel scalability

of back-propagation / SGD

1E+02

1E+03

1E+04

1E+05

1E+06

Scalar Ops Memory Accesses

Peta

Op

s/A

cces

ses

Overfeat

VGG-E

Projected (Imagenet 22K)

Model Building for Image Recognition

AXNN: APPROXIMATE NEURAL NETS

0

0.2

0.4

0.6

0.8

1

MNIST Facedet SvnH Cifar Cifar-mlp Adult GeoMean

No

rmal

ized

En

ergy

Original < 0.5% ~2.5% ~7.5%

Applications Layers Neurons Parameters

House Number Recognition 8 47818 847434

Object Classification 6 38282 846890

Digit Recognition 6 8010 51046

Face Detection 4 13362 25634

Object Recognition MLP 2 1034 3157002

Census Data Analysis 2 12 172

0.00013 28 56 84 112 165.39

Layer 1 Layer 3 Layer 5

Layer 6

Input

Resilient

Neurons

Sensitive

Neurons

Energy savings

LeNet-5 resilience map

S. Venkataramani et. al, AxNN: energy-efficient

neuromorphic systems using approximate computing,

ISLPED 2014.

AGES OF APPROXIMATE COMPUTING?

Computing systems were

approximate, but outside

the control of the “layers

of the stack”

Computing system

designers realize the

opportunity, but heroic

effort (a.k.a. grad

students) necessary

Approximate computing

is democratized;

accessible to the

“common” designer /

end user

“Seventh

heaven” of

applications

“Hell” of

nanoscale

physics

Computing

System Stack

PARMA: Purdue

Approximate RM

Accelerator

(65nm CMOS)

The Age of Gods

The Age of Heroes

The Age of Men

https://engineering.purdue.edu/ISL/AxC

anand raghunathan kaushik roy - uni-paderborn.de€¦ · approximate computing @ purdue •scalable...

Documents