hades: architectural synthesis for heterogeneous dark...

27
HaDeS : Architectural Synthesis for Heterogeneous Dark Silicon Chip Multi - processors Yatish Turakhia, Bharathwaj Raghunathan, Siddharth Garg, Diana Marculescu 1

Upload: others

Post on 21-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS: Architectural Synthesis for Heterogeneous Dark Silicon Chip

Multi-processorsYatish Turakhia, Bharathwaj Raghunathan, Siddharth Garg,

Diana Marculescu

1

Page 2: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Dark Silicon Challenge

0

0.2

0.4

0.6

0.8

1

1.2

0

5

10

15

20

25

30

35

45 32 22 16 11 8

# Tr

ansi

stor

s

Pow

er/D

ark

Silic

on

Technology Node

Power

Dark Silicon

# Transistors80% dark silicon at the 8 nm technology node[Esmaeilzadeh et al., ISCA’11]

2

Page 3: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

The Dark Silicon Era: Demise of Multi-core Scaling?} Rapid rise in the fraction of dark silicon due to power

constraints.

} What's the point of increasing core count if they aren’t active??

} Need to think out of the !!!

Dark

3

Page 4: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

} How to best utilize dark silicon for performance enhancement?} Heterogeneity

Dark Silicon Architectures

Accelerators

Heterogeneous Cores

How do we decide?ØWhat types of cores and how many cores of each type should one provision on a CMP of given area so as to maximize the overall performance benefit?ØWhich core/s to turn on without exceeding power budget / how to map threads?ØHow much parallelism to exploit?

4

Page 5: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Library of Cores

Area Budget Peak PowerConstraint

Multi-threaded Applications

seq

parpar par

seq

parpar par

Architectural Synthesis+

DOP Optimization and Thread Mapping

HaDeS

Synthesized Heterogeneous

Dark Silicon CMPLast Level Cache

HaDeS: Overview

5

Page 6: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

SynthesizedHeterogeneousDarkSiliconCMP

par par par

Application type i

6

Page 7: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par Dark

7

Page 8: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

iE

Total execution time of application i

8

Page 9: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

smisi

E t=

Sequential time of application ion core type ms

Core type ms

9

Page 10: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

[1, ]max {s

i

si im v DOPE t

Î= +

Slowest parallel thread limits performance

10

Page 11: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

[1, ]max {s

i

si vim DOPE t

Î= +

Ranges over all parallel threads

11

Page 12: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

[1, ]max {

is

si v Pi Om DE t

Î= +

Degree of parallelism of application i

12

Page 13: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

[ ]

( )

1,max {s

p

i

si im v DO

pim v

Pi

Et

DOPt

Î= + +

Total execution time of a parallel thread v of application i depends on the core mapping mp(v) and is inversely proportional to# parallel threads.

mp(1)

mp(2) mp(3)

13

Page 14: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

)

, ] ( )(

[1m .ax { }

p

s pi

pim vs

i im v DOPi

iim v

tE t

DOPK DOP

Î= + +

Penalty term for increased resource contention with increasing # parallel threads

14

Page 15: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model

seq

LastLevelCache

par

par par

)

, ] ( )[1

(max { . }p

s pi

pi

i iv DOPi

m vsim im v

ttE DOP

DOPK

Î= + +

Modelparameters

15

Page 16: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Application Performance Model: Validation

seq

LastLevelCache

par

par par

( )( )[1, ]

max { . }p

s pi

pim vs

i iim im vv DOPi

tE t K DOP

DOPÎ= + +

Simple,butsurprisinglyaccurate!!!

16

Page 17: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework

min .( l l )N

s pi i i

iræ ö

+ç ÷è øå

Goal: Minimize weighted sum of execution time for each application.

User defined weights for workload i

16

Page 18: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework

min .( l l )N

s pi i i

iræ ö

+ç ÷è øå

Goal: Minimize weighted sum of execution time for each application.

Sequential execution time of workload i

18

Page 19: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework

min .( l l )N

s pi i i

iræ ö

+ç ÷è øå

Goal: Minimize weighted sum of execution time for each application.

Parallel execution time of workload i

19

Page 20: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework: ILP-OPT

min .( l l )N

s pi i i

iræ ö

+ç ÷è øå

. [1,N]pijp

i ij ij ii

tl b K DOP i

DOPæ ö

³ + " Îç ÷ç ÷è ø

OBJECTIVE

0-1 Indicator variable that is 1 if at least one parallel thread of application iexecutes on a core of type j

Constraint: Parallel Execution Time

20

Page 21: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework: ILP-OPT

min .( l l )N

s pi i i

iræ ö

+ç ÷è øå

. [1,N]pijp

i ij ij ii

tl b K DOP i

DOPæ ö

³ + " Îç ÷ç ÷è ø

OBJECTIVE

Variable DOP introduces non-linearity

Constraint: Parallel Execution Time

21

Page 22: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework: ILP-OPT

min .( l l )N

s pi i i

iræ ö

+ç ÷è øå OBJECTIVE

Constraint: Parallel Execution Time

max max

1 1

max

max

max

[1,N], [1,N], [1,DOP ]

[1,N], [1,N], [1,DOP ]

1 [1,N], [1,N], [1,DOP ]

p pDOP DOPij ijp

i ij iw ij ijw ijww w

ijw ij

ijw iw

ijw ij iw

t tl b y K w m K

w w

m b i j wm y i j wm b y i j w

= =

æ ö æ ö³ + = +ç ÷ ç ÷ç ÷ ç ÷

è ø è ø£ " Î Î Î

£ " Î Î Î

³ + - " Î Î Î

å å

Many variables O(MxNxDOPmax) to solve for in an ILP!Only O(MxN) variables in fixed DOP case

22

Page 23: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

HaDeS Framework: ITER-OPT} Separates the architectural

optimization from DOP optimization.

} Start with initial guess of design vector, compute optimal DOP for this heterogeneous design, re-synthesize heterogeneous design for this DOP and keep iterating till no further improvement in performance.

} Represents local minima. Only 2.5% loss of optimality in experiments.

} Orders of magnitude faster (430X speedup)than ILP-OPT.

DOP Optimizer

D1 DN

App 1 App N

Architecture Optimizer

Optimized Heterogeneous Architecture

Opt

imal

DO

PFo

r E

ach

App

licat

ion

(D*)

Optim

al Heterogeneous

Architecture (Q

*)

Initial Guess For Q*

23

Page 24: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Experimental Results} Used Snipersim, interval core

model based simulator for performance estimates and McPAT for power/area numbers.

} Library of 108 cores with 5 multi-threaded application benchmarks (from Parsec and Splash2 suites). Only 15 cores pareto optimal in area, power or performance.

} Performance model fairly accurate. Only 5.2% average error over 180 experiments with homogeneous and heterogeneous dark silicon CMPs.

24

Page 25: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Experimental Results} Between 19% to 60% performance improvements in

heterogeneous dark silicon CMPs over variable and fixed (16 threads) DOP homogeneous design respectively for ~50% dark silicon.

25

Page 26: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Experimental Results} Benefits of heterogeneous design begin to saturate

beyond certain portion of dark silicon.

26

Page 27: HaDeS: Architectural Synthesis for Heterogeneous Dark ...public.gi.ucsc.edu/~yatisht/files/DAC13_slides_Turakhia.pdf · Dark Silicon Challenge 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20

Thank You!

Please visit the poster for questions and discussions!

27