the dark silicon implications for microprocessors - … area budget fixed power budget two sets of...

39
Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger The Dark Silicon Implications for Microprocessors

Upload: hoangnhi

Post on 11-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Karu Sankaralingam University of Wisconsin-Madison

Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger

The Dark Silicon Implications for Microprocessors

Multicore Decade?

2

We have relied on multicore scaling for over five years.

2000 2005 2010 2015 Pentium Extreme

Dual-Core

Core 2 Quad-Core

i7 980x Hex-Core

?

Multicore Decade?

3

We have relied on multicore scaling for over five years.

How much longer will it be our primary performance scaling technique?

2000 2005 2010 2015 Pentium Extreme

Dual-Core

Core 2 Quad-Core

i7 980x Hex-Core

?

Finding Optimal Multicore Designs

4

Comprehensive design space: Fixed area budget Fixed power budget Two sets of CMOS scaling projections Optimal core and diverse multicore organizations Parallel benchmarks

For next 5 technology generations, we find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks

Symmetric Multicore Projections

5 Symmetric multicores alone will not sustain the multicore era.

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target

Symmetric

3.4x in 10 years

18x

Multicore Solutions

6

Asymmetric Topologies

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric

3.5x

Multicore Solutions

7

Dynamic Topologies

[Chakraborty (2008), Suleman et al (2009)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric

Multicore Solutions

8

Dynamic Topologies

[Chakraborty (2008), Suleman et al (2009)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric

Multicore Solutions

9

Dynamic Topologies

[Chakraborty (2008), Suleman et al (2009)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric

Multicore Solutions

10

Dynamic Topologies

[Chakraborty (2008), Suleman et al (2009)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

3.5x

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

Multicore Solutions

11

Composed/Fused Topologies

[Ipek et al (2007), Kim et al (2007)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic Composed

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

Multicore Solutions

12

Composed/Fused Topologies

[Ipek et al (2007), Kim et al (2007)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic Composed

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic

Multicore Solutions

13

Composed/Fused Topologies

[Ipek et al (2007), Kim et al (2007)]

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic Composed

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target

Symmetric

Asymmetric

Dynamic

3.7x

Multicore Solutions

14

GPU-Style Cores

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic Composed GPU

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target Symmetric Asymmetric Dynamic Composed

2.7x

Multicore Era Projections

15

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target

Composed Composed

3.7x

18x

Multicore Era Projections

16

The best designs speed up 14% per year rather than the recent trend of 34% per year

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Target

Composed Composed

3.7x

18x

Why Diminishing Returns?

17

Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited

Devices • Find the best case technology scaling

Cores • Find the best cores

Multicores • Find the best multicore organization

Projections • Predict best case multicore performance for

each technology generation

Overview

18

Conservative Optimistic

Area 32x 32x

Power 4.5x 8.3x

Frequency 1.3x 3.9x

[Borkar 2007] [ITRS 2010]

Device Scaling Projections

19

From 45 nm to 8 nm:

Modeling Ideal Core Power/Perf.

20

Atom

Nehalem

Modeling Ideal Core Power/Perf.

21

Atom

Nehalem

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

Modeling Ideal Core Power/Perf.

22

Atom

Nehalem

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

Modeling Ideal Core Power/Perf.

23

Atom

Nehalem

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

Modeling Ideal Core Power/Perf.

24

Atom

Nehalem

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

Pareto Frontier includes all optimal power/performance points

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

Modeling Ideal Core Power/Perf.

25

Atom

Nehalem

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Intel Nehalem AMD Shanghai Intel Core Intel Atom

Pareto Frontier includes all optimal power/performance points

Repeat using core area for optimal area/performance points

0

5

10

15

20

25

30

0 10 20 30 40

Pow

er (T

DP,

Wat

ts)

SPECmark Score

Combining Device and Core Models

26

45 nm Frontier

32 nm Frontier

Device Scaling

Devices • Find the best case technology scaling

Cores • Find the best cores

Multicores • Find the best multicore organization

Projections • Predict best case multicore performance for

each technology generation

Overview

27

What belongs in multicore model?

28

Styles

Applications

Topologies

Area & Power / Performance Tradeoffs

Architectures

PARSEC fparallel, Data Use

Number of Threads, Cache Sizes

Area & Power Budget

Cache & memory latencies, memory bandwidth

Pareto Frontiers

Multicore Speedup Model

29

Multicore Speedup

1 1-fparallel

Serial Speedup fparallel

Parallel Speedup

= +

Multicore Performance Model

30

Performance is limited by:

and

Memory bandwidth BWmax / (instructions per byte from memory)

Computation Ncores × (core frequency/CPIexe) × core utilization

[Guz et al, 2009]

Core Utilization Model

31

Core utilization is limited by:

Fraction of Time Core is Ready to Issue Number of Threads in Core / Number of Threads to Keep Busy

[Guz et al, 2009]

Multicore Model & Pareto Frontiers

32

0

5

10

15

20

25

30

0 10 20 30 40

100 points

A(q), P(q)

q

Translating from SPECmark

33

1. From q, find core’s SPECmark speedup

2. Frequency linearly distributed from Atom to Nehalem

3. Recall: model predicts benchmark performance as f(benchmark chars, frequency, CPIexe)

4. Compute CPIexe such that Benchmark Speedup = SPECmark Speedup

Area and Power Constraints

34

Ncores x A(q) ≤ Area Budget

Ncores x P(q) ≤ Power Budget

Dark silicon = Ncores / # of cores that fit in chip area

Devices • Find the best case technology scaling

Cores • Find the best cores

Multicores • Find the best multicore organization

Projections • Predict best case multicore performance for

each technology generation

Overview

35

Dark Silicon

36

0%

20%

40%

60%

80%

100%

blacksholes bodytrack canneal ferret streamcluster GM

Perc

enta

ge D

ark

Silic

on

ITRS Conservative

0%

20%

40%

60%

80%

100%

blacksholes bodytrack canneal ferret streamcluster GM

Perc

enta

ge D

ark

Silic

on

ITRS Conservative

Sources of Dark Silicon: Power + Limited Parallelism

At 22 nm: At 8 nm:

17% 26%

51%

71%

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

ITRS: All Topologies ITRS: Symmetric Conservative: All Topologies Conservative: Symmetric

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Conservative: Symmetric

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

Conservative: All Topologies Conservative: Symmetric

0

4

8

12

16

20

0 2 4 6 8 10

Spee

dup

Year

ITRS: All Topologies ITRS: Symmetric Conservative: All Topologies Conservative: Symmetric

Overall Performance

37

Target

fparallel = 0.99

18x 16x

8x 6x 3x

Conclusions Multicore performance gains are limited

Need at least 18%-40% per generation from architecture alone without additional power

38

Unicore Era Multicore Era

?

39

Specialization

Shrinking chips Pervasive

Efficiency