scaling computers: why streaming is interesting

34
Streaming Workshop 8/03 MAH 1 Scaling Computers: Why Streaming Is Interesting Mark Horowitz Bill Dally, Christos Kozyrakis, Kunle Olukotun Computer Systems Laboratory Stanford University [email protected]

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 1

Scaling Computers:

Why Streaming Is Interesting

Mark HorowitzBill Dally, Christos Kozyrakis, Kunle Olukotun

Computer Systems LaboratoryStanford University

[email protected]

Page 2: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 2

Processor Model

• User model has been very stable for 30 years– Sequentially executes instructions– IO operations interact with outside world

• Model has hidden the scaling of technology– Efficiently transformed transistors to performance– 8008 – 3,500 transistors, and ran at 200kHz– P4 – 42M transistors, runs at 3GHz– Performance changed from 0.06MIPS to >1000MIPS

• C is a perfect fit to this programming model– They grew up together …

Page 3: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 3

Why Talk About Anything Else?

• People hate change

– Programmers especially

• Most applications are written for this computation model

– It is also is what the cheapest machines execute

• It is hard enough to make things work in this model

– Parallel programs are much more complex

– Need to focus on increasing productivity of programmers

• Hardware is always getting faster

– Wait 18 months and the machine will be 2x performance

• DARPA funding is not a good answer

– IMHO

Page 4: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 4

Supporting Data

Specint2000

1.00

10 .00

100 .00

1000 .00

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02

i nt el 386

i nt el 486

i nt el pent i um

i nt el pent i um 2

i nt el pent i um 3

i nt el pent i um 4

i nt el i t an i um

A l pha 21064

A l pha 21164

A l pha 21264

Spar c

Super Spar c

Spar c 64

M i ps

H P P A

P ow er P C

A M D K 6

A M D K 7

Page 5: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 5

The World Is About To Change

• Processor performance will not continue to scale

– We will fall off the current performance curve soon

• Many factors will cause this to occur

– VLSI wire issues (global structure are hard to build)

– Insufficient recoverable ILP

– Power

• This performance growth was partially an illusion

Page 6: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 6

Technology Scaling

Scaling CMOS has two direct effects:

• Devices get smaller

– Both transistors and wires

– Get more per square mm

– Generally means they get cheaper

– Enables more complex devices

• Transistors get faster

– So do wires when viewed the ‘right’ way

Page 7: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 7

FO4 Inverter Delay Under Scaling

Gate delay varies linearly with process technology (so far)• Useful rule of thumb: Dgate = 500pS*Ldrawn at TTLH

• Issues with being able to continue this scaling– Some current technologies are faster (le < feature size)

0

100

200

300

400

500

600

700

0.20.40.60.811.2

Gat

e d

elay

(p

S)

Technology Ldrawn (um)

Fanout=4 inverter delay at TT, 90% Vdd, 125C

500 * Ldrawn

Page 8: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 8

0

0.2

0.4

0.6

0.25 0.18 0.13 0.1 0.07 0.05 0.035

pF

Technology Ldrawn (um)

Semi-global wire capacitance, 1mm long

Aggressive scalingConservative scaling

0

0.1

0.2

0.3

0.4

0.25 0.18 0.13 0.1 0.07 0.05 0.035

Ko

hm

s

Technology Ldrawn (um)

Semi-global wire resistance, 1mm long

Aggressive scalingConservative scaling

Fixed Length Wire Scaling

• R gets quite a bit worse with scaling; C basically constant

Page 9: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 9

0

0.2

0.4

0.6

0.25 0.18 0.13 0.1 0.07 0.05 0.035

pF

Technology Ldrawn (um)

Semi-global wire capacitance, scaled length

Aggressive scalingConservative scaling

0

0.1

0.2

0.3

0.4

0.25 0.18 0.13 0.1 0.07 0.05 0.035

Ko

hm

s

Technology Ldrawn (um)

Semi-global wire resistance, scaled length

Aggressive scalingConservative scaling

Module Level Wire Scaling

• R is basically constant, and C falls linearly with scaling

Page 10: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 10

The World is Growing

• The problem associated with wires is really due to complexity• Diagram shows the logical span you reach in a cycle

– It also show the logical span of a chip

Old view: a chip looks small to a wire

Logical chip size

Distance I can go in 1 cycle

New view: a chip looks really big to a wire

Communication is expensive, even on-chip

Page 11: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 11

Architecture

• Convert transistors to performance• Use transistors to

– Exploit parallelism– Or create it (speculate)

• Processor generations– Simple machine

• Reuse hardware

– Pipelined• Separate hardware for each stage

– Super-scalar• Multiple port mems, function units

– Out-of-order• Mega-ports, complex scheduling

– Speculation• Each design has more logic to

accomplish same task (but faster)

Page 12: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 12

Architecture Scaling

• Plot of IPC

– Compiler + IPC

– 1.5x / generation– Until PIII, now

falling• There is a lot of

hardware to make this happen– Many transistors

– Lots of power

– Lots of designers

0.00

0.01

0.02

0.03

0.04

0.05

Jan-85 Jan-88 Jan-91 Jan-94 Jan-97 Jan-00

8038680486PentiumPentium IIPentiumIIIPentium4

Page 13: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 13

SpecInt/MHz

0.01

0.10

1.00

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel i tanium

Alpha 21064

Alpha 21164

Alpha 21264

Spar c

Super Spar c

Spar c64

Mips

HP PA

Power PC

AMD K6

AMD K7

Page 14: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 14

10

100

1000

10000

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel i tanium

Alpha 21064

Alpha 21164

Alpha 21264

Spar c

Super Spar c

Spar c64

Mips

HP PA

Power PC

AMD K6

AMD K7

Clock Frequency Scaling

Page 15: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 15

Gates Per Clock

• Clock speed has been scaling faster than base technology

• Number of FO4 delays in a cycle has been falling

• Number of gates decrease 1.4x each generation

• Caused by:– Faster circuit families

(dynamic logic)– Better optimization– Better micro-architecture– Better adder/mem arch

• All this generally requires more transistors

10.00

100.00

Dec-83 Dec-86 Dec-89 Dec-92 Dec-95 Dec-98

8038680486PentiumPentium II

Expon.

FO

4 in

vert

er d

elay

s /

cycl

e

Page 16: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 16

Clock Cycle in ‘FO4’

10

100

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel i tanium

Alpha 21064

Alpha 21164

Alpha 21264

Spar c

Super Spar c

Spar c64

Mips

HP PA

Power PC

AMD K6

AMD K7

Note that the points at around 10 FO4 are not correct. The FO4 for these technologies is about ½ my simple formula

Alpha

Page 17: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 17

Gates Per Clock

• Current SOA machines are at 16 FO4 gates per cycle– Historical low values (Cray) were at this level

• Overhead for short tick machines grows rapidly– Power

• Increases clock power per logic function

– Latency• Flops are already 10-20% of cycle today

– Logic reach grows smaller• What fits in a cycle (how many bits/gates) decreases

• Difficult to generate a clock at less than 8 FO4 gates

• Continued scaling of gates/clock will be hard– Performance gain from 16 FO4 to 8 FO4 is only 20% anyhow

Page 18: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 18

1

10

100

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel i tanium

Alpha 21064

Alpha 21164

Alpha 21264

Spar c

Super Spar c

Spar c64

Mips

HP PA

Power PC

AMD K6

AMD K7

Power

Page 19: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 19

Complexity and Power – The Dark Side

Two other factor will limit conventional scalar processors:

• Power– Current SOA designs are power limited– If you put all the techniques you know about

• It will dissipate too much power

• Need to get the most performance for XX Watts

– Performance improvements must be energy efficient

• Complexity– Very few companies can afford the design costs

• Fewer players every year

Page 20: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 20

The Illusion:

• I remember processor performance plots used to have two lines

• Microprocessors and mainframes– Mainframes had maxed out and improved at technology rate

• 2x every 6 years

– Unless something changes micros will be there soon

uP

Mainframe

Page 21: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 21

Jim Smith’s Graph

• Graph Jim Smith presented at ISCA 2000– Fastest uniprocessor– Then growth has been

modest

– Micros were starting from a slow point, so have a large growth rate

Page 22: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 22

Coming Opportunity

• Conventional processor scaling is going to slow down– Design costs are enormous– Improving IPC is getting harder– Improving cycle time is getting harder

• For performance need to exploit parallelism– EV8, Pentium 4 – SMT– Power 4 is an explicit multiprocessor

• Power 5 is explicit multiprocessor with SMT

• How do we do this well?– Create other programming models– Make the models match VLSI constraints– Don’t worry about universality

Page 23: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 23

What VLSI Wants You to Build

• Modular machine– Lots of potential compute units, w/ memory

Interconnect

Processor

Registers

Page 24: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 24

Making Communication Explicit

• In VLSI, communication is what matters

It is the wires, stupid

• Another way of saying this is:– In VLSI building computation elements is easy– Keeping them feed is hard– Hence, most of a modern processor stages data

• What a computation model that– Makes communication explicit– Provides feedback to the programmer about communication

Page 25: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 25

The “Ideal” VLSI Machine

• Lots of simple compute units

– Units feed by cheap (in energy, area) sources – local regs

– Relatively cheap instruction issue logic

• Memory (FIFOs) to decouple data fetch/execute

– Communication takes time (it is the LAW)

– Need to enable the machine to tolerate latency

• Interconnection network with high-bandwidth

– And as small latency as possible

• Connections to large backing store

– Main memory and disk

• Streams are a programming model that matches this machine

Page 26: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 26

What Stream Programming Means To Me

•Programming model where– Applications have large amounts of data parallelism

• So you don’t have to invent it

– Communication is made explicit• So the programmer can tell what will be expensive

– Easy to estimate performance of applications•Similar to

– Vector programming– Synchronous dataflow graphs– Mathlab/Simulink– Graphics machines, etc

It does not bother me that this is not general. It is common enough we should have an efficient solution for these applications

Page 27: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 27

Stream Programming Abstraction

• Stream model makes communication / locality explicit– Temporaries within kernels, producer-consumer variables

between kernels never are written back to memory– Most communication is represented by arcs, or memory

gather/scatter• Locality helps support large numbers of ALUs

SAD

Input Data

Output DataImage 1 convolve convolve

Image 0 convolve convolve

Depth Map

Kernel: Operations within a kernel operate on local data Streams: producer-

consumer locality

Page 28: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 28

Stream Architecture

• Basically need four parts:– Function units (Add,Mult)– Registers to hold the data– DMA channels to load/spill the registers– Something to sequence instructions

• For efficient operation– Want to decouple load/store from execute

• Need lots of registers

– Can optimize registers/memory for different functions

Page 29: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 29

Simple Stream Machine

• Vector machines were early stream machines• Scalar processor was the sequencer• Two levels of storage for the “registers”

– Vector registers– Main memory

• Vector registers are often a limitation– #ports = 3*#FU– #regs = Latency

• Graphics use this resource model– Slice the computation into stream

Register

FU FU FU FU

Lane

Page 30: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 30

More Refined Model

• Split the registers into two functions– Hold temporaries in the computation– Input / Output of the kernel

• Temps don’t need to be global for all function units– Although it is easier for the compiler if they are

• Localized registers will need move instructions

– The don’t need to hide the memory latency either• Generated from function units

• Input /Output registers are only for queuing– Number of ports does not need to grow with #FU– Very regular access out of these registers– Access can be longer– Build dense memory for these

Page 31: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 31

Imagine Stream Machine

2.1 GB/s 26 GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

435 GB/s

Page 32: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 32

Stream Virtual Machine

• Really three threads of control

• Control thread is master

– Issues non-blocking instructions for core and DMA

– Mechanism for respecting dependencies

• Explicitly specify what each non-blocking instruction depends on

• DMA and Core threads

– Have a queue of work– Run when constraints satisfied

• Machine dependent resources

– Graphics SRF = memory

Page 33: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 33

Many Possible Stream Hardware Implementations

• We think (hope) all will fall into the general SVM model• Need to have some parameters that characterize hardware

– Less is more in this situation– Want to export only key issues

• Goal is partition compilation process into two parts– High-level compiler

• Work with parallelization of code

• Data blocking and placement

• Platform independent

– Low-level (node) compiler• Code scheduling

• Detailed resource allocation

Page 34: Scaling Computers: Why Streaming Is Interesting

Streaming Workshop 8/03MAH 34

Streaming Summary

• Uniprocessor scaling will change soon– Highly constrained space– Performance gains are generally modest– Hard to build any really interesting machines

• VLSI wants you to build parallel modular machines– Streams are a programming model that matches machine

• Stream programming model is familiar to some programmers– It is the model that they prototype SP applications in– Easy to explain to someone

• Potentially large improvement possible in Op/sec and Ops/Joule