inst.eecs.berkeley.edu/~cs61c outline uc berkeley cs61c...

5
CS61C L40 Hardware Parallel Computing (1) Beamer, Fall 2006 © UCB Head TA Scott Beamer inst.eecs./~cs61c-tb inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 40 – Hardware Parallel Computing 2006-12-06 http://www.cnn.com/2006/WORLD/europe/11/27/uk.spam.reut/index.html Spam Emails on the Rise New studies show spam emails have increased by 50% worldwide. Worse yet, much of the spamming is being done by criminals, so laws may not help. Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.edu/~cs152/ CS61C L40 Hardware Parallel Computing (2) Beamer, Fall 2006 © UCB Outline Last time was about how to exploit parallelism from the software point of view Today is about how to implement this in hardware Some parallel hardware techniques A couple current examples Wont cover out-of-order execution, since too complicated CS61C L40 Hardware Parallel Computing (3) Beamer, Fall 2006 © UCB Introduction Given many threads (somehow generated by software), how do we implement this in hardware? Recall the performance equation: Execution Time = (Inst. Count)(CPI)(Cycle Time) Hardware Parallelism improves Instruction Count - If the equation is applied to each CPU, each CPU needs to do less CPI - If the equation is applied to system as a whole, more is done per cycle Cycle Time - Will probably be made worse in process CS61C L40 Hardware Parallel Computing (4) Beamer, Fall 2006 © UCB Disclaimers Please dont let todays material confuse what you have already learned about CPUs and pipelining When programmer is mentioned today, it means whoever is generating the assembly code (so it is probably a compiler) Many of the concepts described today are difficult to implement, so if it sounds easy, think of possible hazards CS61C L40 Hardware Parallel Computing (5) Beamer, Fall 2006 © UCB Superscalar Add more functional units or pipelines to CPU Directly reduces CPI by doing more per cycle Consider what if we: Added another ALU Added 2 more read ports to the RegFile Added 1 more write port to the RegFile CS61C L40 Hardware Parallel Computing (6) Beamer, Fall 2006 © UCB Simple Superscalar MIPS CPU clk 5 W0RaRb Register File Rd Data In Data Addr Data Memory Inst0 Instruction Address Instruction Memory PC 5 Rs 5 Rt 32 32 32 32 A B Next Address clk clk ALU 32 ALU 5 Rd Inst1 5 Rs 5 Rt W1RcRd C D 32 Can now do 2 instruction in 1 cycle!

Upload: others

Post on 13-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: inst.eecs.berkeley.edu/~cs61c Outline UC Berkeley CS61C ...gamescrafters.berkeley.edu/~cs61c/fa06/lectures/40/... · by software), how do we implement this in hardware? •Recall

CS61C L40 Hardware Parallel Computing (1) Beamer, Fall 2006 © UCB

Head TA Scott Beamer

inst.eecs./~cs61c-tb

inst.eecs.berkeley.edu/~cs61cUC Berkeley CS61C : Machine Structures

Lecture 40 – Hardware Parallel Computing

2006-12-06

http://www.cnn.com/2006/WORLD/europe/11/27/uk.spam.reut/index.html

Spam Emails on the RiseNew studies show spam

emails have increased by 50% worldwide.Worse yet, much of the spamming is being

done by criminals, so laws may not help.

Thanks to John Lazarro for his CS152 slidesinst.eecs.berkeley.edu/~cs152/

CS61C L40 Hardware Parallel Computing (2) Beamer, Fall 2006 © UCB

Outline

• Last time was about how to exploitparallelism from the software point ofview

• Today is about how to implement thisin hardware

• Some parallel hardware techniques• A couple current examples• Won’t cover out-of-order execution,since too complicated

CS61C L40 Hardware Parallel Computing (3) Beamer, Fall 2006 © UCB

Introduction

• Given many threads (somehow generatedby software), how do we implement this inhardware?

• Recall the performance equation:Execution Time = (Inst. Count)(CPI)(Cycle Time)• Hardware Parallelism improves

• Instruction Count - If the equation is applied to eachCPU, each CPU needs to do less

• CPI - If the equation is applied to system as a whole,more is done per cycle

• Cycle Time - Will probably be made worse in process

CS61C L40 Hardware Parallel Computing (4) Beamer, Fall 2006 © UCB

Disclaimers

• Please don’t let today’s materialconfuse what you have alreadylearned about CPU’s and pipelining

• When programmer is mentionedtoday, it means whoever is generatingthe assembly code (so it is probably acompiler)

• Many of the concepts described todayare difficult to implement, so if itsounds easy, think of possiblehazards

CS61C L40 Hardware Parallel Computing (5) Beamer, Fall 2006 © UCB

Superscalar

• Add more functional units or pipelinesto CPU

• Directly reduces CPI by doing moreper cycle

• Consider what if we:• Added another ALU• Added 2 more read ports to the RegFile• Added 1 more write port to the RegFile

CS61C L40 Hardware Parallel Computing (6) Beamer, Fall 2006 © UCB

Simple Superscalar MIPS CPU

clk

5

W0RaRb

RegisterFile

Rd

DataIn

DataAddr Data

Memory

Inst0

InstructionAddress

InstructionMemory

PC

5Rs

5Rt

32

3232

32

A

B

Nex

t Add

ress

clk clk

ALU

32ALU

5Rd

Inst1

5Rs

5Rt

W1RcRd

C

D32

• Can now do 2instruction in 1cycle!

Page 2: inst.eecs.berkeley.edu/~cs61c Outline UC Berkeley CS61C ...gamescrafters.berkeley.edu/~cs61c/fa06/lectures/40/... · by software), how do we implement this in hardware? •Recall

CS61C L40 Hardware Parallel Computing (7) Beamer, Fall 2006 © UCB

Simple Superscalar MIPS CPU (cont.)

• Considerations• ISA now has to be changed• Forwarding for pipelining now harder

• Limitations• Programmer must explicitly generateparallel code

• Improvement only if other instructionscan fill slots

• Doesn’t scale well

CS61C L40 Hardware Parallel Computing (8) Beamer, Fall 2006 © UCB

Single Instruction Multiple Data (SIMD)

• Often done in a vector form, so alldata has the same operation appliedto it

• Example: AltiVec (like SSE)• 128bit registers can hold:

4 floats, 4 ints, 8 shorts, 16 chars, etc.• Processes whole vector

128ALU

A

B128

CS61C L40 Hardware Parallel Computing (9) Beamer, Fall 2006 © UCB

Superscalar in Practice

• ISA’s have extensions for these vectoroperations

• One thread, that has parallelisminternally

• Performance improvement dependson program and programmer beingable to fully utilize all slots

• Can be parts other than ALU (like load)• Usefulness will be more apparentwhen combined with other paralleltechniques

CS61C L40 Hardware Parallel Computing (10) Beamer, Fall 2006 © UCB

Thread Review

• A Thread is a single stream ofinstructions

• It has its own registers, PC, etc.• Threads from the same process operatein the same virtual address space

• Are an easy way to describe/think aboutparallelism

• A single CPU can execute manythreads by Time Division Multipexing

CPU

Time

Thread0Thread1Thread2

CS61C L40 Hardware Parallel Computing (11) Beamer, Fall 2006 © UCB

Multithreading

• Multithreading is running multiplethreads through the same hardware

• Could we do Time DivisionMultipexing better in hardware?

• Consider if we gave the OS theabstraction of having 4 physical CPU’sthat share memory and each executeone thread, but we did it all on 1physical CPU?

CS61C L40 Hardware Parallel Computing (12) Beamer, Fall 2006 © UCB

Static Multithreading Example

ALU

Introduced in 1964by SeymourCray

Pipeline Stage

Appears tobe 4 CPU’sat 1/4 clock

Page 3: inst.eecs.berkeley.edu/~cs61c Outline UC Berkeley CS61C ...gamescrafters.berkeley.edu/~cs61c/fa06/lectures/40/... · by software), how do we implement this in hardware? •Recall

CS61C L40 Hardware Parallel Computing (13) Beamer, Fall 2006 © UCB

Static Multithreading Example Analyzed

• Results:• 4 Threads running in hardware• Pipeline hazards reduced

No more need to forward No control issues Less structural hazards

• Depends on being able to fully generate 4threads evenly Example if 1 Thread does 75% of the work

– Utilization = (% time run)(% work done) = (.25)(.75) + (.75)(.25) = .375 = 37.5%

CS61C L40 Hardware Parallel Computing (14) Beamer, Fall 2006 © UCB

Dynamic Multithreading

• Adds flexibility in choosing time toswitch thread

• Simultaneous Multithreading (SMT)• Called Hyperthreading by Intel• Run multiple threads at the same time• Just allocate functional units whenavailable

• Superscalar helps with this

CS61C L40 Hardware Parallel Computing (15) Beamer, Fall 2006 © UCB

Dynamic Multithreading Example

1

2

3

4

5

6

7

8

9

M M FX FX FPFPBRCCCycle

One thread, 8 units

1

2

3

4

5

6

7

8

9

M M FX FXFP FPBRCCCycle

Two threads, 8 units

CS61C L40 Hardware Parallel Computing (16) Beamer, Fall 2006 © UCB

Multicore

• Put multiple CPU’s on the same die• Why is this better than multiple dies?

• Smaller• Cheaper• Closer, so lower inter-processor latency• Can share a L2 Cache (details)• Less power

• Cost of multicore: complexity andslower single-thread execution

CS61C L40 Hardware Parallel Computing (17) Beamer, Fall 2006 © UCB

Multicore Example (IBM Power5)

Core #1

Core #2

SharedStuff

CS61C L40 Hardware Parallel Computing (18) Beamer, Fall 2006 © UCB

Administrivia

• Proj4 due tonight at 11:59pm• Proj1 - Check newsgroup for postingabout Proj1 regrades, you may wantone

• Lab tomorrow will have 2 surveys• Come to class Friday for the HKNcourse survey

Page 4: inst.eecs.berkeley.edu/~cs61c Outline UC Berkeley CS61C ...gamescrafters.berkeley.edu/~cs61c/fa06/lectures/40/... · by software), how do we implement this in hardware? •Recall

CS61C L40 Hardware Parallel Computing (19) Beamer, Fall 2006 © UCB

Upcoming Calendar

ParallelComputingin Hardware

(Scott)

Wed

FINAL EXAMTHU 12-14 @

12:30pm-3:30pm234 Hearst Gym

#16Sun 2pm

Review10 Evans

LASTCLASS

Summary,Review, &HKN Evals

I/ONetworking &61C Feedback

Survey

ParallelComputingin Software

#15This Week

FriThu LabMonWeek #

Final examSame rules as Midterm, except you get 2double-sided handwritten review sheets(1 from your midterm, 1 new one)+ green sheet [Don’t bring backpacks]

CS61C L40 Hardware Parallel Computing (20) Beamer, Fall 2006 © UCB

Real World Example 1: Cell Processor

• Multicore, and more….

CS61C L40 Hardware Parallel Computing (21) Beamer, Fall 2006 © UCB

Real World Example 1: Cell Processor

• 9 Cores (1PPE, 8SPE) at 3.2GHz• Power Processing Element (PPE)

• Supervises all activities, allocates work• Is multithreaded (2 threads)

•Synergystic Processing Element (SPE)• Where work gets done• Very Superscalar• No Cache, only Local Store

CS61C L40 Hardware Parallel Computing (22) Beamer, Fall 2006 © UCB

Real World Example 1: Cell Processor

• Great for othermultimediaapplicationssuch as HDTV,cameras, etc…

• Reallydependent onprogrammer useSPE’s and LocalStore to get themost out of it

CS61C L40 Hardware Parallel Computing (23) Beamer, Fall 2006 © UCB

Real World Example 2: Niagara Processor

• Multithreaded andMulticore

• 32 Threads (8 cores, 4threads each) at 1.2GHz

• Designed for low power

• Has simpler pipelines to fit more on• Maximizes thread level parallelism• Project Blackbox

CS61C L40 Hardware Parallel Computing (24) Beamer, Fall 2006 © UCB

Real World Example 2: Niagara Processor

• Each thread runs slower (1.2GHz), andthere is less number crunching ability(no FP unit), but tons of threads

• This is great for webservers, wherethere are typically many simplerequests, and many data stalls

•Can beat “faster”and moreexpensive CPU’s,while using lesspower

Page 5: inst.eecs.berkeley.edu/~cs61c Outline UC Berkeley CS61C ...gamescrafters.berkeley.edu/~cs61c/fa06/lectures/40/... · by software), how do we implement this in hardware? •Recall

CS61C L40 Hardware Parallel Computing (25) Beamer, Fall 2006 © UCB

Peer Instruction

1. The majority of PS3’s processingpower comes from the Cell processor

2. A computer that has max utilizationcan get more done multithreaded

3. Current multicore techniques canscale well to many (32+) cores

ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT

CS61C L40 Hardware Parallel Computing (26) Beamer, Fall 2006 © UCB

Peer Instruction Answer

ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT

1. All PS3 is 2.18TFLOPS, Cell is only204GFLOPS (GPU can do a lot…) FALSE

2. No more functional power FALSE3. Share memory and caches huge barrier. Why

Cell has Local Store FALSE

1. The majority of PS3’s processing powercomes from the Cell processor

2. A computer that has max utilization canget more done multithreaded

3. Current multicore techniques can scalewell to many (32+) cores

CS61C L40 Hardware Parallel Computing (27) Beamer, Fall 2006 © UCB

Summary

• Superscalar: More functional units• Multithread: Multiple threadsexecuting on same CPU

• Multicore: Multiple CPU’s on the samedie

• The gains from all these parallelhardware techniques relies heavily onthe programmer being able to maptheir task well to multiple threads

• Hit up CS150, CS152, CS162 andwikipedia for more info