cs#61c:#greatideas#in#computer#...

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-‐Level Parallelism (TLP)

and OpenMP Intro

Instructors: Krste Asanovic & Vladimir Stojanovic hAp://inst.eecs.berkeley.edu/~cs61c/

Review •  Amdahl’s Law: Serial secLons limit speedup •  Flynn Taxonomy •  Intel SSE SIMD InstrucLons –  Exploit data-‐level parallelism in loops – One instrucLon fetch that operates on mulLple operands simultaneously

–  128-‐bit XMM registers •  SSE InstrucLons in C –  Embed the SSE machine instrucLons directly into C programs through use of intrinsics

– Achieve efficiency beyond that of opLmizing compiler 2

New-‐School Machine Structures (It’s a bit more complicated!)

•  Parallel Requests Assigned to computer e.g., Search “Katz”

•  Parallel Threads Assigned to core e.g., Lookup, Ads

•  Parallel InstrucLons >1 instrucLon @ one Lme e.g., 5 pipelined instrucLons

•  Parallel Data >1 data item @ one Lme e.g., Add of 4 pairs of words

•  Hardware descripLons All gates @ one Lme

•  Programming Languages 3

Smart Phone

Warehouse Scale

Computer

So:ware Hardware

Harness Parallelism & Achieve High Performance

Logic Gates

Core Core …

Memory (Cache)

Input/Output

Computer

Cache Memory

Core

InstrucLon Unit(s)

FuncLonal Unit(s)

A3+B3 A2+B2 A1+B1 A0+B0

Project 3

Simple MulLprocessor

4

Processor 0

Control

Datapath PC Registers

(ALU)

Memory Input

Output

Bytes

I/O-‐Memory Interfaces

Processor 0 Memory Accesses

Processor 1

Control

Datapath PC Registers

(ALU)

Processor 1 Memory Accesses

MulLprocessor ExecuLon Model •  Each processor has its own PC and executes an

independent stream of instrucLons (MIMD) •  Different processors can access the same memory space

–  Processors can communicate via shared memory by storing/loading to/from common locaLons

•  Two ways to use a mulLprocessor: 1.  Deliver high throughput for independent jobs via job-‐level

parallelism 2.   Improve the run 6me of a single program that has been

specially cra=ed to run on a mul6processor -‐ a parallel-‐processing program

Use term core for processor (“Mul6core”) because “Mul6processor Microprocessor” too redundant

5

TransiLon to MulLcore

Sequential App Performance

6

Parallelism Only Path to Higher Performance

•  SequenLal processor performance not expected to increase much, and might go down

•  If want apps with more capability, have to embrace parallel processing (SIMD and MIMD)

•  In mobile systems, use mulLple cores and GPUs •  In warehouse-‐scale computers, use mulLple nodes, and all the MIMD/SIMD capability of each node

7

MulLprocessors and You •  Only path to performance is parallelism –  Clock rates flat or declining –  SIMD: 2X width every 3-‐4 years

•  128b wide now, 256b 2011, 512b in 2014, 1024b in 2018? – MIMD: Add 2 cores every 2 years: 2, 4, 6, 8, 10, …

•  Key challenge is to cras parallel programs that have high performance on mulLprocessors as the number of processors increase – i.e., that scale –  Scheduling, load balancing, Lme for synchronizaLon, overhead for communicaLon

•  Project 3: fastest code on 8-‐core computers –  2 chips/computer, 4 cores/chip

8

PotenLal Parallel Performance (assuming SW can use it)

Year Cores SIMD bits /Core Core * SIMD bits

Peak DP FLOPs/Cycle

2003 2 128 256 4 2005 4 128 512 8 2007 6 128 768 12 2009 8 128 1024 16 2011 10 256 2560 40 2013 12 256 3072 48 2015 14 512 7168 112 2017 16 512 8192 128 2019 18 1024 18432 288 2021 20 1024 20480 320

9

2.5X 8X 20X

MIMD SIMD MIMD *SIMD +2/

2yrs 2X/ 4yrs

Threads •  Thread: a sequenLal flow of instrucLons that performs some task

•  Each thread has a PC + processor registers and accesses the shared memory

•  Each processor provides one (or more) hardware threads (or harts) that acLvely execute instrucLons

•  OperaLng system mulLplexes mulLple so:ware threads onto the available hardware threads

10

OperaLng System Threads Give the illusion of many acLve threads by Lme-‐mulLplexing sosware threads onto hardware threads

•  Remove a sosware thread from a hardware thread by interrupLng its execuLon and saving its registers and PC into memory – Also if one thread is blocked waiLng for network access or user input

•  Can make a different sosware thread acLve by loading its registers into a hardware thread’s registers and jumping to its saved PC

11

Hardware MulLthreading •  Basic idea: Processor resources are expensive and should not be les idle

•  Long memory latency to memory on cache miss? •  Hardware switches threads to bring in other useful work while waiLng for cache miss

•  Cost of thread context switch must be much less than cache miss latency

•  Put in redundant hardware so don’t have to save context on every thread switch: –  PC, Registers

•  AAracLve for apps with abundant TLP –  Commercial mulL-‐user workloads

12

Hardware MulLthreading

13

Memory Input

Output

Bytes

I/O-‐Memory Interfaces

Processor

Control

Datapath PC 0 Registers 0

(ALU)

PC 1 Registers 1

•  Two copies of PC and Registers inside processor hardware •  Looks like two processors to sosware (hardware thread 0, hardware thread 1) •  Control logic decides which thread to execute an instrucLon from next

MulLthreading vs. MulLcore •  MulLthreading => BeAer ULlizaLon – ≈1% more hardware, 1.10X beAer performance? – Share integer adders, floaLng-‐point units, all caches (L1 I$, L1 D$, L2$, L3$), Memory Controller

•  MulLcore => Duplicate Processors – ≈50% more hardware, ≈2X beAer performance? – Share outer caches (L2$, L3$), Memory Controller

•  Modern machines do both – MulLple cores with mulLple threads per core

14

Krste’s MacBook Air •  /usr/sbin/sysctl -a | grep hw\. hw.model = MacBookAir5,1 … hw.physicalcpu: 2 hw.logicalcpu: 4 … hw.cpufrequency = 2,000,000,000

hw.memsize = 8,589,934,592

hw.cachelinesize = 64 hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768 hw.l2cachesize: 262,144 hw.l3cachesize: 4,194,304

15

Machines in (old) 61C Lab •  /usr/sbin/sysctl -a | grep hw\. hw.model = MacPro4,1 … hw.physicalcpu: 8 hw.logicalcpu: 16 … hw.cpufrequency = 2,260,000,000

hw.physmem = 2,147,483,648

hw.cachelinesize = 64 hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768 hw.l2cachesize: 262,144 hw.l3cachesize: 8,388,608

16

Therefore, should try up to 16 threads to see if performance gain even though only 8 cores

Administrivia

17

100s of (Mostly Dead) Parallel Programming Languages

18

ActorScript Concurrent Pascal JoCaml Orc Ada Concurrent ML Join Oz Afnix Concurrent Haskell Java Pict Alef Curry Joule Reia Alice CUDA Joyce SALSA APL E LabVIEW Scala Axum Eiffel Limbo SISAL Chapel Erlang Linda SR Cilk Fortan 90 MultiLisp Stackless Python Clean Go Modula-3 SuperPascal Clojure Io Occam VHDL Concurrent C Janus occam-π XC

OpenMP

•  OpenMP is a language extension used for mulL-‐threaded, shared-‐memory parallelism – Compiler DirecLves (inserted into source code) – RunLme Library RouLnes (called from your code) – Environment Variables (set in your shell)

•  Portable •  Standardized •  Easy to compile: cc –fopenmp name.c!

19

Shared Memory Model with Explicit Thread-‐based Parallelism

•  MulLple threads in a shared memory environment, explicit programming model with full programmer control over parallelizaLon

•  Pros: –  Takes advantage of shared memory, programmer need not worry (that much) about data placement

–  Compiler direcLves are simple and easy to use –  Legacy serial code does not need to be rewriAen

•  Cons: –  Code can only be run in shared memory environments –  Compiler must support OpenMP (e.g. gcc 4.2)

20

OpenMP in CS61C

•  OpenMP is built on top of C, so you don’t have to learn a whole new programming language – Make sure to add #include <omp.h> –  Compile with flag: gcc -fopenmp – Mostly just a few lines of code to learn

•  You will NOT become experts at OpenMP – Use slides as reference, will learn to use in lab

•  Key ideas: –  Shared vs. Private variables – OpenMP direcLves for parallelizaLon, work sharing, synchronizaLon

21

OpenMP Programming Model •  Fork -‐ Join Model: •  OpenMP programs begin as single process (master thread)

and executes sequenLally unLl the first parallel region construct is encountered –  FORK: Master thread then creates a team of parallel threads –  Statements in program that are enclosed by the parallel region construct are executed in parallel among the various threads

–  JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread

22

OpenMP Extends C with Pragmas

•  Pragmas are a preprocessor mechanism C provides for language extensions

•  Commonly implemented pragmas: structure packing, symbol aliasing, floaLng point excepLon modes (not covered in 61C)

•  Good mechanism for OpenMP because compilers that don't recognize a pragma are supposed to ignore them – Runs on sequenLal computer even with embedded pragmas

23

parallel Pragma and Scope

•  Basic OpenMP construct for parallelizaLon: #pragma omp parallel { /* code goes here */ } – Each thread runs a copy of code within the block – Thread scheduling is non-‐determinisIc

•  OpenMP default is shared variables – To make private, need to declare with pragma:

#pragma omp parallel private (x) 24

This is annoying, but curly brace MUST go on separate line from #pragma

Thread CreaLon

•  How many threads will OpenMP create? •  Defined by OMP_NUM_THREADS environment variable (or code procedure call) – Set this variable to the maximum number of threads you want OpenMP to use

– Usually equals the number of cores in the underlying hardware on which the program is run

25

What Kind of Threads?

•  OpenMP threads are operaLng system (sosware) threads.

•  OS will mulLplex requested OpenMP threads onto available hardware threads.

•  Hopefully each gets a real hardware thread to run on, so no OS-‐level Lme-‐mulLplexing.

•  But other tasks on machine can also use hardware threads!

•  Be careful when Lming results for project 3!

26

OMP_NUM_THREADS

•  OpenMP intrinsic to set number of threads: omp_set_num_threads(x);

•  OpenMP intrinsic to get number of threads: num_th = omp_get_num_threads();

•  OpenMP intrinsic to get Thread ID number: th_ID = omp_get_thread_num();

27

Parallel Hello World #include <stdio.h> #include <omp.h> int main () { int nthreads, tid;

/* Fork team of threads with private var tid */ #pragma omp parallel private(tid) { tid = omp_get_thread_num(); /* get thread id */ printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master and terminate */ }

28

Data Races and SynchronizaLon •  Two memory accesses form a data race if from different threads to same locaLon, and at least one is a write, and they occur one aser another

•  If there is a data race, result of program can vary depending on chance (which thread first?)

•  Avoid data races by synchronizing wriLng and reading to get determinisLc behavior

•  SynchronizaLon done by user-‐level rouLnes that rely on hardware synchronizaLon instrucLons

•  (more later)

29

Analogy: Buying Milk

•  Your fridge has no milk. You and your roommate will return from classes at some point and check the fridge

•  Whoever gets home first will check the fridge, go and buy milk, and return

•  What if the other person gets back while the first person is buying milk? – You’ve just bought twice as much milk as you need!

•  It would’ve helped to have les a note… 30

Lock SynchronizaLon (1/2)

•  Use a “Lock” to grant access to a region (criIcal secIon) so that only one thread can operate at a Lme – Need all processors to be able to access the lock, so use a locaLon in shared memory as the lock

•  Processors read lock and either wait (if locked) or set lock and go into criLcal secLon – 0 means lock is free / open / unlocked / lock off – 1 means lock is set / closed / locked / lock on

31

Lock SynchronizaLon (2/2)

•  Pseudocode:

Check lock

Set the lock

Critical section

(e.g. change shared variables)

Unset the lock

32

Can loop/idle here if locked

Possible Lock ImplementaLon

•  Lock (a.k.a. busy wait) Get_lock: # $s0 -> addr of lock

addiu $t1,$zero,1 # t1 = Locked value

Loop: lw $t0,0($s0) # load lock bne $t0,$zero,Loop # loop if locked

Lock: sw $t1,0($s0) # Unlocked, so lock

•  Unlock Unlock:

sw $zero,0($s0)

•  Any problems with this? 33

Possible Lock Problem

•  Thread 1 addiu $t1,$zero,1

Loop: lw $t0,0($s0)

bne $t0,$zero,Loop

Lock: sw $t1,0($s0)

•  Thread 2

addiu $t1,$zero,1

Loop: lw $t0,0($s0)

bne $t0,$zero,Loop

Lock: sw $t1,0($s0)

34

Time

Both threads think they have set the lock! Exclusive access not guaranteed!

Hardware SynchronizaLon

•  Hardware support required to prevent an interloper (another thread) from changing the value – Atomic read/write memory operaLon – No other access to the locaLon allowed between the read and write

•  How best to implement in sosware? – Single instr? Atomic swap of register ↔ memory – Pair of instr? One for read, one for write

35

SynchronizaLon in MIPS

•  Load linked: ll rt,off(rs) •  Store condiIonal: sc rt,off(rs) – Returns 1 (success) if locaLon has not changed since the ll

– Returns 0 (failure) if locaLon has changed

•  Note that sc clobbers the register value being stored (rt)! – Need to have a copy elsewhere if you plan on repeaLng on failure or using value later

36

SynchronizaLon in MIPS Example

•  Atomic swap (to test/set lock variable) Exchange contents of register and memory: $s4 ↔ Mem($s1)

try: add $t0,$zero,$s4 #copy value ll $t1,0($s1) #load linked sc $t0,0($s1) #store conditional beq $t0,$zero,try #loop if sc fails add $s4,$zero,$t1 #load value in $s4

37

sc would fail if another threads executes sc here

Test-‐and-‐Set

•  In a single atomic operaLon: –  Test to see if a memory locaLon is set (contains a 1)

–  Set it (to 1) if it isn’t (it contained a zero when tested)

–  Otherwise indicate that the Set failed, so the program can try again

– While accessing, no other instrucLon can modify the memory locaLon, including other Test-‐and-‐Set instrucLons

•  Useful for implemenLng lock operaLons 38

Test-‐and-‐Set in MIPS

•  Example: MIPS sequence for implemenLng a T&S at ($s1) Try: addiu $t0,$zero,1 ll $t1,0($s1) bne $t1,$zero,Try sc $t0,0($s1) beq $t0,$zero,try Locked:

# critical section

Unlock: sw $zero,0($s1)

39

Idea is that not for programmers to use this directly, but as a tool for enabling implementaLon of parallel libraries

40

Clickers: Consider the following code when executed concurrently by two threads.

What possible values can result in *($s0)?

# *($s0) = 100 lw $t0,0($s0) addi $t0,$t0,1 sw $t0,0($s0)

A: 101 or 102 B: 100, 101, or 102 C: 100 or 101

OpenMP DirecLves (Work-‐Sharing)

41

Shares iteraLons of a loop across the threads

Each secLon is executed by a separate thread

Serializes the execuLon of a thread

•  These are defined within a parallel secLon

Parallel Statement Shorthand

#pragma omp parallel {

#pragma omp for

for(i=0;i<len;i++) { … }

}

can be shortened to: #pragma omp parallel for

for(i=0;i<len;i++) { … }

•  Also works for sections 42

This is the only direcLve in the parallel secLon

Building Block: for loop for (i=0; i<max; i++) zero[i] = 0;

•  Break for loop into chunks, and allocate each to a separate thread –  e.g. if max = 100 with 2 threads:

assign 0-‐49 to thread 0, and 50-‐99 to thread 1 •  Must have relaLvely simple “shape” for an OpenMP-‐aware compiler to be able to parallelize it –  Necessary for the run-‐Lme system to be able to determine how many of the loop iteraLons to assign to each thread

•  No premature exits from the loop allowed –  i.e. No break, return, exit, goto statements

43

In general, don’t jump outside of any pragma block

Parallel for pragma #pragma omp parallel for for (i=0; i<max; i++) zero[i] = 0;

•  Master thread creates addiLonal threads, each with a separate execuLon context

•  All variables declared outside for loop are shared by default, except for loop index which is private per thread (Why?)

•  Implicit synchronizaLon at end of for loop •  Divide index regions sequenLally per thread –  Thread 0 gets 0, 1, …, (max/n)-‐1; –  Thread 1 gets max/n, max/n+1, …, 2*(max/n)-‐1 – Why?

44

OpenMP Timing

•  Elapsed wall clock Lme: double omp_get_wtime(void); – Returns elapsed wall clock Lme in seconds – Time is measured per thread, no guarantee can be made that two disLnct threads measure the same Lme

– Time is measured from “some Lme in the past,” so subtract results of two calls to omp_get_wtime to get elapsed Lme

45

Matrix MulLply in OpenMP start_time = omp_get_wtime();

#pragma omp parallel for private(tmp, i, j, k) for (i=0; i<Mdim; i++){ for (j=0; j<Ndim; j++){ tmp = 0.0; for( k=0; k<Pdim; k++){ /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Pdim+k)) * *(B+(k*Ndim+j)); } *(C+(i*Ndim+j)) = tmp; } } run_time = omp_get_wtime() - start_time;

46

Outer loop spread across N threads; inner loops inside a single thread

Notes on Matrix MulLply Example

•  More performance opLmizaLons available: – Higher compiler opImizaIon (-‐O2, -‐O3) to reduce number of instrucLons executed

– Cache blocking to improve memory performance – Using SIMD SSE instrucLons to raise floaLng point computaLon rate (DLP)

47

And in Conclusion, … •  SequenLal sosware is slow sosware – SIMD and MIMD only path to higher performance

•  MulLthreading increases uLlizaLon, MulLcore more processors (MIMD)

•  OpenMP as simple parallel extension to C – Threads, Parallel for, private, criLcal secLons, … – ≈ C: small so easy to learn, but not very high level and it’s easy to get into trouble

48

cs#61c:#greatideas#in#computer#...

Documents