cs 352h: computer systems architecture · university of texas at austin cs352h - computer systems...

51
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters

Upload: others

Post on 06-Jul-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell

CS 352H: Computer Systems Architecture

Topic 14: Multicores, Multiprocessors, andClusters

Page 2: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2

Introduction

Goal: connecting multiple computersto get higher performance

MultiprocessorsScalability, availability, power efficiency

Job-level (process-level) parallelismHigh throughput for independent jobs

Parallel processing programSingle program run on multiple processors

Multicore microprocessorsChips with multiple processors (cores)

Page 3: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 3

Hardware and Software

HardwareSerial: e.g., Pentium 4Parallel: e.g., quad-core Xeon e5345

SoftwareSequential: e.g., matrix multiplicationConcurrent: e.g., operating system

Sequential/concurrent software can run on serial/parallelhardware

Challenge: making effective use of parallel hardware

Page 4: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 4

What We’ve Already Covered

§2.11: Parallelism and InstructionsSynchronization

§3.6: Parallelism and Computer ArithmeticAssociativity

§4.10: Parallelism and Advanced Instruction-LevelParallelism§5.8: Parallelism and Memory Hierarchies

Cache Coherence§6.9: Parallelism and I/O:

Redundant Arrays of Inexpensive Disks

Page 5: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 5

Parallel Programming

Parallel software is the problemNeed to get significant performance improvement

Otherwise, just use a faster uniprocessor, since it’s easier!

DifficultiesPartitioningCoordinationCommunications overhead

Page 6: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 6

Amdahl’s Law

Sequential part can limit speedupExample: 100 processors, 90× speedup?

Tnew = Tparallelizable/100 + Tsequential

Solving: Fparallelizable = 0.999

Need sequential part to be 0.1% of original time

90/100F)F(1

1Speedup

ableparallelizableparalleliz

=+!

=

Page 7: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 7

Scaling Example

Workload: sum of 10 scalars, and 10 × 10 matrix sumSpeed up from 10 to 100 processors

Single processor: Time = (10 + 100) × tadd10 processors

Time = 10 × tadd + 100/10 × tadd = 20 × taddSpeedup = 110/20 = 5.5 (55% of potential)

100 processorsTime = 10 × tadd + 100/100 × tadd = 11 × taddSpeedup = 110/11 = 10 (10% of potential)

Assumes load can be balanced across processors

Page 8: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 8

Scaling Example (cont)

What if matrix size is 100 × 100?Single processor: Time = (10 + 10000) × tadd

10 processorsTime = 10 × tadd + 10000/10 × tadd = 1010 × tadd

Speedup = 10010/1010 = 9.9 (99% of potential)

100 processorsTime = 10 × tadd + 10000/100 × tadd = 110 × tadd

Speedup = 10010/110 = 91 (91% of potential)

Assuming load balanced

Page 9: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 9

Strong vs Weak Scaling

Strong scaling: problem size fixedAs in example

Weak scaling: problem size proportional tonumber of processors

10 processors, 10 × 10 matrixTime = 20 × tadd

100 processors, 32 × 32 matrixTime = 10 × tadd + 1000/100 × tadd = 20 × tadd

Constant performance in this example

Page 10: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 10

Shared Memory

SMP: shared memory multiprocessorHardware provides single physicaladdress space for all processorsSynchronize shared variables using locksMemory access time

UMA (uniform) vs. NUMA (nonuniform)

Page 11: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 11

Example: Sum Reduction

Sum 100,000 numbers on 100 processor UMAEach processor has ID: 0 ≤ Pn ≤ 99Partition 1000 numbers per processorInitial summation on each processor

sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];

Now need to add these partial sumsReduction: divide and conquerHalf the processors add pairs, then quarter, …Need to synchronize between reduction steps

Page 12: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 12

Example: Sum Reduction

half = 100;repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);

Page 13: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 13

Message Passing

Each processor has private physical address spaceHardware sends/receives messages between processors

Page 14: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 14

Loosely Coupled Clusters

Network of independent computersEach has private memory and OSConnected using I/O system

E.g., Ethernet/switch, Internet

Suitable for applications with independent tasksWeb servers, databases, simulations, …

High availability, scalable, affordableProblems

Administration cost (prefer virtual machines)Low interconnect bandwidth

c.f. processor/memory bandwidth on an SMP

Page 15: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 15

Sum Reduction (Again)

Sum 100,000 on 100 processorsFirst distribute 100 numbers to each

The do partial sums sum = 0;

for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];

ReductionHalf the processors send, other half receive and addThe quarter send, quarter receive and add, …

Page 16: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 16

Sum Reduction (Again)

Given send() and receive() operationslimit = 100; half = 100;/* 100 processors */repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */until (half == 1); /* exit with final sum */

Send/receive also provide synchronizationAssumes send/receive take similar time to addition

Page 17: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 17

Grid Computing

Separate computers interconnected by long-haul networksE.g., Internet connectionsWork units farmed out, results sent back

Can make use of idle time on PCsE.g., SETI@home, World Community Grid

Page 18: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 18

Multithreading

Performing multiple threads of execution in parallelReplicate registers, PC, etc.Fast switching between threads

Fine-grain multithreadingSwitch threads after each cycleInterleave instruction executionIf one thread stalls, others are executed

Coarse-grain multithreadingOnly switch on long stall (e.g., L2-cache miss)Simplifies hardware, but doesn’t hide short stalls (eg, datahazards)

Page 19: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 19

Simultaneous Multithreading

In multiple-issue dynamically scheduled processorSchedule instructions from multiple threadsInstructions from independent threads execute when function unitsare availableWithin threads, dependencies handled by scheduling and registerrenaming

Example: Intel Pentium-4 HTTwo threads: duplicated registers, shared function units and caches

Page 20: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 20

Multithreading Example

Page 21: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 21

Future of Multithreading

Will it survive? In what form?Power considerations ⇒ simplified microarchitectures

Simpler forms of multithreading

Tolerating cache-miss latencyThread switch may be most effective

Multiple simple cores might share resources moreeffectively

Page 22: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 22

Instruction and Data Streams

An alternate classification

MIMD:Intel Xeon e5345

MISD:No examples today

Multiple

SIMD: SSEinstructions of x86

SISD:Intel Pentium 4

SingleInstructionStreams

MultipleSingleData Streams

SPMD: Single Program Multiple DataA parallel program on a MIMD computerConditional code for different processors

Page 23: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 23

SIMD

Operate elementwise on vectors of dataE.g., MMX and SSE instructions in x86

Multiple data elements in 128-bit wide registers

All processors execute the same instruction at the sametime

Each with different data address, etc.Simplifies synchronizationReduced instruction control hardwareWorks best for highly data-parallel applications

Page 24: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 24

Vector Processors

Highly pipelined function unitsStream data from/to vector registers to units

Data collected from memory into registersResults stored from registers to memory

Example: Vector extension to MIPS32 × 64-element registers (64-bit elements)Vector instructions

lv, sv: load/store vectoraddv.d: add vectors of doubleaddvs.d: add scalar to each element of vector of double

Significantly reduces instruction-fetch bandwidth

Page 25: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 25

Example: DAXPY (Y = a × X + Y)

Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to loadloop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a × x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a × x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done

Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result

Page 26: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26

Vector vs. Scalar

Vector architectures and compilersSimplify data-parallel programmingExplicit statement of absence of loop-carried dependences

Reduced checking in hardwareRegular access patterns benefit from interleaved and burst memoryAvoid control hazards by avoiding loops

More general than ad-hoc media extensions (such asMMX, SSE)

Better match with compiler technology

Page 27: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 27

History of GPUs

Early video cardsFrame buffer memory with address generation for video output

3D graphics processingOriginally high-end computers (e.g., SGI)Moore’s Law ⇒ lower cost, higher density3D graphics cards for PCs and game consoles

Graphics Processing UnitsProcessors oriented to 3D graphics tasksVertex/pixel processing, shading, texture mapping,rasterization

Page 28: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 28

Graphics in the System

Page 29: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 29

GPU Architectures

Processing is highly data-parallelGPUs are highly multithreadedUse thread switching to hide memory latency

Less reliance on multi-level cachesGraphics memory is wide and high-bandwidth

Trend toward general purpose GPUsHeterogeneous CPU/GPU systemsCPU for sequential code, GPU for parallel code

Programming languages/APIsDirectX, OpenGLC for Graphics (Cg), High Level Shader Language (HLSL)Compute Unified Device Architecture (CUDA)

Page 30: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 30

Example: NVIDIA Tesla

Streamingmultiprocessor

8 × Streamingprocessors

Page 31: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 31

Example: NVIDIA Tesla

Streaming ProcessorsSingle-precision FP and integer unitsEach SP is fine-grained multithreaded

Warp: group of 32 threadsExecuted in parallel,SIMD style

8 SPs× 4 clock cycles

Hardware contextsfor 24 warps

Registers, PCs, …

Page 32: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 32

Classifying GPUs

Don’t fit nicely into SIMD/MIMD modelConditional execution in a thread allows an illusion of MIMD

But with performance degradationNeed to write general purpose code with care

Tesla MultiprocessorSIMD or VectorData-LevelParallelism

SuperscalarVLIWInstruction-LevelParallelism

Dynamic: Discovered atRuntime

Static: Discoveredat Compile Time

Page 33: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 33

Interconnection Networks

Network topologiesArrangements of processors, switches, and links

Bus Ring

2D MeshN-cube (N = 3)

Fully connected

Page 34: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 34

Multistage Networks

Page 35: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 35

Network Characteristics

PerformanceLatency per message (unloaded network)Throughput

Link bandwidthTotal network bandwidthBisection bandwidth

Congestion delays (depending on traffic)CostPowerRoutability in silicon

Page 36: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 36

Parallel Benchmarks

Linpack: matrix linear algebraSPECrate: parallel run of SPEC CPU programs

Job-level parallelismSPLASH: Stanford Parallel Applications for Shared Memory

Mix of kernels and applications, strong scalingNAS (NASA Advanced Supercomputing) suite

computational fluid dynamics kernelsPARSEC (Princeton Application Repository for Shared MemoryComputers) suite

Multithreaded applications using Pthreads and OpenMP

Page 37: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 37

Code or Applications?

Traditional benchmarksFixed code and data sets

Parallel programming is evolvingShould algorithms, programming languages, and tools be part ofthe system?Compare systems, provided they implement a given applicationE.g., Linpack, Berkeley Design Patterns

Would foster innovation in approaches to parallelism

Page 38: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 38

Modeling Performance

Assume performance metric of interest is achievableGFLOPs/sec

Measured using computational kernels from Berkeley DesignPatterns

Arithmetic intensity of a kernelFLOPs per byte of memory accessed

For a given computer, determinePeak GFLOPS (from data sheet)Peak memory bytes/sec (using Stream benchmark)

Page 39: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 39

Roofline Diagram

Attainable GPLOPs/sec= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Page 40: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 40

Comparing Systems

Example: Opteron X2 vs. Opteron X42-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHzSame memory system

To get higher performance on X4than X2

Need high arithmetic intensityOr working set must fit in X4’s 2MBL-3 cache

Page 41: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 41

Optimizing Performance

Optimize FP performanceBalance adds & multipliesImprove superscalar ILP and use ofSIMD instructions

Optimize memory usageSoftware prefetch

Avoid load stallsMemory affinity

Avoid non-local data accesses

Page 42: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 42

Optimizing Performance

Choice of optimization depends on arithmetic intensity ofcode

Arithmetic intensity is not alwaysfixed

May scale with problem sizeCaching reduces memoryaccesses

Increases arithmeticintensity

Page 43: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 43

Four Example Systems

2 × quad-coreIntel Xeon e5345(Clovertown)

2 × quad-coreAMD Opteron X4 2356(Barcelona)

Page 44: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 44

Four Example Systems

2 × oct-coreIBM Cell QS20

2 × oct-coreSun UltraSPARCT2 5140 (Niagara 2)

Page 45: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 45

And Their Rooflines

KernelsSpMV (left)LBHMD (right)

Some optimizations changearithmetic intensityx86 systems have higher peakGFLOPs

But harder to achieve, givenmemory bandwidth

Page 46: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 46

Performance on SpMV

Sparse matrix/vector multiplyIrregular memory accesses, memory bound

Arithmetic intensity0.166 before memory optimization, 0.25 after

Xeon vs. OpteronSimilar peak FLOPSXeon limited by shared FSBs andchipset

UltraSPARC/Cell vs. x8620 – 30 vs. 75 peak GFLOPsMore cores and memorybandwidth

Page 47: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 47

Performance on LBMHD

Fluid dynamics: structured grid over time stepsEach point: 75 FP read/write, 1300 FP ops

Arithmetic intensity0.70 before optimization, 1.07 after

Opteron vs. UltraSPARCMore powerful cores, not limitedby memory bandwidth

Xeon vs. othersStill suffers from memorybottlenecks

Page 48: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 48

Achieving Performance

Compare naïve vs. optimized codeIf naïve code performs well, it’s easier to write high performancecode for the system

0%0%

6.416.7

Naïve code notfeasible

SpMVLBMHD

IBM Cell QS20

86%93%

4.110.5

3.59.7

SpMVLBMHD

Sun UltraSPARCT2

38%50%

3.614.1

1.47.1

SpMVLBMHD

AMDOpteron X4

64%82%

1.55.6

1.04.6

SpMVLBMHD

Intel Xeon

Naïve as % ofoptimized

OptimizedGFLOPs/sec

NaïveGFLOPs/sec

KernelSystem

Page 49: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 49

Fallacies

Amdahl’s Law doesn’t apply to parallel computersSince we can achieve linear speedupBut only on applications with weak scaling

Peak performance tracks observed performanceMarketers like this approach!But compare Xeon with others in exampleNeed to be aware of bottlenecks

Page 50: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 50

Pitfalls

Not developing the software to take account of amultiprocessor architecture

Example: using a single lock for a shared composite resourceSerializes accesses, even if they could be done in parallelUse finer-granularity locking

Page 51: CS 352H: Computer Systems Architecture · University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 26 Vector vs. Scalar Vector architectures and

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 51

Concluding Remarks

Goal: higher performance by using multiple processorsDifficulties

Developing parallel softwareDevising appropriate architectures

Many reasons for optimismChanging software and application environmentChip-level multiprocessors with lower latency, higher bandwidthinterconnect

An ongoing challenge for computer architects!