asymmetric multiprocessing

32
Asymmetric Multiprocessing for High Performance Computing Stephen Jenks Electrical Engineering and Computer Science UC Irvine [email protected] Source: NVIDIA

Upload: irshadsk

Post on 21-Oct-2015

39 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asymmetric Multiprocessing

Asymmetric Multiprocessing for

High Performance ComputingStephen Jenks

Electrical Engineering and Computer Science

UC Irvine

[email protected]: NVIDIA

Page 2: Asymmetric Multiprocessing

Overview and Motivation

Conventional Parallel Computing Was:◦ For scientists with large problem sets

◦ Complex and difficult

◦ Very expensive computers with limited access

Parallel Computing is Becoming:◦ Ubiquitous

◦ Cheap

◦ Essential

◦ Different

◦ Still complex and difficult

Where Is Parallel Computing Going?

And What are the Research Challenges?

2/25/2009 Stephen Jenks 2

Page 3: Asymmetric Multiprocessing

Computing is Changing

Parallel programming is essential

◦ Clock speed not increasing much

◦ Performance gains require parallelism

Parallelism is changing

◦ Special purpose parallel engines

◦ CPU and parallel engine work together

◦ Different code on CPU & parallel engine

Asymmetric Computing

2/25/2009 Stephen Jenks 3

Page 4: Asymmetric Multiprocessing

Conventional Processor

Architecture Hasn’t Changed Much For 40 Years

◦ Pipelining and Superscalar Since the 1960s

◦ But has become integrated Microprocessors

High Clock Speed

◦ Great performance

◦ High Power

◦ Cooling Issues

◦ Various

Solutions

2/25/2009 Stephen Jenks 4

From Hennessy & Patterson, 2nd Ed.

Page 5: Asymmetric Multiprocessing

Parallel Computing Problem

Overview

2/25/2009 Stephen Jenks 5

Image Relaxation (Blur)

newimage[i][j] = (image[i][j] +

image[i][j-1] + image[i][j+1] +

image[i+1][j] + image[i-1][j]) / 5

Stencil

Page 6: Asymmetric Multiprocessing

Shared Memory Multiprocessors

2/25/2009 Stephen Jenks 6

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Shared

Memory

• Each CPU computes results for its partition

• Memory is shared so dependences satisfied

CPUs see common address space

Page 7: Asymmetric Multiprocessing

Shared Memory Programming

Threads

◦ POSIX Threads

◦ Windows Threads

OpenMP

◦ User-inserted directives to compiler

◦ Loop parallelism

◦ Parallel regions

◦ Visual Studio

◦ GCC 4.2

2/25/2009 Stephen Jenks 7

Not Integrated with Compiler or Language

No idea if code is in thread or not

Poor optimizations

#pragma omp parallel for

for (i=0; i<n; i++)

a[i] = b[i] * c[i];

Page 8: Asymmetric Multiprocessing

Multicore Processors

Several CPU Cores Per Chip

Shared Memory Shared Caches (sometimes) Lower Clock Speed

◦ Lower Power & Heat

◦ But Good Performance

Program with Threads Single Threaded Code

◦ Not Faster

◦ Majority of Code Today

2/25/2009 Stephen Jenks 8

Intel Core Duo

AMD Athlon 64 X2

Page 9: Asymmetric Multiprocessing

Conventional Processors are

Dinosaurs So much circuitry dedicated to

keeping ALUs fed:

◦ Cache

◦ Out-of-order execution/reorder buffer

◦ Branch prediction

◦ Large Register Sets

◦ Simultaneous Multithreading

ALU tiny by comparison

Huge power for little performance gain

2/25/2009 Stephen Jenks 9With thanks to Stanford’s Pat Hanrahan for the analogy

Page 10: Asymmetric Multiprocessing

AMD Phenom X4 Floorplan

2/25/2009 Stephen Jenks 10

Source: AMD.com

Page 11: Asymmetric Multiprocessing

Single Nehalem (Intel I7) Core

2/25/2009 Stephen Jenks 11

Nehalem - Everything You Need to Know about Intel's New Architecture

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382

Page 12: Asymmetric Multiprocessing

NVIDIA GPU Floorplan

2/25/2009 Stephen Jenks 12

Source: Dr. Sumit Gupta - NVIDIA

Page 13: Asymmetric Multiprocessing

Asymmetric Parallel Accelerators

Current Cores are Powerful and General◦ Some Applications Only Need Certain Operations

◦ Perhaps a Simpler Processor Could Be Faster

Pair General CPUs with Specialized Accelerators◦ Graphics Processing Unit

◦ Field ProgrammableGate Array (FPGA)

◦ Single Instruction,Multiple Data (SIMD)Processor

2/25/2009 Stephen Jenks 13

Athlon 64

CPU

ATI

GPU

XBAR

Hyper-

Transport

Memory

Controller

Possible Hybrid

AMD Multi-Core

Design

Page 14: Asymmetric Multiprocessing

Graphics Processing Unit (GPU)

GPUs Do Pixel Pushing and Matrix Math

2/25/2009 Stephen Jenks 14

From

NVIDIA CUDA

Compute Unified

Device Architecture

Programming Guide

11/29/2007

Page 15: Asymmetric Multiprocessing

CUDA Programming Model

Data Parallel◦ But not Loop Parallel

◦ Very Lightweight Threads

◦ Write Code from Thread’sPoint of View

No Shared Memory◦ Host Copies Data To

and From Device

◦ Different Memories

Hundreds of ParallelThreads (Sort-of SIMD)

2/25/2009 Stephen Jenks 15

Block

of

Threads

Page 16: Asymmetric Multiprocessing

Finite Difference, Time Domain

(FDTD) Electromagnetic Simulation

2/25/2009 Stephen Jenks 16

Fields stored in 6 3D Arrays:

EX, EY, EZ, HX, HY, HZ

2 other coefficient arrays

O(N3) algorithm

/* update magnetic field */

for (i = 1; i < nx - 1; i++)

for (j = 1; j < ny - 1; j++) { /* combining Y loop */

for (k = 1; k < nz - 1; k++) {

ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);

ELEM_TYPE tmpx = rx*invmu;

ELEM_TYPE tmpy = ry*invmu;

ELEM_TYPE tmpz = rz*invmu;

AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -

tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));

AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -

tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));

AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -

tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));

}

/* now update the electric field */

for (k = 1; k < nz - 1; k++) {

ELEM_TYPE invep = 1.0f/AR_INDEX(ep,i,j,k);

ELEM_TYPE tmpx = rx*invep;

ELEM_TYPE tmpy = ry*invep;

ELEM_TYPE tmpz = rz*invep;

AR_INDEX(ex,i,j,k) += tmpy * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i,j-1,k)) -

tmpz * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i,j,k-1));

AR_INDEX(ey,i,j,k) += tmpz * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j,k-1)) -

tmpx * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i-1,j,k));

AR_INDEX(ez,i,j,k) += tmpx * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i-1,j,k)) -

tmpy * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j-1,k));

}

}

Page 17: Asymmetric Multiprocessing

OpenMP FDTD

2/25/2009 Stephen Jenks 17

/* update magnetic field */

#pragma omp parallel for private(i,j,k) \

shared(hx,hy,hz,ex,ey,ez)

for (i = 1; i < nx - 1; i++)

for (j = 1; j < ny - 1; j++) {

for (k = 1; k < nz - 1; k++) {

ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);

ELEM_TYPE tmpx = rx*invmu;

ELEM_TYPE tmpy = ry*invmu;

ELEM_TYPE tmpz = rz*invmu;

AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -

tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));

AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -

tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));

AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -

tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));

}

}

#pragma omp parallel for private(i,j,k) \

shared(hx,hy,hz,ex,ey,ez)

Sequential OpenMP

Core2Quad PS3 Core2Quad PS3

26.7 28.6 8.8 26.5

Speedup 3.0 1.1

Because of Dependences,

Need to Split E & M Loops

Page 18: Asymmetric Multiprocessing

CUDA FDTD

2/25/2009 Stephen Jenks 18

// Block index

int bx = blockIdx.x;

// because CUDA does not support 3D grids, we'll use the y block value to

// fake it. Unfortunately, this needs mod and divide (or a lookup table).

int by, bz;

by = blockIdx.y / gridZ;

bz = blockIdx.y % gridZ;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

int tz = threadIdx.z;

// this thread's indices in the global matrix (skip outer boundaries)

int i = bx * THREAD_COUNT_X + tx + 1;

int j = by * THREAD_COUNT_Y + ty + 1;

int k = bz * THREAD_COUNT_Z + tz + 1;

At this point, the thread knows where it

maps in problem space

Predefined variables provide

thread position in block and

block position in grid

Page 19: Asymmetric Multiprocessing

CUDA FDTD (cont.)

2/25/2009 Stephen Jenks 19

float invep = 1.0f/AR_INDEX(d_ep,i,j,k);

float tmpx = d_rx*invep;

float tmpy = d_ry*invep;

float tmpz = d_rz*invep;

float* ex = d_ex; float* ey = d_ey; float* ez = d_ez;

AR_INDEX(ex,i,j,k) += tmpy*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i,j-1,k))-

tmpz*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i,j,k-1));

AR_INDEX(ey,i,j,k) += tmpz*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j,k-1))-

tmpx*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i-1,j,k));

AR_INDEX(ez,i,j,k) += tmpx*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i-1,j,k))-

tmpy*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j-1,k));

Code looks normal, but:

Only one “iteration”

Arrays reside in device global memory

Machine Code Style Time (s) Speedup

Core2Quad Sequential 26.5 1

Core2Quad OpenMP 8.8 3.0

GeForce 9800GX2 CUDA 7.4 3.6

Page 20: Asymmetric Multiprocessing

Better CUDA FDTD

Problem: CUDA Memory Accesses Slow

Problem: Not Much Work Done Per Thread

Idea 1: Copy Data to Fast Shared Memory◦ Allows Reuse By Neighbor Threads

◦ Takes Advantage of Coalesced Memory Accesses

Idea 2: Compute Multiple Elements Per Thread◦ But Each Block Only Gets 16KB Shared Space

◦ Barely Enough for 2 Points at 1x16x16 Threads

2/25/2009 Stephen Jenks 20

Machine Code Style Time (s) Speedup

Core2Quad Sequential 26.5 1

Core2Quad OpenMP 8.8 3.0

GeForce 9800GX2 CUDA 7.4 3.6

GeForce 9800GX2 CUDA faster 6.5 4.1

Page 21: Asymmetric Multiprocessing

Cell Broadband Engine

PowerPC Processing Element with Simultaneous

Multithreading at 3.2 GHz

8 Synergistic Processing Elements at 3.2 GHz

◦Optimized for SIMD/Vector processing (100 GFLOPS Total)

◦256KB Local Storage - no cache

4x16-byte-wide rings @ 96 bytes per clock cycle

2/25/2009 Stephen Jenks 21

From IBM Cell

Broadband

Engine

Programmer

Handbook, 10

May 2006

Page 22: Asymmetric Multiprocessing

Cell Programming Model

New IBM Single Source

Compiler Supports OpenMPExample SPE Code

PPE &SPEs Run

Separate Programs

◦ Programmed in C

◦ Explicitly Move Memory

◦ Hard to Debug SPE Code

◦ SPE Good at Arithmetic, not

Branching

◦ PPE &SPEs Communicate

via Mailboxes & Memory

◦ SPEs Communicate via

Signals & Memory

2/25/2009 Stephen Jenks 22

struct dma_list_eleminList[DMA_LIST_SIZE]

__attribute__ ((aligned (8)));

struct dma_list_elemoutList[DMA_LIST_SIZE]

__attribute__ ((aligned (8)));

void SendSignal1(program_data* destPD, int value)

{

sig1_local.SPU_Sig_Notify_1 = value;

mfc_sndsig(

&(sig1_local.SPU_Sig_Notify_1),

destPD->sig1_addr, 3/* tag */,

0, 0);

mfc_sync(1<<3);}

Page 23: Asymmetric Multiprocessing

Research Agenda & Future

Directions Programming and Performance Being Pushed Back

to Software Developers◦ Big Problem!

◦ Software Less Well Tested/Developed Than Hardware

◦ Architecture Affects Performance In Strange Ways

◦ Compiler Solutions (IBM SSC) Don’t Always Work Well

Extract Top Performance From Multicore CPUs

Make Multicore Processors Better to Program

Explore & Exploit Asymmetric Computing

2/25/2009 Stephen Jenks 23

Page 24: Asymmetric Multiprocessing

Parallel Microprocessor Problems

Memory interface too slow for 1 core/thread

Now multiple threads access memory simultaneously, overwhelming memory interface

Parallel programs can run as slowly as sequential ones!

2/25/2009Stephen Jenks24

CPU

L2 C

ache

Syste

m/

Mem

I/F

Mem

Then

CPU1

CPU2

L2 C

ache

Syste

m/

Mem

I/F

Mem

Now

Page 25: Asymmetric Multiprocessing

SPPM: Producer/Consumer

Parallelism Using The Cache

2/25/2009 Stephen Jenks 25

Thread 1

Half the Work

Thread 2

Half the Work

Data in Memory

Memory

Bottleneck

Producer

Thread

Consumer

Thread

Data in Memory

Communications

Through Cache

Page 26: Asymmetric Multiprocessing

Multicore Architecture

Improvements Cores in Common Multicore Chips Not Well Connected

◦ Communicate Through Cache & Memory

◦ Synchronization Slow (OS-based) or Uses Spin Waits

Register-Based Synchronization

◦ Shared Registers Between Cores

◦ Halt Processor While Waiting To Save Power

Preemptive Communications (Prepushing)

◦ Reduces Latency over Demand-Based Fetches

◦ Cuts Cache Coherence Traffic/Activity

Software Controlled Eviction

◦ Manages Shared Caches Using Explicit Operations

◦ Move Data to Shared Cache Before Needed From Private Cache

Synchronization Engine

◦ Hardware-based multicore synchronization operations

2/25/2009 Stephen Jenks 26

Page 27: Asymmetric Multiprocessing

Asymmetric Multicore

Processors Intel & AMD Likely to Add GPUs to CPUs in Multicore

Processors

Connectivity Better than PCIe

Cache Access Possible◦ Shared Memory Likely

◦ Programming Slightly Better

Multiple CPU Classes Possible◦ Fast & Powerful for Some Apps

◦ Simple & Power Efficient for Most

Thoughts:◦ Not many multithreaded apps, so

don’t need many CPUs

◦ Application accelerators speed up multimedia, etc.

◦ Probably Cost Effective to Integrate CPU & GPU

2/25/2009 Stephen Jenks 27

Athlon 64

CPU

ATI

GPU

XBAR

Hyper-

Transport

Memory

Controller

Possible Hybrid AMD

Multicore Design

Page 28: Asymmetric Multiprocessing

FPGA Application Accelerators

Cray and SGI Add FPGA Application Accelerators to High End Parallel Machines

◦ FPGA Code Created Through VHDL or Other Compiler

◦ Can Speed Up Some Applications

My Thoughts

◦ FPGA Great for Prototyping

◦ General & Flexible

◦ But Have Slow Clock Speed

◦ So Will Be Quickly Surpassed By GPUs & Special Purpose CPUs

2/25/2009 Stephen Jenks 28

Page 29: Asymmetric Multiprocessing

Intel Larrabee

2/25/2009 Stephen Jenks 29

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Interprocessor Ring Network

Many simple, fast, low power, in-order x86 cores

Larrabee: A Many-Core x86 Architecture for Visual Computing

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

Mem

ory

& I/O

Inte

rfaces

Page 30: Asymmetric Multiprocessing

OpenCL

New standard pushed by Apple &

others (all major players)

Data Parallel

◦ N-Dimensional computational domain

◦ Work-groups – similar to CUDA block

◦ Work-items – similar to CUDA threads

Task Parallel

◦ Similar to threads, but single work-item

Supports GPU, multicore, Cell, etc.

2/25/2009 Stephen Jenks 30

Page 31: Asymmetric Multiprocessing

Summary

Parallelism Big Resurgence◦ Ubiquitous

◦ Still Difficult (Programming Challenge Worries Intel, Microsoft, & IBM)

New Types Of Parallelism Becoming Widespread◦ Multicore Processors But They Suffer From Bottlenecks

And Are Not Well Integrated

◦ GPU Processing Hard to Program

Especially Hard to Get Good Performance

◦ Asymmetric Processor Cores Hard to Program and Manage

Also Especially Hard to Get Good Performance

Many Architecture, Hardware, and Software Research Challenges

2/25/2009 Stephen Jenks 31

Page 32: Asymmetric Multiprocessing

Resources

http://www.ibm.com/developerworks/power/cell - Cell Tools &

Docs

http://www.nvidia.com/object/cuda_home.html - CUDA Tools

& Docs

http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html -

SPPM & Parallel Execution Models

http://web.mac.com/sfj/iWeb/Site/Welcome.html - Old Parallel

Programming Blog (Source code for FDTD Examples)

http://newport.eecs.uci.edu/~sjenks – My homepage, where

this presentation is posted

2/25/2009 Stephen Jenks 32