data-parallel algorithms and multithreading gregory s. johnson [email protected]

35
Data-Parallel Algorithms Data-Parallel Algorithms and Multithreading and Multithreading Gregory S. Johnson Gregory S. Johnson [email protected] [email protected]

Upload: eugenia-horton

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Data-Parallel AlgorithmsData-Parallel Algorithmsand Multithreadingand Multithreading

Gregory S. JohnsonGregory S. [email protected]@cs.utexas.edu

Page 2: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

TopicsTopics

• expressing v. exploiting parallelismexpressing v. exploiting parallelism

• data parallel algorithms, case study: MPIRE (part 1)data parallel algorithms, case study: MPIRE (part 1)

• hardware multithreading, case study: MPIRE (part 2)hardware multithreading, case study: MPIRE (part 2)

• Culler’s multithreading rebuttalCuller’s multithreading rebuttal

Page 3: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Expressing v. Exploiting ParallelismExpressing v. Exploiting Parallelism

• there exists a two-part problem in achieving high parallel efficiency:there exists a two-part problem in achieving high parallel efficiency:

1.1. expressing the problem in terms of parallelism - doing so may expressing the problem in terms of parallelism - doing so may require an algorithm very different from the serial version (user)require an algorithm very different from the serial version (user)

2.2. exploiting this parallelism on a specific machine (user + compiler)exploiting this parallelism on a specific machine (user + compiler)

• the “family” tree of parallel architectures is both wide and deepthe “family” tree of parallel architectures is both wide and deep

• parallel APIs abound: PVM, MPI, Shmem, OpenMP, CRAFT, HPF, parallel APIs abound: PVM, MPI, Shmem, OpenMP, CRAFT, HPF, F--, co-array Fortran, ZPLF--, co-array Fortran, ZPL

Page 4: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Data-Parallel AlgorithmsData-Parallel Algorithms

Page 5: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Data Parallel AlgorithmsData Parallel Algorithms

• one of the very first MPPs is describedone of the very first MPPs is described

• SIMD machine with modest processorsSIMD machine with modest processors

• machine used as a “co-processor” hosted by a machine used as a “co-processor” hosted by a conventional workstationconventional workstation

• dated model of parallelism (one data element per PE, dated model of parallelism (one data element per PE, modest ratio of computation to communication)modest ratio of computation to communication)

• a few of the presented parallel algorithms are timelessa few of the presented parallel algorithms are timeless

Page 6: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Array SummationArray Summation

xx00 xx22 xx33 xx44 xx55 xx66 xx77 xx88 xx99 xx1010 xx1111 xx1212 xx1313 xx1414 xx1515xx11

xx00 xx22 xx44 xx66 xx88 xx1010 xx1212 xx14141100

3322 55

44 7766 99

88 11111010 1313

1212 15151414

xx00 xx22 xx44 xx66 xx88 xx1010 xx1212 xx14141100

3300 55

44 7744 99

88 111188 1313

1212 15151212

xx00 xx22 xx44 xx66 xx88 xx1010 xx1212 xx14141100

3300 55

44 7700 99

88 111188 1313

1212 151588

xx00 xx22 xx44 xx66 xx88 xx1010 xx1212 xx14141100

3322 55

44 7700 99

88 11111010 1313

1212 151500

Page 7: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Region LabelingRegion Labeling

Page 8: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

ObservationsObservations

• value placed on processors is quite low and the algorithms show thisvalue placed on processors is quite low and the algorithms show this

• today’s trend is toward smaller numbers of more capable processors, today’s trend is toward smaller numbers of more capable processors, notnot larger numbers of less capable processors (interconnects are not larger numbers of less capable processors (interconnects are not keeping pace with CPU development) keeping pace with CPU development)

• Thinking Machines tried to reverse course with the CM5 (MIMD) but Thinking Machines tried to reverse course with the CM5 (MIMD) but perhaps too late (founded 1983, folded 1993)perhaps too late (founded 1983, folded 1993)

• interesting point that as data grows arbitrarily large, data parallelism interesting point that as data grows arbitrarily large, data parallelism is much more abundant than instruction level parallelism (true for is much more abundant than instruction level parallelism (true for scientific codes, but ...)scientific codes, but ...)

• ““we would be by no means surprised if some of the algorithms we would be by no means surprised if some of the algorithms described in this article begin to look quite ‘old-fashioned’ in the years described in this article begin to look quite ‘old-fashioned’ in the years to cometo come””

Page 9: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

• software implementation of direct volume rendering algorithms software implementation of direct volume rendering algorithms including splatting and ray castingincluding splatting and ray casting

• designed to render high quality images of multi-GB, multi-designed to render high quality images of multi-GB, multi-resolution volume datasets with R, G, B, resolution volume datasets with R, G, B, , , per sample per sample

• designed to express fine grained (per pixel), medium grained designed to express fine grained (per pixel), medium grained (subvolume), and coarse grained (per frame) parallelism(subvolume), and coarse grained (per frame) parallelism

• runs relatively efficiently on 32 and 64-bit SMPs, MPPs, COWs, runs relatively efficiently on 32 and 64-bit SMPs, MPPs, COWs, clustered SMPs, and uniprocessorsclustered SMPs, and uniprocessors

Page 10: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

Page 11: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

• nebulae consist of dust and gas which nebulae consist of dust and gas which emits, reflects, or obscures light from emits, reflects, or obscures light from nearby stars and fluorescing gasnearby stars and fluorescing gas

• Rice, Hayden, and SDSC rendered a Rice, Hayden, and SDSC rendered a “fly-through” of a model of the Orion “fly-through” of a model of the Orion nebula based on Hubble and related nebula based on Hubble and related observational dataobservational data

• total imagery required:total imagery required:

0° azimuth0° azimuth

33

dome projector layoutdome projector layout

11

22

44

5566

77

120 seconds * 30 frames per second120 seconds * 30 frames per second = 3600 frames= 3600 frames3600 frames * 7 projectors3600 frames * 7 projectors = 25200 images= 25200 images25200 images * 1280 x 1024 pixels25200 images * 1280 x 1024 pixels = 3.3 x 10= 3.3 x 101010 pixels pixels3.3 x 103.3 x 101010 pixels * 3 bytes per pixel pixels * 3 bytes per pixel = 92.3 GB= 92.3 GB

Page 12: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

Page 13: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

image planeimage plane

data volumedata volume

““eye” rayseye” rays

.... ....

.... ....

color(i, j) = ccolor(i, j) = c00 00 + (1 - + (1 - 00 (c (c11 11 + (1 - + (1 - 11 (c (c22 22 + (1 - + (1 - 22 (...)))))) (...))))))

pixel i, jpixel i, j

Page 14: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

image planeimage plane

data volumedata volume

““eye” rayseye” rays

order oforder ofcompositioncomposition

image from PE #0image from PE #0 image from PE #1image from PE #1

composited image on PE#0composited image on PE#0

Page 15: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Hardware MultithreadingHardware Multithreading

Page 16: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

PreliminariesPreliminaries

• modern OSes commonly support threads (lightweight processes), modern OSes commonly support threads (lightweight processes), and many vendors ship shared* memory multiprocessorsand many vendors ship shared* memory multiprocessors

• multithreading is one method for sharing the work of a single multithreading is one method for sharing the work of a single application across multiple processors with shared memoryapplication across multiple processors with shared memory

• software multithreading: multithreading without hardware support for software multithreading: multithreading without hardware support for storing state (PC, registers, etc.) for multiple threads simultaneously; storing state (PC, registers, etc.) for multiple threads simultaneously; often a thread / CPU, reduced utilizationoften a thread / CPU, reduced utilization

• hardware multithreading: hardware-level support for storing state for hardware multithreading: hardware-level support for storing state for multiple threads, permitting multiple threads, permitting fastfast context switches context switches

• APIs: POSIX Threads (Pthreads - yuck!), OpenMP, vendor-specificAPIs: POSIX Threads (Pthreads - yuck!), OpenMP, vendor-specific

Page 17: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Argument for MultithreadingArgument for Multithreading

• a variation of Little’s Law from queuing theory gives the average a variation of Little’s Law from queuing theory gives the average number of words “in flight” between memory and processor as the number of words “in flight” between memory and processor as the product of latency and bandwidth (assume full utilization)product of latency and bandwidth (assume full utilization)

bandwidth =bandwidth =

performance =performance =concurrencyconcurrency

latencylatency

concurrencyconcurrency

latencylatency

ban

dw

idth

ban

dw

idth

latencylatency

concurrencyconcurrency

Page 18: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Tera MTA OverviewTera MTA Overview

• Tera Computer Inc. founded in 1987, MTA serial no. 001 (2+Tera Computer Inc. founded in 1987, MTA serial no. 001 (2++ processors) delivered to SDSC in 1997+ processors) delivered to SDSC in 1997

• 128 virtual processors (streams) each with state storage128 virtual processors (streams) each with state storage

• dynamic resource allocationdynamic resource allocation

• LIW instruction words and explicit-dependence lookaheadLIW instruction words and explicit-dependence lookahead

• extended C with extended C with futurefuture type identifier and statement; type identifier and statement; relatively easy to programrelatively easy to program

• tagged memory with full / empty bitstagged memory with full / empty bits

Page 19: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Tera MTA Macro ArchitectureTera MTA Macro Architecture

3D toroidal mesh3D toroidal meshinterconnectinterconnect

memory modulesmemory modules

processor modulesprocessor modules

memory modulesmemory modules

IOP modulesIOP modules

CPU

processor moduleprocessor module

~ 260 MHz~ 260 MHz780 Mflops780 Mflops

remote memoryremote memory1 - 4 GB per module1 - 4 GB per module

2.1 GBps, 150 cycles2.1 GBps, 150 cycles

Page 20: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Tera MTA Micro ArchitectureTera MTA Micro ArchitectureProgram 1 Program 2 Program 3 Program 4

i = 0i = 1

i = 2

i = n subtask B

subtask A serial code

subtask A

i = 0i = 1

i = m

concurrentlyconcurrentlyrunning programsrunning programs

unallocated ...

parallel parallel threadsthreadsof computationof computation

hardware hardware streamsstreams(128 per processor)(128 per processor)

pool of readypool of readyinstructionsinstructions

execution pipelineexecution pipeline(one per processor)(one per processor)

Page 21: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Serial Marching CubesSerial Marching Cubes

for (i=0; i<M; i++)for (i=0; i<M; i++) for (j=0; j<N; j++)for (j=0; j<N; j++) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j])

Parallel Marching Cubes (Ordered)Parallel Marching Cubes (Ordered)

sync char tag[M][N];sync char tag[M][N];

#pragma tera assert parallel#pragma tera assert parallelfor (i=0; i<M; i++)for (i=0; i<M; i++) for (j=0; j<N; j++)for (j=0; j<N; j++) if (i>0) readff(&tag[i-1,j])if (i>0) readff(&tag[i-1,j]) if (j>0) readff(&tag[i,j-1])if (j>0) readff(&tag[i,j-1]) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j]) writeef(&tag[i,j], 1)writeef(&tag[i,j], 1)

Tagged Memory: Full / EmptyTagged Memory: Full / Empty

eyeeye

gridgrid

completedcompleted

readyready

blockedblocked

returning readffreturning readff

Page 22: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Related WorkRelated Work

• Network ProcessorsNetwork Processors

• SMT (Simultaneous Multithreading, Dean Tullsen)SMT (Simultaneous Multithreading, Dean Tullsen)

• HyperThreading (2-way SMT on Intel Xeon and Pentium 4)HyperThreading (2-way SMT on Intel Xeon and Pentium 4)

• SSMT (Simultaneous Subordinate Microthreading, Yale Patt)SSMT (Simultaneous Subordinate Microthreading, Yale Patt)

• Cascade (Burton Smith et al.)Cascade (Burton Smith et al.)

Page 23: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

image planeimage plane

data volumedata volume

““eye” rayseye” rays

.... ....

.... ....

Observations:Observations:

• no dependencies exist between eye raysno dependencies exist between eye rays

• data volume traversal is view dependentdata volume traversal is view dependent

Page 24: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

Ray *my_ray=NULL; ...

/* raycast an image pixel-by-pixel */#pragma tera assert parallel#pragma tera dynamic schedulefor (index=0 ; index < xsize * ysize ; index++) { #pragma tera assert local my_ray

my_ray = InitializeRay(my_ray, index); image[index] = SampleRay(my_ray, data_volume);}

Ray *my_ray=NULL; ...

/* raycast an image pixel-by-pixel */#pragma tera assert parallel#pragma tera dynamic schedulefor (index=0 ; index < xsize * ysize ; index++) { #pragma tera assert local my_ray

my_ray = InitializeRay(my_ray, index); image[index] = SampleRay(my_ray, data_volume);}

Page 25: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Case Study: MPIRECase Study: MPIRE

IBM Power3 SP SUN HPC 10000 Cray MTA

1076.74s --- 1157.96s --- 724.46s ---553.99s 97% 591.33s 98% 366.73s 99%297.68s 90% 317.06s 91% 182.27s 99%165.15s 82% 168.57s 86% 92.16s 98%

93.99s 77%

IBM Power3 SP SUN HPC 10000 Cray MTA

1076.74s --- 1157.96s --- 724.46s ---553.99s 97% 591.33s 98% 366.73s 99%297.68s 90% 317.06s 91% 182.27s 99%165.15s 82% 168.57s 86% 92.16s 98%

93.99s 77%

1248

16

1248

16Pro

cessors

Pro

cessors

MPIRE Rendering Times(raycasting engine, perspective projection,

1280 x 1024 pixels, 1.5GB input, 84 features)

MPIRE Rendering Times(raycasting engine, perspective projection,

1280 x 1024 pixels, 1.5GB input, 84 features)

A. Snavely, G. Johnson and J. Genetti, A. Snavely, G. Johnson and J. Genetti, Proceedings of the High Performance Computing Symposium - HPC '99Proceedings of the High Performance Computing Symposium - HPC '99 ,,Adrian Tentner (Ed.), 1999, Adrian Tentner (Ed.), 1999, Data Intensive Volume Visualization on the Tera MTA and Cray T3EData Intensive Volume Visualization on the Tera MTA and Cray T3E..

Page 26: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Culler’s Multithreading RebuttalCuller’s Multithreading Rebuttal

Page 27: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Culler’s RebuttalCuller’s Rebuttal

• multithreaded processors view the memory / interconnect as a pipeline multithreaded processors view the memory / interconnect as a pipeline which contains multiple outstanding memory references “in flight”which contains multiple outstanding memory references “in flight”

• since latency to memory is non-uniform, certain references are returned since latency to memory is non-uniform, certain references are returned faster than others and potentially complete out of orderfaster than others and potentially complete out of order

• references must thus be matched (synchronized) with their issuing references must thus be matched (synchronized) with their issuing instructions and the corresponding thread marked “ready”instructions and the corresponding thread marked “ready”

• ready threads must then be scheduledready threads must then be scheduled

• Culler argues that synchronization and scheduling costs limit the Culler argues that synchronization and scheduling costs limit the concurrency which can be effectively exploited in a given codeconcurrency which can be effectively exploited in a given code

Page 28: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Culler’s RebuttalCuller’s Rebuttal

• given given RR cycles of useful work between remote references with latency cycles of useful work between remote references with latency L, L, processor utilization processor utilization UU is given as: is given as:

• however if latency is fully covered by the availability of ready threads however if latency is fully covered by the availability of ready threads then then UU can be expressed as the following, where can be expressed as the following, where SS is the total synch- is the total synch-ronization cost (product of the number of remote references and the ronization cost (product of the number of remote references and the synchronization cost per reference):synchronization cost per reference):

U =U =RR

(R + S)(R + S)

U =U =RR

(R + L)(R + L)

Page 29: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Culler’s RebuttalCuller’s Rebuttal

• scheduling threads without respect to the storage hierarchy (i.e. simply scheduling threads without respect to the storage hierarchy (i.e. simply scheduling the next ready thread) increases the average latency of scheduling the next ready thread) increases the average latency of memory referencesmemory references

• but as latency increases, the number of threads required to hide the but as latency increases, the number of threads required to hide the latency increases, thus synchronization costs increase, thus overall latency increases, thus synchronization costs increase, thus overall utilization decreasesutilization decreases

• a related problem is that multithreading schemes typically reduce the a related problem is that multithreading schemes typically reduce the state per thread (and thereby the cost of switching between threads) at state per thread (and thereby the cost of switching between threads) at the cost of increased number of remote references (again increasing the cost of increased number of remote references (again increasing synchronization costs)synchronization costs)

Page 30: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Culler’s Rebuttal & MTACuller’s Rebuttal & MTA

• the number of threads should be a function of the size of the top level of the number of threads should be a function of the size of the top level of the storage hierarchy (to minimize switching costs), thus the size of the the storage hierarchy (to minimize switching costs), thus the size of the top level determines the amount of latency that can be hiddentop level determines the amount of latency that can be hidden

• what if (as in the case of the Tera MTA) the top level is main memory?what if (as in the case of the Tera MTA) the top level is main memory?

• codes in which computation dominates are unable to benefit from high-codes in which computation dominates are unable to benefit from high-speed register / cache access and reusespeed register / cache access and reuse

• indeed we’ve seen this on the Tera MTA - the machine really does indeed we’ve seen this on the Tera MTA - the machine really does favor memory intensive codes with little reuse (like MPIRE)favor memory intensive codes with little reuse (like MPIRE)

Page 31: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

TAMTAM

• Threaded Abstract Machine tailors thread count and thread Threaded Abstract Machine tailors thread count and thread scheduling to the storage hierarchy of the target machinescheduling to the storage hierarchy of the target machine

• unlike MTA threads, TAM threads each execute until completionunlike MTA threads, TAM threads each execute until completion

• threads are organized by function activation, threads within the threads are organized by function activation, threads within the same activation execute “near” one another (in time or space)same activation execute “near” one another (in time or space)

• TAM benefits from the migration of data up through the storage TAM benefits from the migration of data up through the storage hierarchy (TAM thread scheduling maximizes reuse)hierarchy (TAM thread scheduling maximizes reuse)

Page 32: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

TAMTAM

Page 33: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Concluding RemarksConcluding Remarks

Page 34: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

Hardware Multithreaded GraphicsHardware Multithreaded Graphics

• Argus pipelineArgus pipeline

• growth of bus speeds versus growth rate of CPUs and GPUs implies growth of bus speeds versus growth rate of CPUs and GPUs implies the distance between the CPU and GPU is growingthe distance between the CPU and GPU is growing

• perhaps a solution to the problem of building and traversing balanced perhaps a solution to the problem of building and traversing balanced spatial data structures (potentially useful for ray tracing, shadows, anti-spatial data structures (potentially useful for ray tracing, shadows, anti-aliasing, etc.); in other words, maybe this is a solution to matching aliasing, etc.); in other words, maybe this is a solution to matching irregular data structures to streaming processorsirregular data structures to streaming processors

Page 35: Data-Parallel Algorithms and Multithreading Gregory S. Johnson johnsong@cs.utexas.edu

The EndThe End

© C

art

er

Em

mart

© C

art

er

Em

mart