data-parallel algorithms and multithreading gregory s. johnson [email protected]

Data-Parallel AlgorithmsData-Parallel Algorithmsand Multithreadingand Multithreading

Gregory S. JohnsonGregory S. [email protected]@cs.utexas.edu

TopicsTopics

• expressing v. exploiting parallelismexpressing v. exploiting parallelism

• data parallel algorithms, case study: MPIRE (part 1)data parallel algorithms, case study: MPIRE (part 1)

• hardware multithreading, case study: MPIRE (part 2)hardware multithreading, case study: MPIRE (part 2)

• Culler’s multithreading rebuttalCuller’s multithreading rebuttal

Expressing v. Exploiting ParallelismExpressing v. Exploiting Parallelism

• there exists a two-part problem in achieving high parallel efficiency:there exists a two-part problem in achieving high parallel efficiency:

1.1. expressing the problem in terms of parallelism - doing so may expressing the problem in terms of parallelism - doing so may require an algorithm very different from the serial version (user)require an algorithm very different from the serial version (user)

2.2. exploiting this parallelism on a specific machine (user + compiler)exploiting this parallelism on a specific machine (user + compiler)

• the “family” tree of parallel architectures is both wide and deepthe “family” tree of parallel architectures is both wide and deep

• parallel APIs abound: PVM, MPI, Shmem, OpenMP, CRAFT, HPF, parallel APIs abound: PVM, MPI, Shmem, OpenMP, CRAFT, HPF, F--, co-array Fortran, ZPLF--, co-array Fortran, ZPL

Data-Parallel AlgorithmsData-Parallel Algorithms

Data Parallel AlgorithmsData Parallel Algorithms

• one of the very first MPPs is describedone of the very first MPPs is described

• SIMD machine with modest processorsSIMD machine with modest processors

• machine used as a “co-processor” hosted by a machine used as a “co-processor” hosted by a conventional workstationconventional workstation

• dated model of parallelism (one data element per PE, dated model of parallelism (one data element per PE, modest ratio of computation to communication)modest ratio of computation to communication)

• a few of the presented parallel algorithms are timelessa few of the presented parallel algorithms are timeless

Array SummationArray Summation

xx00 xx22 xx33 xx44 xx55 xx66 xx77 xx88 xx99 xx1010 xx1111 xx1212 xx1313 xx1414 xx1515xx11

xx00 xx22 xx44 xx66 xx88 xx1010 xx1212 xx14141100

3322 55

44 7766 99

88 11111010 1313

1212 15151414


3300 55

44 7744 99

88 111188 1313

1212 15151212


3300 55

44 7700 99

88 111188 1313

1212 151588


3322 55

44 7700 99

88 11111010 1313

1212 151500

Region LabelingRegion Labeling

ObservationsObservations

• value placed on processors is quite low and the algorithms show thisvalue placed on processors is quite low and the algorithms show this

• today’s trend is toward smaller numbers of more capable processors, today’s trend is toward smaller numbers of more capable processors, notnot larger numbers of less capable processors (interconnects are not larger numbers of less capable processors (interconnects are not keeping pace with CPU development) keeping pace with CPU development)

• Thinking Machines tried to reverse course with the CM5 (MIMD) but Thinking Machines tried to reverse course with the CM5 (MIMD) but perhaps too late (founded 1983, folded 1993)perhaps too late (founded 1983, folded 1993)

• interesting point that as data grows arbitrarily large, data parallelism interesting point that as data grows arbitrarily large, data parallelism is much more abundant than instruction level parallelism (true for is much more abundant than instruction level parallelism (true for scientific codes, but ...)scientific codes, but ...)

• ““we would be by no means surprised if some of the algorithms we would be by no means surprised if some of the algorithms described in this article begin to look quite ‘old-fashioned’ in the years described in this article begin to look quite ‘old-fashioned’ in the years to cometo come””

Case Study: MPIRECase Study: MPIRE

• software implementation of direct volume rendering algorithms software implementation of direct volume rendering algorithms including splatting and ray castingincluding splatting and ray casting

• designed to render high quality images of multi-GB, multi-designed to render high quality images of multi-GB, multi-resolution volume datasets with R, G, B, resolution volume datasets with R, G, B, , , per sample per sample

• designed to express fine grained (per pixel), medium grained designed to express fine grained (per pixel), medium grained (subvolume), and coarse grained (per frame) parallelism(subvolume), and coarse grained (per frame) parallelism

• runs relatively efficiently on 32 and 64-bit SMPs, MPPs, COWs, runs relatively efficiently on 32 and 64-bit SMPs, MPPs, COWs, clustered SMPs, and uniprocessorsclustered SMPs, and uniprocessors


• nebulae consist of dust and gas which nebulae consist of dust and gas which emits, reflects, or obscures light from emits, reflects, or obscures light from nearby stars and fluorescing gasnearby stars and fluorescing gas

• Rice, Hayden, and SDSC rendered a Rice, Hayden, and SDSC rendered a “fly-through” of a model of the Orion “fly-through” of a model of the Orion nebula based on Hubble and related nebula based on Hubble and related observational dataobservational data

• total imagery required:total imagery required:

0° azimuth0° azimuth

33

dome projector layoutdome projector layout

11

22

44

5566

77

120 seconds * 30 frames per second120 seconds * 30 frames per second = 3600 frames= 3600 frames3600 frames * 7 projectors3600 frames * 7 projectors = 25200 images= 25200 images25200 images * 1280 x 1024 pixels25200 images * 1280 x 1024 pixels = 3.3 x 10= 3.3 x 101010 pixels pixels3.3 x 103.3 x 101010 pixels * 3 bytes per pixel pixels * 3 bytes per pixel = 92.3 GB= 92.3 GB


image planeimage plane

data volumedata volume

““eye” rayseye” rays

.... ....

.... ....

color(i, j) = ccolor(i, j) = c00 00 + (1 - + (1 - 00 (c (c11 11 + (1 - + (1 - 11 (c (c22 22 + (1 - + (1 - 22 (...)))))) (...))))))

pixel i, jpixel i, j





order oforder ofcompositioncomposition

image from PE #0image from PE #0 image from PE #1image from PE #1

composited image on PE#0composited image on PE#0

Hardware MultithreadingHardware Multithreading

PreliminariesPreliminaries

• modern OSes commonly support threads (lightweight processes), modern OSes commonly support threads (lightweight processes), and many vendors ship shared* memory multiprocessorsand many vendors ship shared* memory multiprocessors

• multithreading is one method for sharing the work of a single multithreading is one method for sharing the work of a single application across multiple processors with shared memoryapplication across multiple processors with shared memory

• software multithreading: multithreading without hardware support for software multithreading: multithreading without hardware support for storing state (PC, registers, etc.) for multiple threads simultaneously; storing state (PC, registers, etc.) for multiple threads simultaneously; often a thread / CPU, reduced utilizationoften a thread / CPU, reduced utilization

• hardware multithreading: hardware-level support for storing state for hardware multithreading: hardware-level support for storing state for multiple threads, permitting multiple threads, permitting fastfast context switches context switches

• APIs: POSIX Threads (Pthreads - yuck!), OpenMP, vendor-specificAPIs: POSIX Threads (Pthreads - yuck!), OpenMP, vendor-specific

Argument for MultithreadingArgument for Multithreading

• a variation of Little’s Law from queuing theory gives the average a variation of Little’s Law from queuing theory gives the average number of words “in flight” between memory and processor as the number of words “in flight” between memory and processor as the product of latency and bandwidth (assume full utilization)product of latency and bandwidth (assume full utilization)

bandwidth =bandwidth =

performance =performance =concurrencyconcurrency

latencylatency

concurrencyconcurrency

latencylatency

ban

dw

idth

ban

dw

idth

latencylatency

concurrencyconcurrency

Tera MTA OverviewTera MTA Overview

• Tera Computer Inc. founded in 1987, MTA serial no. 001 (2+Tera Computer Inc. founded in 1987, MTA serial no. 001 (2++ processors) delivered to SDSC in 1997+ processors) delivered to SDSC in 1997

• 128 virtual processors (streams) each with state storage128 virtual processors (streams) each with state storage

• dynamic resource allocationdynamic resource allocation

• LIW instruction words and explicit-dependence lookaheadLIW instruction words and explicit-dependence lookahead

• extended C with extended C with futurefuture type identifier and statement; type identifier and statement; relatively easy to programrelatively easy to program

• tagged memory with full / empty bitstagged memory with full / empty bits

Tera MTA Macro ArchitectureTera MTA Macro Architecture

3D toroidal mesh3D toroidal meshinterconnectinterconnect

memory modulesmemory modules

processor modulesprocessor modules

memory modulesmemory modules

IOP modulesIOP modules

CPU

processor moduleprocessor module

~ 260 MHz~ 260 MHz780 Mflops780 Mflops

remote memoryremote memory1 - 4 GB per module1 - 4 GB per module

2.1 GBps, 150 cycles2.1 GBps, 150 cycles

Tera MTA Micro ArchitectureTera MTA Micro ArchitectureProgram 1 Program 2 Program 3 Program 4

i = 0i = 1

i = 2

i = n subtask B

subtask A serial code

subtask A

i = 0i = 1

i = m

concurrentlyconcurrentlyrunning programsrunning programs

unallocated ...

parallel parallel threadsthreadsof computationof computation

hardware hardware streamsstreams(128 per processor)(128 per processor)

pool of readypool of readyinstructionsinstructions

execution pipelineexecution pipeline(one per processor)(one per processor)

Serial Marching CubesSerial Marching Cubes

for (i=0; i<M; i++)for (i=0; i<M; i++) for (j=0; j<N; j++)for (j=0; j<N; j++) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j])

Parallel Marching Cubes (Ordered)Parallel Marching Cubes (Ordered)

sync char tag[M][N];sync char tag[M][N];

#pragma tera assert parallel#pragma tera assert parallelfor (i=0; i<M; i++)for (i=0; i<M; i++) for (j=0; j<N; j++)for (j=0; j<N; j++) if (i>0) readff(&tag[i-1,j])if (i>0) readff(&tag[i-1,j]) if (j>0) readff(&tag[i,j-1])if (j>0) readff(&tag[i,j-1]) ExtractAndRender(grid[i,j])ExtractAndRender(grid[i,j]) writeef(&tag[i,j], 1)writeef(&tag[i,j], 1)

Tagged Memory: Full / EmptyTagged Memory: Full / Empty

eyeeye

gridgrid

completedcompleted

readyready

blockedblocked

returning readffreturning readff

Related WorkRelated Work

• Network ProcessorsNetwork Processors

• SMT (Simultaneous Multithreading, Dean Tullsen)SMT (Simultaneous Multithreading, Dean Tullsen)

• HyperThreading (2-way SMT on Intel Xeon and Pentium 4)HyperThreading (2-way SMT on Intel Xeon and Pentium 4)

• SSMT (Simultaneous Subordinate Microthreading, Yale Patt)SSMT (Simultaneous Subordinate Microthreading, Yale Patt)

• Cascade (Burton Smith et al.)Cascade (Burton Smith et al.)





.... ....

.... ....

Observations:Observations:

• no dependencies exist between eye raysno dependencies exist between eye rays

• data volume traversal is view dependentdata volume traversal is view dependent


Ray *my_ray=NULL; ...

/* raycast an image pixel-by-pixel */#pragma tera assert parallel#pragma tera dynamic schedulefor (index=0 ; index < xsize * ysize ; index++) { #pragma tera assert local my_ray

my_ray = InitializeRay(my_ray, index); image[index] = SampleRay(my_ray, data_volume);}

Ray *my_ray=NULL; ...

/* raycast an image pixel-by-pixel */#pragma tera assert parallel#pragma tera dynamic schedulefor (index=0 ; index < xsize * ysize ; index++) { #pragma tera assert local my_ray

my_ray = InitializeRay(my_ray, index); image[index] = SampleRay(my_ray, data_volume);}


IBM Power3 SP SUN HPC 10000 Cray MTA

1076.74s --- 1157.96s --- 724.46s ---553.99s 97% 591.33s 98% 366.73s 99%297.68s 90% 317.06s 91% 182.27s 99%165.15s 82% 168.57s 86% 92.16s 98%

93.99s 77%

IBM Power3 SP SUN HPC 10000 Cray MTA

1076.74s --- 1157.96s --- 724.46s ---553.99s 97% 591.33s 98% 366.73s 99%297.68s 90% 317.06s 91% 182.27s 99%165.15s 82% 168.57s 86% 92.16s 98%

93.99s 77%

1248

16

1248

16Pro

cessors

Pro

cessors

MPIRE Rendering Times(raycasting engine, perspective projection,

1280 x 1024 pixels, 1.5GB input, 84 features)

MPIRE Rendering Times(raycasting engine, perspective projection,

1280 x 1024 pixels, 1.5GB input, 84 features)

A. Snavely, G. Johnson and J. Genetti, A. Snavely, G. Johnson and J. Genetti, Proceedings of the High Performance Computing Symposium - HPC '99Proceedings of the High Performance Computing Symposium - HPC '99 ,,Adrian Tentner (Ed.), 1999, Adrian Tentner (Ed.), 1999, Data Intensive Volume Visualization on the Tera MTA and Cray T3EData Intensive Volume Visualization on the Tera MTA and Cray T3E..

Culler’s Multithreading RebuttalCuller’s Multithreading Rebuttal

Culler’s RebuttalCuller’s Rebuttal

• multithreaded processors view the memory / interconnect as a pipeline multithreaded processors view the memory / interconnect as a pipeline which contains multiple outstanding memory references “in flight”which contains multiple outstanding memory references “in flight”

• since latency to memory is non-uniform, certain references are returned since latency to memory is non-uniform, certain references are returned faster than others and potentially complete out of orderfaster than others and potentially complete out of order

• references must thus be matched (synchronized) with their issuing references must thus be matched (synchronized) with their issuing instructions and the corresponding thread marked “ready”instructions and the corresponding thread marked “ready”

• ready threads must then be scheduledready threads must then be scheduled

• Culler argues that synchronization and scheduling costs limit the Culler argues that synchronization and scheduling costs limit the concurrency which can be effectively exploited in a given codeconcurrency which can be effectively exploited in a given code


• given given RR cycles of useful work between remote references with latency cycles of useful work between remote references with latency L, L, processor utilization processor utilization UU is given as: is given as:

• however if latency is fully covered by the availability of ready threads however if latency is fully covered by the availability of ready threads then then UU can be expressed as the following, where can be expressed as the following, where SS is the total synch- is the total synch-ronization cost (product of the number of remote references and the ronization cost (product of the number of remote references and the synchronization cost per reference):synchronization cost per reference):

U =U =RR

(R + S)(R + S)

U =U =RR

(R + L)(R + L)


• scheduling threads without respect to the storage hierarchy (i.e. simply scheduling threads without respect to the storage hierarchy (i.e. simply scheduling the next ready thread) increases the average latency of scheduling the next ready thread) increases the average latency of memory referencesmemory references

• but as latency increases, the number of threads required to hide the but as latency increases, the number of threads required to hide the latency increases, thus synchronization costs increase, thus overall latency increases, thus synchronization costs increase, thus overall utilization decreasesutilization decreases

• a related problem is that multithreading schemes typically reduce the a related problem is that multithreading schemes typically reduce the state per thread (and thereby the cost of switching between threads) at state per thread (and thereby the cost of switching between threads) at the cost of increased number of remote references (again increasing the cost of increased number of remote references (again increasing synchronization costs)synchronization costs)

Culler’s Rebuttal & MTACuller’s Rebuttal & MTA

• the number of threads should be a function of the size of the top level of the number of threads should be a function of the size of the top level of the storage hierarchy (to minimize switching costs), thus the size of the the storage hierarchy (to minimize switching costs), thus the size of the top level determines the amount of latency that can be hiddentop level determines the amount of latency that can be hidden

• what if (as in the case of the Tera MTA) the top level is main memory?what if (as in the case of the Tera MTA) the top level is main memory?

• codes in which computation dominates are unable to benefit from high-codes in which computation dominates are unable to benefit from high-speed register / cache access and reusespeed register / cache access and reuse

• indeed we’ve seen this on the Tera MTA - the machine really does indeed we’ve seen this on the Tera MTA - the machine really does favor memory intensive codes with little reuse (like MPIRE)favor memory intensive codes with little reuse (like MPIRE)

TAMTAM

• Threaded Abstract Machine tailors thread count and thread Threaded Abstract Machine tailors thread count and thread scheduling to the storage hierarchy of the target machinescheduling to the storage hierarchy of the target machine

• unlike MTA threads, TAM threads each execute until completionunlike MTA threads, TAM threads each execute until completion

• threads are organized by function activation, threads within the threads are organized by function activation, threads within the same activation execute “near” one another (in time or space)same activation execute “near” one another (in time or space)

• TAM benefits from the migration of data up through the storage TAM benefits from the migration of data up through the storage hierarchy (TAM thread scheduling maximizes reuse)hierarchy (TAM thread scheduling maximizes reuse)

TAMTAM

Concluding RemarksConcluding Remarks

Hardware Multithreaded GraphicsHardware Multithreaded Graphics

• Argus pipelineArgus pipeline

• growth of bus speeds versus growth rate of CPUs and GPUs implies growth of bus speeds versus growth rate of CPUs and GPUs implies the distance between the CPU and GPU is growingthe distance between the CPU and GPU is growing

• perhaps a solution to the problem of building and traversing balanced perhaps a solution to the problem of building and traversing balanced spatial data structures (potentially useful for ray tracing, shadows, anti-spatial data structures (potentially useful for ray tracing, shadows, anti-aliasing, etc.); in other words, maybe this is a solution to matching aliasing, etc.); in other words, maybe this is a solution to matching irregular data structures to streaming processorsirregular data structures to streaming processors

data-parallel algorithms and multithreading gregory s. johnson [email protected]

Documents

dataparallel algorithms

data parallelism

presented parallel algorithms

data element

model of parallelism

terms of parallelism

high parallel efficiency

case study