introduction to parallel and high-performance...
TRANSCRIPT
![Page 1: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/1.jpg)
Introduction to Parallel and High-Performance Computing
(with Machine-Learning applications)
Anshul Gupta, IBM Research
Prabhanjan (Anju) Kambadur, Bloomberg L.P.
![Page 2: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/2.jpg)
Introduction to Parallel and High-Performance Computing (with Machine-Learning applications)
Part 1
Parallel computing basics and parallel algorithm analysis
Part 2
Parallel algorithms for Building a Classifier
![Page 3: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/3.jpg)
Part 1: Parallel and High-Performance Computing
•Why parallel computing?
![Page 4: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/4.jpg)
Part 1: Parallel and High-Performance Computing
•Why parallel computing?
• Parallel computing platforms
![Page 5: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/5.jpg)
Part 1: Parallel and High-Performance Computing
•Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
![Page 6: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/6.jpg)
Part 1: Parallel and High-Performance Computing
•Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
![Page 7: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/7.jpg)
Part 1: Parallel and High-Performance Computing
•Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
• Parallel programming models
![Page 8: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/8.jpg)
Part 1: Parallel and High-Performance Computing
•Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
• Parallel programming models
• Parallel algorithm analysis
![Page 9: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/9.jpg)
Why parallel computing?
Microprocessor trends: 1972--2015
Kirk M. Bresniker, Sharad Singhal, R. Stanley Williams, "Adapting to Thrive in a New Economy of Memory Abundance",Computer, vol.48, no. 12, pp. 44-53, Dec. 2015, doi:10.1109/MC.2015.368
![Page 10: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/10.jpg)
Parallel computing platforms
Highest level of parallelism:
• Compute nodes on an interconnection network
• Possibly, thousands of nodes
• Distributed memory
• Distributed or shared address space
• Scalability analysis crucial
![Page 11: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/11.jpg)
Parallel computing platforms (cont.)
A generalized compute node
• (Possibly) multiple CPUs
• (Possibly) multiple GPUs
![Page 12: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/12.jpg)
Parallel computing platforms (cont.)
![Page 13: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/13.jpg)
Parallel computing platforms (cont.)
Shared memory on a node, but Non-Uniform Memory Access (NUMA).
![Page 14: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/14.jpg)
Parallel computing platforms (cont.)
Nvidia GeForce GTX 680 (Kepler)
• 4 GPCs (graphics processing clusters)
• 8 SMXs (streaming multiprocessors)
• 192 X 8 = 1536 CUDA cores
![Page 15: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/15.jpg)
Parallel hardware hierarchy
Node ensemble
Nodes
CPUs
Cores
GPUs
GPCs
SMXs
CUDA cores
![Page 16: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/16.jpg)
Parallel program hierarchy
Process groups
Processes
Thread pools
Threads
GPU thread grids
GPU thread blocks
GPU threads
![Page 17: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/17.jpg)
Parallel programming paradigms
Distributed Memory Shared Memory Accelerator
![Page 18: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/18.jpg)
Parallel programming paradigms
Distributed Memory
• Multiple processes
• Distributed address space
• Explicit data movement
• Locality!
![Page 19: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/19.jpg)
Parallel programming paradigms
Distributed Memory
• Multiple processes
• Distributed address space
• Explicit data movement
• Locality!
Shared Memory
• Multiple threads
• Shared address space
• Implicit data movement
• Locality!
![Page 20: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/20.jpg)
Parallel programming paradigms
Distributed Memory
• Multiple processes
• Distributed address space
• Explicit data movement
• Locality!
Shared Memory
• Multiple threads
• Shared address space
• Implicit data movement
• Locality!
Accelerator
• Host memory PCIe Device memory
• Explicit data movement
• Locality!
![Page 21: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/21.jpg)
Algorithm design
• Algorithm design is critical to devising a computing solution
![Page 22: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/22.jpg)
Algorithm design
• Algorithm design is critical to devising a computation solution
• Serial algorithm is a recipe or sequence of basic steps or operations
![Page 23: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/23.jpg)
Algorithm design
• Algorithm design is critical to devising a computation solution
• Serial algorithm is a recipe or sequence of basic steps or operations
• Parallel algorithm is a recipe for solving the given problem using an ensemble of hardware resources
![Page 24: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/24.jpg)
Algorithm design
• Algorithm design is critical to devising a computation solution
• Serial algorithm is a recipe or sequence of basic steps or operations
• Parallel algorithm is a recipe for solving the given problem using an ensemble of hardware resources
• Specifying parallel algorithm involves a lot more than specifying a sequence of basic steps
![Page 25: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/25.jpg)
Algorithm design
• Algorithm design is critical to devising a computation solution
• Serial algorithm is a recipe or sequence of basic steps or operations
• Parallel algorithm is a recipe for solving the given problem using an ensemble of hardware resources
• Specifying parallel algorithm involves a lot more than specifying a sequence of basic steps
![Page 26: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/26.jpg)
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently
![Page 27: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/27.jpg)
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently
• Mapping concurrent pieces of work onto computing agents running in parallel
![Page 28: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/28.jpg)
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently
• Mapping concurrent pieces of work onto computing agents running in parallel
• Making the input, output, and intermediate data available to the right computing agent at the right time
![Page 29: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/29.jpg)
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently
• Mapping concurrent pieces of work onto computing agents running in parallel
• Making the input, output, and intermediate data available to the right computing agent at the right time
• Managing simultaneous requests for shared data
![Page 30: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/30.jpg)
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently
• Mapping concurrent pieces of work onto computing agents running in parallel
• Making the input, output, and intermediate data available to the right computing agent at the right time
• Managing simultaneous requests for shared data
• Synchronizing computing agents for correct program execution
![Page 31: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/31.jpg)
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently - Decomposition
• Mapping concurrent pieces of work onto computing agents running in parallel
• Making the input, output, and intermediate data available to the right computing agent at the right time – Data Dependencies
• Managing simultaneous requests for shared data
• Synchronizing computing agents for correct program execution –Task Dependencies
![Page 32: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/32.jpg)
Decomposition for concurrency
Task Decomposition Data Decomposition
![Page 33: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/33.jpg)
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified and mapped onto to threads or processes
![Page 34: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/34.jpg)
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified and mapped onto to threads or processes
• Tasks share or exchange data as needed
![Page 35: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/35.jpg)
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified and mapped onto to threads or processes
• Tasks share or exchange data as needed
• May be static or dynamic
![Page 36: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/36.jpg)
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified and mapped onto to threads or processes
• Tasks share or exchange data as needed
• May be static or dynamic
Data Decomposition
• Data is partitioned (input, output, or intermediate)
![Page 37: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/37.jpg)
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified and mapped onto to threads or processes
• Tasks share or exchange data as needed
• May be static or dynamic
Data Decomposition
• Data is partitioned
• Partitions are assigned to computing agents
• “Owner computes” rule
![Page 38: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/38.jpg)
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified and mapped onto to threads or processes
• Tasks share or exchange data as needed
• May be static or dynamic
Data Decomposition
• Data is partitioned
• Partitions are assigned to computing agents
• “Owner computes” rule
• Usually static
![Page 39: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/39.jpg)
Decomposition for concurrency
Data Decomposition
Task Decomposition
Hybrid Decompositions
(Example: sparse matrix factorization)
+ =
![Page 40: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/40.jpg)
Task decomposition example
Chess Program
• Each task evaluates all moves of a single piece (branch-and-bound)
• Small data (board position) can be replicated
• Dynamic load balancing required
R1 B1 P1K1 Q
. . .
P2
![Page 41: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/41.jpg)
Data decomposition example
Dense Matrix-Vector Multiplication
. =
P1
P2
P3
P4
P5
P6
P7
P8
P9
![Page 42: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/42.jpg)
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to bottom.
![Page 43: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/43.jpg)
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to bottom.
• Devise the best decomposition strategy at the given level
![Page 44: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/44.jpg)
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
![Page 45: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/45.jpg)
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data movement among computing agents
![Page 46: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/46.jpg)
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data movement among computing agents
• Minimize load imbalance and idling among computing agents
![Page 47: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/47.jpg)
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data movement among computing agents
• Minimize load imbalance and idling among computing agents
![Page 48: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/48.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
![Page 49: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/49.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
• Problem size, W = total amount of work: TS = kW
![Page 50: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/50.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes
![Page 51: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/51.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
![Page 52: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/52.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
![Page 53: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/53.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
• Efficiency, fraction of overall time spent doing useful work: E = S/p = TS/pTP = TS/(TS+TO)
![Page 54: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/54.jpg)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
• Efficiency, fraction of overall time spent doing useful work: E = S/p = TS/pTP = TS/(TS+TO)
![Page 55: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/55.jpg)
Parallel algorithm analysis
![Page 56: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/56.jpg)
Parallel algorithm analysis
![Page 57: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/57.jpg)
Isoefficiency function
• Function fE(p) of the number of computing agents p by which the problem size Wmust grow in order to maintain a given efficiency E.
![Page 58: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/58.jpg)
Isoefficiency function
• Function fE(p) of the number of computing agents p by which the problem size Wmust grow with p in order to maintain a given efficiency E.
• Captures the effect of communication, load-imbalance, contention, serial-bottlenecks, etc.
![Page 59: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/59.jpg)
Isoefficiency function
![Page 60: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/60.jpg)
Scalability analysis
E = S/p = TS/pTP
![Page 61: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/61.jpg)
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
![Page 62: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/62.jpg)
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW
![Page 63: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/63.jpg)
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW
W = TO . E/k(1-E)
![Page 64: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/64.jpg)
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW
W = TO . E/k(1-E)
W ~ TO
![Page 65: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/65.jpg)
Scalability analysis: W = O(n3)
Algorithm A
TP = O(n3/p) + O(n2/√p)
Algorithm B
TP = O(n3/p) + O(n√n)
![Page 66: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/66.jpg)
Scalability analysis: W = O(n3)
Algorithm A
TP = O(n3/p) + O(n2/√p)
W = O(n3) => n3 ~ n2√p
Algorithm B
TP = O(n3/p) + O(n√n)
W = O(n3) => n3 ~ n1.5p
![Page 67: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/67.jpg)
Scalability analysis: W = O(n3)
Algorithm A
TP = O(n3/p) + O(n2/√p)
W = O(n3) => n3 ~ n2√p
n ~ √p
Algorithm B
TP = O(n3/p) + O(n√n)
W = O(n3) => n3 ~ n1.5p
n1.5 ~ p
![Page 68: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/68.jpg)
Scalability analysis: W = O(n3)
Algorithm A
TP = O(n3/p) + O(n2/√p)
W = O(n3) => n3 ~ n2√p
n ~ √p
W = O(n3) = O(p1.5)
Algorithm B
TP = O(n3/p) + O(n√n)
W = O(n3) => n3 ~ n1.5p
n1.5 ~ p
W = O(n3) = O(p2)
![Page 69: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/69.jpg)
Parallel algorithm design and analysis
Dense Matrix-Vector Multiplication (1-D decomposition)
. =
P1
P2
P3
P4
P5
P6
P7
P8
P9
A x y
![Page 70: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/70.jpg)
Parallel algorithm design and analysis
Dense Matrix-Vector Multiplication (2-D decomposition)
P1P1
. =
P1 P2 P3
P5 P6
P7 P8 P9
A x y
![Page 71: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/71.jpg)
Parallel matrix-vector multiplication: 1-D decomposition
![Page 72: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/72.jpg)
Parallel matrix-vector multiplication: 2-D decomposition
![Page 73: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/73.jpg)
Scalability analysis of matrix-vector multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
![Page 74: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/74.jpg)
Scalability analysis of matrix-vector multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
![Page 75: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/75.jpg)
Scalability analysis of matrix-vector multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
![Page 76: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/76.jpg)
Scalability analysis of matrix-vector multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
![Page 77: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/77.jpg)
Scalability analysis of matrix-vector multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)
![Page 78: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/78.jpg)
Scalability analysis of matrix-vector multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)
![Page 79: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/79.jpg)
Scalability analysis of matrix-vector multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
![Page 80: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/80.jpg)
Scalability analysis of matrix-vector multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
![Page 81: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/81.jpg)
Scalability analysis of matrix-vector multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
![Page 82: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/82.jpg)
Scalability analysis of matrix-vector multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
![Page 83: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/83.jpg)
Scalability analysis of matrix-vector multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ n√plog(p), or n ~ √plog(p),
or W = O(n2) = O(plog2p)
![Page 84: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/84.jpg)
Scalability analysis of matrix-vector multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ n√plog(p), or n ~ √plog(p),
or W = O(n2) = O(plog2p)
![Page 85: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/85.jpg)
Isoefficiency function of dense sparse matrix-vector multiplication
1-D decomposition
W ~ p2
2-D decomposition
W ~ plog2p
2-D decomposition is likely to yield higher speedups, require smaller
problems to deliver the speedups, and scale more readily to larger number
of computing agents.
![Page 86: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/86.jpg)
Concluding remarks
• Parallelism necessary for continued performance improvement.
![Page 87: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/87.jpg)
Concluding remarks
• Parallelism necessary for continued performance improvement.
• Complex hierarchy of parallel computing hardware and programming paradigms
![Page 88: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/88.jpg)
Concluding remarks
• Parallelism necessary for continued performance improvement.
• Complex hierarchy of parallel computing hardware and programming paradigms
• Systematic top down parallel application design
![Page 89: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/89.jpg)
Concluding remarks
• Parallelism necessary for continued performance improvement.
• Complex hierarchy of parallel computing hardware and programming paradigms
• Systematic top down parallel application design
• Decomposition strategy is critical
![Page 90: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/90.jpg)
Concluding remarks
• Parallelism necessary for continued performance improvement.
• Complex hierarchy of parallel computing hardware and programming paradigms
• Systematic top down parallel application design
• Decomposition strategy is critical
• Analysis important to understand scalability
![Page 91: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)](https://reader033.vdocuments.mx/reader033/viewer/2022052717/5f043c2c7e708231d40cfa13/html5/thumbnails/91.jpg)
Thank you!