parallel programming stuff jud leonard february 28, 2008

25
Parallel Programming & Stuff Jud Leonard February 28, 2008

Upload: emerald-little

Post on 18-Jan-2018

214 views

Category:

Documents


0 download

DESCRIPTION

3 Outline Parallel problems –Simulation Models –Imaging –Monte Carlo methods –Embarrassing Parallelism Software issues due to parallelism –Communication –Synchronization –Simultaneity –Debugging

TRANSCRIPT

Page 1: Parallel Programming  Stuff Jud Leonard February 28, 2008

Parallel Programming & StuffJud Leonard

February 28, 2008

Page 2: Parallel Programming  Stuff Jud Leonard February 28, 2008

2

SiCortex Systems

Page 3: Parallel Programming  Stuff Jud Leonard February 28, 2008

3

Outline

• Parallel problems– Simulation Models– Imaging– Monte Carlo methods– Embarrassing Parallelism

• Software issues due to parallelism– Communication– Synchronization– Simultaneity– Debugging

Page 4: Parallel Programming  Stuff Jud Leonard February 28, 2008

4

Limits to Scaling

• Amdahl’s Law: serial eventually dominates– Seldom the limitation in practice– Gustafson: Big problems have lots of parallelism

• Often in practice, communication dominates– Each node treats a smaller volume– Each node must communicate with more partners– More, smaller messages in the fabric

• Improved communication enables scaling

• Communication is key to higher performance

Page 5: Parallel Programming  Stuff Jud Leonard February 28, 2008

5

Physical System Simulations

• Spatial partition of problem– Works best if compute load evenly distributed

• Weather, Climate• Fluid dynamics

– Complex boundary management after load balancing• Partition criteria must balance:

– Communication– Compute– Storage

Page 6: Parallel Programming  Stuff Jud Leonard February 28, 2008

6

Example: 3D Convolution• Operate on N3 array with M3 processors• Result is a weighted sum of neighbor points

• Single-processor– no communication cost– Compute time ≈ N3

• 3D partition– Communication ≈ (N/M)2

– Compute Time ≈ (N/M)3

Page 7: Parallel Programming  Stuff Jud Leonard February 28, 2008

7

Scalability of 3D Convolution

Effect of Cost Ratio on Scaling Efficiency

Scaling Efficiency

1%

10%

100%

1 10 100 1000 10000 100000 1000000 10000000 1E+08 1E+09 1E+10Number of Processors

Elap

sed

Tim

e

10x scaling 100x scaling

Page 8: Parallel Programming  Stuff Jud Leonard February 28, 2008

8

Example: Logic Simulation

• Modern chips contain many millions of gates– Enormous inherent parallelism in model

• Product quality depends on test coverage– Economic incentive

• Perfect application for parallel simulation– Why has nobody done it?

• Communication costs• Complexity of partition problem

– Multidimensional non-linear optimization

Page 9: Parallel Programming  Stuff Jud Leonard February 28, 2008

9

Example: Seismic Imaging

• Similar to Radar, Sonar, MRI…• Record echoes of a distinctive signal

– Correlate across time and space– Estimate remote structure from variation in echo

delay at multiple sensors• Terabytes of data

– Need efficient algorithms– Every sensor affected by the whole structure– How to partition for efficiency?

Page 10: Parallel Programming  Stuff Jud Leonard February 28, 2008

10

New Issues due to Parallelism

• Communication costs– My memory is more accessible than others

• Planning, sequencing halo exchanges– Bulk transfers most efficient

• but take longer– Subroutine syntax vs Language intrinsic– Coherence and synchronization explicitly managed– Issues of grain size

• Synchronization– Coordination of “loose” parallelism

• Identification of necessary sync points

Page 11: Parallel Programming  Stuff Jud Leonard February 28, 2008

11

Mind Games

• Simultaneity– Contrary to habitual sequential mindset– Access to variables is not well-ordered between

parallel threads– Order is not repeatable

• Debugging– Printf?– Breakpoints?– Timestamps?

Page 12: Parallel Programming  Stuff Jud Leonard February 28, 2008

12

Interesting Problems - Parallelism

• Event-driven simulation• Load balancing• Debugging

– Correctness• Dependency• Synchronization

– Performance• Critical paths

Page 13: Parallel Programming  Stuff Jud Leonard February 28, 2008

13

The Kautz Digraph

• Log diameter (base 3, in our case)– Reach any of 972 nodes in 6 or fewer steps

• Multiple disjoint paths– Fault tolerance– Congestion avoidance

• Large bisection width– No choke points as network grows

• Natural tree structure– Parallel broadcast & multicast– Parallel barriers & collectives

Page 14: Parallel Programming  Stuff Jud Leonard February 28, 2008

14

Alphabetic Construction

• Node names are strings of length k (diameter)– Alphabet of d+1 letters (d = degree)– No letter repeats in adjacent positions– ABAC: allowed– ABAA: not allowed

• Network order = (d+1)dk-1

– d+1 choices for first letter– d choices for (k-1) letters

• Connections correspond to shifts– ABAC, CBAC, DBAC -> BACA, BACB, BACD

Page 15: Parallel Programming  Stuff Jud Leonard February 28, 2008

15

Noteworthy• Most paths simply shift in destination ID

– ABCD -> BCDB -> CDBA -> DBAD -> BADC• Unless tail overlaps head

– ABCD -> BCDA -> CDAB• A few nodes have bidirectionally-connected

neighbors– ABAB <-> BABA

• A “necklace” consists of nodes whose names are merely rotations of each other– ABCD -> BCDA -> CDAB -> DABC -> ABCD again

Page 16: Parallel Programming  Stuff Jud Leonard February 28, 2008

16

Whatsa Kautz Graph?

3

2

0

1

Diam Order1 42 123 364 1085 3246 972

Page 17: Parallel Programming  Stuff Jud Leonard February 28, 2008

17

Kautz Graph Topology

11

10

9

8 7 6

0 1 2

3

4

5

Diam Order1 42 123 364 1085 3246 972

Page 18: Parallel Programming  Stuff Jud Leonard February 28, 2008

18

Whatsa Kautz Graph?

35

34

33

32

31

30

29

28

27

26 25 24 23 22 21 20 19 18

0 1 2 3 4 5 6 7 8

9

10

11

12

13

14

15

16

17

Diam Order1 42 123 364 1085 3246 972

Page 19: Parallel Programming  Stuff Jud Leonard February 28, 2008

19

Interconnect Fabric

• Logarithmic diameter– Low latency– Low contention– Low switch degree

• Multiple paths – Fault tolerant to link,

node, or module failures– Congestion avoidance

• Cost-effective– Scalable– Modular

L2 Cache PCIe

Fabric Switch

DMA

Memory Control

CPU

CacheCPU

Cache

CPU

CacheCPU

Cache

CPU

CacheCPU

Cache

DDR DIMMDDR DIMM

Page 20: Parallel Programming  Stuff Jud Leonard February 28, 2008

20

DMA Engine API• Per-process structures:

– Command and Event queues in user space– Buffer Descriptor table (writable by kernel only)– Route Descriptor table (writable by kernel only)– Heap (User readable/writable)– Counters (control conditional execution)

• Simple command set:– Send Event: immediate data for remote event queue– Put Im Heap: immediate data for remote heap– Send Command: nested command for remote exec– Put Buffer to Buffer: RDMA transfer– Do Command: conditionally execute command string

Page 21: Parallel Programming  Stuff Jud Leonard February 28, 2008

21

Interesting Problems - SiCortex• Collectives optimized for Kautz digraph

– Optimization for a subset– Primitive operations

• Partitions– Best subsets to choose – Best communication pattern within a subset

• Topology mapping– N-dimensional mesh– Tree– Systolic array

• Global shared memory

Page 22: Parallel Programming  Stuff Jud Leonard February 28, 2008

22

Brains and Beauty, too!

Page 23: Parallel Programming  Stuff Jud Leonard February 28, 2008

23

ICE9 Die Layout

Page 24: Parallel Programming  Stuff Jud Leonard February 28, 2008

24

27-node ModulePCIe Express Module Options

ICE9 Node Chip

DDR2 DIMM

Dual Gigabit Ethernet

Fibre Channel10 Gb Ethernet

InfiniBand

Power regulatorBackpanel Connector

Module Service Processor

MSP Ethernet

Page 25: Parallel Programming  Stuff Jud Leonard February 28, 2008

25

What’s new or unique? What’s not?• Designed for HPC• It’s not x86

– Performance = low power• Communication

– Kautz digraph topology– Messaging: 1st class op– Mesochronous cluster

• Open source everything• Performance counters• Reliable by design

– ECC everywhere– Thousands of monitors

• Factors of 3• Lighted gull wing doors!

• Linux (Gentoo)• Little-endian• MIPS-64 ISA• Pathscale compiler• GNU toolchain• IEEE Floating Point• MPI• PCI Express I/O