scalable molecular dynamics for large biomolecular systems

51
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale

Upload: carlos-cervantes

Post on 03-Jan-2016

27 views

Category:

Documents


3 download

DESCRIPTION

Scalable Molecular Dynamics for Large Biomolecular Systems. Robert Brunner James C Phillips Laxmikant Kale. Overview. Context: approach and methodology Molecular dynamics for biomolecules Our program NAMD Basic Parallelization strategy NAMD performance Optimizations Techniques Results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalable Molecular Dynamics for Large Biomolecular Systems

1

Scalable Molecular Dynamicsfor Large Biomolecular Systems

Robert Brunner

James C Phillips

Laxmikant Kale

Page 2: Scalable Molecular Dynamics for Large Biomolecular Systems

2

Overview

• Context: approach and methodology• Molecular dynamics for biomolecules• Our program NAMD

– Basic Parallelization strategy

• NAMD performance Optimizations– Techniques

– Results

• Conclusions: summary, lessons and future work

Page 3: Scalable Molecular Dynamics for Large Biomolecular Systems

3

The context

• Objective: Enhance Performance and productivity in parallel programming– For complex, dynamic applications

– Scalable to thousands of processors

• Theme:– Adaptive techniques for handling dynamic behavior

• Look for optimal division of labor between human programmer and the “system”– Let the programmer specify what to do in parallel

– Let the system decide when and where to run the subcomputations

• Data driven objects as the substrate

Page 4: Scalable Molecular Dynamics for Large Biomolecular Systems

4

1

12

5

9 10

2

11

34

7

13

6

8

15810

4

11 12

9 2 3

9

6 713

Page 5: Scalable Molecular Dynamics for Large Biomolecular Systems

5

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 6: Scalable Molecular Dynamics for Large Biomolecular Systems

6

Charm++

• Parallel C++ with Data Driven Objects• Object Arrays and collections• Asynchronous method invocation• Object Groups:

– global object with a “representative” on each PE

• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu

Page 7: Scalable Molecular Dynamics for Large Biomolecular Systems

7

Multi-partition decomposition

Page 8: Scalable Molecular Dynamics for Large Biomolecular Systems

8

Load balancing

• Based on migratable objects• Collect timing data for several cycles• Run heuristic load balancer

– Several alternative ones

• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration

Page 9: Scalable Molecular Dynamics for Large Biomolecular Systems

9

Measurement based load balancing

• Application induced imbalances:– Abrupt, but infrequent, or

– Slow, cumulative

– rarely: frequent, large changes

• Principle of persistence– Extension of principle of locality

– Behavior, including computational load and communication patterns, of objects tend to persist over time

• We have implemented strategies that exploit this automatically

Page 10: Scalable Molecular Dynamics for Large Biomolecular Systems

10

Molecular Dynamics

Page 11: Scalable Molecular Dynamics for Large Biomolecular Systems

11

Molecular dynamics and NAMD

• MD to understand the structure and function of biomolecules– proteins, DNA, membranes

• NAMD is a production quality MD program– Active use by biophysicists (science publications)

– 50,000+ lines of C++ code

– 1000+ registered users

– Features and “accessories” such as

• VMD: visualization

• Biocore: collaboratory

• Steered and Interactive Molecular Dynamics

Page 12: Scalable Molecular Dynamics for Large Biomolecular Systems

12

NAMD Contributors

• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel

• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill

Humphrey, Mark Nelson

• NAMD2: – M. Bhandarkar, R. Brunner, A. Gursoy, J. Phillips,

N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..

Page 13: Scalable Molecular Dynamics for Large Biomolecular Systems

13

Molecular Dynamics

• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step

– Calculate forces on each atom

• bonds:

• non-bonded: electrostatic and van der Waal’s

– Calculate velocities and Advance positions

• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)

Page 14: Scalable Molecular Dynamics for Large Biomolecular Systems

14

Cut-off radius

• Use of cut-off radius to reduce work– 8 - 14 Å

– Faraway charges ignored!

• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions

– Periodic systems: Ewald, Particle-Mesh Ewald

– Aperiodic systems: FMA

• Even so, cut-off based computations are important:– near-atom calculations are part of the above

– multiple time-stepping is used: k cut-off steps, 1 PME/FMA

Page 15: Scalable Molecular Dynamics for Large Biomolecular Systems

15

Scalability

• The Program should scale up to use a large number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable• Better definition of scalability:

– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Page 16: Scalable Molecular Dynamics for Large Biomolecular Systems

16

Isoefficiency

• Quantify scalability – (Work of Vipin Kumar, U. Minnesota)

• How much increase in problem size is needed to retain the same efficiency on a larger machine?

• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =

• computation + communication + idle

Page 17: Scalable Molecular Dynamics for Large Biomolecular Systems

17

Atom decomposition

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– Communication: O(N) per processor

– Communication/Computation: O(N)/(N/P): O(P)

– Again, not scalable by our definition

Page 18: Scalable Molecular Dynamics for Large Biomolecular Systems

18

Force Decomposition

• Distribute force matrix to processors– Matrix is sparse, non uniform

– Each processor has one block

– Communication:

– Ratio:

• Better scalability in practice – (can use 100+ processors)

– Plimpton:

– Hwang, Saltz, et al:

• 6% on 32 Pes 36% on 128 processor

– Yet not scalable in the sense defined here!

P

N

P

Page 19: Scalable Molecular Dynamics for Large Biomolecular Systems

19

Spatial Decomposition

• Allocate close-by atoms to the same processor• Three variations possible:

– Partitioning into P boxes, 1 per processor

• Good scalability, but hard to implement

– Partitioning into fixed size boxes, each a little larger than the cutoff distance

– Partitioning into smaller boxes

• Communication: O(N/P): – so, scalable in principle

Page 20: Scalable Molecular Dynamics for Large Biomolecular Systems

20

Spatial Decomposition in NAMD

• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size

system, load balancing problems• For midsize systems, got good speedups up to 16

processors….• Use the symmetry of Newton’s 3rd law to facilitate

load balancing

Page 21: Scalable Molecular Dynamics for Large Biomolecular Systems

21

Spatial Decomposition

But the load balancing problems are still severe:

Page 22: Scalable Molecular Dynamics for Large Biomolecular Systems

22

Page 23: Scalable Molecular Dynamics for Large Biomolecular Systems

23

FD + SD

• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor

– Number of diamonds (3D):

• 14·Number of Patches

Page 24: Scalable Molecular Dynamics for Large Biomolecular Systems

24

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..

– Luckily, each involves atoms in neighboring patches only

• Straightforward implementation:– Send message to all neighbors,

– receive forces from them

– 26*2 messages per patch!

Page 25: Scalable Molecular Dynamics for Large Biomolecular Systems

25

Bonded Forces:• Assume one patch per processor:

– an angle force involving atoms in patches:

• (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)

• is calculated in patch: (max{xi}, max{yi}, max{zi})

B

CA

Page 26: Scalable Molecular Dynamics for Large Biomolecular Systems

26

Implementation

• Multiple Objects per processor– Different types: patches, pairwise forces, bonded forces,

– Each may have its data ready at different times

– Need ability to map and remap them

– Need prioritized scheduling

• Charm++ supports all of these

Page 27: Scalable Molecular Dynamics for Large Biomolecular Systems

27

Load Balancing

• Is a major challenge for this application– especially for a large number of processors

• Unpredictable workloads– Each diamond (force object) and patch encapsulate variable

amount of work

– Static estimates are inaccurate

• Measurement based Load Balancing Framework– Robert Brunner’s recent Ph.D. thesis

– Very slow variations across timesteps

Page 28: Scalable Molecular Dynamics for Large Biomolecular Systems

28

Bipartite graph balancing

• Background load:– Patches (integration, ..) and bond-related forces:

• Migratable load:– Non-bonded forces

• Bipartite communication graph – between migratable and non-migratable objects

• Challenge:– Balance Load while minimizing communication

Page 29: Scalable Molecular Dynamics for Large Biomolecular Systems

29

Load balancing strategy

Greedy variant (simplified):

Sort compute objects (diamonds)

Repeat (until all assigned)

S = set of all processors that:

-- are not overloaded

-- generate least new commun.

P = least loaded {S}

Assign heaviest compute to P

Refinement:

Repeat

- Pick a compute from

the most overloaded PE

- Assign it to a suitable

underloaded PE

Until (No movement)

Cell CellCompute

Page 30: Scalable Molecular Dynamics for Large Biomolecular Systems

30

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

Tim

e migratable work

non-migratable work

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

0 2 4 6 8 10 12 14

Avera

ge

Processors

Tim

e migratable work

non-migratable work

Page 31: Scalable Molecular Dynamics for Large Biomolecular Systems

32

Initial Speedup Results: ASCI RedSpeedup on ASCI Red: Apo-A1

0

100

200

300

400

500

600

700

800

900

0 200 400 600 800 1000 1200 1400 1600 1800

Processors

Sp

ee

du

p

Page 32: Scalable Molecular Dynamics for Large Biomolecular Systems

33

BC1 complex: 200k atoms

Page 33: Scalable Molecular Dynamics for Large Biomolecular Systems

34

Optimizations

• Series of optimizations• Examples to be covered here:

– Grainsize distributions (bimodal)

– Integration: message sending overheads

Page 34: Scalable Molecular Dynamics for Large Biomolecular Systems

35

Grainsize and Amdahls’s law

• A variant of Amdahl’s law, for objects, would be:– The fastest time can be no shorter than the time for the biggest

single object!

• How did it apply to us?– Sequential step time was 57 seconds

– To run on 2k processors, no object should be more than 28 msecs.

• Should be even shorter

– Grainsize analysis via projections showed that was not so..

Page 35: Scalable Molecular Dynamics for Large Biomolecular Systems

36

Grainsize analysisGrainsize distribution

0

100

200

300

400

500

600

700

800

900

1000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

grainsize in milliseconds

nu

mb

er

of

ob

jec

ts

Solution:

Split compute objects that may have too much work:

using a heuristics based on number of interacting atoms

Problem

Page 36: Scalable Molecular Dynamics for Large Biomolecular Systems

37

Grainsize reduced

Grainsize distribution after splitting

0

200

400

600

800

1000

1200

1400

1600

1 3 5 7 9 11 13 15 17 19 21 23 25

grainsize in msecs

nu

mb

er o

f o

bje

cts

Page 37: Scalable Molecular Dynamics for Large Biomolecular Systems

38

Performance audit

Page 38: Scalable Molecular Dynamics for Large Biomolecular Systems

39

Performance audit

• Through the optimization process, – an audit was kept to

decide where to look to improve performance

Total Ideal Actual

Total 57.04 86

nonBonded 52.44 49.77

Bonds 3.16 3.9

Integration 1.44 3.05

Overhead 0 7.97

Imbalance 0 10.45

Idle 0 9.25

Receives 0 1.61

Integration time doubled

Page 39: Scalable Molecular Dynamics for Large Biomolecular Systems

40

Integration overhead analysis

integration

Problem: integration time had doubled from sequential run

Page 40: Scalable Molecular Dynamics for Large Biomolecular Systems

41

Integration overhead example:

• The projections pictures showed the overhead was associated with sending messages.

• Many cells were sending 30-40 messages.– The overhead was still too much compared with the cost of

messages.

– Code analysis: memory allocations!

– Identical message is being sent to 30+ processors.

• Simple multicast support was added to Charm++– Mainly eliminates memory allocations (and some copying)

Page 41: Scalable Molecular Dynamics for Large Biomolecular Systems

42

Integration overhead: After multicast

Page 42: Scalable Molecular Dynamics for Large Biomolecular Systems

43

Improved Performance DataSpeedup on Asci Red

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000 2500

Processors

Sp

eed

up

Page 43: Scalable Molecular Dynamics for Large Biomolecular Systems

45

Results on Linux Cluster

Speedup on Linux Cluster

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120

Processors

Sp

eed

up

Page 44: Scalable Molecular Dynamics for Large Biomolecular Systems

46

Performance of Apo-A1 on Asci Red

0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500

Processors

Sp

eed

up

Page 45: Scalable Molecular Dynamics for Large Biomolecular Systems

47

Performance of Apo-A1 on O2k and T3E

0

50

100

150

200

250

0 50 100 150 200 250 300

Processors

Sp

eed

up

Page 46: Scalable Molecular Dynamics for Large Biomolecular Systems

48

Lessons learned

• Need to downsize objects!– Choose smallest possible grainsize that amortizes overhead

• One of the biggest challenge – was getting time for performance tuning runs on parallel

machines

Page 47: Scalable Molecular Dynamics for Large Biomolecular Systems

49

Future and Planned work

• Speedup on small molecules!– Interactive molecular dynamics

• Increased speedups on 2k-10k processors– Smaller grainsizes

– New algorithms for reducing communication impact

– New load balancing strategies

• Further performance improvements for PME/FMA– With multiple timestepping

– Needs multi-phase load balancing

Page 48: Scalable Molecular Dynamics for Large Biomolecular Systems

50

Steered MD: example picture

Image and Simulation by the theoretical biophysics group, Beckman Institute, UIUC

Page 49: Scalable Molecular Dynamics for Large Biomolecular Systems

51

More information

• Charm++ and associated framework:– http://charm.cs.uiuc.edu

• NAMD and associated biophysics tools:– http://www.ks.uiuc.edu

• Both include downloadable software

Page 50: Scalable Molecular Dynamics for Large Biomolecular Systems

52

Performance: size of system

# ofatoms

Procs 1 2 4 8 16 32 64 128 160

bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms

Speedup 1.0 1.97 3.61 7.20 13.2 23.7

ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms

Speedup (3.88) 7.64 14.7 28.4 57.3 109 130

Performance data on Cray T3E

Page 51: Scalable Molecular Dynamics for Large Biomolecular Systems

53

Performance: various machines

Procs 1 2 4 8 16 32 64 128 160 192

T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098

- ---------

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152

2000-------

Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3

ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196

Red ---------

Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143

NOWs Time 24.1 12.4 6.39 3.69

HP735/125

Speedup 1.0 1.94 3.77 6.54