ccs cray biology workshop - csm.ornl.gov · ccs the center for computational ... •compiler (cc...

Post on 04-Jun-2018

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CCSThe Center forComputational SciencesCCSThe Center forComputational Sciences

CCS Cray Biology Workshop

Mark R. FaheyMay 9, 2003

2

CCSThe Center forComputational Sciences

Goals

• Computational Sciences: − Identify one or two important biology codes that are expected

to benefit from vectorization− Identify people who will work on the optimization of these

codes on the X1 • Life Sciences:

− Hear what Cray has done with informatics algorithms to date − Find out what non-traditional facilities can be accessed (i.e., all

integer and word manipulation) • Cray:

− Understand of ORNL's usage of applications in the life sciences. Interested in hearing about genomics, proteomics, or systems biology

− Work with scientists who have bio applications believed to be well-suited for the X1 system, and to see how they might potentially assist in porting and optimizing those codes

3

CCSThe Center forComputational Sciences

Agenda

1060 Commerce Park Conference Room 176

8:00-9:00 Cray X1 Overview, Mark Fahey9:00-10:00 Biology and Informatics at Cray, Jim Maltby10:00-10:30 Break10:30-11:00 Genomes to Life, Al Geist11:00-11:45 Cross-cutting needs GTL Facilities and

Transparent access to Supercomputers, Phil Locascio

11:45-12:15 PROSPECT and pathway codes, Ying Xu12:15-1:15 Lunch

4

CCSThe Center forComputational Sciences

Agenda (cont.)

6010 Conference Room1:15-1:40 PICUPP and gene function codes, Nagiza

Samatova1:40-2:05 Protein complex using Rosetta, Andrey

Gorin2:05-2:20 MD simulation of biological molecules,

Pratul Agarwal2:20-3:30 discussion3:30-4:00 Break4:00-4:30 Biomechanics, Richard Ward4:30-5:00 discussion and wrap up, Mark Fahey

5

CCSThe Center forComputational SciencesCCSThe Center forComputational Sciences

CCS Cray Biology Workshop:CCS X1

Mark R. Faheyfaheymr@ornl.gov

Center for Computational Sciences

6

CCSThe Center forComputational Sciences

Acknowledgement

Research sponsored by the Mathematical, Information, and Computational Sciences Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.

7

CCSThe Center forComputational Sciences

Phase 1 – March 2003• 32 Vector Processors

− 8 nodes, each with 4 processors

• 128 GB shared memory• 8 TB of disk space

400 GigaFLOP/s

8

CCSThe Center forComputational Sciences

Phase 2 – September 2003

• 256 Vector Processors− 64 nodes

• 1 TB shared memory• 20 TB of disk space 3.2 TeraFLOP/s

9

CCSThe Center forComputational Sciences

Outline

• Evaluation Plan• X1 architecture• CCS X1 status

10

CCSThe Center forComputational SciencesCCSThe Center forComputational Sciences

Evaluation Plan

11

CCSThe Center forComputational Sciences

Evaluate entire system

• Compare performance with other systems− Applications Performance Matrix

• Determine most-effective usage• Evaluate system-software reliability and

performance• Predict scalability• Collaborate with Cray on next generation

12

CCSThe Center forComputational Sciences

Hierarchical approach

• System software• Microbenchmarks• Parallel-paradigm evaluation• Full applications• Scalability evaluation

13

CCSThe Center forComputational Sciences

System-software evaluation

• Job-management systems• Mean time between failure• Mean time to repair• All problems, with Cray responses• Scalability and fault tolerance of OS• Filesystem performance & scalability• Tuning for HPSS, NFS, and wide-area

high-bandwidth networks

14

CCSThe Center forComputational Sciences

Microbenchmarking

• Results of standard benchmarks• Performance metrics of components

− Vector & scalar arithmetic− Memory hierarchy− Message passing− Process & thread management− I/O primitives

• Models of component performance

15

CCSThe Center forComputational Sciences

Programming-paradigm evaluation

• MPI-1, MPI-2 one-sided, SHMEM, Global Arrays, Co-Array Fortran, UPC, OpenMP, MLP, …

• Identify best techniques for X1• Develop optimization strategies for

applications

16

CCSThe Center forComputational Sciences

Scalability evaluation

• Hot-spot analysis− Inter- and intra-node communication− Memory contention− Parallel I/O

• Trend analysis for selected communication and I/O patterns

• Trend analysis for kernel benchmarks• Scalability predictions from performance

models and bounds

17

CCSThe Center forComputational Sciences

Application tuning & benchmarking

• Full applications of interest to DOE Office of Science− Scientific goals require multi-tera-scale

resources• Evaluation of performance, scaling, and

efficiency• Evaluation of ease/effectiveness of

targeted tuning

18

CCSThe Center forComputational Sciences

Identifying applications

• Draft evaluation plan• Prototype Workshop Nov. 5-6 • Feb 3-5, 2003: Fusion• Feb 6, 2003: Climate• March 2, 2003: Materials • Current audience

− This workshop…

19

CCSThe Center forComputational Sciences

Identifying applications

• Priorities− Potential performance and science payoff

• Schedule the pipeline− Porting, tuning, production runs− small number of applications in each stage

• Potential application− Important to DOE Office of Science− Scientific goals require multi-terascale resources

• Potential user− Willing and able to learn the X1− Knows the application− Motivated to tune application, not just recompile

20

CCSThe Center forComputational SciencesCCSThe Center forComputational Sciences

X1 Architecture

21

CCSThe Center forComputational Sciences

Cray X1 – Eventual goal

• World’s biggest single system

• 4096 processors, one system image

• 50+ TF• 4-way SMP nodes• Globally addressable

memory among nodes

22

CCSThe Center forComputational Sciences

X1 node

• 4 multi-streaming processors (MSPs)

• 51 GF• 16 GB shared memory

(CCS)• 200 GB/s bandwidth

− 150 GB/s to local MSPs− Extra for remote access

Node

local shared memory

memory controllers

rem

ote

mem

ory

MSP MSP MSP MSP

23

CCSThe Center forComputational Sciences

X1: MSP and SSP

• MSP− 4 single-streaming processors (SSPs)

• 12.8 GF (25.6 GF single precision)− 2 MB shared cache

• 51 GB/s load, 26 GB/s store• 4-word cache line

• SSP− Decoupled scalar and vector units− 400 MHz 2-way super-scalar unit

• 16 kB each data & instruction caches− 800 MHz 2-pipe vector unit

• 32 registers, 64 values (64-bit) each• 3 Functional Unit Groups:

Add, multiply, divide/sqrt

2 MB cache

SSP SSP SSP SSP

MSP

24

CCSThe Center forComputational Sciences

X1: MSP

• Is an MSP one or four processors?• One!

− Fast synchronization− Shared cache− Can act like one 8-pipe processor

• Four!− Each SSP can operate independently− OpenMP-like directives

25

CCSThe Center forComputational Sciences

X1 interconnect

• 3D torus− Not full bisection bandwidth− Process topology matters

• 12.8 GB/s/MSP• Memory to memory• Globally addressable memory

− Load/store memory on any node− Remote address translation

• On memory’s node, not at processor• Avoids TLB misses• Requires contiguous processors

− Cache coherent• Only cache local memory

26

CCSThe Center forComputational Sciences

Parallel models

• Multistreaming within MSP• OpenMP within node (late 2003)• Between nodes (or processors)

− MPI-1 two-sided message passing− MPI-2 one-sided communication− SHMEM one-sided communication− Co-Array Fortran remote memory

27

CCSThe Center forComputational Sciences

CCS X1: Status

• Current State• What’s there, what’s not• Latest performance numbers

28

CCSThe Center forComputational Sciences

Status: Current State

• UNICOS/mp 2.1.09− Relatively stable− Tuesday afternoon was memorable though

• PE 4.3.0.0− C, C++, Co-Array, Fortran compilers

• Some Open Software• LIBSCI, MPI, SHMEM• Gdb, Totalviewcli, PAT performance tool• DFS-NFS gateway to home directories• /scratch scratch space

29

CCSThe Center forComputational Sciences

Status: What’s there

• LIBSCI− Optimized BLAS – some issues− LAPACK− BLACS/SCALAPACK (not yet optimized)− FFTs− NetCDF

• Unofficially obtained from Cray• Open Software

− Perl 5.6.1 provided by Cray− Tcltk 8.4 built locally

30

CCSThe Center forComputational Sciences

Status: What’s not there

• Passwords exported from DCE− Chsh disabled

• Loopmark listings for C, coming fall• Integration between psched and PBS (July)• Gcc• OpenMP not supported yet• MPI SSP mode

− Not officially supported yet, in PE 5.2 (Sept.)• no way to effectively trap exceptions in a code

31

CCSThe Center forComputational Sciences

Status: problems

• Single precision BLAS− CGEMM performance

• Poor with leading dimension of 2n (like 1024)• Jobs Migrating• NFS performance issues• CPES occasionally loses NFS mounts• Network performance issues• Compiler (cc and ftn) core dumping

32

CCSThe Center forComputational Sciences

Status: Performance

33

CCSThe Center forComputational Sciences

34

CCSThe Center forComputational Sciences

35

CCSThe Center forComputational Sciences

36

CCSThe Center forComputational Sciences

37

CCSThe Center forComputational Sciences

38

CCSThe Center forComputational Sciences

39

CCSThe Center forComputational Sciences

Problems: Pivoting, matrix scaling

40

CCSThe Center forComputational Sciences

Conclusion

• Early results are promising• Many issues/problems to be resolved• Definitely not ready for general population• Evaluation continues

41

CCSThe Center forComputational Sciences

Conclusion: Goals

• Computational Sciences: − Identify one or two important biology codes that are expected

to benefit from vectorization− Identify people who will work on the optimization of these

codes on the X1 • Life Sciences:

− Hear what Cray has done with informatics algorithms to date − Demonstrate what non-traditional facilities can be accessed

(i.e., all integer and word manipulation) • Cray:

− Understand of ORNL's usage of applications in the life sciences. Interested in hearing about genomics, proteomics, or systems biology

− Work with scientists who have bio applications believed to be well-suited for the X1 system, and to see how they might potentially assist in porting and optimizing those codes

42

CCSThe Center forComputational Sciences

References

• James L. Schwarzmeier. “Cray X1 Architecture and Hardware Overview.” ORNL Cray X1 Tutorial, November 2002.

• Nathan Wichmann. “Coding for Performance on the Cray X1.” ORNL Cray X1 Tutorial, November 2002.

• John M. Levesque. “Message-Passing Paradigms.” ORNL Cray X1 Tutorial, November 2002.

• Arthur S. Bland. “Cray X1 at ORNL.” ORNL Cray X1 Tutorial, November 2002.

• James B. White III. “DOE Evaluation of the Cray X1.” ORNL Cray X1 Tutorial, November 2002.

• James B. White III. “Modern Vector Systems.” ORNL Cray Fusion Workshop, February 2003.

• Plots taken from Tom Dunigan’s early evaluation web page http://www.csm.ornl.gov/~dunigan

top related