titanium: a high performance language based on javayelick/talks/titanium/hpcs-sp03.pdfkathy yelick,...

31
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick http://titanium.cs.berkeley.edu/ U.C. Berkeley Also the UPC project at LBNL http://upc.nersc.gov

Upload: others

Post on 25-Mar-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley

Titanium

Titanium: A High Performance Language Based on Java

Kathy Yelickhttp://titanium.cs.berkeley.edu/

U.C. Berkeley

Also the UPC project at LBNLhttp://upc.nersc.gov

Page 2: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Titanium Group (Past and Present)• Susan Graham• Katherine Yelick• Paul Hilfinger• Phillip Colella (LBNL)• Alex Aiken

• Greg Balls• Andrew Begel• Dan Bonachea• Kaushik Datta• David Gay• Ed Givelberg• Arvind Krishnamurthy

• Ben Liblit• Peter McQuorquodale (LBNL)• Sabrina Merchant• Carleton Miyamoto• Chang Sun Lin• Geoff Pike• Luigi Semenzato (LBNL)• Jimmy Su• Tong Wen (LBNL)• Siu Man Yau

(and many undergrad researchers)

Page 3: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Context• Most parallel programs are written using explicit

parallelism, either:• Message passing with a SPMD model

• Usually for scientific applications with C++/Fortran• Scales easily

• Shared memory with threads in C or Java • Usually for non-scientific applications• Easier to program, but usually provide less scalable performance

• Global Address Space Languages take the best of both• global address space like threads (programmability)• SPMD parallelism like MPI (performance)• local/global distinction, i.e., layout matters (performance)

Page 4: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Titanium• Based on Java, a cleaner C++

• classes, automatic memory management, etc.• compiled to C and then native binary (no JVM)

• Scalable parallelism model• SPMD with a global address space

• Optimizing compiler• static (compile-time) optimizer, not a JIT• communication and memory optimizations• synchronization analysis (e.g. static barrier analysis)• cache and other uniprocessor optimizations

Page 5: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Summary of Features Added to Java

1. Scalable parallelism (Java threads replaced)2. Multidimensional arrays with iterators 3. Checked Synchronization 4. Immutable (“value”) classes5. Operator overloading6. Templates7. Zone-based memory management (regions)8. Libraries for collective communication,

distributed arrays, bulk I/O

Page 6: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Immutable Classes in Titanium• For small objects, would sometimes prefer

• to avoid level of indirection • pass by value (copying of entire object)• especially when immutable -- fields never modified

• Examples:• complex type• multiple fields (pressure, velocity, force) in a grid

• Titanium introduces immutable classes• all fields are final (constant) plus • compiler implements as above

• Note: considering extension to allow mutation

Page 7: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Example of Immutable Classes

• An immutable class has few additionsimmutable class Complex {

Complex () {real=0; imag=0; }...

}

• Use of immutable complex valuesComplex c1 = new Complex(7.1, 4.3);c1 = c1.add(c1);

• Addresses performance and programmability• Similar to structs in C in terms of performance• Adds support for complex types

Zero-argument constructor required

new keyword

Rest unchanged. No assignment to fields outside of constructors.

Page 8: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Operator Overloading• Titanium adds operator overloading, important

for readability in scientific code• Very similar to operator overloading in C++

public Complex operator+(Complex c) { return new Complex(c.real + real, c.imag + imag);

}Complex c1 = new Complex(7.1, 4.3);c1 = c1 + c1;

• Adds to programmability, not performance• Must be used judiciously

Page 9: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Templates

• Many applications use containers:• E.g., arrays parameterized by dimensions, element types• Java supports this kind of parameterization through

inheritance; Java templates based on this as well• May only put Object types into containers• Inefficient when used extensively

• Titanium provides a template mechanism closer to that of C++• E.g., can instantiate with “double” or “immutable Complex”

Page 10: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Example of Templatestemplate <class Element> class Stack {

. . .public Element pop() {...}public void push( Element arrival ) {...}

}

template Stack<int> list = new template Stack<int>();list.push( 1 );int x = list.pop();

• Addresses programmability and performance

Not an object

Strongly typed, No dynamic cast

Page 11: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Multidimensional Arrays• Arrays in Java are objects• Only 1D arrays directly supported• Array bounds are checked

• Safe but potentially slow

• Multidimensional arrays as arrays-of-arrays• General, but may be slow due to memory layout and

difficulty of compiler analysis• Hand-coding (array libraries) can confuse optimizer

Page 12: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Multidimensional Arrays in Titanium

• New kind of multidimensional array added• Sub-arrays are supported • Indexed by Points (tuple of ints)

• Very expressive sub-array support, e.g., • Can refer to a row or column as a sub-array• refer to the boundary region of an array

• Optimized by the compiler for caches

• Addresses programmability and performance

Page 13: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Unordered iteration

• With arrays, Titanium adds unordered iteration• Helps compiler with loop analysis• Also avoids some indexing details

foreach (p within A.domain()) { A[p]... }

• p is a Point (tuple of ints) that can be used to index arrays • Works for any dimension array

• Provides programmability and performance

Page 14: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Parallelism Model• Titanium starts a copy of “main” on each

processor (SPMD parallelism)• Only major restriction to Java semantics• Replaced Java’s thread model • Many programs written with more general threads do:

for i = 1 to p fork • Handling dynamic thread creation on 1000s of processors

is difficult

• Design is purely a performance consideration, dynamic threads are a future direction

Page 15: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Global Address Space

• Shared address space is partitioned • References (pointers) are either local or global

(meaning possibly remote)

Object heapsare shared

Glo

bal

ad

dre

ss s

pac

e

x: 1y: 2

Program stacks are private

l: l: l:

g: g: g:

x: 5y: 6

x: 7y: 8

p0 p1 pn

Page 16: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Communication• Titanium has explicit global communication:

• Broadcast, reduction, etc.• Primarily used to set up distributed data structures

• Most communication is implicit through the shared address space• Dereferencing a global reference, g.x, can generate

communication• Arrays have copy operations, which generate bulk

communication: A1.copy(A2)

Page 17: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Distributed Data Structures• Building distributed arrays:

Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d];

Particle [1d] myParticle = new Particle [0:myParticleCount-1];

allParticle.exchange(myParticle);

• Now each processor has array of pointers, one to each processor’s chunk of particles

P0 P1 P2

All to all broadcast

Page 18: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Global Address Space

• Communication through global address space designed for• Productivity: explicit representation of distributed data

structures• Performance: exploits efficient one-sided communication

(remote put/get) when it exists• Tunability: shared memory style uses more global

dereferences; distributed style uses more array copies

Page 19: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Region-Based Memory Management• Extension of Java’s implicit memory management• Regions are still “safe”, but can avoid or reduce

need for distributed garbage collectionPrivateRegion r = new PrivateRegion();for (int j = 0; j < 10; j++) {int[] x = new ( r ) int[j + 1];work(j, x);

}try { r.delete(); }catch (RegionInUse oops) {

System.out.println(“failed to delete”);}

}• Designed for performance

Page 20: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Applications in Titanium• Several applications and benchmarks

• Heart simulation• Fluid solvers with Adaptive Mesh Refinement (AMR)• Dense linear algebra: LU, MatMul• Unstructured mesh kernel: EM3D• Finite element benchmark• Genetics: micro-array selection• Tree-structure n-body code

Page 21: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

3D AMR Gas Dynamics• Hyperbolic Solver [McCorquodale and Colella]

• Implementation of Berger-Colella algorithm• Mesh generation algorithm included

• 2D Example (3D supported) • Mach-10 shock on solid surface

at oblique angle

• Future: Self-gravitating gas dynamics package

Page 22: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

• Immersed Boundary Method [Peskin/MacQueen, Yau]• Fibers (e.g., heart muscles) modeled by

list of fiber points• Fluid space modeled by a regular lattice

• Irregular fiber lists need to interact with regular fluid lattice• Trade-off between load balancing of

fibers and minimizing communication• memory and communication intensive

• Uses several parallel numerical kernels• Navier-Stokes solver• 3-D FFT solver• Soon to be enhanced using an adaptive

multigrid solver (possibly written in KeLP)

Heart Simulation

Page 23: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Productivity Measures

• Performance • Programmability• Robustness• Portability

Page 24: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Serial Performance (Pure Java)Performance on a Pentium IV (1.5GHz)

050

100150200250300350400450

Overall FFT SOR MC Sparse LU

MF

lop

s

java C (gcc -O6) Ti Ti -nobc

Note the Ti/Java numbers use Java arrays, not Titanium arrays

Ti -nobc is with bounds-checking disabled

Page 25: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Parallel Performance and Scalability• Poisson solver using “Method of Local Corrections”• Communication < 5%; Scaled speedup nearly ideal (flat)

IBM SP at SDSC Cray T3E at NERSC

Page 26: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Performance Tuning by Compiler

Scale performance

0

2

4

6

8

10

12

5 10 30 50 80 90 95

percentage of remote accesses

time

(sec

onds

)

1 thread

2 thread

4 thread

1 thread, writepipeline

2 thread, writepipeline

4 thread, writepipeline

1 thread, readpipeline

2 thread, readpipeline

4 thread, readpipeline

Advantage of compiled languages (Berkeley UPC compiler)

Scaled version of

GUPS

Page 27: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Programmability

• Heart simulation developed in ~1 year• Extended to support 2D structures for Cochlea model in

~1 month• Preliminary code length measures

• Simple torus model• Serial torus code is 17045 lines long (2/3 comments)• Parallel Titanium torus version is 3057 lines long.

• Full heart model• Shared memory Fortran heart code is 8187 lines long• Parallel Titanium version is 4249 lines long.

• Need to be analyzed more carefully, but not a significant overhead for distributed memory parallelism

Page 28: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Robustness

• Robustness is the primary motivation for language “safety” in Java• Type-safe, array bounds checked, auto memory management• Study on C++ vs. Java from Phipps at Spirus:

• C++ has 2-3x more bugs per line than Java• Java had 30-200% more lines of code per minute

• Extended in Titanium• Checked synchronization avoids barrier deadlocks• More abstract array indexing, retains bounds checking

• No attempt at quantify for Titanium yet• Would like to measure speed of error detection (compile time, runtime

exceptions, etc.)

Page 29: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Portability

• Heart code and other applications run anywhere Titanium runs• Runs on serial or shared memory machines with native C

compiler• Including my laptop!• Very important for programmer productivity

• For distributed memory, requires communication layer• Alpha/Quadrics, IBM SP, Cray T3E, PC/Myrinet, anything with MPI• Global Address Space Networking layer (GASNet)

– With C compiler, get Titanium and LBNL/UPC compilers

• FFTW used in heart code: strategy for performance and portability

Page 30: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Performance and Portability Approach

• Use machines, not humans for architecture-specific tuning• Code generation + search-based selection

• Can adapt to cache size, # registers, network buffering

• Used in • Signal processing: FFTW, SPIRAL, UHFFT• Dense linear algebra: Atlas, PHiPAC• Sparse linear algebra: Sparsity• Rectangular grid-based computations: Titanium compiler• Global communication: Atlas-derivative

Page 31: Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Summary

• Titanium designed for performance and programmability• Some compromises (regions, local/global refs)

• Retains robustness (safety) of Java• Also a big help in learning, avoiding certain kinds of bugs

• Tunability and performance transparency• Aggressive automatic optimizations can make this worse

• Advertising (all open source):• Titanium compiler: http://titanium.cs.berkeley.edu• Berkeley UPC compiler: http://upc.nersc.gov• Automatic tuning: http://www.cs.berkeley.edu/~richie/bebop