stanford streaming supercomputer

Stanford Streaming Supercomputer

Eric Darve

Mechanical Engineering Department

Stanford University

Stream Register File

Clust.0

Clust.15

Micro-ctrl

ScalarProc.

ScalarCache

Memory SystemNetwork

DRAM DRAM

Inter-cluster Crossbar

StreamCtrl

12/10/2002 Eric Darve - Stanford Streaming Supercomputer 2/33

Overview of Streaming Project

• Main PIs:– Pat Hanrahan, [email protected]– Bill Dally, [email protected]

• Objectives: – Cost/Performance: 100:1 compared to clusters.– Programmable: applicable to large class of scientific

applications.– Porting and developing new code made easier:

stream language, support of legacy codes.


Performance/Cost

Item Cost Per Node

Processor chip 200 200

Router chip 200 50

Memory chip 20 320

Board/Backplane 3000 188

Cabinet 50000 49

Power 1 50

Per-Node Cost 976

Cost estimate – about $1K/nodePreliminary numbers, parts cost only, no I/O included.Expect 2x to 4x to account for margin and I/O

News Center

News Releases | Publications | Resources | Multimedia Gallery

News Release Archive | Awards

FOR IMMEDIATE RELEASEOctober 21, 2002

Sandia National Laboratories and Cray Inc. finalize $90 million contract for new supercomputer

Collaboration on Red Storm System under Department of Energy’s Advanced Simulation and Computing Program (ASCI)

ALBUQUERQUE, N.M. and SEATTLE, Wash. — The Department of Energy’s Sandia National Laboratories and Cray Inc. (Nasdaq NM: CRAY) today announced that they have finalized a multiyear contract, valued at approximately $90 million, under which Cray will collaborate with Sandia to develop and deliver a new massively parallel processing (MPP) supercomputer called Red Storm. InJune 2002, Sandia reported that Cray had been selected for the award, subject to successful contract negotiations.


Performance/Cost Comparisons

• Earth Simulator (today)– Peak 40TFLOPS, ~$450M– 0.09MFLOPS/$ – Sustained 0.03MFLOPS/$

• Red Storm (2004)– Peak 40TFLOPS, ~$90M– 0.44MFLOPS/$

• SSS (proposed 2006)– Peak 40TFLOPS, < $1M– 128MFLOPS/$– Sustained 30MFLOPS/$ (single node)

• Numbers are sketchy today, but even if we are off by 2x, improvement over status quo is large


ES

RedStorm

Desktop SSS

SSSASCI machines

GFLOPS


How did we achieve that?


VLSI Makes Computation Plentiful

VLSI: very large-scale integration. This is the current level of computer microchip miniaturization (refers to microchips containing in the hundreds of thousands of transistors.)

• Abundant, inexpensive arithmetic

– Can put 100s of 64-bit ALUs on a chip

– 20pJ per FP operation

• (Relatively) high off-chip bandwidth

– 1Tb/s demonstrated, 2nJ per word off chip

• Memory is inexpensive $100/Gbyte

nVidia GeForce4~120 Gflops/sec~1.2 Tops/sec

Velio VC30031Tb/s I/O BW


But VLSI imposes some constraintsCurrent Architecture: few ALUs / chip = expensive and limited performance.

Objective for SSS architecture: • Keep hundreds of ALUs/chip

busy.

Difficulty:• Locality of data: we need to

match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth.

• Latency tolerance: to cover 500 cycle remote memory access time.

Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system

Chip64-bit ALU(to scale)

Architecture of Pentium 4


The Stream Model exposes parallelism and locality in applications

• Streams of records passing through kernels • Parallelism

– Across stream elements– Across kernels

• Locality– Within kernels– Producer-consumer locality between kernels

Grid ofCells

5 6K1

12W I/O50 Ops

K214W I/O100 Ops

Table

K315W I/O70 Ops

8K4

9W I/O80 Ops

IndexStream

1

3

4 41

Grid ofCells


Streams match scientific computation to constraints of VLSI

Stream program matches application to Bandwidth Hierarchy 32:4:1

Memory

Grid ofCells

Cells5 K150 Ops

Stream Cache Stream Reg File Local Registers

K2100 Ops

Indices

Results 1

7

K370 Ops

K480 Ops

Table Table

0.5

Results 2

Results 3

Results 4

5

6

8

Results 2

8

33

8

8

Grid ofCells

5

300 Ops900W

58Words

4

12Words9.5Words

Indices


Scientific programs stream well

17 16 11

7

47

71 83 93

856 1289

1372

1447

1

10

100

1000

10000

Constant Linear Quadratic Cubic

Polynomial Order (Euler Equations)

Mem BW (GB/s)

SRF BW (GB/s)

LRF BW (GB/s)

StreamFEM results show L:S:M ratios of 206:13:1 to 50:3:1


BW Hierarchy of SSS

DRAM

DRAM

StreamCacheBank 0

StreamCacheBank 7

StreamRegister

File

Func.Units

Func.Units

Cluster 0

Cluster 15

16 GB/s 64 GB/s 512 GB/s 3,840 GB/s


Stream processor = Vector processor + Local registers

• Like a vector processor, stream processors

– Amortize instruction overhead over records of a stream

– Hide latency by loading (storing) streams of records

– Can exploit producer consumer locality at the SRF (VRF) level

• Stream processors add local registers and microcoded kernels

– >90% of all references from local registers

• Increases effective bandwidth and capacity of SRF (VRF) by 10x

• Enables 10x number of ALUs• Enables SRF to capture working set

MemoryVector

RegisterFile

1x10x

MemoryStreamRegister

File

LocalRegisters

1x 10x100x


Brook: streaming language

• C with streaming– Make data parallelism explicit– Declare communication pattern

• Streams– View of records in memory– Operated on in parallel– Accessing stream values not is permitted outside of kernels

Kernel


Brook Kernels

• Kernels– Functions which operate only on streams

• Stream arguments are read-only or write-only• Reduction variables (associative operations only)

– Restricted communication between records• No state or “static” variables• No global memory access

Kernel


Brook Example: Molecular Dynamics

struct Vector { float x, y, z ;} ;

typedef stream struct Vector Vectors ;

kernel void UpdatePosition ( Vectors sPos, Vectors sVel, const float timestep, out Vectors sNewPos ){ sNewPos.x = sPos.x + timestep * sVel.x; sNewPos.y = sPos.y + timestep * sVel.y; sNewPos.z = sPos.z + timestep * sVel.z;}


struct Vector { float x, y, z ;} ;

typedef stream struct Vector Vectors ;

void main () { struct Vector Pos[MAX] = {…} ; struct Vector Vel[MAX] = {…} ;

Vectors sPos, sVel, sNewPos ;

streamLoad (sPos, Pos, MAX) ; streamLoad (sVel, Vel, MAX) ; UpdatePosition (sPos, sVel, 0.2f, sNewPos) ;

streamStore (sNewPos, Pos) ;}


StreamMD: motivation• Application: study the folding of human

proteins.

• Molecular Dynamics: computer simulation of the dynamics of macro molecules.

• Why this application?– Expect high arithmetic intensity.

– Requires variable length neighborlists.

– Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.

• Test case chosen for initial evaluation: box of water molecules.

DNA molecule

Human immunodeficiency virus (HIV)


Numerical Algorithm• Interaction between atoms is modeled by the potential energy

associated to each configuration. Includes:– Chemical bond potentials.

– Electrostatic interactions.

– Van der Waals interactions.

• Newton’s second law of motion used to compute the trajectory of all atoms:

• Velocity Verlet time integrator (leap-frog):


High-Level Implementation in Brook

• Cutoff is used to compute non-bonded forces: two particles do not interact if they are at a distance larger than cutoff radius.

• Gridding technique is used to accelerate search of all atoms within cutoff radius.

• Stream of variable length is associated to each cell of the grid: contains all the water molecules inside the cell.

• High level Brook functionality are used:– streamGatherOP: used to

construct the list of all water molecules inside a cell.

– streamScatterOP: used to reduce the partial forces computed for each molecule.

Memory

SRF

GatherOP

n++

n

ScatterOP

f+g (g)

f


Imagine SSS

StreamMD Results

FDIV

FADD

FDIV

FSUB

FSUB

FSUBFSUBFSUB

FSQRT SPREADFMUL FMULFSUB SPREAD

FSUB

FSUB

FADD FMUL SPREAD

FSQRTFMULFMULFSUB

FADD

FADD FMUL

FDIV

FADD

FDIV

FSUBFSUB

FSQRTFMUL FMULFSUB

FSUB

FSUB

FADD FMUL

FSQRTFMULFMULFSUB

FADD

FADD FMUL

FDIV

FADD

FDIV

SPREADSPREAD

FSUB FSUB

SPREADFSQRTFSUB

FMULFMULFSUB

FSUB

FADD FMULFSQRT

FMUL FMULFSUB

FADD

FMULFADD

FDIV

PASSFSUBFADD

FSUB

FDIVFSUB FMULFMUL

FMULFADD

FADD

FSUBFSQRT

FDIVFMUL

FMUL

FADD

FMUL

FMUL

FMUL FSQRTFMULFADD FMUL

FADD FSQRTFMULFADDFMULFMUL

FMUL

FADD FMULFADD FMUL

FMUL

FADD FMUL

FMUL FMULFADDPASSPASSFMULPASS

FMULFLE

FADD FMULFMULFMULFMUL NSELECT

FMUL FMULFMUL

FADD FMUL FMULFMUL FDIVFMUL FMUL

FMUL FMUL PASSFADD FMUL FMUL

FMULFMUL

FMUL FMULFMULFMUL

FADD FMULFMUL PASS

FMULFMULFMUL FMULFMUL FMULFADDFADD

FADD FMUL FMUL

FMUL FMULFMUL FMULFMUL FMUL

FADD FMULFMULFMUL FMUL PASSFMUL FMUL SELECT

FMUL FMULFMUL FMUL

FMULFMULFMUL FMUL

FMUL FMULFSUB FMUL PASSFMUL PASS

FMULFMUL

FMULFMULFADD FADDFLE PASSFMULFMULFADD FADDFADD

FMULFMULFADD PASSPASSFMULFMULFADD PASS

FMUL FMULFADDFMUL FMULFADDFADD

FADD FADDFADD

FADD FMULFMULFADDFADDFADD FMULFADDFADD

FADD FMULFADDFADD PASS

FADDFADDFADD SPREADFADD FMULFADDFADD PASSSELECT SPREAD

FADDFSUB FMULFADD SELECT

FADDFSUB FADD PASS SPREADFADDFADD FADD SELECTSPREAD

FADD FSUB FADD SELECT SELECTSPREADFADD FSUBFADD SELECT SELECTPASS SPREAD

FADD FMUL SPWRITEFSUB SELECTSELECT SPREADSELECTFADDFADD FSUB SPWRITESELECT SPREAD

SELECTFSUB FADD SPWRITESELECTSPREAD

FSUB SELECTFADDFADD SPWRITESELECTFSUB SELECTSELECTFADD FADD SPWRITEPASS

FADD SELECT SELECTFADDFADD SPWRITE

SPWRITEFADD FADD DEC_CHK_UCRLOOPSPWRITE DEC_UCR

IADD 32 SPWRITE

PASSPASS

NSELECT DATA_OUT

NSELECT DATA_OUT340

100

280

220

160

FSUB

FSUB

FABSFLTFSUB

FABS FLTFLT

FLT SELECT

SELECT NSELECTFABS FLT

NSELECTFSUB FLT SPREAD

FSUBFSUB SELECT SPREAD

FSUB NSELECTSPREAD

FSUB SPREAD

FSUB SPREAD

FSUB FSUBFSUB PASSSPREAD

FSUBFSUB FSUB SPREAD

FSUBFSUB SPREAD

FSUB

FMUL FMUL FSUBFSUB

FMUL FMULFSUB FSUB SPREAD

PASS PASSFSUBFSUB

FSUBFSUB FMUL

FADDFMULFMULFMUL

FADD FMULFSUBFMUL

FSUB FMUL

FMULFSUB FINVSQRT_LOOKUP

FADDFADD FMUL FSUB PASS

FADDFMULFSUB

PASSFADD

FMULFMULFSUBFSUB

FINVSQRT_LOOKUPFADDFMUL FMUL

FINVSQRT_LOOKUP PASSFADD PASS

FMULFMUL FADD PASS

FMUL FMULFMULFSUB

FINVSQRT_LOOKUPFADD FMULFSUBFSUB

FMULFADD FSUB PASS

FMULFMUL

FSUB FADDFMUL FSUB PASS

FADD FINVSQRT_LOOKUPFMUL FMULFSUB

FMULFMUL FINVSQRT_LOOKUPFSUB FSUB

FMULFSUB FSUBFMUL

FMULFADD FMUL FSUB

FINVSQRT_LOOKUPFADDFMUL

FSUBFMUL FMUL PASSFSUB

FMUL FMULFMULFMUL PASS

FMUL FINVSQRT_LOOKUPFADDFMUL FMUL

FMUL FADD

FMUL FSUBFMUL PASS

FSUBFMULFMULFSUB PASS

FMULFADD FADD FMUL

FMUL FINVSQRT_LOOKUP

FMULFSUB FMUL FMUL

FMULFMULFMUL FINVSQRT_LOOKUP

FMUL FINVSQRT_LOOKUPFADD

FSUBFMULFMUL PASS

FMULFMUL FMULFMUL

FMULFMUL FMUL

FSUB FINVSQRT_LOOKUP

FMUL FMUL FMULFMUL

FMULFMUL PASSFSUB

FMUL FSUBFSUBFMUL

FMUL

FMUL FSUB FSUBFMUL

PASSFSUB FMUL FMUL

FMULFMULFMULFMUL

FMUL PASS

FMUL FMULFMULFSUB

FMUL FMULFMUL

FSUBFMUL FMULFSUB

FMUL

FMULFMUL FMUL FMUL

FMULFSUB FMUL

FMULFMUL FMULFMUL

FMUL FMULFMUL

FMULFSUB FMULFMUL

FADDFMULFMUL FMUL

PASS

FMUL FMULFMUL

FMUL FSUBFMUL

FADDFMUL FMULFMUL PASS

FSUB FMULFMULFMUL

FMUL FSUB

FADD FMULFMUL FMUL

PASSPASS

FSUBFMUL FADDFADD

FMULFMUL

FADDFMULFMUL

FLE

FMUL FMULFMULFMUL

FMUL FMUL

FADDFMUL FMUL

PASS

PASSFMULFADDFADD FMUL

FMULFMUL FMULFMUL

FADD FMUL FMUL

FMULFMUL FMUL

FMUL FMULFMUL FLE

FADD FMULFMUL FMUL

FMUL FMUL FADD

FMUL FMULFMULFMUL

FADDFADD FADDFMUL

FMULFMUL PASSPASS

PASSFMUL NSELECT


FADDFMUL FMUL FMUL PASSNSELECT

PASS PASSFMUL FMULFMUL

FMULFMUL FMUL FMUL PASS


FMUL FMULFADD FMUL PASS


FMUL FMUL FMULFMUL

FMULFMUL FMULFMUL PASS

FSUB FMUL FADDFMUL PASSPASS

FMULFMULFMULFMUL

FADD FMUL FMULFMUL

FADDFMUL FADDFMUL PASS

FMULFMUL FMUL FMUL

FMUL FMUL FADDFMUL PASS

FMULFMUL FMULFADD PASS

FMULFADD FMULFADD PASS

FADDFMUL FADDFMUL

FADD FADDFMUL FADD

FADDFADDFMUL PASSFMUL

FADD FADD FADDFMUL

FADDFADD FADDFADD PASS

FADD FADDFMULFADD NSELECT SPREAD

FADDFADD FMULFADD SPREAD

FADD FSUBFADDFADD SELECTSPREAD

FADDFADD FADDFSUB SELECTNSELECT SPREAD

FADDFSUB FADD PASSNSELECT SPREAD

FADDFSUB FADD SELECTNSELECT SPREAD

FADDFMUL FSUB SPWRITE SELECTNSELECT SPREAD

SELECTFADD FADDFSUB SPWRITENSELECT SPREAD

NSELECTFSUB SPWRITE SELECTSPREAD

FSUB NSELECT SPWRITE SELECT

FSUB SELECTFADDFADD SPWRITESELECTFADD SPWRITESPWRITEDEC_CHK_UCRLOOPSPWRITEDEC_UCR32 SPWRITEPASST DATA_OUTNSELECTDATA_OUT

120

260

170

220

• Preliminary schedule obtained using the Imagine architecture:– High arithmetic

intensity: all ALUs are kept busy. Gflops expected to be very high.

– SRF bandwidth is sufficient. About 1 word for 30 instructions.

• Results helped guide architectural decisions for SSS.


Observations• Arithmetic intensity is sufficient. Bandwidth is not going to be the

limiting factor in these applications. Computation can be naturally organized in a streaming fashion.

• The interaction between the application developers and the language development group has helped insured that Brook can be used to code real scientific applications.

• Architecture has been refined in the process of evaluating these applications.

• Implementation is much easier than MPI. Brook hides all the parallelization complexity from the user. The code is very clean and easy to understand. The streaming versions of these applications are in the range of 1000-5000 lines of code.


A GPU is a stream processor

• The GPU on a Graphics Card is streaming processor.• n VIDIA recently announced that their latest graphics

card, the NV30, will be programmable and capable of delivering 51 Gflops peak performance (1.6 Gflops for Pentium 4).

Can we use this computing power for scientific application?


Cg: Assembly or High-level?

Assembly…

DP3 R0, c[11].xyzx, c[11].xyzx;

RSQ R0, R0.x;

MUL R0, R0.x, c[11].xyzx;

MOV R1, c[3];


DP3 R2, R1.xyzx, R1.xyzx;

RSQ R2, R2.x;

MUL R1, R2.x, R1.xyzx;

ADD R2, R0.xyzx, R1.xyzx;


RSQ R3, R3.x;



MAX R2, c[3].z, R2.x;

MOV R2.z, c[3].y;

MOV R2.w, c[3].y;

LIT R2, R2;

...

Assembly…

DP3 R0, c[11].xyzx, c[11].xyzx;

RSQ R0, R0.x;


MOV R1, c[3];



RSQ R2, R2.x;


ADD R2, R0.xyzx, R1.xyzx;


RSQ R3, R3.x;



MAX R2, c[3].z, R2.x;

MOV R2.z, c[3].y;

MOV R2.w, c[3].y;

LIT R2, R2;

...

Cg

COLOR cPlastic = Ca + Cd * dot(Nf, L) + Cs * pow(max(0, dot(Nf, H)), phongExp);

Cg

COLOR cPlastic = Ca + Cd * dot(Nf, L) + Cs * pow(max(0, dot(Nf, H)), phongExp);

or PhongShader


Cg uses separate vertex and fragment programs

ApplicationVertexProcessor

FragmentProcessor

Assem

bly &R

asterization

Fram

ebufferO

perations

Fram

ebuffer

Program Program

Textures


Characteristics of NV30 & Cg • Characteristics of GPU:

– optimized for 4-vector arithmetic– Cg has vector data types and operations

e.g. float2, float3, float4– Cg also has matrix data types

e.g. float3x3, float3x4, float4x4• Some Math:

– Sin/cos/etc.– Normalize

• Dot product: dot(v1,v2);• Matrix multiply:

– matrix-vector: mul(M, v); // returns a vector– vector-matrix: mul(v, M); // returns a vector– matrix-matrix: mul(M, N); // returns a matrix

Innermost loop in C: computation of LJ and Coulomb interactions.

for (k=nj0;k<nj1;k++) { //loop over indices in neighborlistjnr = jjnr[k]; //get index of next j atom (array LOAD) j3 = 3*jnr; //calc j atom index in coord & force arraysjx = pos[j3]; //load x,y,z coordinates for j atomjy = pos[j3+1];jz = pos[j3+2];qq = iq*charge[jnr]; //load j charge and calc. productdx = ix – jx; //calc vector distance i-jdy = iy – jy;dz = iz – jz;rsq = dx*dx+dy*dy+dz*dz; //calc square distance i-jrinv = 1.0/sqrt(rsq); //1/rrinvsq = rinv*rinv; //1/(r*r)vcoul = qq*rinv; //potential from this interactionfscal = vcoul*rinvsq; //scalarforce/|dr|vctot += vcoul; //add to temporary potential variablefix += dx*fscal; //add to i atom temporary force variablefiy += dy*fscal; //F=dr*scalarforce/|dr|fiz += dz*fscal;force[j3] -= dx*fscal; //subtract from j atom forcesforce[j3+1]-= dy*fscal;force[j3+2]-= dz*fscal;

}

Example: MD


Inner loop in Cg/* Find the index and coordinates of j atom */jnr = f4tex1D (jjnr, k);

/* Get the atom position */j1 = f3tex1D(pos, jnr.x);j2 = f3tex1D(pos, jnr.y);j3 = f3tex1D(pos, jnr.z);j4 = f3tex1D(pos, jnr.w);

We compute four interactions at a time so that we can take advantage of high performance of vector arithmetic.

We are fetching coordinates of atom: data is stored as texture

/* Get the vectorial distance, and r^2 */d1 = i - j1;d2 = i - j2;d3 = i - j3;d4 = i - j4;

rsq.x = dot(d1, d1);rsq.y = dot(d2, d2);rsq.z = dot(d3, d3);rsq.w = dot(d4, d4);

/* Calculate 1/r */rinv.x = rsqrt(rsq.x);rinv.y = rsqrt(rsq.y);rinv.z = rsqrt(rsq.z);rinv.w = rsqrt(rsq.w);

Computing the square of distanceWe use built-in dot product for float3 arithmetic

Built-in function: rsqrt

/* Calculate Interactions */

rinvsq = rinv * rinv;

rinvsix = rinvsq * rinvsq * rinvsq;

Highly efficient float4 arithmetic

vnb6 = rinvsix * temp_nbfp;vnb12 = rinvsix * rinvsix * temp_nbfp;vnbtot = vnb12 - vnb6;

qq = iqA * temp_charge;vcoul = qq*rinv;fs = (12f * vnb12 - 6f * vnb6 + vcoul) * rinvsq;vctot = vcoul;

/* Calculate vectorial force and update local i atom force */fi1 = d1 * fs.x;fi2 = d2 * fs.y;fi3 = d3 * fs.z;fi4 = d4 * fs.w;

This is the force computation

ret_prev.fi_with_vtot.xyz += fi1 + fi2 + fi3 + fi4;

ret_prev.fi_with_vtot.w += dot(vnbtot, float4(1, 1, 1, 1))

+ dot(vctot, float4(1, 1, 1, 1));

Return type is:

struct inner_ret { float4 fi_with_vtot; };

Contains x, y and z coordinates of force and total energy.

Computing total force due to 4 interactions

Computing total potential energy for this particle


Conclusion• 3 representative applications show high bandwidth ratios:

streamMD, streamFlo, StreamFEM.

• Feasibility of streaming established for scientific applications: high arithmetic intensity, bandwidth hierarchy is sufficient.

• Available today: NVidia NV30 graphics card.

• Future work:– StreamMD to GROMACS (Folding @ Home)– StreamFEM and StreamFLO to 3D– Multinode versions of all applications– Sparse solvers for implicit time-stepping– Adaptive meshing– Numerics

stanford streaming supercomputer

Documents

alus chip

red storm system

sss architecture

chip bandwidth1tbs

new code

new supercomputercollaboration

gbs chip bandwidth

multiyear contract