stanford streaming supercomputer

Download Stanford Streaming Supercomputer

If you can't read please download the document

Post on 09-Jan-2016

32 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Stanford Streaming Supercomputer. Eric Darve Mechanical Engineering Department Stanford University. Overview of Streaming Project. Main PIs: Pat Hanrahan, hanrahan@graphics.stanford.edu Bill Dally, billd@csl.stanford.edu Objectives: Cost/Performance: 100:1 compared to clusters. - PowerPoint PPT Presentation

TRANSCRIPT

  • Stanford Streaming Supercomputer Eric DarveMechanical Engineering DepartmentStanford University

    Eric Darve - Stanford Streaming Supercomputer

    text

    text

    Stream Register File

    Clust.0

    Clust.15

    Micro-ctrl

    ScalarProc.

    Scalar Cache

    Memory System

    Network

    DRAM

    DRAM

    Inter-cluster Crossbar

    Stream Ctrl

  • Overview of Streaming ProjectMain PIs:Pat Hanrahan, hanrahan@graphics.stanford.eduBill Dally, billd@csl.stanford.edu

    Objectives: Cost/Performance: 100:1 compared to clusters.Programmable: applicable to large class of scientific applications.Porting and developing new code made easier: stream language, support of legacy codes.

    Eric Darve - Stanford Streaming Supercomputer

  • Performance/CostCost estimate about $1K/nodePreliminary numbers, parts cost only, no I/O included.Expect 2x to 4x to account for margin and I/O

    ItemCostPer NodeProcessor chip200200Router chip20050Memory chip20320Board/Backplane3000188Cabinet5000049Power150Per-Node Cost976

    Eric Darve - Stanford Streaming Supercomputer

  • News Center

    News Releases | Publications | Resources | Multimedia Gallery

    News Release Archive | Awards

    FOR IMMEDIATE RELEASEOctober 21, 2002

    Sandia National Laboratories and Cray Inc. finalize $90 million contract for new supercomputer

    Collaboration on Red Storm System under Department of Energys Advanced Simulation and Computing Program (ASCI)

    ALBUQUERQUE, N.M. and SEATTLE, Wash. The Department of Energys Sandia National Laboratories and Cray Inc. (Nasdaq NM: CRAY) today announced that they have finalized a multiyear contract, valued at approximately $90 million, under which Cray will collaborate with Sandia to develop and deliver a new massively parallel processing (MPP) supercomputer called Red Storm. InJune 2002, Sandia reported that Cray had been selected for the award, subject to successful contract negotiations.

    Eric Darve - Stanford Streaming Supercomputer

  • Performance/Cost ComparisonsEarth Simulator (today)Peak 40TFLOPS, ~$450M0.09MFLOPS/$ Sustained 0.03MFLOPS/$Red Storm (2004)Peak 40TFLOPS, ~$90M0.44MFLOPS/$SSS (proposed 2006)Peak 40TFLOPS, < $1M128MFLOPS/$Sustained 30MFLOPS/$ (single node)Numbers are sketchy today, but even if we are off by 2x, improvement over status quo is large

    Eric Darve - Stanford Streaming Supercomputer

  • ESRedStormDesktop SSSSSSASCI machinesGFLOPS

    Eric Darve - Stanford Streaming Supercomputer

  • How did we achieve that?

    Eric Darve - Stanford Streaming Supercomputer

  • VLSI Makes Computation PlentifulVLSI: very large-scale integration. This is the current level of computer microchip miniaturization (refers to microchips containing in the hundreds of thousands of transistors.)Abundant, inexpensive arithmeticCan put 100s of 64-bit ALUs on a chip20pJ per FP operation(Relatively) high off-chip bandwidth1Tb/s demonstrated, 2nJ per word off chipMemory is inexpensive $100/GbytenVidia GeForce4~120 Gflops/sec~1.2 Tops/sec Velio VC30031Tb/s I/O BW

    Eric Darve - Stanford Streaming Supercomputer

  • But VLSI imposes some constraintsCurrent Architecture: few ALUs / chip = expensive and limited performance.Objective for SSS architecture: Keep hundreds of ALUs/chip busy.Difficulty:Locality of data: we need to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth.Latency tolerance: to cover 500 cycle remote memory access time.Arithmetic is cheap, global bandwidth is expensiveLocal
  • The Stream Model exposes parallelism and locality in applicationsStreams of records passing through kernels Parallelism Across stream elementsAcross kernelsLocalityWithin kernelsProducer-consumer locality between kernels

    Eric Darve - Stanford Streaming Supercomputer

    K214W I/O100 Ops

    Grid of Cells

    3

    4

    5

    6

    K112W I/O50 Ops

    Table

    K315W I/O70 Ops

    8

    K49W I/O80 Ops

    Index Stream 1

    4

    1

    Grid of Cells

  • Streams match scientific computation to constraints of VLSIStream program matches application to Bandwidth Hierarchy 32:4:1MemoryGrid ofCellsCells5K150 OpsStream CacheStream Reg FileLocal RegistersK2100 OpsIndicesResults 17K370 OpsK480 OpsTableTable0.5Results 2Results 3Results 4568Results 283388Grid ofCells5300 Ops900W58Words412Words9.5WordsIndices

    Eric Darve - Stanford Streaming Supercomputer

  • Scientific programs stream wellStreamFEM results show L:S:M ratios of 206:13:1 to 50:3:1

    Eric Darve - Stanford Streaming Supercomputer

    Chart2

    16.637567154846.7050021424856.2117313587

    15.558585084571.26527648931288.9354253217

    10.962552280583.36372668241371.9349344103

    6.523606608193.39752575691446.5366117003

    Mem BW (GB/s)

    SRF BW (GB/s)

    LRF BW (GB/s)

    Polynomial Order (Euler Equations)

    Sheet1

    Constant (Euler)Constant (MHD)Linear (Euler)Linear (MHD)Quadratic (Linear)Quadratic (MHD)Cubic (Linear)Cubic (MHD)

    ConstantConstantLinearLinearQuadraticQuadraticCubicCubic

    0/1/1/E0/1/1/M1/2/3/E1/2/3/M2/3/7/E2/3/7/M3/4/16/E3/4/16/M3/4/16/E (4x strip)

    Total cycles1820422281093966885358729884181497334263773647842742352587

    Total floating point ops30476005211376107438401851857631778000548363689234790416012422492347904

    Sustained GFLOPS16.741191593122.845990294127.083854313734.557834706832.150365533636.622669357735.010290643233.468865704639.2537678734

    Total LRF reads/writes194833123232950463913152998285921695056482973435364769477127332940801.1212065696

    LRF BW (GB/s)856.21173135871133.82651276361288.93542532171490.33488594291371.93493441031588.65576284251446.53661170031226.1740527403

    Total SRF reads/writes106278413445123533760481542410299776140768003079475243762752

    SRF BW (GB/s)46.705002142447.153317054671.265276489371.889167562483.363726682475.209939799793.397525756973.1776683359

    Total mem reads/writes37859247584077148810647681354448193793621509443165280

    Mem BW (GB/s)16.637567154816.688162238215.558585084515.895855726710.962552280510.35406128496.52360660815.2928072263

    Ops per mem access8.049826726410.951950235413.926127172417.392122978923.461956457528.29627397442.933662615150.5876965071

    Instructions16171696130428000

    IPC40.76678901349.4469499601

    Sheet1

    Sustained GFLOPS

    LRF BW (GB/s)

    SRF BW (GB/s)

    Mem BW (GB/s)

    Sheet2

    Ops per mem access

    Sustained GFLOPS

    Sheet3

    Sustained GFLOPS

    StreamFEM Parameters

    Sustained GFLOPS

    Sustained GFLOPS

    Mem BW (GB/s)

    Polynomial Order (Euler Equations)

    Mem BW (GB/s)

    SRF BW (GB/s)

    LRF BW (GB/s)

    Polynomial Order (Euler Equations)

    Sustained GFLOPS

    Polynomial Order (Euler Equations)

    Ops per mem access

    Polynomial Order (Euler Equations)

  • BW Hierarchy of SSS

    Eric Darve - Stanford Streaming Supercomputer

    text

    text

    DRAM

    DRAM

    Stream Cache Bank 0

    Stream Cache Bank 7

    Stream Register File

    Func.Units

    Func.Units

    Cluster 0

    Cluster 15

    16 GB/s

    64 GB/s

    512 GB/s

    3,840 GB/s

  • Stream processor = Vector processor + Local registersLike a vector processor, stream processorsAmortize instruction overhead over records of a stream Hide latency by loading (storing) streams of recordsCan exploit producer consumer locality at the SRF (VRF) levelStream processors add local registers and microcoded kernels>90% of all references from local registersIncreases effective bandwidth and capacity of SRF (VRF) by 10xEnables 10x number of ALUsEnables SRF to capture working set

    Eric Darve - Stanford Streaming Supercomputer

    Memory

    VectorRegisterFile

    1x

    10x

    Memory

    StreamRegisterFile

    LocalRegisters

    1x

    10x

    100x

  • Brook: streaming languageC with streamingMake data parallelism explicitDeclare communication patternStreamsView of records in memoryOperated on in parallelAccessing stream values not is permitted outside of kernelsKernel

    Eric Darve - Stanford Streaming Supercomputer

  • Brook KernelsKernelsFunctions which operate only on streamsStream arguments are read-only or write-onlyReduction variables (associative operations only)Restricted communication between recordsNo state or static variablesNo global memory access

    Eric Darve - Stanford Streaming Supercomputer

  • Brook Example: Molecular Dynamicsstruct Vector { float x, y, z ;} ;

    typedef stream struct Vector Vectors ;

    kernel void UpdatePosition ( Vectors sPos, Vectors sVel, const float timestep, out Vectors sNewPos ){ sNewPos.x = sPos.x + timestep * sVel.x; sNewPos.y = sPos.y + timestep * sVel.y; sNewPos.z = sPos.z + timestep * sVel.z;}

    Eric Darve - Stanford Streaming Supercomputer

  • struct Vector { float x, y, z ;} ;

    typedef stream struct Vector Vectors ;

    void main () { struct Vector Pos[MAX] = {} ; struct Vector Vel[MAX] = {} ;

    Vectors sPos, sVel, sNewPos ;

    streamLoad (sPos, Pos, MAX) ; streamLoad (sVel, Vel, MAX) ; UpdatePosition (sPos, sVel, 0.2f, sNewPos) ;

    streamStore (sNewPos, Pos) ;}

    Eric Darve - Stanford Streaming Supercomputer

  • StreamMD: motivationApplication: study the folding of human proteins.Molecular Dynamics: computer simulation of the dynamics of macro molecules.Why this application?Expect high arithmetic intensity.Requires variable length neighborlists.Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.Test case chosen for initial evaluation: box of water molecules.DNA moleculeHuman immunodefi

Recommended

View more >