latency vs. bandwidth which matters more? katherine yelick u.c. berkeley and lbnl joint with with:...

Latency vs. BandwidthLatency vs. BandwidthWhich Matters More?Which Matters More?

Katherine Yelick

U.C. Berkeley and LBNL

Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL)The Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave

Judd, Christoforos Kozyrakis, Sam Williams,…The Berkeley Bebop group: Jim Demmel, Rich Vuduc, Ben Lee,

Rajesh Nishtala,…

K. Yelick, PIM Software 2004

Blame the Memory Bus Blame the Memory Bus

Many scientific applications run at less than 10% of hardware peak, even on a single processor The trend is to blame the memory bus Is this accurate?

Need to understand bottlenecks to Design better machines Design better algorithms

Two parts Algorithm bottlenecks on microprocessors Bottlenecks on a PIM system, VIRAM

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

198

0 198

1 198

3 198

4 198

5 198

6 198

7 198

8 198

9 199

0 199

1 199

2 199

3 199

4 199

5 199

6 199

7 199

8 199

9 200

0

DRAM

CPU

198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

Note: this is latency, not bandwidth.


Memory Intensive ApplicationsMemory Intensive Applications

Poor performance is especially problematic for memory-intensive applications Low ratio of arithmetic operations to memory Irregular memory access patterns

Example Sparse matrix-vector multiply (dominant kernel of NAS CG)

Many scientific applications do this by some perspective Compute y = y + A*x

Matrix is stored as two main arrays:– Column index array (int)– Value array (floating point)

For each element y[i] compute j x[index[j]] * value[j]

So latency (to x) dominates, right? Irregular Not necessarily in cache

x

y


Performance Model is RevealingPerformance Model is Revealing

A simple analytical model for sparse matvec kernel # loads from memory * cost of load + # loads from cache …

Two versions: Only compulsory misses

to source vector, x All accesses to x

produce a miss to memory

Conclusion Cache misses to source

(memory latency) is not the dominant cost

PAPI measurements confirm

So bandwidth to the matrix dominates, right?


Memory Bandwidth MeasurementsMemory Bandwidth Measurements

Yes, but be careful about how you measure bandwidth Not a constant


An Architectural ProbeAn Architectural Probe

Sqmat is a tunable probe to measure architectures Stream of small matrices Square each matrix to some power: computational intensity The stream may be direct (dense), or indirect (sparse)

If indirect, how frequently is there a non-unit stride jump

Parameters: Matrix size within stream Computational Intensity Indirection (yes/no) # unit strides before jump

. . .. . .


Cost of IndirectionCost of Indirection

Adding a second load stream for indexes into stream has a big effect on some machines

This is truly a bandwidth issue

1

2

3

4

5

6

1 2 4 8 16 32 64 128 256 512Number of squarings (computational

intensity)

Slo

wd

ow

n

Itanium 2OpteronPow er3Pow er4


Cost of IrregularityCost of Irregularity

Slowdown relative to the previous slide results Even a tiny bit of irregularity (1/S) can have a big effect

0

3

6

9

12

15

1 2 4 8 16 32 64

# of squarings

1

2

3

4

1 2 4 8 16 32 64

# of Squarings

S=1S=2S=4S=8S=16S=128

1

2

3

4

1 2 4 8 16 32 64# of squarings

Slo

wd

ow

n

0

5

10

15

20

25

1 2 4 8 16 32 64

# of squarings

OpteronItanium2

Power3 Power4


What Does This Have to Do with PIMs?What Does This Have to Do with PIMs?

Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!)

Imagine much faster for long streams, slower for short ones

Mflop/s as with varying stream lengths

0

500

1000

1500

2000

2500

3000

3500

IMAGINE IRAM DIVA Power3

MF

lop

s

8

16

32

64

128

256

512

1024


VIRAM OverviewVIRAM Overview

Technology: IBM SA-27E0.18m CMOS, 6 metal layers

290 mm2 die area225 mm2 for memory/logic

Transistor count: ~130M 13 MB of DRAM Power supply

1.2V for logic, 1.8V for DRAM Typical power consumption: 2.0 W

0.5 W (scalar) + 1.0 W (vector) + 0.2 W (DRAM) + 0.3 W (misc)

MIPS Scalar core + 4-lane vector Peak vector performance

1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations)

3.2/6.4 /12.8 Gops w. madd1.6 Gflops (single-precision)

14.5 mm

20

.0 m

m


Vector IRAM ISA SummaryVector IRAM ISA Summary

s.intu.ints.fpd.fp

.v.vv.vs.sv

s.intu.int

unit strideconstant stride

indexed

loadstore

VectorALU

VectorMemory

Scalar MIPS64 scalar instruction set

alu op

8163264

•91 instructions•660 opcodes

ALU operations: integer, floating-point, fixed-point and DSP, convert, logical, vector processing,flag processing


VIRAM CompilerVIRAM Compiler

Based on the Cray’s production compiler Challenges:

narrow data types and scalar/vector memory consistency

Advantages relative to media-extensions: powerful addressing modes and ISA independent of datapath

width

Optimizer

C

Fortran95

C++

Frontends Code Generators

Cray’s

PDGCS

T3D/T3E

SV2/VIRAM

C90/T90/X1


Compiler and OS EnhancementsCompiler and OS Enhancements

Compiler based on Cray PDGCS Outer-loop vectorization Strided and indexed vector loads and stores Vectorization of loops with if statements

Full predicated execution of vector instructions using flag registers

Vectorization of reductions and FFTs Instructions for simple, intra-register permutations Automatic for reductions, manual (or StreamIT) for FFTs

Vectorization of loops with break statements Software speculation support for vector loads

OS development MMU-based virtual memory

OS performance Dirty and valid bits for registers to reduce context switch

overhead


HW Resources Visible to SoftwareHW Resources Visible to Software

Vector IRAM Pentium III

Visible to SW

Visible to SW

Transparent to SW

Tra

ns

pa

ren

t to

SW

• Software (applications/compiler/OS) can control– Main memory, registers, execution datapaths


VIRAM Chip StatisticsVIRAM Chip Statistics

Technology IBM SA-27E, 0.18um CMOS, 6 layers of copperDeep trench DRAM cell, full speed logic

Area 270 mm2: 65 mm2 logic, 140 mm2 for DRAM

Transistors ~130 millions: 7.5M logic, 122.5 DRAM

Supply 1.2V logic, 1.8V DRAM, 3.3V I/O

Clock 200 MHz

Power 2W: 0.5W MIPS core, 1W vector unit, 0.5W DRAM-I/O

Package 304-lead quad ceramic package (125 signal I/Os)

Crossbar BW

12.8 Gbytes/s per direction (load or store, peak)

Peak Performance

Integer wo. madd: 1.6/3.2/6.4 Gops (64b/32b/16b)Integer w. madd: 3.2/6.4/12.8 Gops (64b/32b/16b)FP: 1.6 Gflops (32b, wo. madd)


VIRAM Design StatisticsVIRAM Design Statistics

RTL model 170K lines of Verilog

DesignMethodology

Synthesized: MIPS core, vector unit control, FP datapathFull-custom: vector reg. file, crossbar, integer datapathsMacros: DRAM, SRAM for caches

IP Sources UC Berkeley (Vector coprocessor, crossbar, I/O)MIPS Technologies (MIPS core)IBM (DRAM/SRAM macros)MIT (FP Datapath)

Verification 566K lines of directed tests (9.8M lines of assembly)4 months of random testing on 20 linux workstations

Design team 5 graduate students

Status Place & route, chip assembly

Tape-out October, 2002

Design time ~2.5 years


VIRAM ChipVIRAM Chip

Taped out to IBM in October ‘02

Received wafers in June 2003. Chips were thinned, diced,

and packaged. Parts were sent to ISI, who

produced test boards.

DRAM

DRAM

I/O

MIPS

4 64-bit Vector Lanes


Demonstration SystemDemonstration System

Based on the MIPS Malta development board PCI, Ethernet, AMR, IDE, USB,

CompactFlash, parallel, serial VIRAM daughter-card

Designed at ISI-East VIRAM processor Galileo GT64120 chipset 1 DIMM slot for external DRAM

Software support and OS Monitor utility for debugging Modified version of MIPS Linux


Benchmarks for Scientific ProblemsBenchmarks for Scientific Problems

Dense and Sparse Matrix-vector multiplication Compare to tuned codes on conventional machines

Transitive-closure (small & large data set) On a dense graph representation

NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses

Sparse matrix-vector product: Order 10000, #nonzeros 177820

Computing a histogram Used for image processing of a 16-bit greyscale image: 1536 x

1536 2 algorithms: 64-elements sorting kernel; privatization Also used in sorting

2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010


Sparse MVM PerformanceSparse MVM Performance

Performance is matrix-dependent: lp matrix compiled for VIRAM using “independent” pragma

sparse column layout Sparsity-optimized for other machines

sparse row (or blocked row) layout

0

50

100

150

200

250

VIRA

M4

VIRA

M8

Sun U

ltra I

MIPS 1

0K

Alp

ha

21264

Power P

C

604e

MFLO

PS


Power and Performance on BLAS-2 Power and Performance on BLAS-2

100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS

0

100

200

300

400

VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264

PowerPCG3

Power3630

MFLOPS MFLOPS/Watt


Performance ComparisonPerformance Comparison

IRAM designed for media processing Low power was a higher priority than high performance

IRAM (at 200MHz) is better for apps with sufficient parallelism

0100200300400500600700800900

1000

Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh

MO

PS

VIRAM

R10K

P-III

P4

Sparc

EV6


Power EfficiencyPower Efficiency

Same data on a log plot Includes both low power processors (Mobile PIII) The same picture for operations/cycle

0.1

1

10

100

1000


MO

PS

/Wa

tt

VIRAM R10K

P-III P4

Sparc EV6


Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?

What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism

0

1000

2000

3000

4000

5000

6000

Transitive GUPS SPMV(Regular)

SPMV(Random)

Histogram Mesh

MB

/s

0

100200

300400

500

600700

800900

1000

Mo

ps

MB/s

MOPS


Summary of 1-PIM ResultsSummary of 1-PIM Results

Programmability advantage All vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers

Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular

performance

Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key


Alternative VIRAM DesignsAlternative VIRAM Designs

“VIRAM-4Lane”

4 lanes, 8 Mbytes

~190 mm2

3.2 Gops at 200MHz

“VIRAM-2Lanes”

2 lanes, 4 Mbytes

~120 mm2

1.6 Gops at 200MHz

“VIRAM-Lite”

1 lanes, 2 Mbytes

~60 mm2

0.8 Gops at 200MHz


Compiled Multimedia PerformanceCompiled Multimedia Performance

Single executable for multiple implementationsLinear scaling with number of lanesRemember, this is a 200MHz, 2W processor

0

1000

2000

3000

4000

Mill

ion

Op

era

tio

ns

pe

r s

eco

nd

matmul 64x64 saxpy 4K fir filter decrypt detect convolve compose colorspace

1 Lane 2 Lanes 4 Lanes

integerfloating-point


Third Party Comparison (I)Third Party Comparison (I)

0

5

10

15

20

25

Sp

ee

du

p o

ve

r P

PC

G3

Corner Turn Coherent SidelobeCanceller

Beam Steering

ISI Results for SLIIC Kernels (Performance)

PPC G3-400MHz M32R/D-80MHz PPC G4-733MHz

Pentium III-733MHz VIRAM-200MHz Imagine-400MHz

PP

C-G

4P

enti

um

III Im

agin

e

VIR

AM

PP

C-G

4

PP

C-G

4

Pen

tiu

m I

II

Pen

tiu

m I

II

VIR

AM

VIR

AM

Imag

ine

Imag

ine


Third Party Comparison (II)Third Party Comparison (II)

0

5

10

15

20

25

30

35

40

Imp

rove

men

t o

ver

PP

C G

3

Corner Turn Coherent SidelobeCanceller

Beam Steering

ISI-East Results for SLIIC Kernels (Peformance/Watt)

PPC G3-400MHz M32R/D-80MHz PPC G4-733MHz

Pentium III-733MHz VIRAM-200MHz Imagine-400MHz

PP

C-G

4P

enti

um

III

Imag

ine

VIR

AM

PP

C-G

4

PP

C-G

4

Pen

tiu

m I

II

Pen

tiu

m I

II VIR

AM

VIR

AM

Imag

ine

Imag

ine


Vectors VS. SIMD or VLIWVectors VS. SIMD or VLIW

SIMD Short, fixed-length, vector extensions

Require wide issue or ISA change to scale They don’t support vector memory accesses

Difficult to compile for Performance wasted for pack/unpack, shifts, rotates…

VLIW Architecture for instruction level parallelism

Orthogonal to vectors for data parallelism Inefficient for data parallelism

Large code size (3X for IA-64?) Extra work for software (scheduling more instructions) Extra work for hardware (decode more instructions)


Vector Vs. Wide Word SIMD: ExampleVector Vs. Wide Word SIMD: Example

Vector instruction sets have Strided and scatter/gather load/store operations

SIMD extensions load contiguous memory Implementation-independent vector length

SIMD extensions change ISA with bit wide in hardware

Simple example: conversion from RGB to YUV Thanks to Christoforos Kozyrakis

Y = [( 9798*R + 19235*G + 3736*B) / 32768]

U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128

V = [(20218*R – 16941*G – 3277*B) / 32768] + 128


VIRAM CodeVIRAM CodeRGBtoYUV:

vlds.u.b r_v, r_addr, stride3, addr_inc # load R

vlds.u.b g_v, g_addr, stride3, addr_inc # load G

vlds.u.b b_v, b_addr, stride3, addr_inc # load B

xlmul.u.sv o1_v, t0_s, r_v # calculate Y

xlmadd.u.sv o1_v, t1_s, g_v

xlmadd.u.sv o1_v, t2_s, b_v

vsra.vs o1_v, o1_v, s_s

xlmul.u.sv o2_v, t3_s, r_v # calculate U




vadd.sv o2_v, a_s, o2_v

xlmul.u.sv o3_v, t6_s, r_v # calculate V




vadd.sv o3_v, a_s, o3_v

vsts.b o1_v, y_addr, stride3, addr_inc # store Y

vsts.b o2_v, u_addr, stride3, addr_inc # store U

vsts.b o3_v, v_addr, stride3, addr_inc # store V

subu pix_s,pix_s, len_s

bnez pix_s, RGBtoYUV


MMX Code (1)MMX Code (1)RGBtoYUV:

movq mm1, [eax]

pxor mm6, mm6

movq mm0, mm1

psrlq mm1, 16

punpcklbw mm0, ZEROS

movq mm7, mm1


movq mm2, mm0

pmaddwd mm0, YR0GR

movq mm3, mm1

pmaddwd mm1, YBG0B

movq mm4, mm2

pmaddwd mm2, UR0GR

movq mm5, mm3

pmaddwd mm3, UBG0B

punpckhbw mm7, mm6;

pmaddwd mm4, VR0GR

paddd mm0, mm1

pmaddwd mm5, VBG0B

movq mm1, 8[eax]

paddd mm2, mm3

movq mm6, mm1

paddd mm4, mm5

movq mm5, mm1

psllq mm1, 32

paddd mm1, mm7

punpckhbw mm6, ZEROS

movq mm3, mm1

pmaddwd mm1, YR0GR

movq mm7, mm5

pmaddwd mm5, YBG0B

psrad mm0, 15

movq TEMP0, mm6

movq mm6, mm3

pmaddwd mm6, UR0GR

psrad mm2, 15

paddd mm1, mm5

movq mm5, mm7

pmaddwd mm7, UBG0B

psrad mm1, 15

pmaddwd mm3, VR0GR

packssdw mm0, mm1

pmaddwd mm5, VBG0B

psrad mm4, 15

movq mm1, 16[eax]


MMX Code (2)MMX Code (2) paddd mm6, mm7

movq mm7, mm1

psrad mm6, 15

paddd mm3, mm5

psllq mm7, 16

movq mm5, mm7

psrad mm3, 15

movq TEMPY, mm0

packssdw mm2, mm6

movq mm0, TEMP0


movq mm6, mm0

movq TEMPU, mm2

psrlq mm0, 32

paddw mm7, mm0

movq mm2, mm6

pmaddwd mm2, YR0GR

movq mm0, mm7

pmaddwd mm7, YBG0B

packssdw mm4, mm3

add eax, 24

add edx, 8

movq TEMPV, mm4

movq mm4, mm6

pmaddwd mm6, UR0GR

movq mm3, mm0

pmaddwd mm0, UBG0B

paddd mm2, mm7

pmaddwd mm4,

pxor mm7, mm7

pmaddwd mm3, VBG0B

punpckhbw mm1,

paddd mm0, mm6

movq mm6, mm1

pmaddwd mm6, YBG0B

punpckhbw mm5,

movq mm7, mm5

paddd mm3, mm4

pmaddwd mm5, YR0GR

movq mm4, mm1

pmaddwd mm4, UBG0B

psrad mm0, 15

paddd mm0, OFFSETW

psrad mm2, 15

paddd mm6, mm5

movq mm5, mm7


MMX Code (3)MMX Code (3) pmaddwd mm7, UR0GR

psrad mm3, 15

pmaddwd mm1, VBG0B

psrad mm6, 15

paddd mm4, OFFSETD

packssdw mm2, mm6

pmaddwd mm5, VR0GR

paddd mm7, mm4

psrad mm7, 15

movq mm6, TEMPY

packssdw mm0, mm7

movq mm4, TEMPU

packuswb mm6, mm2

movq mm7, OFFSETB

paddd mm1, mm5

paddw mm4, mm7

psrad mm1, 15

movq [ebx], mm6

packuswb mm4,

movq mm5, TEMPV

packssdw mm3, mm4

paddw mm5, mm7

paddw mm3, mm7

movq [ecx], mm4

packuswb mm5, mm3

add ebx, 8

add ecx, 8

movq [edx], mm5

dec edi

jnz RGBtoYUV


SummarySummary

Combination of Vectors and PIM Simple execution model for hardware – pushes complexity to

compiler Low power/footprint/etc. PIM provides bandwidth needed by vectors Vectors hid latency effectively

Programmability Programmable from “high” level language More compact instruction stream Works well for:

Applications with fine-grained data parallelism Memory intensive problems

Both scientific and multimedia applications


The The EndEnd


Algorithm SpaceAlgorithm Space

Regularity

Reuse

Two-sided dense linear algebra

One-sided dense linear algebra

FFTs

Sparse iterative solvers

Sparse direct solvers

Asynchronous discrete even simulation

Grobner Basis (“Symbolic LU”)

Search

Sorting


VIRAM OverviewVIRAM Overview

14.5 mm

20

.0 m

m

MIPS core (200 MHz) Single-issue, 8 Kbyte I&D caches

Vector unit (200 MHz) 32 64b elements per register 256b datapaths, (16b, 32b, 64b

ops) 4 address generation units

Main memory system 13 MB of on-chip DRAM in 8 banks 12.8 GBytes/s peak bandwidth

Typical power consumption: 2.0 W Peak vector performance

1.6/3.2/6.4 Gops wo. multiply-add 1.6 Gflops (single-precision)

Fabrication by IBM Tape-out in O(1 month)


Benchmarks for Scientific ProblemsBenchmarks for Scientific Problems

Dense Matrix-vector multiplication Compare to hand-tuned codes on conventional machines

Transitive-closure (small & large data set) On a dense graph representation

NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses

Sparse matrix-vector product: Order 10000, #nonzeros 177820

Computing a histogram Used for image processing of a 16-bit greyscale image: 1536 x

1536 2 algorithms: 64-elements sorting kernel; privatization Also used in sorting

2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010


Power and Performance on BLAS-2 Power and Performance on BLAS-2

100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS

0

100

200

300

400

VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264

PowerPCG3

Power3630

MFLOPS MFLOPS/Watt


Performance ComparisonPerformance Comparison

IRAM designed for media processing Low power was a higher priority than high performance

IRAM (at 200MHz) is better for apps with sufficient parallelism

0100200300400500600700800900

1000


MO

PS

VIRAM

R10K

P-III

P4

Sparc

EV6



Huge power/performance advantage in VIRAM from both PIM technology Data parallel execution model (compiler-controlled)

0

50

100

150

200

250

300

350

400

450

500


MO

PS

/Wa

tt

VIRAM

R10K

P-III

P4

Sparc

EV6



Same data on a log plot Includes both low power processors (Mobile PIII) The same picture for operations/cycle

0.1

1

10

100

1000


MO

PS

/Wa

tt

VIRAM R10K

P-III P4

Sparc EV6


Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?

What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism

0

1000

2000

3000

4000

5000

6000

Transitive GUPS SPMV(Regular)

SPMV(Random)

Histogram Mesh

MB

/s

0

100200

300400

500

600700

800900

1000

Mo

ps

MB/s

MOPS


Summary of 1-PIM ResultsSummary of 1-PIM Results

Programmability advantage All vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers

Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular

performance

Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key


Analysis of a Multi-PIM SystemAnalysis of a Multi-PIM System

Machine Parameters Floating point performance

PIM-node dependent Application dependent, not theoretical peak

Amount of memory per processor Use 1/10th Algorithm data

Communication Overhead Time processor is busy sending a message Cannot be overlapped

Communication Latency Time across the network (can be overlapped)

Communication Bandwidth Single node and bisection

Back-of-the envelope calculations !


Real Data from an Old Machine (T3E)Real Data from an Old Machine (T3E)

UPC uses a global address space Non-blocking remote put/get model Does not cache remote data

Sparse Matrix-Vector Multiply (T3E)

0

50

100

150

200

250

1 2 4 8 16 32

Processors

Mfl

op

s

UPC + PrefetchMPI (Aztec)UPC BulkUPC Small


Running Sparse MVM on a Pflop PIMRunning Sparse MVM on a Pflop PIM

1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak 8 Address generators limit performance to 16 Gflops 500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead Programmability differences too: packing vs. global address space

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E+13

1.E+14

1.E+15

1.E+16

Op

s/s

ec

Put/Get

Blocking read/w rite

Synchronous MP

Asynch MP

Peak


Effect of Memory SizeEffect of Memory Size

For small memory nodes or smaller problem sizes Low overhead is more important

For large memory nodes and large problems packing is better

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E+13

1.E+14

1.E+15

1.E+16

0.3

0.5

1.0

2.1

4.1

8.2

16.4

32.9

65.8

131.

626

3.1

526.

3

1052

.5

2105

.0

4210

.0

MB/node of data

Op

s/s

ec

Put/Get

Blocking read/w rite

Synchronous MP

Asynch MP

Peak


ConclusionsConclusions

Performance advantage for PIMS depends on application Need fine-grained parallelism to utilize on-chip bandwidth Data parallelism is one model with the usual trade-offs

Hardware and programming simplicity Limited expressibility

Largest advantages for PIMS are power and packaging Enables Peta-scale machine

Multiprocessor PIMs should be easier to program At least at scale of current machines (Tflops) Can we bget rid of the current programming model hierarchy?


BenchmarksBenchmarks

Kernels Designed to stress memory systems

Some taken from the Data Intensive Systems Stressmarks

Unit and constant stride memory Dense matrix-vector multiplication Transitive-closure

Constant stride FFT

Indirect addressing NSA Giga-Updates Per Second (GUPS) Sparse Matrix Vector multiplication Histogram calculation (sorting)

Frequent branching a well and irregular memory acess Unstructured mesh adaptation


Conclusions and VIRAM Future DirectionsConclusions and VIRAM Future Directions

VIRAM outperforms Pentium III on Scientific problems With lower power and clock rate than the Mobile Pentium

Vectorization techniques developed for the Cray PVPs applicable. PIM technology provides low power, low cost memory system. Similar combination used in Sony Playstation.

Small ISA changes can have large impact Limited in-register permutations sped up 1K FFT by 5x.

Memory system can still be a bottleneck Indexed/variable stride costly, due to address generation.

Future work: Ongoing investigations into impact of lanes, subbanks Technical paper in preparation – expect completion 09/01 Run benchmark on real VIRAM chips Examine multiprocessor VIRAM configurations


Management PlanManagement Plan

Roles of different groups and PIs Senior researchers working on particular class of benchmarks

Parry: sorting and histograms Sherry: sparse matrices Lenny: unstructured mesh adaptation Brian: simulation Jin and Hyun: specific benchmarks

Plan to hire additional postdoc for next year (focus on Imagine) Undergrad model used for targeted benchmark efforts

Plan for using computational resources at NERSC Few resourced used, except for comparisons


Future Funding ProspectsFuture Funding Prospects

FY2003 and beyond DARPA initiated DIS program Related projects are continuing under Polymorphic Computing New BAA coming in “High Productivity Systems” Interest from other DOE labs (LANL) in general problem

General model Most architectural research projects need benchmarking Work has higher quality if done by people who understand

apps. Expertise for hardware projects is different: system level

design, circuit design, etc. Interest from both IRAM and Imagine groups show level of

interest


Long Term ImpactLong Term Impact

Potential impact on Computer Science Promote research of new architectures and micro-

architectures Understand future architectures

Preparation for procurements Provide visibility of NERSC in core CS research areas

Correlate applications: DOE vs. large market problems

Influence future machines through research collaborations


Benchmark Performance on IRAM Benchmark Performance on IRAM SimulatorSimulator IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)


Project Goals for FY02 and BeyondProject Goals for FY02 and Beyond

Use established data-intensive scientific benchmarks with other emerging architectures:

IMAGINE (Stanford Univ.) Designed for graphics and image/signal processing Peak 20 GLOPS (32-bit FP) Key features: vector processing, VLIW, a streaming memory

system. (Not a PIM-based design.) Preliminary discussions with Bill Dally.

DIVA (DARPA-sponsored: USC/ISI) Based on PIM “smart memory” design, but for multiprocessors Move computation to data Designed for irregular data structures and dynamic databases. Discussions with Mary Hall about benchmark comparisons


Media BenchmarksMedia Benchmarks

FFT uses in-register permutations, generalized reduction All others written in C with Cray vectorizing compiler

0

0.5

1

1.5

2

2.5

3

3.5

4G

OP

S


Integer BenchmarksInteger Benchmarks

Strided access important, e.g., RGB narrow types limited by address generation

Outer loop vectorization and unrolling used helps avoid short vectors spilling can be a problem

01000200030004000500060007000

1 lane

2 lane

4 lane


Status of benchmarking software releaseStatus of benchmarking software release

Build and test scripts (Makefiles, timing, analysis, ...)

Standard random number generator

OptimizedGUPS

inner loop

GUPS C codes PointerJumping

PointerJumpingw/Update

Transitive Field

ConjugateGradient(Matrix)

Neighborhood

Optimizedvector

histogramcode

Vector histogramcode generator

GUPSDocs

Test cases (small and large working sets)

Optimized

Unoptimized Future work:

• Write more documentation, add better test cases as we find them

• Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas


Status of benchmarking workStatus of benchmarking work

Two performance models: simulator (vsim-p), and trace analyzer (vsimII)

Recent work on vsim-p: Refining the performance model for double-precision FP

performance.

Recent work on vsimII: Making the backend modular

Goal: Model different architectures w/ same ISA. Fixing bugs in the memory model of the VIRAM-1 backend. Better comments in code for better maintainability. Completing a new backend for a new decoupled cluster

architecture.


Comparison with Mobile PentiumComparison with Mobile Pentium

GUPS: VIRAM gets 6x more GUPS

Data element width

16 bit 32 bit

64 bit

Mobile Pentium GUPS

.045 .046 .036

VIRAM GUPS .295 .295 .244

0

1

2

3

4

5

6

7

8

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

tran

sitiv

e

50 100150200250350450550

Matrix size

To

tal

exec

uti

on

tim

e (s

eco

nd

s)

P-III

VIRAM 4lane

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

update update update update update

0tiny test test2 test3 test4

Working set size

tota

l execu

tio

n t

ime (

seco

nd

s)

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

pointer pointer pointer pointer

0tiny test test2 test3

working set size

tota

l execu

tio

n t

ime (

seco

nd

s)

TransitivePointerUpdate

VIRAM=30-50% faster than P-III

Ex. time for VIRAM rises much more slowly w/ data size than for P-III


Sparse CGSparse CG

Solve Ax = b; Sparse matrix-vector multiplication dominates.

Traditional CRS format requires: Indexed load/store for X/Y vectors Variable vector length, usually short

Other formats for better vectorization: CRS with narrow band (e.g., RCM ordering)

Smaller strides for X vector Segmented-Sum (Modified the old code developed for Cray

PVP) Long vector length, of same size Unit stride

ELL format: make all rows the same length by padding zeros Long vector length, of same size Extra flops


SMVM PerformanceSMVM Performance

DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row)

IRAM results (MFLOPS)

Mobile PIII (500 MHz) CRS: 35 MFLOPS

SubBanks

1 2 4 8

CRS 91 106 109 110

CRS banded

110 110 110 110

SEG-SUM 135 154 163 165

ELL (4.6 X more flops)

511(111)

570(124)

612(133)

632(137)


2D Unstructured Mesh Adaptation2D Unstructured Mesh Adaptation

Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)

Complicated logic and data structures Difficult to achieve high efficiently

Irregular data access patterns (pointer chasing) Many conditionals / integer intensive

Adaptation is tool for making numerical solution cost effective Three types of element subdivision


Vectorization Strategy and Performance Vectorization Strategy and Performance ResultsResults Color elements based on vertices (not edges)

Guarantees no conflicts during vector operations

Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time Difficult: many conditionals, low flops, irregular data access,

dependencies Initial grid: 4802 triangles, Final grid 24010 triangles

Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500

Higher code complexity (requires graph coloring + reordering)

Pentium III 500 1 Lane 2 Lanes 4 Lanes

61 18 14 13Time (ms)

latency vs. bandwidth which matters more? katherine yelick u.c. berkeley and lbnl joint with with:...

Documents

pim software

memory bus

memory bandwidth measurements

constant slide

bandwidth issue slide

pim system

katherine yelick

cache x y slide