april, 2011cartesian computations larry carter with thanks to bowen alpern, jeanne ferrante, kang su...

April, 2011 Cartesian Computations

Cartesian Computations

Larry Carter

with thanks to

Bowen Alpern, Jeanne Ferrante,

Kang Su Gatlin, Nick Mitchell, Michelle Strout

Cartesian Computations2

Outline

• Architectural trends

– Computation is (almost) free

– Data movement and memory aren’t

• Cartesian computations – Examples– Static, tidy and lazy implementations

• Analysis of Cartesian computations– Computation density and temp data ratio– Some theorems

• Application to examples


Technology trends

Everything’s getting smaller

Energy consumption limits computation speed

Need to keep chip from melting

Electricity costs are non-trivial!

Energy is dominated by cost to move data

In 32 nm technology (available today): 0.1 pJ (picoJoule) to add two 32-bit numbers

2 pJ to move a 32-bit number 1 mm on chip

320 pJ to move a 32-bit number onto or off a chipthese numbers may be reduced by 10x with heroic methods, e.g. very low

power&speed

m mm µm nm Å

chip light atomfeature width


Computation is (almost) free

On a 20 x 20 mm chip, you can fit:

1,000,000 32-bit adders, … or …

256 MB fast memory (SRAM)

Memory is what takes up area

For 100 Watts of power, you can

Perform 1015 additions per second (one petaOp), … or …

Read data from DRAM at 128 GB/sec

Memory and moving data is what consumes power


(Super)computer architecture

Core = CPU + caches

Chip = bunch of coresHigh bandwidth interconnect

Node = bunch of chips + DRAMTb/sec bandwidth between chips

Less bandwidth to memory

System = bunch of nodesSwitch or grid connection

OK bandwidth for block transfer

ProcL1

L2

P

L1

L2

P

L1

L2

P

L1

L2

P

L1

L2

DRAM


Models of computation

RAM

PRAM

Proc

Memory

Memory

Proc Proc Proc Proc

Memory


The point being …

(P)RAM analysis counts arithmetic operations, ignoring cost of memory accesses and data movement

Real costs, both time and energy, are dominated by memory accesses and data movementThese costs vary by over 100x depending on where data

goes

Algorithm analysis should focus on data movement!

The rest of this talk will study data movement for a common class of scientific computation kernels

not even considering disk


Outline• Architectural trends







An incomplete taxonomy of kernels

Compute intensive E.g. well-tuned matrix multiply, seismic migration, …Runtime is dominated by core speed

Streaming applications E.g. simple encoding, scanning, one pass of quicksort, …Runtime dominated by sequential access to memory

Dynamic data structures E.g. binary search trees, quadtrees, graph algorithmsRuntime dominated by random access to memory

and …


Cartesian computation

Given large data structures A and B

A generates temporary data T, which updates B

For each chunk of A and B, there ’s a “tile” of computation

There may be data dependences, but we’ll assume that reordering is possible.

In a uniform Cartesian computation, for all tiles of a given size, the amount of computation is roughly the same.

A

B


A messy detail

For very small tiles, the uniformity assumption fails

How small is “very small” depends on the computation

Most very small tiles will have no temp data or work

If we only execute non-empty very small tiles, there is a savings in communication costs.

But using very small tiles requires much more communication than using large tiles

• In this talk, we assume that such a strategy is too expensive!

• We also leave the open the problem of non-uniform distributions.


Sparse matrix-vector product y = Mx

A = input vector (x), B = output vector (y)

T might be pairs of the form {i, Mij xj}

Matrix M could be:

known implicitly (e.g. regular grid), or

streamed in from local memory

stored in order of use

A tile is a rectangular subset of M

The memory characteristics of y = Mx

depend on the density & distribution of nonzeros

depend on the implementation

y

x

** *

**

M


A different (?) example

Histogram (from NAS Integer Sort Benchmark)

For i=1 to numkeys Tally[Key[i]]++

Key is A; Tally is B;

Temp data is indices into tile’s subset of tally

Transpose, Bit Reversal (used in FFTs), and other Permutations are similar

0 1 2 3 4 5 6 7 8 9

Tally

3

1

4

1

5

9

2

6

5

3

5

8

9

7

5

65

Key


Generalization: Map-Reduce

Input A is a set of (key,value) pairs

Distributed by “key” in big computer

“Map” takes each (k,v) to a list {(k1,v1), (k2,v2), … }

Resulting data is sorted according to new keys

Sj = {v | (kj, v) comes from Map}

This is intermediate data of Cartesian computation

“Reduce” is applied to each Sj, producing a new pair (kj, Reduce(Sj))

(or more generally, any number of new pairs)

This is output set B


In general …

In a uniform Cartesian computation,

– chunks of A are used to generate temporary datamay need extra “one-time” data too

– chunks of B consume the temporary dataN.B. the intermediate values of B are not called temp data

– every chunk of A generates data for every chunk of B

Program for executing Cartesian computations:

for each “architectural unit” (cache, core, chip, node, …), choose

– tile size and shape

– whether to move temp data or to move chunks of A and/or B

– how to use multiple processors and share memory


Implementing Cartesian computationsTidy Implementation

– Temp data is consumed as soon as it is generated• relevant chunks of both A and B must be in local memory

– A and B chunks are copied or moved so each is together at some time

Temp data doesn’t move; but each bit of A and B may move many times

• particularly if the tiles are small

Lazy Implementation– Bring bits of A into a processor; generate temp data

– Move temp either:• To another processor, or• To some level of cache memory

– Consume temp data where or when appropriate chunk of B is available

Temp data moves, but A and B may move less than for tidy program


Two types of lazy implementation

1. Static partitioning:

– Partition B among memory of all “architectural units”B must fit in totality of these memories

– Stream A in a few bits at a time, generate temp data and send it to unit owning appropriate chunk of B

A and B only moved once Temp data may need to travel long distance

Aside: we could swap the roles of A and B.

2. Bucket tiling:• Stream A in, generate & store temp data in local memory

Store A in bucket depending on which chunk of B will consume it

• Later, bring B in a chunk at a time, consume temp data

• Repeat the above until done

A and/or B may move in & out many times .


Bucket Tiling Example: histogram

For i=1 to numkeys Tally[Key[i]]++

becomes … For ii=0 to numKeys by chunk_size // Break set of Keys into

chunks

Bend[*] = 0

For i=ii to ii + chunk_size // Put keys into buckets

k = Key[i]>>16 // (indexed by high bits)

Temp[k][Bend[k]++] = Key[i] && mask

For k = 0 to numBuckets // Process each bucket

For i = 0 to Bend[k]

Tally[(k<<16)+Temp[k][i]]++

Choose chunk_size so that Temp fits in L2 cache;

Choose “16” so random accesses to Tally are in L1 cache

Code ran 3x faster (very old experiment)


Hybrid implementations

Many variants are possible:

Different strategies at different levels of granularity

E.g. tidy implementation within node; lazy within chip

If a chunk of A (or B) is needed in several places, it can either move from place or there can be duplicate copies

If B has copies, they need to be combined at end.

Different strategies for different regions

E.g., if (non-uniform) problem has dense regions, use static partitioning but duplicate denser parts of A and/or B to reduce communication


Outline

• Architectural trends







Data movement in Cartesian computations

Focus on moving of A, B, and temp data …

ignore other data

… into an architectural component

cache, core, chip, node

Same analysis for sequential or parallel execution.

Useful metrics:

Bits moved into component gives approximate energy or power requirement (power =

energy/time)

ignores difference between moving from DRAM vs. another chip

ignores difference between near and distant network moves

Computation/communication ratio= ops/sec ÷ bits/sec = speed ÷ bandwidth

DRAM


Quantifying data movement

Measure chunks of A and B in bits

Tiles area is measured in “square bits”

In a uniform problem, tile area is proportional to work

Define compute density = work/bits2 (e.g. floatOps/bit2)

Measure temp data in bits

Suppose an a x b tile generates t bits temp data.

Define (temp) data density d = t/ab (d is “bits/bit2”)

In a uniform problem, d is independent of tile choice

Suppose all temp data is moved off-chip (or off-core or off-node):

(almost happen in static partitioning – occasionally B is local)

bits moved = abd

compute speed / bandwidth ≤ /d


y = Mx example

Suppose x and y are 64-bit floats.

Suppose 1/400 of matrix entries are non-zerosimilar to NAS cg class B benchmark

= 2 floatOps/(400 x 64 bits x 64 bits)

≈ 1.2 x 10-6 floatOp/bit2

Suppose temp data is {64 bit float, 32-bit index}

d = 96 bits/(400 x 64 x 64 bit2)

≈ 58 x 10-6 bit-1

Note: /d = 1 floatOp/48 bits

Typical supercomputer has only 4 bit DRAM bandwidth per floatOp

y

x

** *

**

M

≈ compute/communication for static partitioning


Lower bound on tidy implementationRecall: the intermediate data is consumed immediately in the

same component as it is produced.

And, every square bit of A x B co-resides in memory at some time

Theorem: If a machine component with memory capacity c executes an a x b tile tidily, a>c and b>c, then ab/c bits must move into the component during the computation, even if the memory contained data at the start.

Note: This is better than abd (static partitioning) if 1/c < d

Corollary: computation speed ≤ x memory capacity x bandwidth


Proof (if memory is initially empty):

Assign a penny to each square bit (i,j) in tile total of ab pennies

The first time i and j are in the same core, give penny to whichever load operation brought i or j in later

no load can get more than c pennies

Thus, there must be at least ab/c load operations.

Proof (general case):

Messy.


Upper bound on tidy implementations

Tiling an a x b Cartesian problem:

For all chunks Bi of B of size c(1-)

Move Bi into local memory

Stream through all nibbles of A of size

Execute tile nibble x Bi

B moves into component once b bits of B are moved

At most b/c(1-) + 1 chunks of B ab/c(1-) + a bits of A moved

Temp data is always local.

Total: ab/c(1-) + a + b bits

For large a and b and small , this is about ab/c

matching the lower bound


Hierarchical tiling example

Partition A among DRAM of nodes

Each cache-sized chunk of B makes a tour of all nodes

When at a node,

chunk resides in cache of one core

local part of A streams into core

relevant tile is executed

Data movement of B between nodes is ab / DRAM sizecorresponds to horizontal lines in picture

Movement of A into cores is ab / cache sizevertical lines in picture

DRAMsize

Cachesize

B

A


Lower bound for bucket tiling

Recall: Bucket tiling has only two operations:

move bits of A into component, generate temp data

move bits of B into component, consume temp data

Theorem: A bucket tiled uniform Cartesian computation, where temp data is stored in a component with memory capacity c, moves at least ab(2d/c)½ bits of A and B into the component.

Proof: See Carter & Gatlin, FOCS ’98

But (d/c)½ is the geometric mean of d and 1/c, so

this is worse than abd (static partitioning) or ab/c (tidy) or both.

Possible advantage over static partitioning:

If B doesn’t fit in sum of local memories, can’t static partition

Bucket tiling communication is to nearest neighbor

tilingstatic partitioning


Upper bound on bucket tilingPartition space into m x n tiles

Entire a x b problem needs ab/mn tiles

Tile generates mnd bits of temp data; must be ≤ c

For each tile:

Move m bits of A into node and create temp data from it

Consume temp data by moving and updating n bits of B m+n bits of A and B moved per tile (temp data stays in

cache)

Data movement is minimized if m = n = (c/d)½

so total communication = ab/(c/d) x 2(c/d) ½ = 2ab (d/c)½

6x6 tile needs 12 bits moved(3 bits/point)

4x9 tile needs 13 bits moved


(Theoretically) better implementation

Form triangular tile with side m

m(m-1)d/2 temp data; must be ≤ c; so m (2c/d)½

Slide tile along by deleting one column, placing one row

communication = ab (2d/c)½

sqrt(2) times better (4 vs 3 points/bit moved in example)matches lower bound

delete generate


Which implementation is best?

blahblahblah Method

Bits of cache to shared memory

Bits to nearest neighbor node

Bits to random node

Static partitioning

(relatively small)

none ~abd

Bucket tiling abd 2ab(d/c2)½ none

Tidy ab/c1 ab/c2 none

Each core has cache of size c1

Node has memory of size c2 >> c1

a = |A| > c2, b = |B| > c2

d = temp data ratio

NetworkDRAM


Sparse Matrix-Vector example

blahblahblah Method



Bits to random node

Static partitioning

(relatively small)

none abd = ab/2KB Worst!

Bucket tiling abd 2ab(d/c2)½ =

ab×.015/c2½

none

Tidy ab/c1

c1 > 2KB

ab/c2

c2 >> 2 KB

none



a = |A| > c2, b = |B| > c2

d = 58 x 10-6 (2KB)-1


Histogram example

Suppose A is 64-bit integers and B is 16-bit counters

There’s one increment operation per row

An a x b tile (a and b are bits) represents a/64 rows

In a uniform problem, tile contains b/|B| of indices

so tile has (a/64)x(b/|B|) = ab/64|B| “hits”

= work/bit2 = (ab/64|B|)/ab = 1 / 64|B| adds/bit2

Suppose we need 32-bit indices for temp data

(fewer bits are needed since index is limited in tile)

d = t/ab = (ab/64|B|) x 32/ab = 1 / (2|B|)

Note: /d = 1 add/32 bits (not at all surprising) ≈ compute/communication for static partitioning

|B| is size of B in bits

5

65

Key

Tally


Histogram example

blahblahblah Method



Bits to random node

Static partitioning

(relatively small)

none abd = a/2 BEST!

Bucket tiling abd = a/2 2ab(d/c2)½

= a (2b/c2)½ > a

none

Tidy ab/c1

b/c1 >> 1

ab/c2

b/c2 > 1

none



a = |A| > c2, b = |B| > c2

d = 1/(2b)


Conclusions, etc

Computation is free

A theory for data movement (analogous to a theory of computation) is needed for today’s architectures

For Cartesian computations, we can relate the three orthogonal aspects of computers:computation speedbandwidthmemory

and choose algorithms accordingly

Results are needed for other classes of computations !

And architectural features (e.g. sequential access)!


Backup slides


Implementing Cartesian computations(put some words here)

P1

P1

P1

P2

P2

P2

P3

P3

P3

P4

P4

P4

Static partitioning Tidy (tiling)

B

A2->1

1->4

2->3

1->2 1->3

etc.


Algorithmic vs. memory analysis

• MergeSort or QuickSort

– O(N lg N) operations

– Sequential memory access

• BucketSort or CountSort

– O(N) operations

– Random memory access


Algorithmic vs. memory analysis• Sandia study of Linear Algebra algorithms

Millions of floating point ops needed for sample problem

– 1950: 1,000,000,000,000 Mops (Cramer’s rule)

– 1965: 10,000,000 Mops (Gaussian Elimination)

– 1975: 300 Mops (Gauss-Seidel)

– 1985: 8 Mop (Conjugate Gradient)

• Each algorithmic improvement results in less locality in memory references.


Integer Sort Example• Loop from NAS IS benchmarkFor i=1 to numkeys Tally[Key[i]]++ Random access to Tally (BAD)

•

Bucket-tiling*

For i=1 to numkeys // Count keys per bucket Bcount[Key[i]>>12]++ Bcount in cache (GOOD); Key is sequential (OK)For k=1 to maxKey>>12 // Find bucket end points Bend[k] = Bend[k-1]+Bcount[k] Bend fits in cache (GOOD)For i=1 to numkeys // Put keys into buckets Temp[Bend[Key[i]>>12]++] = Key[i] Bend in cache; Key and Temp are sequential (OK)For i=1 to numkeys // Count keys Tally[Temp[i]]++ Active parts of Tally stay in cache (GOOD); Temp OK Bottom Line: Code is over 2 times faster (on IBM RS6000).*Alpern & Carter ‘94, “Towards a Model for Portable Parallel Performance”


Inspector – Executor paradigm

• Introduced by Chaos Group at U. Md. (led by Joel Saltz) e.g. 1995 TPDS paper

• “Inspector” is runtime code that:

– Iterates through loops without executing, records information about data access order

– Reorders data in arrays or chooses new execution order

– Generates new index arrays

• “Executor” is (modified) code

– Does the actual computation

– Uses reordered data and index arrays

april, 2011cartesian computations larry carter with thanks to bowen alpern, jeanne ferrante, kang su...

Documents