ibm research © 2008 feeding the multicore beast: it’s all about the data! michael perrone ibm...

IBM Research

© 2008

Feeding theMulticore Beast:It’s All About the Data!

Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.

IBM Research

© 20082 [email protected]

Outline

History: Data challenge

Motivation for multicore

Implications for programmers

How Cell addresses these implications

Examples• 2D/3D FFT

– Medical Imaging, Petroleum, general HPC…

• Green’s Functions– Seismic Imaging (Petroleum)

• String Matching– Network Processing: DPI & Intrusion Detections

• Neural Networks– Finance

IBM Research


Chapter 1:

The Beast is Hungry!

IBM Research


The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

IBM Research


The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

IBM Research


Move the food closer

Example: Intel Tulsa– Xeon MP 7100 series

– 65nm, 349mm2, 2 Cores

– 3.4 GHz @ 150W

– ~54.4 SP GFlops

– http://www.intel.com/products/processor/xeon/index.htm

Large cache on chip

– ~50% of area

– Keeps data close for efficient access

If the data is local,the beast is happy!

– True for many algorithms

IBM Research


What happens if the beast is still hungry?

Data

Cache

If the data set doesn’t fit in cache

– Cache misses

– Memory latency exposed

– Performance degraded

Several important application classes don’t fit

– Graph searching algorithms

– Network security

– Natural language processing

– Bioinformatics

– Many HPC workloads

IBM Research


Make the food bowl larger

Data

Cache Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

IBM Research


Make the food bowl larger

Data

Cache Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

But…

– Important application working sets are growing faster

– Multicore even more demanding on cache than uni-core

IBM Research


Chapter 2:

The Beast Has Babies

IBM Research


Power Density – The fundamental problem

1

10

100

1000

1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07

i386i486

Pentium®

Pentium Pro ®

Pentium II ®Pentium III®

W/cm2

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

IBM Research


What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a fundamental limit (a few atomic layers)

0.010.110.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004Po

wer

Den

sity

(W

/cm

2 )

65 nM

Gate Length (microns)

1 0.010.1

1000

100

10

1

0.1

0.01

0.001

Power, signal jitter, etc...

IBM Research


1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clo

ck S

pee

d (

MH

z)

Clock Speed

103

102

104

Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing

returns. The industry has moved to lower frequency multicore architectures.

Frequency-DrivenDesignPoints

IBM Research


Power vs Performance Trade Offs

Relative Performance

0

1

2

3

4

5

Rel

ativ

e P

ower

1

1.45

1.3.85 1.7

We need to adapt our algorithms to get performance out of multicore

IBM Research


Implications of Multicore

There are more mouths to feed

– Data movement will take center stage

Complexity of cores will stop increasing

… and has started to decrease in some cases

Complexity increases will center around communication

Assumption

– Achieving a significant % or peak performance is important

IBM Research


Chapter 3:

The Proper Care and Feeding of Hungry Beasts

IBM Research


Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W

IBM Research


Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

PPE

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

MFC

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

IBM Research


Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

IBM Research


Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory

and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

IBM Research


Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology

(to scale)

IBM Research


Memory Managing Processor vs. Traditional General Purpose Processor

IBM

AMD

Intel

Cell

BE

IBM Research


Examples of Feeding Cell

2D and 3D FFTs

Seismic Imaging

String Matching

Neural Networks (function approximation)

IBM Research


Feeding FFTs to Cell

Buffer

Input Image

Transposed Image

Tile

Transposed Tile

Transposed Buffer

SIMDized data

DMAs double buffered

Pass 1: For each buffer• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Pass 2: For each buffer• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

IBM Research


3D FFTs

Long stride trashes cache

Cell DMA allows prefetch

Single Element Data envelope

Stride 1

StrideN2

N

IBM Research


Feeding Seismic Imaging to Cell

(X,Y)

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Data

Green’s Function

ij

jiyxGjyixD ),,,(),(

IBM Research


Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

IBM Research


Feeding Seismic Imaging to Cell

For each X

– Load next column of data

– Load next column of indices

– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at

(X,Y)

– Cycle buffers

H

2R+1

1

Data bufferGreen’s Index buffer

(X,Y)

R

2

IBM Research


Feeding String Matching to Cell

Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Problem: Graph doesn’t fit in LS

Sample Word List:

“the”“that”

“math”

IBM Research


Feeding String Matching to Cell

IBM Research


Hiding Main Memory Latency

IBM Research


Software Multithreading

IBM Research


Feeding Neural Networks to Cell

Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for LS, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

OutputF

X

IBM Research


Convert BW Bound to Compute Bound

Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

Merge

IBM Research


Moral of the Story:It’s All About the Data!

The data problem is growing: multicore

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!

IBM Research


Backup

IBM Research


Abstract

Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.

ibm research © 2008 feeding the multicore beast: it’s all about the data! michael perrone ibm...

Documents