mapping data-intensive applications to an explicitly managed...

58
Mapping Data-intensive Applications to an Explicitly Managed Memory Architecture: Challenges and Solutions Pierre G. Paulin Director, SoC Platform Automation STMicroelectronics (Canada) Inc. Memory Architecture and Organization Workshop Montreal, Canada, 3 October 2013

Upload: others

Post on 22-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Mapping Data-intensive Applications to an Explicitly Managed Memory Architecture: Challenges and Solutions

Pierre G. PaulinDirector, SoC Platform AutomationSTMicroelectronics (Canada) Inc.

Memory Architecture and Organization Workshop Montreal, Canada, 3 October 2013

Page 2: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Outline

• Cache vs. Explicitly Managed Memory (EMM)

• STHORM (aka Platform 2012) many core fabric• Explicitly Managed Memory (EMM) architecture

• STHORM programming tools

• Computer vision application benchmark results

• Challenges of programming EMM architectures

• Outlook: Emerging programming tool solutions

2

Page 3: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Cache vs. EMM 3

07/10/2013

Shared L1 TCDM

Cor

e

Cor

e

Cor

e

TC

DM

Cor

e

Cor

e

Cor

e

TC

DM

TC

DMLocal interconnect

Coherent L2 $

Cor

e

Cor

e

Cor

e

L3 Memory L3 Memory L3 Memory

L1 $

L1 $

L1 $

L2 [Memory | $] L2 Memory

DMADMA DMA DMARefill H/W

Coherency H/W DMA DMA

1cycle

10’scycles

100’scycles

Page 4: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Cache vs. Explicitly Managed Memory

• Cache �High performance

�Programming simplicity

Higher power, cost

Unpredictable performance

• EMM�Lower power, cost

�Performance predictability

�Explicit control of data movement

Poor handling of irregular accesses (scatter-gather)

Programming complexity• Use of DMAs for explicit data movement• Need to overlap communication and computation

4

07/10/2013

Page 5: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Many-core Fabric:Explicitly Managed MemoryArchitecture

Page 6: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STMicroelectronics STHORM Platform 6

1 > 1003 8

CPU GP-GPU HW IP

GOPS/mm 2 – GOPS/W

STHORM

SW HWMixed

ThroughputComputing

General-purposeComputing

Page 7: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

What is STHORM?• Massively parallel programmable processor:

• Delivers high performance per mm2 � 20 GOPS/cluster� 3.7 mm2/cluster (28nm)

• Low-power: can be battery-powered � 100 GOPS/W

• Explicitly managed memory architecture

• Scalable GALS architecture: each cluster has its own F,V point

7

07/10/2013

BridgeCore 1 Core 16

Int

Flo

at

Int

Flo

at

Int

Flo

at

L1 Memory

Synchronizer

DMA

DMA

DVFS

Cluster Controler

L2 memory

FC

Host

STxP70 STxP70 STxP70

Page 8: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Cluster Architecture

Parametric SMP

SW Cluster

Shared Mem

Cor

e

Cor

e

Cor

eSynchronizer

NI

Asynchronous Local Interconnect

Cluster

Controller

Con

trol

Pro

cess

or

IO

IO

IO

Top

2D m

esh

Asy

nchr

onou

sne

twor

k

Comm

HW

PU

HW

PU

HW

PU

HW PE

Comm

HW

PU

HW

PU

HW

PU

HW PE• Scalable GALS

architecture

• Each cluster has its ownoperating point (F,V)

• Asynchronous2D meshinterconnect

Use

r D

efin

edS

TH

OR

M In

fras

truc

ture

Page 9: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

LD/ST and DMA Memory Transfers

• NUMA architecture

• Intra-Cluster:• LD/ST (UMA)

• DMA: From/to TCDM

• Inter-Cluster:• LD/ST (NUMA)

• DMA: L1-to/from-L1

• Cluster to/from L2-Mem:• LD/ST (NUMA)

• DMA: L1 to/from L2

• Cluster to/from L3-Mem (though the system bridge):

• LD/ST (NUMA)

• DMA: L1 to/from L3

9

07/10/2013

Fabric Controller

System

Bridge

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

L2-MEM

System

Bridge

L1-MEM

L1-MEM

L1-MEM

L1-MEM

L1-MEM

L1-MEM

L1-MEM

L1-MEM

L1-MEM

Page 10: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Cluster Architecture

STHORM Cluster Overview 10

07/10/2013Presentation Title

Debug and Test Unit (DTU)

…………….

Timers

HWS

Shared Tightly Coupled Data

Memory(TCDM)

TCDM Logarithmic interconnect

Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Per

iphe

ral L

ogar

ithm

ic in

terc

onne

ct

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

EN

Cor

e<N

>-C

C

inte

rfac

e

CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC

Inte

rcon

nect

, CC

I

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#3

1

Mem

ory

bank

#3

2

• Parametric multi-core crossbar with a logarithmic s tructure• Reduced arbitration complexity• Round robin arbitration scheme• Up to N memory accesses per cycle• Test-and-Set support

Page 11: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Shared Memory 11

07/10/2013

P1 P2 P3 P4

B2

B3

B4

B5

B6

B7

B8

B1

Routing Tree

Arbitration Tree

N Processors

2N Memory Banks

Page 12: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Shared Memory (contd.) 12

07/10/2013

Shared Memory Conflicts

1 1.0328 1.07211.1329

1.2536

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0% 35.70% 50% 66.70% 100%

LD/ST %

Com

puta

tion

Tim

e

Theoretical

Measured

Typical case

Page 13: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Cluster Overview 13

07/10/2013

STHORM Cluster Architecture

Debug and Test Unit (DTU)

…………….

Timers

HWS

Shared Tightly Coupled Data

Memory(TCDM)

TCDM Logarithmic interconnect

Mem

ory

bank

#1

Mem

ory

bank

#2

Mem

ory

bank

#3

Mem

ory

bank

#4

………

Per

iphe

ral L

ogar

ithm

ic in

terc

onne

ct

ENC2EXT

EXT2PER

EXT2MEM

Glo

bal I

nter

conn

ect

Inte

rfac

e

EN

Cor

e<N

>-C

C

inte

rfac

e

CVP-CC

CC-Peripherals

DMAChannel

#0

DMAChannel

#1

STxP70#1

16KB-P$

STxP70#2

16KB-P$

STxP70#N

16KB-P$

Mem

ory

bank

#2

xN-1

Mem

ory

bank

#2

xN

STxP70Cluster

Processor(CP)

16KB-P$

TCDM

CC

Inte

rcon

nect

, CC

I

STxP70+ FPx

#1

16KB-P$

STxP70+FPx

#2

16KB-P$

STxP70+FPx#16

16KB-P$

STxP70Cluster

Processor+ FPx (CP)

16KB-P$

32-KB TCDM

Mem

ory

bank

#3

1

Mem

ory

bank

#3

2

• Supports 1D & 2D transfers• Up to 3.2GB/s peak per DMA • Support up to 16 outstanding transactions• Support of Out of Order (OoO)

Page 14: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Programming Tools

14

Page 15: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Programming STHORM 15

Application

Prog. Model

Mappingcontrol

PerformanceFeedback &

debug

Async NoC

ARM Cortex A9

PE0

Shared L1 MEM

PE1

PEn

Clus .Ctrl

HW Synchr.MultichanDMA

FabricCtrl

DMA

System bus

…A-LICWrRd

HWPE 0 HWPE N

RecvSend

CC Mem

Ctrl Peri.

FC Mem

Ctrl Periph.

PE0

Shared L1 MEM

PE1

PEn

Clus .Ctrl

HW Synchr.MultichanDMA

CC Mem

Ctrl Peri.

FabricHost

…A-LICWrRd

HWPE 0 HWPE N

RecvSend

Mapping info

Page 16: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

OpenCL: Standard of the Convergence 16

Multi-core CPU GPUMany-core

RogueSTHORM

(aka P2012)

More programmabilityMore parallelism

Task Parallelism(run-to-completion)

Data Parallelism(with some-synchro)

Cortex-A9 Midgard-T6xxCortex-A15

Page 17: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM as OpenCL Device 17

Conceptual OpenCL Device STHORM Compute Device

Cluster N (16 PEs)

256 KB shared memory (L1)

Global Memory: External DDR memory (L3)

Cluster 1 (16 PEs)

256KB shared memory (L1)

2 DMAchannels

2 DMAchannels

xp70+FPx xp70+FPxxp70+FPx xp70+FPx

• Scalable architecture (cluster based)• Supports SPMD with 0-cost branch divergence• Supports Task parallelism• Shared Local memory• 1D/2D DMA engines• Hardware synchronizer

• Scalable programming model• Supports SPMD model• Supports Task parallelism• Covers complex memory hierarchy• Supports async memory transfers

Local Memory

Global / Constant Memory Data Cache

Global Memory

Compute Unit N

Private Memory

Private Memory

Proc 1 Proc M

……

Compute Unit 1

Private Memory

Private Memory

Proc 1 Proc M…

Local Memory

Async

Async

P$ Refill. Static & Dynamic Buffer Allocn.

Page 18: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Global OpenCL Execution Model

• Host-centric, devices are asynchronous

• Supports two kind of parallelism• DLP (Data Level Parallelism):

Parallel multiple instances of a kernel

• TLP (Task Level Parallelism): Different kernels executed in parallel

• Communication• Between work-items of same workgroup

• Between kernels: through buffers in global memory

• Between host and kernels: memory mapped buffers

• Synchronization• Between work-items: work-group ‘barrier’

• Between commands of a command queue: out-of-order

18

07/10/2013

Page 19: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM ≠ GPU: Kernel Level Parallelism 19

Independent Clusters + Independent Processors� Work-item divergence not an issue in Sthorm� GPUs can support, but cost of stalling threads� STHORM supports more complex OpenCL

task graph than GPUs. Both task-level and data-level (ND-Range) are possible

STHORM cores are not HW-multithreaded� STHORM OCL runtime does not accept

more than 16 work-items per work-group when creating an ND-Range.

� But you can choose which work to do in each work-item (e.g. using “case”)

WGWG WG

WGWG WG

WGWG WG

WG

WG

WG

WGWG WG WG

WG

GP

US

TH

OR

M

Page 20: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Memory Mapping

L3 Global Memory(Physically contiguous)

Local MemoryPrivate (work-item stacks)

Constant Memory~10 KBRuntime

Semi-resident(per program build)

non-resident(per kernel run)

L1

256 KB

~2cycles

150~200cycles

shared256KBTCDM

Extmem

resident

Host memory(virtualized)

20

Program cache refill, allocation of static & non-static buffersL2

1 MB

~20cycles

1MB

Page 21: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Data Movements inside STHORM 21

L3 Global Memory (buffers)

L1shared256KB

Scalar/Vector load/storeasync_work_group_copy

async_work_group_2d_copyasync_work_item_[2d]_copy

Local MemoryPrivate (work-item stacks)

Constant Memory~10 KBRuntime

Program loadtime

Page 22: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

OpenCLapplication

OpenCL Programming Tools

OCLcompiler

OpenCL-aware visualization & analysis tools

Low-level

Traces

OpenCL Encodings

SoC

Cluster2

ClusterN

STHORM

Host

SystemTrace

Module

Cluster0

Cluster1

TraceDatabase

•Symbolic names / physical IDs•Trace point states(enabled / disabled)

•Compact encoding to reduce instrumentation cost

•Debug symbolic info

CC

CC CC

CC

Debug & Test Unit OpenCL-aware

debugger

Runtime

RTRT

RT RT

C C

CC

Page 23: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Target Platform

23

Page 24: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Test Chip 24

• 28nm LP CMOS process

• 4 Clusters, 69 dual issue processors

• 80 GOPS

• 1MB L2 mem

• 600 MHz typ

• 200mW per cluster typical(400 mW worst case)

• 3.7 mm2 per cluster

• Fully functional samplesavailable since Dec. 2012

Energy efficiency = 100 GOPS / W

Page 25: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM SoC + Zynq board

STHORM Evaluation Board 25

STHORM 2D SoC

AN

oC

Cluster1 (16 PE)

L2 Mem (1 MB)

SN

oC S

erde

s I/F

Cluster2 (16 PE)

Cluster3 (16 PE)

Cluster4 (16 PE)

Ethernet

L3 Mem (512 MB)

USBSD I/F DebugHDMI

HDvideo

Zynq FPGA (Z7020)

AX

I

SN

oC/ N

I

SN

OC

-AX

IB

ridge Cortex

A9

2 CMOSsensor I/F

SDcard

STM-Probe

STMC2

Debug & Trace

1GB/s

Bluetooth

UART

MEMS

512K

512K

512K

512K

Fabric Ctrl. 32K

2 Imagesensors

Page 26: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

26

07/10/2013Presentation Title

STHORM Evaluation Board

2 CMOS imaging plug-in

STHORM test chipBluetooth

module

Zynq host:Cortex A9,Peripherals

3x USB (OTG, serial, JTAG)

Ethernet

MEMS: acc./gyro/ magneto;mikes (2x)

Debug &Trace

Page 27: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Applications & Benchmarks

27

Page 28: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Benchmarking Visual Analytics Applications

• Objective: Evaluation of achievable speedup when offloading visual analytics applications from ARM hostto STHORM accelerator

• Corner detection: FAST, SIFT

• Feature tracking: PKLT

• Input image: VGA resolution, 777 corners detected

• Measurement results collected by analyzing execution traces from STHORM evaluation board

28

Page 29: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Visual Analytics Benchmark 29

0

2

4

6

8

10

12

KLT (NoPyramid)

PKLT (1Level)

FAST9 FASTROSTEN

Geomean

Speed ST HORM 1 Cluster Versus 1 CA9

KLT (No Pyramid)

PKLT (1 Level)

FAST9

FAST ROSTEN

Geomean

0

50

100

150

200

250

300

KLT (NoPyramid)

PKLT (1Level)

FAST9 FASTROSTEN

STHORM - 16 proc. 600Mhz

ARM + Neon Dual Core.1GHz

Running Time (ms)

ARM Cortex A9Dual Core

(ONE used)

1 ClusterSTHORM(3.7 mm2)

Page 30: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Visual Analytics Benchmark 30

0

2

4

6

8

10

12

KLT (NoPyramid)

PKLT (1Level)

FAST9 FASTROSTEN

Geomean

Speed ST HORM 1 Cluster Versus 1 CA9

KLT (No Pyramid)

PKLT (1 Level)

FAST9

FAST ROSTEN

Geomean

0

50

100

150

200

250

300

KLT (NoPyramid)

PKLT (1Level)

FAST9 FASTROSTEN

STHORM - 16 proc. 600Mhz

ARM + Neon Dual Core.1GHz

Running Time (ms)

ARM Cortex A9Dual Core

(ONE used)

1 ClusterSTHORM(3.7 mm2)

8X GOPS/mm2, 25X GOPS/mm2*W

Page 31: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

KLT Scalability: Execution Time 31

1 cluster (1 WG, <n> WI)

2 clusters(2 WG, 32 WI)

Same compiled code, Different runtime parameters (ND-Range)

0

1

2

3

4

5

6

7

8

9

10

STHORM 4 Proc -600 Mhz

STHORM 8 Proc -600 Mhz

STHORM 16 Proc -600 Mhz

SHORM 32 Proc - 600Mhz

CA9 1GHz

Execution Time (ms) (less is better)

WG=work-group, WI=work-item

Page 32: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

KLT Scalability: Speed-up 32

0

2

4

6

8

10

12

STHORM 4 Proc - 600Mhz

STHORM 8 Proc - 600Mhz

STHORM 16 Proc - 600Mhz

STHORM 32 Proc - 600Mhz

Speedup vs CA9(more is better)

Same compiled code, Different runtime parameters (ND-Range)

1 cluster (1 WG, <n> WI)

2 clusters(2 WG, 32 WI)

WG=work-group, WI=work-item

1.8x

3.5x

6x

10.5x

Page 33: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

FAST Application Performance 33

07/10/2013

Best in class performance

260

31 25 12 7 1.3

Page 34: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

FAST Application Power 34

07/10/2013

Best in class energy efficiency

Page 35: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Challenges of Programming EMM Architectures

07/10/2013

Page 36: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

OpenCL Trace Visualization Ex. (1 Cluster)36

07/10/2013

Running

DMA prog.

Waiting

Barrier

WI instance

Page 37: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

OpenCL Trace Visualization Ex. (4 cluster)37

07/10/2013

Running

DMA prog.

Waiting

Barrier

WI instance

Page 38: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Main Source of Complexity and Bugs 38

Tile 0

Tile 1

data

set

Tile 0

Tile 1

Tile 0

Tile 1

Tile 0

Tile 1

work-group work-group

work-group work-group

16 work-items 16 work-items

16 work-items 16 work-items

Data Slicing/Tilingfor exploiting the L1 memory

processwrite

read

Asynchronous transferoperations

To fully exploit the DMA

Sobel

Gradientx

Gradient y

Euclideannorm

Merge multiple functions into a single OpenCL kernel

to maximize the data locality and reduce host-device interaction

100

378

0

50

100

150

200

250

300

350

400

1 OpenCL kernel 3 OpenCL kernels

Sobel 3x3 performance(lower is better)

data set

Page 39: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Hide Memory Latency: Software Pipeline 39

processwrite

read

Work-group async-copy Next input data to localWork-group async-copy Previous output data to globalProcess my own Current input local blockWork-group barrierWait for async-copies completion

Kernel code

readprocesswritereadprocessreadprocesswrite

read

Page 40: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Hide Memory Latency: Double Buffer 40

Local Memory

W

CR

R C

W

Read data

block 0→local in0

Prolog

0

Compute data

local in0 → local out0

Read data

block1→local in1

1

Loop body

Write data

local out0

→ block 0

Compute data

local in1 → local out1

Read data

block 2→local in0

2

Write data

local out i%2

→ block i-2

Compute data

local in (i+1)%2

→ local out (i+1)%2

Read data

block i→local out i%2

i

Epilog

Write data

local out N%2

→ block N-2

Compute data

local in (N+1)%2 → local out (N+1)%2

N

Write data

local out (N+1)%2

→ block N-1

N+1

In0

In1

Out0

Out1

Local memory

Page 41: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

OpenCL Pros

• OpenCL enables full performance of STHORM EMM• Explicitly managed memory by exploiting DMAs and L1 memory

• OpenCL can adapt to different Host-Fabric system configurations

• OpenCL well-suited to image processing applications• Exploiting data-level parallelism is enough in most cases

• Still possibility of coarse grain task parallelism with kernel graphs

• Conceptually simple execution model• WG barriers

• Atomic operations

• Good scalability• One kernel, multiple execution configurations (nb of clusters, of PE)

Page 42: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

OpenCL Cons

• Lower level programming• Low level programming (host programming and kernel memory

management)

• First parallelization steps are error-prone (C � OpenCL, image tiling)

�Substantial Learning curve, need knowledge of target architecture

�Specialization of code for the target

�Lower programming productivity

�Wide range of results, depending on programmer experience

�Performance and bandwidth issue with low computeintensive kernels• Data round trip between L1 and L3

• Host-device interaction

�Need to manually merge logical functions into single kernel

Page 43: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Outlook: Emerging Programming Tool Solutions

Page 44: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Programming Outlook

Cluster 1 Cluster 2 Cluster 3 Cluster 4ST

horm

Expertprogrammer

Algorithm designer

Computer VisionSpecific Language

KernelGenius

API + OpenCL-C

OpenCL-CKernel

C Host wrapper

OpenCL-CKernel

C hostwrapper

Page 45: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Kernel Genius Inputs

5x1convol

1x5convol

R: [-2:2] [0:0]

W: [0:0] [0:0]

R: [-2:2] [0:0]

W: [0:0] [0:0]

NxMImage

NxMImage

NxMImage

Graph

Page 46: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Read/Write Pattern and Stride Examples

stride_inX=4

stri

de

_in

Y=

4

stride_outx=4

stri

de

_o

ut y

=4

4x4

IDCT

Node

stride_inX=1

stri

de

_in

Y=

1 stride_outx=2

stri

de

_o

ut Y

=1

3x3

Gradient

Node

(a)

(b)

stride_inX=1st

rid

e_

inY=

1

readX=[-2:2]re

ad

Y=

[-2

:2] 5x5

Blur

Node

stride_outx=2

stri

de

_o

ut Y

=1

rea

dY=

[-1

:1]

readX=[-1:1]

readX=[0:3]

rea

dY=

[0:3

]writeX=[0:0]

wri

teY=

[0:0

]

(c)

writeX=[0:1]

wri

teY=

[0:0

]w

rite

Y=

[0:4

]

writeX=[0:4]

Page 47: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

KernelGenius Compiler

5x1convol

1x5convol

R: [-2:2] [0:0]

W: [0:0] [0:0]

R: [-2:2] [0:0]

W: [0:0] [0:0]

NxMImage

NxMImage

NxMImage

5x1convol

1x5convol

NxMImage

Nx2 tile

Cycle 0

Cycle 3

Copytile in

Copytile out Cycle 4

Nx5 tile

Nx2 tile

Cycle 1

NxMImage

Synchronous processing

Async

operationA

syncoperation

• Graph merge

• Data tiling

• Tile basedscheduling

• Size the localbuffering

• Parallelize theprocessing

• Manage ImageBorders

• Generate theOpenCL Kernel + C wrapper

Page 48: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Buffer Calculation Examples

Node A

Node B

Tile out B

……

Node A

Node B

Tile out B

Input B

Node C

Tile out C

Input C

(a) (b)

Tile

s o

ut

A

bu

ffe

rin

g

Input B

Tile

s o

ut

A

bu

ffe

rin

g

Page 49: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Buffer Management and Scheduling

Synchronous processing

Async

operationA

syncoperation

C-5x5

blur

D-3x3

gradient

A-Copy

tile in

E-Copy

tile out

B-Down

sample

Cycle 1

Rate 1

cycle 0Rate 2

Cycle 3

Rate 1

Cycle 4

Rate 1

Cycle 5

Rate 1

Pop=1

Peek range : [-2:2]

Push=1

Pop=1

Peek range : [-1:1]

Push=1

Peek range : [0:0]

Push=1

Pop=1

Peek range : [-1:1]

Push=1

Pop=1

Peek range : [0:0]

4 ti

les

width

width/2

5 til

es

width/2

3til

es

2til

es width

2Ai 2AI

B

2AI

B

2Ai

B

2Ai

B

0 1 2 3

C C

D

4

2Ai

B

C

D

5

2Ai

B

C

D

N

Steady

stateProlog Epilog

B

C

D

C

D

C

D D

N+1 N+2 N+3 N+4 N+5…

Ei Ei Ei Ei Ei Ei Ei

Ew Ew Ew Ew Ew Ew Ew

2Aw 2Aw 2Aw 2Aw 2Aw 2Aw 2Aw

B- Scheduler loop iterations

A- Scheduled and buffer allocated graph

Page 50: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Simple Example 50

void ImageDiff(int w, int h,float *in1,float *in2) {

for (y=0;y<h;y++) {for (x=0;x<w;x++) {out[y*w+x]=in1[y*w+x]-in2[y*w+x]

}}

__kernel void ImageDifference(__global DATA_TYPE *global_output,__global DATA_TYPE *global_input0,__global DATA_TYPE *global_input1,__local DATA_TYPE *local_buffer0,__local DATA_TYPE *local_buffer1,__local DATA_TYPE *local_buffer2,int image_width, int image_height) {

event_t e_in[2], e_out;int id = get_local_id(0),nb_it = get_local_size(0);

local DATA_TYPE *local_in0[2] = {local_buffer0, local_buffer0+image_width};local DATA_TYPE *local_in1[2] = {local_buffer1, local_buffer1+image_width};local DATA_TYPE *local_out[2] = {local_buffer2, local_buffer2+image_width};

for (int i=0;i<(image_height+2);i++) {if (i>1) {

e_out= async_work_group_copy(global_output, local_out[i&0x1], image_width, 0);global_output+=image_width;

}if (i<image_height) {

e_in[0]= async_work_group_copy(local_in0[i&0x1], global_input0, image_width, 0);global_input0+=image_width;e_in[1]= async_work_group_copy(local_in1[i&0x1], global_input1, image_width, 0);global_input1+=image_width;

}

if ((i>0)&&(i<=image_height)) {int j=(~i)&0x1;for (int k=id;k<image_width;k+=nb_it) {

local_out[j][k]= local_in0[j] [k] - local_in1[j] [k];}

barrier(CLK_LOCAL_MEM_FENCE);}

if (i>1) { wait_group_events(1, &e_out); }if (i<image_height) { wait_group_events(2, (event_t *)&e_in); }

}}

C codeOpenCL kernel code

kernel ImageDiff(int w, int h,float in1[h][w],float in2[h][w]) {

Operator<float> sub(in1,in2) {.function=${ @sub=$in1-$in2; }$

}return sub;

}

KernelGenius code

Synchronization

Async DMA

Local databuffering

Computations

Page 51: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Kernel Genius early results (1)

Page 52: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Kernel Genius early results (2)

Page 53: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Canny Edge Example (Step 1)

GradientNoise

reduction

Non Maxima

Suppression

Edge Points

Connection

Thinning

Edges

Step 3:

OpenCL Kernel

Step 2:

C code (Host)

Step 1:

OpenCL Kernel

X

Blur

Y

BlurNon

Maxima

Strong

edge

7x1

convol

1x7

convol

7x1

convol

1x7

convol

Gradient X

Gradient Y

Gaussian Blur

Magni

-tude

Out

Img

Convolution predefined

parametric nodes

Operatorprogrammable

node

Operatorprogrammable

nodes

Filterprogrammable

node

Page 54: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Canny Edge Detector Result

Page 55: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Kernel Genius Lessons

• Use of domain-specific progr. model on EMM

• Pros• Higher-level of abstraction

• Automation of DMA-based data movement

• Automation of scheduler generation

• Automation of buffer sizing

• Cons• Domain-specific: Imaging and Computer Vision

• Does not cover some highly dynamic use cases• Revert back to OpenCL

• Non-standard programming model

55

07/10/2013

Page 56: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

STHORM Programming Outlook

Cluster 1 Cluster 2 Cluster 3 Cluster 4ST

horm

Expertprogrammer

Algorithm designer

Computer VisionSpecific Language

KernelGenius

API + OpenCL-CComputer Vision

standard API

OpenCL-CKernel

C Host wrapper

OpenCL-CKernel

C hostwrapper

Page 57: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Conclusion

• Explicitly managed memory architecture leads to high-performance at low-cost and power

• Over 35X speedup wrt ARM Cortex A9 @1GHz for typical visual analytics applications

• 20X to 40X power reduction wrt GPUs

• Development with OpenCL allows to fully exploit the architecture � at the expense of extra development effort

• Double buffering

• Management of DMA-based data movement

• Kernel fusion

• Programming tools outlook• Domain-specific higher-level capture and automation yield best time-to-

market and similar performance to OpenCL-based capture

Page 58: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1

Thank you!Questions?

07/10/2013