resource allocation & scheduling in moore's law twilight...

Resource Allocation & Scheduling in Moore's Law Twilight Zone

Luca BeniniUniversità di Bologna & STMicroelectronics

The Twilight of Moore’s Law: Economics32/28nm:Fab: $3BProcess R&D: $1.2BDesign: $50-90M Mask: $2-3MEDA: $400-500M

22/20nm: Fab: $4-7B Process R&D: $2.1-3B Design: $120-500M Mask: $5-8MEDA: $0.8-1.2B

Market volume wall: only the largest volume products will be

manufactured with the most advanced technology

The Twilight of Moore’s Law: Power

Dark Silicon !!!

Sca

lin

g F

act

or

Year

Transistor Scaling

(Moore's Law)

Supply Voltage

(ITRS)

Thermal wall: transistor count still increases exponentially but we can

no longer power the entire chip (voltages, cooling do not scale)

Wa

tts

/ C

hip

Year

Max Power (air cooling + heatsink)

Chip Power (ITRS) [Watanabe et al., ISCA’10]

[Hardavellas11]

The twilight of Moore’s Law: IO Bandwidth

Memory wall: larger datasets and limited bandwidth at high power

cost for accessing external memory

STMicroelectronics’ Platform 2012

1 > 1003 6

CPU GPGPU HW IP

GOPS/mm 2 – GOPS/W

Platform 2012

SW HWMixed

ThroughputComputing

General-purposeComputing 1GOPS/mW

Closing The Accelerator Efficiency Gap

P2012 in a nutshell…

SoC

FabricController

ClusterCluster

Cluster

Cluster

Cluster Cluster

IPIPIP

IPIPIP

IPIPIP

Customization design flowCustomization design flowCustomization design flow

HW

Opt

imiz

edIP

s

P2012 Fabric

3D-stackableSW accelerator3D-stackableSW accelerator3D-stackableSW accelerator

Value Proposition for STMIPs for SoCs

P2012

•Video Codecs•Imaging•Base Band•IQI•…

X GopsSmall

LowPower

Standalone SoC

Mixing HW and SW

Programmable device•Eco System•Analytics•Fragmented Mkts•…

?

?

Area/Power/Productivity

Flexibility/Quick Prototyping

Heterogeneous Computing

Homogeneous Computing

A Killer Application (domain) for P2012

0,00

500,00

1000,00

1500,00

2000,00

2500,00

Video Analytics Applications

Security

Business Intelligence

M$

0,00

500,00

1000,00

1500,00

2000,00

2500,00

2010 2011 2012 2013 2014 2015 2016

Video Analytics Platforms

Embedded

Intel

M$

Embedded Visual intelligence

The next killer app: Machines that see (J. Bier)

P2012 SoC in 28nm

• 4 Clusters, 69 processors

• 80 GFlops• 1MB L2 mem• 2D flip chip or 3D stacked

• 600 MHz typ• < 2 W• 3.7 mm2 per cluster

Energy efficiency 40GOPS/W �0,04GOPS/mW

Taped out 2/3/12

Designing the P2012 Computing SoC

Communication challenge

P2012 as GP Accelerator

P2012 Fabric

L2

L3 (DRAM)L3 (DRAM)

Cluster 0Cluster 0

L1

TC

DM

L1

TC

DM

Cluster 1Cluster 1

L1

TC

DM

L1

TC

DM

Cluster 2Cluster 2

L1

TC

DM

L1

TC

DM

Cluster 3Cluster 3

L1

TC

DM

L1

TC

DM

ARM Host

FCFC

Cluster Interconnect � Latency

Global Interconnect � Decoupling

Off-chip interconnect � Bandwidth

The cluster

Cluster Controller

Debug & Test Unit BOUTBIN

ENCore<N>

TMS TCK TDI TD0 BI BOTMS TCK TDI TD0 BI BO

CC

Int

erfa

ce

EN

Cor

eIn

terf

ace

N:1 Mux/Arbiter

DMAsub-system

CC interconnect, CCI

CP Sub-System

CVP

STxP70-basedCluster

Processor, CP

32K I$,16K TCDM, 32-entry ITC

CC-Peripherals

DMAChannel #0

BI

JTAG

BO

HardWare Synchronizer, HWS

Shared 256-KB, 2xN-bank TCDM

… x N…PE#0

I$

PE#1

PE#N-1

I$ I$

DMAChannel #1

DMAChannel # P-1

…N

I64

TMS TCK TDI TD0 BI BO

GIC

GA

LS I/

F

M/S

T3

Configurable(N, EFUs, banking factor, …)

“Computing Farm”

ClusterControl

ENCoreBoot,

HWPE control,etc…

DMAControlL3 � L1,L1 � L1

Debug Multicore Debug

P2012 Cluster Main Features

� Symmetric Multi-Processing� Uniform Memory Access within the cluster� Non-uniform Memory Access between clusters� Up to 16 +1 processors per cluster.� Up to 20.4 GOPS (32 bits) peak per cluster at 600 MHz� 2 DMA channels allowing up to 6.4 GB/s data transfer

13

P2012 Cluster Main Features (Cont’d)

� HW Support for synchronization:� Fast barrier (within a cluster only) in ~4 Cycles for 16 processors� Flexible barrier ~20 cycles for 16 processors

� Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements

� High level of customization though:� The number of STxP70 processing elements� The STxP70 extensions (ISA customization)� Up to 32 user-defined H/W PEs� Memory sizes� Banking factor of the shared memory

14

Cluster interconnect

ENCore16

DEMUX

Low-Latency Interconnect

8-KBMem

#0

8-KBMem

#1

8-KBMem #32

HWS

ENC2EXT Bridge

128-bit

64-bit

64-bit

32-bit

DEMUX DEMUX

STxP70#0 + FPx

16K I$

32-bit

STxP70#1 + FPx

16K I$

32-bit

STxP70#15 + FPx

16K I$

32-bit

64-bit

Per

iph.

Inte

rcon

nect

Timer

EXT2MEM Bridge

EXT2PER Bridge

SlaveMaster T3 T1 Log-Interc. Interrupt Control

IT

It_wdt

It_hws

IT

CTL

ITC ITC ITC

64-bit 64-bit

OCE OCE OCE

BI BOJTAG BI BOJTAG BI BOJTAG

STBus T3 N:1 Node

DDATA

CC

DMA

IDATA

Features:• 2/3 cycles latency

� 0/1 pipe stalls• Non-blocking• Parametric soft IP

Logarithmic Mesh of TreesRouting Trees

P0

P1

M0

Arbitration Trees

M1

M2

M3

Processors Memory cutsRouting primitive

Arbitration primitive0

N-1

0

M-1Log2(M)Log2(N)

Fine-grained Interleaving � routing based on LSBs of addr.

Log ICO Performance

Shared Memory Conflicts

1 1.0328 1.07211.1329

1.2536

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0% 35.70% 50% 66.70% 100%

LD/ST %

Com

puta

tion

Tim

e

Theoretical

Measured

Typical case �� worst case?

Global interconnect

� Each cluster has its own operating point tunable to the right energy budget

� Inter cluster communication is handled by a fullyasynchronous network on chip

Cluster Decoupling

F1 F2

F3 F4

20

P2012 SoC GANoC

SoC ports

64-bit G-ANoC-L3

L2 tile (1MB)

64-bit G-ANoC-L2

STxP70-V4B

ITC

32-KB TCDM

16-KB I$

Fabric Controller

OC

E

FC PERIPHERAL

STM

JTAG

External ITs fromI/O Peripherals

rst_n, anoc_rst_n

ref_clk (100MHz)

FC ITs for host

Sys_clk

360

DMA

Trace5

2

3

8

1

1

5

CVP4pad_en

Slave

MasterIOs

CCCVP

Cluster Processor

CC-Peripherals

DMA

EN

Cor

e16

CCIGALSI/F

CCCVP

Cluster Processor

CC-Peripherals

DMA

EN

Cor

e16

CCIGALSI/F

CCCVP

Cluster Processor

CC-Peripherals

DMA

EN

Cor

e16

CCIGALSI/F

CCCVP

Cluster Processor

CC-Peripherals

DMA

EN

Cor

e16

CCIGALSI/F

std_config_n2

tst_mode

1

360 360

Topology is tailored to architecture

System plug: asymmetric bandwidth

Sideband signaling: events, debug

Off-chip interconnect

LD/ST and DMA memory transfers

� Intra-Cluster:� LD/ST (UMA)� DMA: From/to TCDM to/from HWPE

� Inter-Cluster:� LD/ST (NUMA)� DMA: L1-to/from-L1

� Cluster to/from L2-Mem:� LD/ST (NUMA)� DMA: L1 to/from L2

� Cluster to/from L3-Mem (though the system bridge):� LD/ST (NUMA)� DMA: L1 to/from L3

Fabric Controller

System

Bridge

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

P2012Cluster

L2-ME

M

System

Bridge

DMA BW 6.4GBps/Cluster � external memory bandwidth?

LOG

-IC

OG

AN

OC

??

Short-term 2D strategy : FPGA SoC host

L3 (DRAM)

ARM Host

P20

12 F

abric

L2

Clu

ster

0C

lust

er 0

L1 TCDM

Clu

ster

1C

lust

er 1

L1 TCDM

Clu

ster

2C

lust

er 2

L1 TCDM

Clu

ster

3C

lust

er 3

L1 TCDM

FC

FCXilinx Zynq 7000

• Dual ARM Cortex™-A9 MPCore (800MHz, NEON, 32KL1-I$, 32KL1-D$, 512K SH-L2)

• DDR3, DDR2 and LPDDR2 CTRL• 2x QSPI, NAND Flash and NOR Flash Memory

Controller• 2x USB2.0 (OTG), 2x GbE, 2x CAN2,0B 2x

SD/SDIO, 2x UART, 2x SPI, 2x I2C, 4x 32b GPIO• Advanced Low Power 28nm Programmable Logic:

– 28k to 350k Logic Cells (approximately 430k to 5.2M of equivalent ASIC Gates)

– 240KB to 2180KB of Extensible Block RAM– 80 to 900 18x25 DSP Slices (58 to 1080

GMACS peak DSP performance)• PCI Express® Gen2x8 (in largest devices)• 154 to 404 User IOs (Multiplexed + SelectIO™)• 4 to 16 12.5Gbps Transceivers (in largest devices)

1GBps

SoC integration: Logical view

P2012fabric

SoCInterco

Layer1

Layer2 10+ GBps

P2012 – 3D Wioming test chip

Central array of µbumps

Metal stackM1

Wide IO DRAM: face down

SoC: face down

PackageSubstrate

sign

al

SDRAMsupply IO

SoC signal,SDRAM test,

and supply IOs

~1100 TSVstest

supp

ly

µ-buffersupply IO

SoC signal,DRAM test,

and supply IOs

VD

DQ

supp

ly

Metal stackCentral matrix

Peripheral µbumps

Package Balls

Packagemolding• Fully functional

• High yield• Actual perf

higher thanexpected

June 2012 ….

SW & Tools

SW & Tools: Key challenges

� SW Ecosystem: Sticking to SW and SW integration standards throughout the development chain.

� App Ecosystem: Enabling SW development and application content well ahead of time of Silicon

� Platform Ecosystem: Enabling short iteration loop with all the actors from academic R&D up to product owner

And be efficient (GOPS/W$)!

Programming P2012

Application

Prog. Model

Mappingcontrol

PerformanceFeedback &

debug

Async NoC

ARM A9

PE0

…

Shared L1 MEM

PE1

PEn

Clus .Ctrl

HW Synchr.MultichanDMA

FabricCtrl

DMA

System bus

…A-LICWrRd

HWPE 0 HWPE N

RecvSend

CC Mem

Ctrl Peri.

FC Mem

Ctrl Periph.

PE0

…

Shared L1 MEM

PE1

PEn

Clus .Ctrl

HW Synchr.MultichanDMA

CC Mem

Ctrl Peri.

FabricHost

…A-LICWrRd

HWPE 0 HWPE N

RecvSend

Map info

P2012 Programming Models

� S/W-based platform variant� OpenCL

� CLAM compiler (CL Above Many-Core)

� NPM (Native Programming Model)� Based on MIND components for code partitioning� Communication components pulled from a provided library� Fine grain parallelism through runtime C-API

� Mixed H/W-S/W platform variant� PEDF (Predicated Execution Data Flow)

� Dataflow based programming� Support of H/W PEs accessed via streaming interfaces� Not currently part of the public SDK

Software stack (for S/W -based platform)

Assume P2012 acts as an accelerator to an Android ARM-SMP system

P2012 Fabric sideHost side

ARM SMPP2012 Fabric /

STxP70

Linux +

P2012 driverP2012 Runtime

Programming models

Applications

Assume the ARM is running a Linux system

Focus on augmented-reality applications (based on OpenCV).

Note that P2012 is a GP platform

Focus on OpenCL and dedicated programming model (NPM)

Programming Heterogeneous Parallelism

Multi-core CPU GPUMany-core

RogueP2012

More programmabilityMore parallelism

Task Parallelism

(run-to-completion)

Data Parallelism

(with some-synchro)

Coretex-A15

Native Programing Model

Host

Linux + fabric driver

Comete

Binding comp. Deployment

Appl. host code Cluster 1 Cluster 2

Memory

AC2AC1

BC1BC3BC2 AC3

Runtime Runtime

� Application Component (computation � actor) + internal parallelism� Binding Component, from a library (communication)� Building blocks for data-flow with HW & SW filters � heterogeneous

computing (e.g. based RVC-CAL MPEG standard)

Image Understanding: OpenCV on P2012

FAST: key point detection

AR

M F

MP

osix

Linux (+Android)

P2012 Linux driver

P2012 OpenCL 1.1

OpenCV library

App

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernel

FAST kernelH

ost

P20

12

X

fb

CC

OCV functions accelerated with P2012 (transparently for the Appprogrammer) � Standard domain specific APIs

Visual Analytic s Benchmarks

0

50

100

150

200

250

300

350

400

450

500

SIFT PKLT FAST_2 FAST_1

Running Time (ms)

P2012-1 Cluster 600MHz

ARM CA9 Dual Core,1GHz, Neon + FPU

0123456789

10

SIFT PKLT FAST_2 FAST_1 GeoMean

Speed Up. P12-1Cluster vs ARM CA9-2Cores 1Ghz

SIFT

PKLT

FAST_2

FAST_1

GeoMean

ARM Cortex A9Dual Core

1 ClusterP2012

8-Jul-12Eric Flamand/ STMicroelectronics

Scalability (PKLT)

50

2716

9

107

0

20

40

60

80

100

120

P2012- 4cores

P2012 -8cores

P2012 -16cores

P2012 -32cores

ARM CA9 -1GHz

2,1

4,0

6,7

11,9

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0

P2012- 4cores

P2012 -8cores

P2012 -16cores

P2012 -32cores

Running time [ms] Speedup vs ARM CA9

1 cluster: varying ND range

2 clusters

P2012: SDK / Software Ecosystem

P2012 simulatorsRuntime

Applications

IDE

com

pile

rs

Allocation and scheduling for P2012

Managing memory

P2012 ≠ GPU: Kernel level parallelism

Independent Clusters + Independent processor IF� Work-item divergence is not an issue� P2012 supports more complex OpenCL task

graph than GPUs. Both task-level and data-level (ND-Range) are possible

P2012 cores are not HW-multithreaded� P2012 OCL runtime does not accept

more than 16 work-items per work group when creating an ND-Range.

� But you can chose which work to do (e.g. using “case”) in each WI

The best way to hide memory transfer latencies when programming for P2012 is to overlap computation with DMA transfers. This technique is based on software pipelining and double buffering

multiple buffers are needed to implement

such mechanism

P2012 ≠ GPU: Managing vs. hiding MemLatency

Memory Mapping and Data Movements

L3Global Memory (buffers)

Local Memory (shared)Private Memory(kernel stacks)

Constant MemoryL1

shared256KB

>200cycles

Scalar/Vector load/storeasync_work_group_copy

Cluster

The compiler can

compute accurately

OpenCL-C kernel

stack size !async_work_group_2d_copyasync_work_item_[2d]_copy

P2012 OpenCL Programming Style

P20122 clusters16 PE per clusters

Data

work-group16 work-item

work-group16 work-item

Process block Nlocal→ local

Write block N-1local→ global

Read block N+1global→ local

Double bufferingSoftware pipelining

ND-Range2 work-groups16 work-items per WG

P2012 OpenCL Programming Challenge

1. Fill processors/clusters with processing� Fully with data parallelism when available� Task parallelism otherwise (mid-coarse grain)

2. Optimize the data locality� Use local & private as user managed cache� Minimize global ↔ local/private

3. Parallelize memory transfers & computation� Use asynchronous copies (DMA)� Satisfy memory size constraints

Can we automate this?

Allocation and scheduling for P2012

Managing Power & Temperature

Closed Loop Control

V F

Sensor (P, V, T) Actuator (V, F)Control

Computing Tile 4 cycle GALS crossing � cluster-level

T: 1-absolute (Bgap ref.),8/cluster relative (Ring osc.)

TMF-ring � 1/cluster TMF-sens � 128/cluster

FLL -based : 4-cycles � /2,/4… δf20-cycles � +-10MHz

Thermal Controller

[Intel®, ISSCC 2007]

Threshold basedcontroller

•T > Tmax � low freq•T < Tmin � high freq• cannot prevent overshoot• thermal cycle

Classical feed-back controller

• PID controllers• Better than threshold

based approach• Cannot prevent overshoot

Model Predictive Controller

•Internal prediction:avoid overshoot

•Optimization:maximizes performance

• Centralized• aware of neighbor

cores thermal influence

• All at once – MIMO controller

• Complexity !!!

Thermal Model

Past input & output

OptimizerFuture input

Target frequency

+-

Cost function

MPC

Futureoutput

Future error

Constraint

GOAL: track perf. Request @ safe T

Thermal Controller

Cluster 1CPI

f1,EC

f1,TC

Thermal Controller

Custer 2CPI

f1,EC

f2,TC

Thermal Controller

Custer 3CPI

f1,EC

Thermal Controller

Custer 4CPI

f1,EC

P2012PLANT

T1+Tneigh

T1+Tneigh

f4,TC

T3+Tneigh

T4+Tneigh

f3,TC

Ene

rgy

Con

trol

ler

High Level Architecture

Energy Proportionality + Thermal & Variability managementHolistic view

CoreN

Cluster1

Clusteri

P2012

fCLK

Controller

Sensors andperformance

counters

Job Allocation,Scheduling, f assignment

Rules

Current state

fREQ

fCLK<fREQ based on T

ERC-Multitherman

MPC: Preliminary Results

�Trace Driven Simulation(Matlab)

�Power Model: Nonlinearvs. linear

�Thermal Model:first vs. second order

�Centralized vs. Distributed

MPC enables exploitationof thermal capacitance(and PCM enhancers) for computationalsprinting

pow

erte

mpe

ratu

re

Tmax

N cores

1 core

resource allocation & scheduling in moore's law twilight...

Documents