resource allocation & scheduling in moore's law twilight...
TRANSCRIPT
Resource Allocation & Scheduling in Moore's Law Twilight Zone
Luca BeniniUniversità di Bologna & STMicroelectronics
The Twilight of Moore’s Law: Economics32/28nm:Fab: $3BProcess R&D: $1.2BDesign: $50-90M Mask: $2-3MEDA: $400-500M
22/20nm: Fab: $4-7B Process R&D: $2.1-3B Design: $120-500M Mask: $5-8MEDA: $0.8-1.2B
Market volume wall: only the largest volume products will be
manufactured with the most advanced technology
The Twilight of Moore’s Law: Power
Dark Silicon !!!
Sca
lin
g F
act
or
Year
Transistor Scaling
(Moore's Law)
Supply Voltage
(ITRS)
Thermal wall: transistor count still increases exponentially but we can
no longer power the entire chip (voltages, cooling do not scale)
Wa
tts
/ C
hip
Year
Max Power (air cooling + heatsink)
Chip Power (ITRS) [Watanabe et al., ISCA’10]
[Hardavellas11]
The twilight of Moore’s Law: IO Bandwidth
Memory wall: larger datasets and limited bandwidth at high power
cost for accessing external memory
STMicroelectronics’ Platform 2012
1 > 1003 6
CPU GPGPU HW IP
GOPS/mm 2 – GOPS/W
Platform 2012
SW HWMixed
ThroughputComputing
General-purposeComputing 1GOPS/mW
Closing The Accelerator Efficiency Gap
P2012 in a nutshell…
SoC
FabricController
ClusterCluster
Cluster
Cluster
Cluster Cluster
IPIPIP
IPIPIP
IPIPIP
Customization design flowCustomization design flowCustomization design flow
HW
Opt
imiz
edIP
s
P2012 Fabric
3D-stackableSW accelerator3D-stackableSW accelerator3D-stackableSW accelerator
Value Proposition for STMIPs for SoCs
P2012
•Video Codecs•Imaging•Base Band•IQI•…
X GopsSmall
LowPower
Standalone SoC
Mixing HW and SW
Programmable device•Eco System•Analytics•Fragmented Mkts•…
?
?
Area/Power/Productivity
Flexibility/Quick Prototyping
Heterogeneous Computing
Homogeneous Computing
A Killer Application (domain) for P2012
0,00
500,00
1000,00
1500,00
2000,00
2500,00
Video Analytics Applications
Security
Business Intelligence
M$
0,00
500,00
1000,00
1500,00
2000,00
2500,00
2010 2011 2012 2013 2014 2015 2016
Video Analytics Platforms
Embedded
Intel
M$
Embedded Visual intelligence
The next killer app: Machines that see (J. Bier)
P2012 SoC in 28nm
• 4 Clusters, 69 processors
• 80 GFlops• 1MB L2 mem• 2D flip chip or 3D stacked
• 600 MHz typ• < 2 W• 3.7 mm2 per cluster
Energy efficiency 40GOPS/W �0,04GOPS/mW
Taped out 2/3/12
Designing the P2012 Computing SoC
Communication challenge
P2012 as GP Accelerator
P2012 Fabric
L2
L3 (DRAM)L3 (DRAM)
Cluster 0Cluster 0
L1
TC
DM
L1
TC
DM
Cluster 1Cluster 1
L1
TC
DM
L1
TC
DM
Cluster 2Cluster 2
L1
TC
DM
L1
TC
DM
Cluster 3Cluster 3
L1
TC
DM
L1
TC
DM
ARM Host
FCFC
Cluster Interconnect � Latency
Global Interconnect � Decoupling
Off-chip interconnect � Bandwidth
The cluster
Cluster Controller
Debug & Test Unit BOUTBIN
ENCore<N>
TMS TCK TDI TD0 BI BOTMS TCK TDI TD0 BI BO
CC
Int
erfa
ce
EN
Cor
eIn
terf
ace
N:1 Mux/Arbiter
DMAsub-system
CC interconnect, CCI
CP Sub-System
CVP
STxP70-basedCluster
Processor, CP
32K I$,16K TCDM, 32-entry ITC
CC-Peripherals
DMAChannel #0
BI
JTAG
BO
HardWare Synchronizer, HWS
Shared 256-KB, 2xN-bank TCDM
… x N…PE#0
I$
PE#1
PE#N-1
I$ I$
DMAChannel #1
DMAChannel # P-1
…N
I64
TMS TCK TDI TD0 BI BO
GIC
GA
LS I/
F
M/S
T3
Configurable(N, EFUs, banking factor, …)
“Computing Farm”
ClusterControl
ENCoreBoot,
HWPE control,etc…
DMAControlL3 � L1,L1 � L1
Debug Multicore Debug
P2012 Cluster Main Features
� Symmetric Multi-Processing� Uniform Memory Access within the cluster� Non-uniform Memory Access between clusters� Up to 16 +1 processors per cluster.� Up to 20.4 GOPS (32 bits) peak per cluster at 600 MHz� 2 DMA channels allowing up to 6.4 GB/s data transfer
13
P2012 Cluster Main Features (Cont’d)
� HW Support for synchronization:� Fast barrier (within a cluster only) in ~4 Cycles for 16 processors� Flexible barrier ~20 cycles for 16 processors
� Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements
� High level of customization though:� The number of STxP70 processing elements� The STxP70 extensions (ISA customization)� Up to 32 user-defined H/W PEs� Memory sizes� Banking factor of the shared memory
14
Cluster interconnect
ENCore16
DEMUX
Low-Latency Interconnect
8-KBMem
#0
8-KBMem
#1
8-KBMem #32
HWS
ENC2EXT Bridge
128-bit
64-bit
64-bit
32-bit
DEMUX DEMUX
STxP70#0 + FPx
16K I$
32-bit
STxP70#1 + FPx
16K I$
32-bit
STxP70#15 + FPx
16K I$
32-bit
64-bit
Per
iph.
Inte
rcon
nect
Timer
EXT2MEM Bridge
EXT2PER Bridge
SlaveMaster T3 T1 Log-Interc. Interrupt Control
IT
It_wdt
It_hws
IT
CTL
ITC ITC ITC
64-bit 64-bit
OCE OCE OCE
BI BOJTAG BI BOJTAG BI BOJTAG
STBus T3 N:1 Node
DDATA
CC
DMA
IDATA
Features:• 2/3 cycles latency
� 0/1 pipe stalls• Non-blocking• Parametric soft IP
Logarithmic Mesh of TreesRouting Trees
P0
P1
M0
Arbitration Trees
M1
M2
M3
Processors Memory cutsRouting primitive
Arbitration primitive0
N-1
0
M-1Log2(M)Log2(N)
Fine-grained Interleaving � routing based on LSBs of addr.
Log ICO Performance
Shared Memory Conflicts
1 1.0328 1.07211.1329
1.2536
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0% 35.70% 50% 66.70% 100%
LD/ST %
Com
puta
tion
Tim
e
Theoretical
Measured
Typical case ���� worst case?
Global interconnect
� Each cluster has its own operating point tunable to the right energy budget
� Inter cluster communication is handled by a fullyasynchronous network on chip
Cluster Decoupling
F1 F2
F3 F4
20
P2012 SoC GANoC
SoC ports
64-bit G-ANoC-L3
L2 tile (1MB)
64-bit G-ANoC-L2
STxP70-V4B
ITC
32-KB TCDM
16-KB I$
Fabric Controller
OC
E
FC PERIPHERAL
STM
JTAG
External ITs fromI/O Peripherals
rst_n, anoc_rst_n
ref_clk (100MHz)
FC ITs for host
Sys_clk
360
DMA
Trace5
2
3
8
1
1
5
CVP4pad_en
Slave
MasterIOs
CCCVP
Cluster Processor
CC-Peripherals
DMA
EN
Cor
e16
CCIGALSI/F
CCCVP
Cluster Processor
CC-Peripherals
DMA
EN
Cor
e16
CCIGALSI/F
CCCVP
Cluster Processor
CC-Peripherals
DMA
EN
Cor
e16
CCIGALSI/F
CCCVP
Cluster Processor
CC-Peripherals
DMA
EN
Cor
e16
CCIGALSI/F
std_config_n2
tst_mode
1
360 360
Topology is tailored to architecture
System plug: asymmetric bandwidth
Sideband signaling: events, debug
Off-chip interconnect
LD/ST and DMA memory transfers
� Intra-Cluster:� LD/ST (UMA)� DMA: From/to TCDM to/from HWPE
� Inter-Cluster:� LD/ST (NUMA)� DMA: L1-to/from-L1
� Cluster to/from L2-Mem:� LD/ST (NUMA)� DMA: L1 to/from L2
� Cluster to/from L3-Mem (though the system bridge):� LD/ST (NUMA)� DMA: L1 to/from L3
Fabric Controller
System
Bridge
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
L2-ME
M
System
Bridge
DMA BW 6.4GBps/Cluster � external memory bandwidth?
LOG
-IC
OG
AN
OC
??
Short-term 2D strategy : FPGA SoC host
L3 (DRAM)
ARM Host
P20
12 F
abric
L2
Clu
ster
0C
lust
er 0
L1 TCDM
Clu
ster
1C
lust
er 1
L1 TCDM
Clu
ster
2C
lust
er 2
L1 TCDM
Clu
ster
3C
lust
er 3
L1 TCDM
FC
FCXilinx Zynq 7000
• Dual ARM Cortex™-A9 MPCore (800MHz, NEON, 32KL1-I$, 32KL1-D$, 512K SH-L2)
• DDR3, DDR2 and LPDDR2 CTRL• 2x QSPI, NAND Flash and NOR Flash Memory
Controller• 2x USB2.0 (OTG), 2x GbE, 2x CAN2,0B 2x
SD/SDIO, 2x UART, 2x SPI, 2x I2C, 4x 32b GPIO• Advanced Low Power 28nm Programmable Logic:
– 28k to 350k Logic Cells (approximately 430k to 5.2M of equivalent ASIC Gates)
– 240KB to 2180KB of Extensible Block RAM– 80 to 900 18x25 DSP Slices (58 to 1080
GMACS peak DSP performance)• PCI Express® Gen2x8 (in largest devices)• 154 to 404 User IOs (Multiplexed + SelectIO™)• 4 to 16 12.5Gbps Transceivers (in largest devices)
1GBps
SoC integration: Logical view
P2012fabric
SoCInterco
Layer1
Layer2 10+ GBps
P2012 – 3D Wioming test chip
Central array of µbumps
Metal stackM1
Wide IO DRAM: face down
SoC: face down
PackageSubstrate
sign
al
SDRAMsupply IO
SoC signal,SDRAM test,
and supply IOs
~1100 TSVstest
supp
ly
µ-buffersupply IO
SoC signal,DRAM test,
and supply IOs
VD
DQ
supp
ly
Metal stackCentral matrix
Peripheral µbumps
Package Balls
Packagemolding• Fully functional
• High yield• Actual perf
higher thanexpected
June 2012 ….
SW & Tools
SW & Tools: Key challenges
� SW Ecosystem: Sticking to SW and SW integration standards throughout the development chain.
� App Ecosystem: Enabling SW development and application content well ahead of time of Silicon
� Platform Ecosystem: Enabling short iteration loop with all the actors from academic R&D up to product owner
And be efficient (GOPS/W$)!
Programming P2012
Application
Prog. Model
Mappingcontrol
PerformanceFeedback &
debug
Async NoC
ARM A9
PE0
…
Shared L1 MEM
PE1
PEn
Clus .Ctrl
HW Synchr.MultichanDMA
FabricCtrl
DMA
System bus
…A-LICWrRd
HWPE 0 HWPE N
RecvSend
CC Mem
Ctrl Peri.
FC Mem
Ctrl Periph.
PE0
…
Shared L1 MEM
PE1
PEn
Clus .Ctrl
HW Synchr.MultichanDMA
CC Mem
Ctrl Peri.
FabricHost
…A-LICWrRd
HWPE 0 HWPE N
RecvSend
Map info
P2012 Programming Models
� S/W-based platform variant� OpenCL
� CLAM compiler (CL Above Many-Core)
� NPM (Native Programming Model)� Based on MIND components for code partitioning� Communication components pulled from a provided library� Fine grain parallelism through runtime C-API
� Mixed H/W-S/W platform variant� PEDF (Predicated Execution Data Flow)
� Dataflow based programming� Support of H/W PEs accessed via streaming interfaces� Not currently part of the public SDK
Software stack (for S/W -based platform)
Assume P2012 acts as an accelerator to an Android ARM-SMP system
P2012 Fabric sideHost side
ARM SMPP2012 Fabric /
STxP70
Linux +
P2012 driverP2012 Runtime
Programming models
Applications
Assume the ARM is running a Linux system
Focus on augmented-reality applications (based on OpenCV).
Note that P2012 is a GP platform
Focus on OpenCL and dedicated programming model (NPM)
Programming Heterogeneous Parallelism
Multi-core CPU GPUMany-core
RogueP2012
More programmabilityMore parallelism
Task Parallelism
(run-to-completion)
Data Parallelism
(with some-synchro)
Coretex-A15
Native Programing Model
Host
Linux + fabric driver
Comete
Binding comp. Deployment
Appl. host code Cluster 1 Cluster 2
Memory
AC2AC1
BC1BC3BC2 AC3
Runtime Runtime
� Application Component (computation � actor) + internal parallelism� Binding Component, from a library (communication)� Building blocks for data-flow with HW & SW filters � heterogeneous
computing (e.g. based RVC-CAL MPEG standard)
Image Understanding: OpenCV on P2012
FAST: key point detection
AR
M F
MP
osix
Linux (+Android)
P2012 Linux driver
P2012 OpenCL 1.1
OpenCV library
App
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernel
FAST kernelH
ost
P20
12
X
fb
CC
OCV functions accelerated with P2012 (transparently for the Appprogrammer) � Standard domain specific APIs
Visual Analytic s Benchmarks
0
50
100
150
200
250
300
350
400
450
500
SIFT PKLT FAST_2 FAST_1
Running Time (ms)
P2012-1 Cluster 600MHz
ARM CA9 Dual Core,1GHz, Neon + FPU
0123456789
10
SIFT PKLT FAST_2 FAST_1 GeoMean
Speed Up. P12-1Cluster vs ARM CA9-2Cores 1Ghz
SIFT
PKLT
FAST_2
FAST_1
GeoMean
ARM Cortex A9Dual Core
1 ClusterP2012
8-Jul-12Eric Flamand/ STMicroelectronics
Scalability (PKLT)
50
2716
9
107
0
20
40
60
80
100
120
P2012- 4cores
P2012 -8cores
P2012 -16cores
P2012 -32cores
ARM CA9 -1GHz
2,1
4,0
6,7
11,9
0,0
2,0
4,0
6,0
8,0
10,0
12,0
14,0
P2012- 4cores
P2012 -8cores
P2012 -16cores
P2012 -32cores
Running time [ms] Speedup vs ARM CA9
1 cluster: varying ND range
2 clusters
P2012: SDK / Software Ecosystem
P2012 simulatorsRuntime
Applications
IDE
com
pile
rs
Allocation and scheduling for P2012
Managing memory
P2012 ≠ GPU: Kernel level parallelism
Independent Clusters + Independent processor IF� Work-item divergence is not an issue� P2012 supports more complex OpenCL task
graph than GPUs. Both task-level and data-level (ND-Range) are possible
P2012 cores are not HW-multithreaded� P2012 OCL runtime does not accept
more than 16 work-items per work group when creating an ND-Range.
� But you can chose which work to do (e.g. using “case”) in each WI
The best way to hide memory transfer latencies when programming for P2012 is to overlap computation with DMA transfers. This technique is based on software pipelining and double buffering
multiple buffers are needed to implement
such mechanism
P2012 ≠ GPU: Managing vs. hiding MemLatency
Memory Mapping and Data Movements
L3Global Memory (buffers)
Local Memory (shared)Private Memory(kernel stacks)
Constant MemoryL1
shared256KB
>200cycles
Scalar/Vector load/storeasync_work_group_copy
Cluster
The compiler can
compute accurately
OpenCL-C kernel
stack size !async_work_group_2d_copyasync_work_item_[2d]_copy
P2012 OpenCL Programming Style
P20122 clusters16 PE per clusters
Data
work-group16 work-item
work-group16 work-item
Process block Nlocal→ local
Write block N-1local→ global
Read block N+1global→ local
Double bufferingSoftware pipelining
ND-Range2 work-groups16 work-items per WG
P2012 OpenCL Programming Challenge
1. Fill processors/clusters with processing� Fully with data parallelism when available� Task parallelism otherwise (mid-coarse grain)
2. Optimize the data locality� Use local & private as user managed cache� Minimize global ↔ local/private
3. Parallelize memory transfers & computation� Use asynchronous copies (DMA)� Satisfy memory size constraints
Can we automate this?
Allocation and scheduling for P2012
Managing Power & Temperature
Closed Loop Control
V F
Sensor (P, V, T) Actuator (V, F)Control
Computing Tile 4 cycle GALS crossing � cluster-level
T: 1-absolute (Bgap ref.),8/cluster relative (Ring osc.)
TMF-ring � 1/cluster TMF-sens � 128/cluster
FLL -based : 4-cycles � /2,/4… δf20-cycles � +-10MHz
Thermal Controller
[Intel®, ISSCC 2007]
Threshold basedcontroller
•T > Tmax � low freq•T < Tmin � high freq• cannot prevent overshoot• thermal cycle
Classical feed-back controller
• PID controllers• Better than threshold
based approach• Cannot prevent overshoot
Model Predictive Controller
•Internal prediction:avoid overshoot
•Optimization:maximizes performance
• Centralized• aware of neighbor
cores thermal influence
• All at once – MIMO controller
• Complexity !!!
Thermal Model
Past input & output
OptimizerFuture input
Target frequency
+-
Cost function
MPC
Futureoutput
Future error
Constraint
GOAL: track perf. Request @ safe T
Thermal Controller
Cluster 1CPI
f1,EC
f1,TC
Thermal Controller
Custer 2CPI
f1,EC
f2,TC
Thermal Controller
Custer 3CPI
f1,EC
Thermal Controller
Custer 4CPI
f1,EC
P2012PLANT
T1+Tneigh
T1+Tneigh
f4,TC
T3+Tneigh
T4+Tneigh
f3,TC
Ene
rgy
Con
trol
ler
High Level Architecture
Energy Proportionality + Thermal & Variability managementHolistic view
CoreN
Cluster1
Clusteri
P2012
fCLK
Controller
Sensors andperformance
counters
Job Allocation,Scheduling, f assignment
Rules
Current state
fREQ
fCLK<fREQ based on T
ERC-Multitherman
MPC: Preliminary Results
�Trace Driven Simulation(Matlab)
�Power Model: Nonlinearvs. linear
�Thermal Model:first vs. second order
�Centralized vs. Distributed
MPC enables exploitationof thermal capacitance(and PCM enhancers) for computationalsprinting
pow
erte
mpe
ratu
re
Tmax
N cores
1 core