mapping data-intensive applications to an explicitly managed...
TRANSCRIPT
![Page 1: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/1.jpg)
Mapping Data-intensive Applications to an Explicitly Managed Memory Architecture: Challenges and Solutions
Pierre G. PaulinDirector, SoC Platform AutomationSTMicroelectronics (Canada) Inc.
Memory Architecture and Organization Workshop Montreal, Canada, 3 October 2013
![Page 2: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/2.jpg)
Outline
• Cache vs. Explicitly Managed Memory (EMM)
• STHORM (aka Platform 2012) many core fabric• Explicitly Managed Memory (EMM) architecture
• STHORM programming tools
• Computer vision application benchmark results
• Challenges of programming EMM architectures
• Outlook: Emerging programming tool solutions
2
![Page 3: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/3.jpg)
Cache vs. EMM 3
07/10/2013
Shared L1 TCDM
Cor
e
Cor
e
Cor
e
TC
DM
Cor
e
Cor
e
Cor
e
TC
DM
TC
DMLocal interconnect
Coherent L2 $
Cor
e
Cor
e
Cor
e
L3 Memory L3 Memory L3 Memory
L1 $
L1 $
L1 $
L2 [Memory | $] L2 Memory
DMADMA DMA DMARefill H/W
Coherency H/W DMA DMA
1cycle
10’scycles
100’scycles
![Page 4: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/4.jpg)
Cache vs. Explicitly Managed Memory
• Cache �High performance
�Programming simplicity
Higher power, cost
Unpredictable performance
• EMM�Lower power, cost
�Performance predictability
�Explicit control of data movement
Poor handling of irregular accesses (scatter-gather)
Programming complexity• Use of DMAs for explicit data movement• Need to overlap communication and computation
4
07/10/2013
![Page 5: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/5.jpg)
STHORM Many-core Fabric:Explicitly Managed MemoryArchitecture
![Page 6: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/6.jpg)
STMicroelectronics STHORM Platform 6
1 > 1003 8
CPU GP-GPU HW IP
GOPS/mm 2 – GOPS/W
STHORM
SW HWMixed
ThroughputComputing
General-purposeComputing
![Page 7: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/7.jpg)
What is STHORM?• Massively parallel programmable processor:
• Delivers high performance per mm2 � 20 GOPS/cluster� 3.7 mm2/cluster (28nm)
• Low-power: can be battery-powered � 100 GOPS/W
• Explicitly managed memory architecture
• Scalable GALS architecture: each cluster has its own F,V point
7
07/10/2013
BridgeCore 1 Core 16
Int
Flo
at
Int
Flo
at
Int
Flo
at
L1 Memory
Synchronizer
DMA
DMA
DVFS
Cluster Controler
L2 memory
FC
Host
STxP70 STxP70 STxP70
![Page 8: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/8.jpg)
STHORM Cluster Architecture
Parametric SMP
SW Cluster
Shared Mem
Cor
e
Cor
e
Cor
eSynchronizer
NI
Asynchronous Local Interconnect
Cluster
Controller
Con
trol
Pro
cess
or
IO
IO
IO
Top
2D m
esh
Asy
nchr
onou
sne
twor
k
Comm
HW
PU
HW
PU
HW
PU
HW PE
Comm
HW
PU
HW
PU
HW
PU
HW PE• Scalable GALS
architecture
• Each cluster has its ownoperating point (F,V)
• Asynchronous2D meshinterconnect
Use
r D
efin
edS
TH
OR
M In
fras
truc
ture
![Page 9: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/9.jpg)
LD/ST and DMA Memory Transfers
• NUMA architecture
• Intra-Cluster:• LD/ST (UMA)
• DMA: From/to TCDM
• Inter-Cluster:• LD/ST (NUMA)
• DMA: L1-to/from-L1
• Cluster to/from L2-Mem:• LD/ST (NUMA)
• DMA: L1 to/from L2
• Cluster to/from L3-Mem (though the system bridge):
• LD/ST (NUMA)
• DMA: L1 to/from L3
9
07/10/2013
Fabric Controller
System
Bridge
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
L2-MEM
System
Bridge
L1-MEM
L1-MEM
L1-MEM
L1-MEM
L1-MEM
L1-MEM
L1-MEM
L1-MEM
L1-MEM
![Page 10: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/10.jpg)
STHORM Cluster Architecture
STHORM Cluster Overview 10
07/10/2013Presentation Title
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory(TCDM)
TCDM Logarithmic interconnect
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Per
iphe
ral L
ogar
ithm
ic in
terc
onne
ct
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
EN
Cor
e<N
>-C
C
inte
rfac
e
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC
Inte
rcon
nect
, CC
I
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#3
1
Mem
ory
bank
#3
2
• Parametric multi-core crossbar with a logarithmic s tructure• Reduced arbitration complexity• Round robin arbitration scheme• Up to N memory accesses per cycle• Test-and-Set support
![Page 11: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/11.jpg)
Shared Memory 11
07/10/2013
P1 P2 P3 P4
B2
B3
B4
B5
B6
B7
B8
B1
Routing Tree
Arbitration Tree
N Processors
2N Memory Banks
![Page 12: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/12.jpg)
Shared Memory (contd.) 12
07/10/2013
Shared Memory Conflicts
1 1.0328 1.07211.1329
1.2536
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0% 35.70% 50% 66.70% 100%
LD/ST %
Com
puta
tion
Tim
e
Theoretical
Measured
Typical case
![Page 13: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/13.jpg)
STHORM Cluster Overview 13
07/10/2013
STHORM Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory(TCDM)
TCDM Logarithmic interconnect
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Per
iphe
ral L
ogar
ithm
ic in
terc
onne
ct
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
EN
Cor
e<N
>-C
C
inte
rfac
e
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC
Inte
rcon
nect
, CC
I
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#3
1
Mem
ory
bank
#3
2
• Supports 1D & 2D transfers• Up to 3.2GB/s peak per DMA • Support up to 16 outstanding transactions• Support of Out of Order (OoO)
![Page 14: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/14.jpg)
STHORM Programming Tools
14
![Page 15: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/15.jpg)
Programming STHORM 15
Application
Prog. Model
Mappingcontrol
PerformanceFeedback &
debug
Async NoC
ARM Cortex A9
PE0
…
Shared L1 MEM
PE1
PEn
Clus .Ctrl
HW Synchr.MultichanDMA
FabricCtrl
DMA
System bus
…A-LICWrRd
HWPE 0 HWPE N
RecvSend
CC Mem
Ctrl Peri.
FC Mem
Ctrl Periph.
PE0
…
Shared L1 MEM
PE1
PEn
Clus .Ctrl
HW Synchr.MultichanDMA
CC Mem
Ctrl Peri.
FabricHost
…A-LICWrRd
HWPE 0 HWPE N
RecvSend
Mapping info
![Page 16: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/16.jpg)
OpenCL: Standard of the Convergence 16
Multi-core CPU GPUMany-core
RogueSTHORM
(aka P2012)
More programmabilityMore parallelism
Task Parallelism(run-to-completion)
Data Parallelism(with some-synchro)
Cortex-A9 Midgard-T6xxCortex-A15
![Page 17: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/17.jpg)
STHORM as OpenCL Device 17
Conceptual OpenCL Device STHORM Compute Device
Cluster N (16 PEs)
256 KB shared memory (L1)
Global Memory: External DDR memory (L3)
…
Cluster 1 (16 PEs)
…
256KB shared memory (L1)
2 DMAchannels
2 DMAchannels
xp70+FPx xp70+FPxxp70+FPx xp70+FPx
• Scalable architecture (cluster based)• Supports SPMD with 0-cost branch divergence• Supports Task parallelism• Shared Local memory• 1D/2D DMA engines• Hardware synchronizer
• Scalable programming model• Supports SPMD model• Supports Task parallelism• Covers complex memory hierarchy• Supports async memory transfers
Local Memory
Global / Constant Memory Data Cache
Global Memory
Compute Unit N
Private Memory
Private Memory
Proc 1 Proc M
……
Compute Unit 1
Private Memory
Private Memory
Proc 1 Proc M…
Local Memory
Async
Async
…
P$ Refill. Static & Dynamic Buffer Allocn.
![Page 18: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/18.jpg)
Global OpenCL Execution Model
• Host-centric, devices are asynchronous
• Supports two kind of parallelism• DLP (Data Level Parallelism):
Parallel multiple instances of a kernel
• TLP (Task Level Parallelism): Different kernels executed in parallel
• Communication• Between work-items of same workgroup
• Between kernels: through buffers in global memory
• Between host and kernels: memory mapped buffers
• Synchronization• Between work-items: work-group ‘barrier’
• Between commands of a command queue: out-of-order
18
07/10/2013
![Page 19: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/19.jpg)
STHORM ≠ GPU: Kernel Level Parallelism 19
Independent Clusters + Independent Processors� Work-item divergence not an issue in Sthorm� GPUs can support, but cost of stalling threads� STHORM supports more complex OpenCL
task graph than GPUs. Both task-level and data-level (ND-Range) are possible
STHORM cores are not HW-multithreaded� STHORM OCL runtime does not accept
more than 16 work-items per work-group when creating an ND-Range.
� But you can choose which work to do in each work-item (e.g. using “case”)
WGWG WG
WGWG WG
WGWG WG
WG
WG
WG
WGWG WG WG
WG
GP
US
TH
OR
M
![Page 20: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/20.jpg)
STHORM Memory Mapping
L3 Global Memory(Physically contiguous)
Local MemoryPrivate (work-item stacks)
Constant Memory~10 KBRuntime
Semi-resident(per program build)
non-resident(per kernel run)
L1
256 KB
~2cycles
150~200cycles
shared256KBTCDM
Extmem
resident
Host memory(virtualized)
20
Program cache refill, allocation of static & non-static buffersL2
1 MB
~20cycles
1MB
![Page 21: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/21.jpg)
Data Movements inside STHORM 21
L3 Global Memory (buffers)
L1shared256KB
Scalar/Vector load/storeasync_work_group_copy
async_work_group_2d_copyasync_work_item_[2d]_copy
Local MemoryPrivate (work-item stacks)
Constant Memory~10 KBRuntime
Program loadtime
![Page 22: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/22.jpg)
OpenCLapplication
OpenCL Programming Tools
OCLcompiler
OpenCL-aware visualization & analysis tools
Low-level
Traces
OpenCL Encodings
SoC
Cluster2
ClusterN
STHORM
Host
SystemTrace
Module
Cluster0
Cluster1
TraceDatabase
•Symbolic names / physical IDs•Trace point states(enabled / disabled)
•Compact encoding to reduce instrumentation cost
•Debug symbolic info
CC
CC CC
CC
Debug & Test Unit OpenCL-aware
debugger
Runtime
RTRT
RT RT
C C
CC
![Page 23: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/23.jpg)
STHORM Target Platform
23
![Page 24: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/24.jpg)
STHORM Test Chip 24
• 28nm LP CMOS process
• 4 Clusters, 69 dual issue processors
• 80 GOPS
• 1MB L2 mem
• 600 MHz typ
• 200mW per cluster typical(400 mW worst case)
• 3.7 mm2 per cluster
• Fully functional samplesavailable since Dec. 2012
Energy efficiency = 100 GOPS / W
![Page 25: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/25.jpg)
STHORM SoC + Zynq board
STHORM Evaluation Board 25
STHORM 2D SoC
AN
oC
Cluster1 (16 PE)
L2 Mem (1 MB)
SN
oC S
erde
s I/F
Cluster2 (16 PE)
Cluster3 (16 PE)
Cluster4 (16 PE)
Ethernet
L3 Mem (512 MB)
USBSD I/F DebugHDMI
HDvideo
Zynq FPGA (Z7020)
AX
I
SN
oC/ N
I
SN
OC
-AX
IB
ridge Cortex
A9
2 CMOSsensor I/F
SDcard
STM-Probe
STMC2
Debug & Trace
1GB/s
Bluetooth
UART
MEMS
512K
512K
512K
512K
Fabric Ctrl. 32K
2 Imagesensors
![Page 26: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/26.jpg)
26
07/10/2013Presentation Title
STHORM Evaluation Board
2 CMOS imaging plug-in
STHORM test chipBluetooth
module
Zynq host:Cortex A9,Peripherals
3x USB (OTG, serial, JTAG)
Ethernet
MEMS: acc./gyro/ magneto;mikes (2x)
Debug &Trace
![Page 27: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/27.jpg)
Applications & Benchmarks
27
![Page 28: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/28.jpg)
Benchmarking Visual Analytics Applications
• Objective: Evaluation of achievable speedup when offloading visual analytics applications from ARM hostto STHORM accelerator
• Corner detection: FAST, SIFT
• Feature tracking: PKLT
• Input image: VGA resolution, 777 corners detected
• Measurement results collected by analyzing execution traces from STHORM evaluation board
28
![Page 29: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/29.jpg)
Visual Analytics Benchmark 29
0
2
4
6
8
10
12
KLT (NoPyramid)
PKLT (1Level)
FAST9 FASTROSTEN
Geomean
Speed ST HORM 1 Cluster Versus 1 CA9
KLT (No Pyramid)
PKLT (1 Level)
FAST9
FAST ROSTEN
Geomean
0
50
100
150
200
250
300
KLT (NoPyramid)
PKLT (1Level)
FAST9 FASTROSTEN
STHORM - 16 proc. 600Mhz
ARM + Neon Dual Core.1GHz
Running Time (ms)
ARM Cortex A9Dual Core
(ONE used)
1 ClusterSTHORM(3.7 mm2)
![Page 30: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/30.jpg)
Visual Analytics Benchmark 30
0
2
4
6
8
10
12
KLT (NoPyramid)
PKLT (1Level)
FAST9 FASTROSTEN
Geomean
Speed ST HORM 1 Cluster Versus 1 CA9
KLT (No Pyramid)
PKLT (1 Level)
FAST9
FAST ROSTEN
Geomean
0
50
100
150
200
250
300
KLT (NoPyramid)
PKLT (1Level)
FAST9 FASTROSTEN
STHORM - 16 proc. 600Mhz
ARM + Neon Dual Core.1GHz
Running Time (ms)
ARM Cortex A9Dual Core
(ONE used)
1 ClusterSTHORM(3.7 mm2)
8X GOPS/mm2, 25X GOPS/mm2*W
![Page 31: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/31.jpg)
KLT Scalability: Execution Time 31
1 cluster (1 WG, <n> WI)
2 clusters(2 WG, 32 WI)
Same compiled code, Different runtime parameters (ND-Range)
0
1
2
3
4
5
6
7
8
9
10
STHORM 4 Proc -600 Mhz
STHORM 8 Proc -600 Mhz
STHORM 16 Proc -600 Mhz
SHORM 32 Proc - 600Mhz
CA9 1GHz
Execution Time (ms) (less is better)
WG=work-group, WI=work-item
![Page 32: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/32.jpg)
KLT Scalability: Speed-up 32
0
2
4
6
8
10
12
STHORM 4 Proc - 600Mhz
STHORM 8 Proc - 600Mhz
STHORM 16 Proc - 600Mhz
STHORM 32 Proc - 600Mhz
Speedup vs CA9(more is better)
Same compiled code, Different runtime parameters (ND-Range)
1 cluster (1 WG, <n> WI)
2 clusters(2 WG, 32 WI)
WG=work-group, WI=work-item
1.8x
3.5x
6x
10.5x
![Page 33: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/33.jpg)
FAST Application Performance 33
07/10/2013
Best in class performance
260
31 25 12 7 1.3
![Page 34: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/34.jpg)
FAST Application Power 34
07/10/2013
Best in class energy efficiency
![Page 35: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/35.jpg)
Challenges of Programming EMM Architectures
07/10/2013
![Page 36: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/36.jpg)
OpenCL Trace Visualization Ex. (1 Cluster)36
07/10/2013
Running
DMA prog.
Waiting
Barrier
WI instance
![Page 37: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/37.jpg)
OpenCL Trace Visualization Ex. (4 cluster)37
07/10/2013
Running
DMA prog.
Waiting
Barrier
WI instance
![Page 38: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/38.jpg)
Main Source of Complexity and Bugs 38
Tile 0
Tile 1
data
set
Tile 0
Tile 1
Tile 0
Tile 1
Tile 0
Tile 1
work-group work-group
work-group work-group
16 work-items 16 work-items
16 work-items 16 work-items
Data Slicing/Tilingfor exploiting the L1 memory
processwrite
read
Asynchronous transferoperations
To fully exploit the DMA
Sobel
Gradientx
Gradient y
Euclideannorm
Merge multiple functions into a single OpenCL kernel
to maximize the data locality and reduce host-device interaction
100
378
0
50
100
150
200
250
300
350
400
1 OpenCL kernel 3 OpenCL kernels
Sobel 3x3 performance(lower is better)
data set
![Page 39: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/39.jpg)
Hide Memory Latency: Software Pipeline 39
processwrite
read
Work-group async-copy Next input data to localWork-group async-copy Previous output data to globalProcess my own Current input local blockWork-group barrierWait for async-copies completion
Kernel code
readprocesswritereadprocessreadprocesswrite
read
![Page 40: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/40.jpg)
Hide Memory Latency: Double Buffer 40
Local Memory
W
CR
R C
W
Read data
block 0→local in0
Prolog
0
Compute data
local in0 → local out0
Read data
block1→local in1
1
Loop body
Write data
local out0
→ block 0
Compute data
local in1 → local out1
Read data
block 2→local in0
2
Write data
local out i%2
→ block i-2
Compute data
local in (i+1)%2
→ local out (i+1)%2
Read data
block i→local out i%2
i
…
Epilog
Write data
local out N%2
→ block N-2
Compute data
local in (N+1)%2 → local out (N+1)%2
N
Write data
local out (N+1)%2
→ block N-1
N+1
In0
In1
Out0
Out1
Local memory
![Page 41: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/41.jpg)
OpenCL Pros
• OpenCL enables full performance of STHORM EMM• Explicitly managed memory by exploiting DMAs and L1 memory
• OpenCL can adapt to different Host-Fabric system configurations
• OpenCL well-suited to image processing applications• Exploiting data-level parallelism is enough in most cases
• Still possibility of coarse grain task parallelism with kernel graphs
• Conceptually simple execution model• WG barriers
• Atomic operations
• Good scalability• One kernel, multiple execution configurations (nb of clusters, of PE)
![Page 42: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/42.jpg)
OpenCL Cons
• Lower level programming• Low level programming (host programming and kernel memory
management)
• First parallelization steps are error-prone (C � OpenCL, image tiling)
�Substantial Learning curve, need knowledge of target architecture
�Specialization of code for the target
�Lower programming productivity
�Wide range of results, depending on programmer experience
�Performance and bandwidth issue with low computeintensive kernels• Data round trip between L1 and L3
• Host-device interaction
�Need to manually merge logical functions into single kernel
![Page 43: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/43.jpg)
Outlook: Emerging Programming Tool Solutions
![Page 44: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/44.jpg)
STHORM Programming Outlook
Cluster 1 Cluster 2 Cluster 3 Cluster 4ST
horm
Expertprogrammer
Algorithm designer
Computer VisionSpecific Language
KernelGenius
API + OpenCL-C
OpenCL-CKernel
C Host wrapper
OpenCL-CKernel
C hostwrapper
![Page 45: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/45.jpg)
Kernel Genius Inputs
5x1convol
1x5convol
R: [-2:2] [0:0]
W: [0:0] [0:0]
R: [-2:2] [0:0]
W: [0:0] [0:0]
NxMImage
NxMImage
NxMImage
Graph
![Page 46: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/46.jpg)
Read/Write Pattern and Stride Examples
stride_inX=4
stri
de
_in
Y=
4
stride_outx=4
stri
de
_o
ut y
=4
4x4
IDCT
Node
stride_inX=1
stri
de
_in
Y=
1 stride_outx=2
stri
de
_o
ut Y
=1
3x3
Gradient
Node
(a)
(b)
stride_inX=1st
rid
e_
inY=
1
readX=[-2:2]re
ad
Y=
[-2
:2] 5x5
Blur
Node
stride_outx=2
stri
de
_o
ut Y
=1
rea
dY=
[-1
:1]
readX=[-1:1]
readX=[0:3]
rea
dY=
[0:3
]writeX=[0:0]
wri
teY=
[0:0
]
(c)
writeX=[0:1]
wri
teY=
[0:0
]w
rite
Y=
[0:4
]
writeX=[0:4]
![Page 47: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/47.jpg)
KernelGenius Compiler
5x1convol
1x5convol
R: [-2:2] [0:0]
W: [0:0] [0:0]
R: [-2:2] [0:0]
W: [0:0] [0:0]
NxMImage
NxMImage
NxMImage
5x1convol
1x5convol
NxMImage
Nx2 tile
Cycle 0
Cycle 3
Copytile in
Copytile out Cycle 4
Nx5 tile
Nx2 tile
Cycle 1
NxMImage
Synchronous processing
Async
operationA
syncoperation
• Graph merge
• Data tiling
• Tile basedscheduling
• Size the localbuffering
• Parallelize theprocessing
• Manage ImageBorders
• Generate theOpenCL Kernel + C wrapper
![Page 48: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/48.jpg)
Buffer Calculation Examples
Node A
Node B
Tile out B
……
Node A
Node B
Tile out B
Input B
Node C
Tile out C
Input C
(a) (b)
Tile
s o
ut
A
bu
ffe
rin
g
Input B
Tile
s o
ut
A
bu
ffe
rin
g
![Page 49: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/49.jpg)
Buffer Management and Scheduling
Synchronous processing
Async
operationA
syncoperation
C-5x5
blur
D-3x3
gradient
A-Copy
tile in
E-Copy
tile out
B-Down
sample
Cycle 1
Rate 1
cycle 0Rate 2
Cycle 3
Rate 1
Cycle 4
Rate 1
Cycle 5
Rate 1
Pop=1
Peek range : [-2:2]
Push=1
Pop=1
Peek range : [-1:1]
Push=1
Peek range : [0:0]
Push=1
Pop=1
Peek range : [-1:1]
Push=1
Pop=1
Peek range : [0:0]
4 ti
les
width
width/2
5 til
es
width/2
3til
es
2til
es width
2Ai 2AI
B
2AI
B
2Ai
B
2Ai
B
0 1 2 3
C C
D
4
2Ai
B
C
D
5
2Ai
B
C
D
N
…
Steady
stateProlog Epilog
B
C
D
C
D
C
D D
N+1 N+2 N+3 N+4 N+5…
Ei Ei Ei Ei Ei Ei Ei
Ew Ew Ew Ew Ew Ew Ew
2Aw 2Aw 2Aw 2Aw 2Aw 2Aw 2Aw
B- Scheduler loop iterations
A- Scheduled and buffer allocated graph
![Page 50: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/50.jpg)
Simple Example 50
void ImageDiff(int w, int h,float *in1,float *in2) {
for (y=0;y<h;y++) {for (x=0;x<w;x++) {out[y*w+x]=in1[y*w+x]-in2[y*w+x]
}}
__kernel void ImageDifference(__global DATA_TYPE *global_output,__global DATA_TYPE *global_input0,__global DATA_TYPE *global_input1,__local DATA_TYPE *local_buffer0,__local DATA_TYPE *local_buffer1,__local DATA_TYPE *local_buffer2,int image_width, int image_height) {
event_t e_in[2], e_out;int id = get_local_id(0),nb_it = get_local_size(0);
local DATA_TYPE *local_in0[2] = {local_buffer0, local_buffer0+image_width};local DATA_TYPE *local_in1[2] = {local_buffer1, local_buffer1+image_width};local DATA_TYPE *local_out[2] = {local_buffer2, local_buffer2+image_width};
for (int i=0;i<(image_height+2);i++) {if (i>1) {
e_out= async_work_group_copy(global_output, local_out[i&0x1], image_width, 0);global_output+=image_width;
}if (i<image_height) {
e_in[0]= async_work_group_copy(local_in0[i&0x1], global_input0, image_width, 0);global_input0+=image_width;e_in[1]= async_work_group_copy(local_in1[i&0x1], global_input1, image_width, 0);global_input1+=image_width;
}
if ((i>0)&&(i<=image_height)) {int j=(~i)&0x1;for (int k=id;k<image_width;k+=nb_it) {
local_out[j][k]= local_in0[j] [k] - local_in1[j] [k];}
barrier(CLK_LOCAL_MEM_FENCE);}
if (i>1) { wait_group_events(1, &e_out); }if (i<image_height) { wait_group_events(2, (event_t *)&e_in); }
}}
C codeOpenCL kernel code
kernel ImageDiff(int w, int h,float in1[h][w],float in2[h][w]) {
Operator<float> sub(in1,in2) {.function=${ @sub=$in1-$in2; }$
}return sub;
}
KernelGenius code
Synchronization
Async DMA
Local databuffering
Computations
![Page 51: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/51.jpg)
Kernel Genius early results (1)
![Page 52: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/52.jpg)
Kernel Genius early results (2)
![Page 53: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/53.jpg)
Canny Edge Example (Step 1)
GradientNoise
reduction
Non Maxima
Suppression
Edge Points
Connection
Thinning
Edges
Step 3:
OpenCL Kernel
Step 2:
C code (Host)
Step 1:
OpenCL Kernel
X
Blur
Y
BlurNon
Maxima
Strong
edge
7x1
convol
1x7
convol
7x1
convol
1x7
convol
Gradient X
Gradient Y
Gaussian Blur
Magni
-tude
Out
Img
Convolution predefined
parametric nodes
Operatorprogrammable
node
Operatorprogrammable
nodes
Filterprogrammable
node
![Page 54: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/54.jpg)
Canny Edge Detector Result
![Page 55: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/55.jpg)
Kernel Genius Lessons
• Use of domain-specific progr. model on EMM
• Pros• Higher-level of abstraction
• Automation of DMA-based data movement
• Automation of scheduler generation
• Automation of buffer sizing
• Cons• Domain-specific: Imaging and Computer Vision
• Does not cover some highly dynamic use cases• Revert back to OpenCL
• Non-standard programming model
55
07/10/2013
![Page 56: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/56.jpg)
STHORM Programming Outlook
Cluster 1 Cluster 2 Cluster 3 Cluster 4ST
horm
Expertprogrammer
Algorithm designer
Computer VisionSpecific Language
KernelGenius
API + OpenCL-CComputer Vision
standard API
OpenCL-CKernel
C Host wrapper
OpenCL-CKernel
C hostwrapper
![Page 57: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/57.jpg)
Conclusion
• Explicitly managed memory architecture leads to high-performance at low-cost and power
• Over 35X speedup wrt ARM Cortex A9 @1GHz for typical visual analytics applications
• 20X to 40X power reduction wrt GPUs
• Development with OpenCL allows to fully exploit the architecture � at the expense of extra development effort
• Double buffering
• Management of DMA-based data movement
• Kernel fusion
• Programming tools outlook• Domain-specific higher-level capture and automation yield best time-to-
market and similar performance to OpenCL-based capture
![Page 58: Mapping Data-intensive Applications to an Explicitly Managed …jasonxue/MeAOW/20131003_meaow_pres... · 2013. 10. 8. · PKLT (1 Level) FAST9 FAST ROSTEN Geomean Speed ST HORM 1](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac66a0d9899b110531dd4e/html5/thumbnails/58.jpg)
Thank you!Questions?
07/10/2013