a flexible multi-core platform for multi-standard video applications soo-ik chae center for soc...
TRANSCRIPT
A Flexible Multi-Core Platform For Multi-Standard Video Applications
Soo-Ik ChaeCenter for SoC Design Technology
Seoul National University
MPSoC 2009Savannah, Georgia, USA
2
Content
Motivation Proposed multi-core platform architecture
RISC cluster Hardware operating system kernel Computation coprocessor architecture Communication architecture with two separated networks
Design flow for application mapping Experimental result
H.264/AVC 720p high profile decoder implementations
Future work
3
High-performance Video Systems
Huge computation load
• 60 GOPS to decode 1080p 30 fps• Dedicated H/W blocks (for high-end applications)
Multiple standards/ New standards
• MPEG2/4, H.264, DivX, VC-1, etc• Software with RISC, DSP, SIMD processors
Embedded in mobile devices
• PMPs, Smart Phones, etc• Area and energy efficiencies are critical
Large data transfers and memories
• At least 96MB for 1080p decoders• Application-specific optimized communication and memory architectures
Should satisfy
all of these
CONFLICTING requirements!!
Flexible high-performance platform
4
Proposed Multi-core Platform Architecture
An array of RISC clusters with coprocessors connected through two separated networks: control and data
Each RISC consists of up to 4 cores, shared I$ and D$, HOSK, coprocessors.
Data Network
Control Network
RISCCluster 0
RISCCluster 1
RISCCluster 2
RISCCluster 3
Multi-core platform
Local Memory
Local Memory
Direct Memory Access Controller
Global Memory
FIFO Group
FIFO Group
FIFO Group
SharedI-CacheHardware
OS Kernel(HOSK)
Context/Data Bus
Computation coprocessor
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core3(Data)
SharedD-Cache
Thread Control Memory
Context Memory
Communication coprocessor(Data processing)
Communication coprocessor(Control processing)
5
A Multi-threading RISC Cluster
Scheduling
Context Switching
Load Balancing
Multithreading
Synchronization
Message Passing(Channel Access)
Communication
Implementation
Area (Complexity)
Scalability(# of threads, # of cores)
Coherent Shared Memory
Hardware Operating System
Area-efficientRISCCore
Area-efficientRISCCore
Area-efficientRISCCore
Shared Caches
CommunicationCo-processor (CCP)
H/WBlock
H/WBlock
channels or CATs
H/W based Task queue management+ {priority+RR}-based task scheduling
Fast context switching in 4 ~ 17 cycles
Dynamic thread allocation +Pre-emptive multithreading(Priority- or Round Robin)
Thread migration withoutcompulsory cache missesH/W-based mutex/semaphore
No cache-coherency problem
Channel access with a singleco-processor instruction
Thread suspend or wake-up withoutsoftware interventionOn-chip/Off-chip memory-based
Context memory
No system services in each core+ Shared multiplier unit
Use larger SRAMsNo cache fragmentation
The number of cores in a cluster is limited due to cache sharing.
Area (Complexity)
6
Hardware Operating System Kernel (HOSK)
ConfigCoreContext
controller
HOS
R0R1
R15 (PC)...
Context Buffer
ThreadManager
context bus
ContextMemory
ThreadControlMemory
MainController
Datapath
R0R1
R15 (PC)...
Register File
coprocessor bus (or data bus)
ConfigCoreConfigCore
Context Manager
32-bit bus: 17 cycles64-bit bus: 9 cycles544-bit bus: 4 cycles
Context switching order
R15 R14 R13 …Pre-fetch or Save contexts
in background!
Task Scheduling &
Semaphore Control
SDRAMor
SRAM
SDRAMor
SRAM
Main controller: receive service requests and control other blocks Context manager: pre-fetch or save contexts in background Thread manager: schedule tasks and control semaphores
7
Computation Coprocessors
Local memory is accessed by both RISC cores and the computation coprocessors
Coprocessor task manager selects an available hardware thread for an outstanding coprocessor command
A pool of hardware threads
General coprocessor interface
Command queues to issue nonblocking coprocessor commands
RISCCore 0
(Control)
RISCCore3(Data)
command queue manager
Computation coprocessor
RISCCore 1(Data)
RISCCore 2(Data)
T0 T1Local
memory
Arbiter command queue
RISC cluster
Tn
thread pool
A pool of software threads
Implemented for computation-intensive part of the video algorithms that cannot be run in a RISC cores.
8
Communication Network Architecture
Among RISC clusters Two separated communication networks
control network: smaller data size, and synchronization information based on conventional message passing employ point-to-point hardware FIFO provide a new path to transfer data
data network: larger data size based on remote DMA operations, and bus-based style-like employ memory (local or global) and hardware FIFO handle high-rate data transfers for stream-based applications
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core3(Data)
RISC cluster
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core3(Data)
RISC cluster
coprocessor coprocessor coprocessor coprocessor coprocessor coprocessor coprocessor coprocessor
Control network
Data network
HOSK HOSK
9
RISC Core 0(Control)
in Cluster A
coprocessor
clusterIDcontroller
RISC Core 0(Control)
in Cluster Bcopr
oces
sor
RISC Core 0(Control)
in Cluster C
coprocessor
fifoIDcontroller
0
1
N
32
0
1
N
32
0
1
N
32
clusterIDcontroller
clusterIDcontroller
fifoIDcontroller
clock A clock B
clock C
fifoIDcontroller
Control Network: point-to-point FIFO based
FIFO group
Fully programmableconnectivity
Two-level distributed identificationfor FIFOs
Each control transaction is initiatedby a control core with clusterID and fifoID
A control core can issue a command to thecommunication coprocessor in a single cyclefor a control transaction
10
Multimedia Address Translator (MAT)
RequestQueue
Data Recombination
Unit (DRU)
Global Memoryfor I/D Cache Data
(SDRAM)
P
coprocessor
I-$
Local Data Network 1
Local Memory(SRAM)
Local Memory(SRAM)
Local Data Network 3
P
P
P D-$
P
coprocessor
I-$
P
P
P D-$
P
coprocessor
I-$
P
P
P D-$
P
coprocessor
I-$
P
P
P D-$
Local Memory(SRAM)
Local Data Network 2
Global Data Network 1
Memory Controller
Global Memoryfor Streaming Data
(SDRAM)
Memory Controller
Global Data Network 2
DMA Controller
Data Communication Network
Streaming data is stored in either a local memory or a global memory, which depends on the size of the data.
Platform provides nC2 local data links
Local data between two RISC clusters is exchanged through a shared local memory.
11
Global Data Communication with a DMAC
A centralized DMA controller performs address translation, DMA request queueManagement, and data arrangement so thatdata cores are free from tasks related todata transfers
Two global data network for streaming data andI/D cache data can be either unified or separated, which depends on configuration of the memory controllers
A small buffer is usedbetween the DMA controller and a RISC cluster forDMA operations
P
copr
oce
ssor
P
PP
Global Data Network 1
Global Memory(Streaming Data)
Memory Controller
Multimedia Address Translator (MAT)
RequestQueue
Data Recombination
Unit (DRU)
DMA Controller
I-$ D-$
Global Memory(I/D Cache Data)
Memory Controller
Global Data Network 2
12
Design Flow for Application Mapping • video specification• area, power• operating frequency• number of clusters
• configurable network
• SystemC simulation in TLM• Multithreading
• Code generation (for RISC clusters)• RTL coding or generation (for coprocessors)
• Core #, cache sizing for each cluster• Sizing local memories
• FPGA prototyping
application profiling
cluster partitioning
communication mapping
TLM modeling & function profiling
HW/ SW thread partitioning & mapping
performance estimation
verification
Starting with an application model and a platform model with constraints
• function partitioning & clustering
13
Partitioning into clusters
According the profiling results for a reference software, the application is first partitionedinto grouped functionsEach grouped function is mapped into a RISC cluster.
application profiling
cluster partitioning
communication mapping
TLM modeling & function profiling
thread partitioning & mapping
performance estimation
verification
Assumptions:RISC clusters with 4 cores @ 200MHzutilization rate=0.7
Upper MIPS bound for a 4-core cluster=560MIPS
14
RISC cluster
Cluster Partitioning
Example: an H.264/AVC CIF decoder is mapped into
4 RISC clusters
Resolutions MBs MIPS
CIF 396 2091
D1 1350 7130
720p 3600 19013
1080p 8160 43097
EntropyDecoding
InverseQuantization
IntraPrediction
InterPrediction
Reconstruction
DeblockingFilter
NeighborReference Pixels
Current16x16
Multi ReferenceFrames
FrameN-1
MUX
H.264 bitstream
01011000011010100101010110010111
output
231 MIPS 259 MIPS 113 MIPS
1087 MIPS
45 MIPS
356 MIPS
15
Cluster Partitioning
Example: A H.264/AVC 720p decoder is mapped into 6 RISC clusters
RISC cluster
EntropyDecoding
InverseQuantization
IntraPrediction
InterPrediction
Reconstruction
DeblockingFilter
NeighborReference Pixels
Current16x16
Multi ReferenceFrames
FrameN-1
MUX
H.264 bitstream
01011000011010100101010110010111
output
2106 MIPS 2354 MIPS 1026 MIPS
9882 MIPS
480 MIPS
3240 MIPS
Resolutions MBs MIPS
CIF 396 2091
D1 1350 7130
720p 3600 19013
1080p 8160 43097
16
Communication Mapping
1. identify control and data flows among the clusters
2. Map each control flow into a specific FIFO in a FIFO group
3. Map a data flow for streaming into a local data network or the global data network according to the size of its bandwidth requirement4. Map data flows for I/D cache into the global memory
17
Example 1: Control Network Mapping for an H.264.AVC CIF high-profile decoder
transaction and size
ITQ / INTRA / RECON Cluster
INTER Cluster
ED Cluster
DF Cluster
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
0.24 MB/s, 32 x 5 0.048 MB/s, 32 x 1
0.048 MB/s, 32 x 11.63 MB/s, 32 x 34
0.19 MB/s, 32 x 4
18
Example 1: Data Network Mapping for an H.264.AVC CIF high-profile decoder
transaction and size
ITQ / INTRA / RECON Cluster
INTER Cluster
ED Cluster
DF Cluster
Local Memory
DMAC
Global Memory
40.96 MB/s
4.57 MB/s
4.56 MB/s
9.12 MB/s0.19 MB/s
4.56 MB/s
4.56 MB/s
4.56 MB/s
Local Memory
Local Memory
9.12 MB/s
3.41 KB0.45 KB
2.40 KB
19
Example 2: Control Network Mapping for an H.264.AVC 720p high-profile decoder
transaction and size
ITQ Cluster
INTRA Cluster INTER Cluster
RECON Cluster
ED Cluster
DF Cluster
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
0.43 MB/s, 32 x 1
0.43 MB/s, 32 x 1
0.43 MB/s, 32 x 1 0.43 MB/s, 32 x 1
0.43 MB/s, 32 x 10.43 MB/s, 32 x 1
0.43 MB/s, 32 x 114.69 MB/s, 32 x 34
1.70 MB/s, 32 x 4
1.30 MB/s, 32 x 3
20
Example 2: Data Network Mapping for an H.264.AVC 720p high-profile decoder
transaction and size
ITQ Cluster
INTRA Cluster INTER Cluster
ED Cluster
DF Cluster
Local Memory
Local Memory
Local Memory
Local Memory
Local Memory
Local Memory
Local MemoryLocal MemoryDMAC
Global Memory
RECON Cluster
82.9 MB/s, 1.63KB 82.9 MB/s, 1.54KB
41.5 MB/s, 0.77KB
41.5 MB/s, 0.77KB
41.5 MB/s, 0.77KB
19.9 MB/s, 2.95KB
1.73 MB/s
372.4 MB/s
41.5 MB/s
-, 0.64 KB
-, 2.40 KB
21
HW/SW Thread Partitioning & Mapping
1. Profile the required MIPS of each thread from TLM modeling
3. Allocate the threads to the cores or the coprocessor in the cluster
4. Back to step 2 if the result is not good enough
application profiling
cluster partitioning
communication mapping
TLM modeling & function profiling
thread partitioning & mapping
performance estimation
verification
2. Select # of RISC cores and HW threads in the coprocessor
For each RISC cluster
cores
coprocessors
22
~480 MIPS for intra prediction in the 720p decoder Upper bound for a 4-core cluster: 560 MIPS
Example: Thread Partitioning & Mapping for Intra-prediction (1)
Map all threads to SW
Thread-level parallelism is limited due to dependency among the threads, which limits core utilization
threads MIPS
control 2.9
luma
4x4 401.8
8x8 298.3
16x16 95.8
chroma cb/cr 78.0
23
Dependency and intra-prediction order in a MB
Example: Thread Partitioning & Mapping for Intra-prediction (2)
2 3 4 5
0 1 2 3
4 5 6 7
6 7 8 9
4x4 luma intra prediction for luma samples
Core utilization: limited because of limited parallelism (2) Reducing cores from 4 to 3
INTRA Cluster
0.79 0.78
0.23
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
INTRA Cluster
0.26
0.53 0.54
0.25
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
2 3 6 7
0 1 4 5
8 9 12 13
10 11 14 15
dependency Intra-prediction ordering
24
Example: Thread Partitioning & Mapping for Inter-prediction
Inter prediction case in the 720p decoder Upper bound for a 4-core cluster: 560 MIPS
One of several possible SW-HW partitions is selected.
threads MIPS
control 2.9
luma
DMA setup 300.7
Data Recombination
1838.6
Interpolation 4644.9
chroma
DMA setup 269.6
Data Recombination
546.1
Interpolation 414.7
INTER Cluster
0.45
0.81 0.79 0.76
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
25
A Software-Centric Solution
For H.264/AVC 720p High-Profile Decoder
26
Complexity of 720p High-profile Decoder
Logic gate count and memory usage Synthesis conditions
0.18-um CMOS technology 200MHz for RISC clusters and 100MHz for others
Logic part (unit: K gates) Memory part (unit: KB)
Computation
Component RISC cluster Coprocessor I-cache Tag D-cache Tag
ED cluster 186 (2) 35 8.00 0.67 4.00 0.38
ITQ cluster 226 (3) 30 4.00 0.35 1.00 0.10
INTRA cluster 226 (3) 0 32.00 2.43 2.00 0.20
INTER cluster 266 (4) 36 8.00 0.67 2.00 0.20
RECON cluster 145 (1) 5 1.00 0.10 0.50 0.05
DF cluster 145 (1) 10 8.00 0.67 0.50 0.05
Sum 1,194 (14) 116 61.00 4.90 10.00 1.00
Communication
Control network 25 -
Data network 42 11.50
Sum 67 11.50
Total (Logic + Memory) 1,377 88.39
27
Thread Partitioning for 720p (@ 200MHz)
1. intra mode calculation2. motion vector calculation3. NAL decoding
1. VLD parsing2. boundary strength calculation
ED Cluster (2 cores)
1. luma dc transform2. chroma dc transform3. chroma dequantization4. chroma inverse transformation
1. 4x4 luma dequantization2. 4x4 inverse transform3. 8x8 luma dequantization4. 8x8 inverse transform
ITQ Cluster (3 cores)
1. luma 4x4 prediction2. luma 8x8 prediction3. luma 16x16 prediction4. chroma 8x8 prediction
INTRA Cluster (3 cores)
1. chroma predictioin
1. luma sample prediction2. luma DMA setup3. luma data recombination4. chroma data recombination5. chroma DMA setup
INTER Cluster (4 cores)
1. add residual and prediction
RECON Cluster (1 core)
1. DMA setup
1. filtering process2. data recombination
DF Cluster (1 core)
coprocessor task coprocessor task coprocessor task
coprocessor task coprocessor task coprocessor task
processor task processor task processor task
processor task processor task processor task
28
Communication Network
ITQ Cluster (3 Cores)
ED Cluster(2 Cores)
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
FIFO Group
INTRA Cluster (3 Cores) INTER Cluster (4 Cores)
Local Memory
Local Memory
Local Memory
Local Memory
Local Memory
Local Memory
Local MemoryLocal MemoryDMAC
Global Memory
Control network
Local data network
Global data network (streaming data)
Global Memory
RECON Cluster (1 Cores)
DF Cluster(1 Cores)
Global data network(cache data)
21.6 MB/sec
415.63 MB/sec
310.2 MB/sec
196 MB/sec
29
Core Utilization (@200MHz)
ED Cluster (3, 2) ITQ Cluster (4, 7) INTRA Cluster (3, 0)
INTER Cluster (4, 0) RECON Cluster (1, 0) DF Cluster (1, 0)
ED Cluster
0.630.58
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
RECON Cluster
0.55
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
INTER Cluster
0.45
0.81 0.79 0.76
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
ITQ Cluster
0.810.73
0.82
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
INTRA Cluster
0.79 0.78
0.23
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
DF Cluster
0.45
0
0.2
0.4
0.6
0.8
1
RISC Core 0(Control)
RISC Core 1(Data)
RISC Core 2(Data)
RISC Core 3(Data)
utili
zatio
n
(thread number, context switching number per MB)
30
Design Space Exploration
Seven mappings of an H.264 720p decoder With the same networks for control and data communication
0
200
400
600
800
1000
1200
1400
1600
6 8 8 10 11 13 14
Total number of cores
Co
mp
lexi
ty in
log
ic (
K g
ate
s)
0
10
20
30
40
50
60
70
80
90
Co
mp
lexi
t in
me
mo
ry (
KB
)
complexity in logic complexity in memory
software-centric
hardware-centric
31
Future Works
More codec implementations H.264/AVC 720-p high-profile encoder VC-1 720p advanced-profile decoder
Flexible coprocessors: Coarse-grained reconfigurable architecture (CGRA)
RISCCore 0(Control)
RISCCore3(Data)
command queue manager
Computation coprocessor
RISCCore 1(Data)
RISCCore 2(Data)
T0 T1Local
memory
Arbiter command queue
RISC cluster
Tnthread pool
32
Thank you