a flexible multi-core platform for multi-standard video applications soo-ik chae center for soc...

A Flexible Multi-Core Platform For Multi-Standard Video Applications

Soo-Ik ChaeCenter for SoC Design Technology

Seoul National University

MPSoC 2009Savannah, Georgia, USA

2

Content

Motivation Proposed multi-core platform architecture

RISC cluster Hardware operating system kernel Computation coprocessor architecture Communication architecture with two separated networks

Design flow for application mapping Experimental result

H.264/AVC 720p high profile decoder implementations

Future work

3

High-performance Video Systems

Huge computation load

• 60 GOPS to decode 1080p 30 fps• Dedicated H/W blocks (for high-end applications)

Multiple standards/ New standards

• MPEG2/4, H.264, DivX, VC-1, etc• Software with RISC, DSP, SIMD processors

Embedded in mobile devices

• PMPs, Smart Phones, etc• Area and energy efficiencies are critical

Large data transfers and memories

• At least 96MB for 1080p decoders• Application-specific optimized communication and memory architectures

Should satisfy

all of these

CONFLICTING requirements!!

Flexible high-performance platform

4

Proposed Multi-core Platform Architecture

An array of RISC clusters with coprocessors connected through two separated networks: control and data

Each RISC consists of up to 4 cores, shared I$ and D$, HOSK, coprocessors.

Data Network

Control Network

RISCCluster 0

RISCCluster 1

RISCCluster 2

RISCCluster 3

Multi-core platform

Local Memory

Local Memory

Direct Memory Access Controller

Global Memory

FIFO Group

FIFO Group

FIFO Group

SharedI-CacheHardware

OS Kernel(HOSK)

Context/Data Bus

Computation coprocessor

RISC Core 0(Control)

RISC Core 1(Data)

RISC Core 2(Data)

RISC Core3(Data)

SharedD-Cache

Thread Control Memory

Context Memory

Communication coprocessor(Data processing)

Communication coprocessor(Control processing)

5

A Multi-threading RISC Cluster

Scheduling

Context Switching

Load Balancing

Multithreading

Synchronization

Message Passing(Channel Access)

Communication

Implementation

Area (Complexity)

Scalability(# of threads, # of cores)

Coherent Shared Memory

Hardware Operating System

Area-efficientRISCCore



Shared Caches

CommunicationCo-processor (CCP)

H/WBlock

H/WBlock

channels or CATs

H/W based Task queue management+ {priority+RR}-based task scheduling

Fast context switching in 4 ~ 17 cycles

Dynamic thread allocation +Pre-emptive multithreading(Priority- or Round Robin)

Thread migration withoutcompulsory cache missesH/W-based mutex/semaphore

No cache-coherency problem

Channel access with a singleco-processor instruction

Thread suspend or wake-up withoutsoftware interventionOn-chip/Off-chip memory-based

Context memory

No system services in each core+ Shared multiplier unit

Use larger SRAMsNo cache fragmentation

The number of cores in a cluster is limited due to cache sharing.

Area (Complexity)

6

Hardware Operating System Kernel (HOSK)

ConfigCoreContext

controller

HOS

R0R1

R15 (PC)...

Context Buffer

ThreadManager

context bus

ContextMemory

ThreadControlMemory

MainController

Datapath

R0R1

R15 (PC)...

Register File

coprocessor bus (or data bus)

ConfigCoreConfigCore

Context Manager

32-bit bus: 17 cycles64-bit bus: 9 cycles544-bit bus: 4 cycles

Context switching order

R15 R14 R13 …Pre-fetch or Save contexts

in background!

Task Scheduling &

Semaphore Control

SDRAMor

SRAM

SDRAMor

SRAM

Main controller: receive service requests and control other blocks Context manager: pre-fetch or save contexts in background Thread manager: schedule tasks and control semaphores

7

Computation Coprocessors

Local memory is accessed by both RISC cores and the computation coprocessors

Coprocessor task manager selects an available hardware thread for an outstanding coprocessor command

A pool of hardware threads

General coprocessor interface

Command queues to issue nonblocking coprocessor commands

RISCCore 0

(Control)

RISCCore3(Data)

command queue manager


RISCCore 1(Data)

RISCCore 2(Data)

T0 T1Local

memory

Arbiter command queue

RISC cluster

Tn

thread pool

A pool of software threads

Implemented for computation-intensive part of the video algorithms that cannot be run in a RISC cores.

8

Communication Network Architecture

Among RISC clusters Two separated communication networks

control network: smaller data size, and synchronization information based on conventional message passing employ point-to-point hardware FIFO provide a new path to transfer data

data network: larger data size based on remote DMA operations, and bus-based style-like employ memory (local or global) and hardware FIFO handle high-rate data transfers for stream-based applications


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core3(Data)

RISC cluster


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core3(Data)

RISC cluster

coprocessor coprocessor coprocessor coprocessor coprocessor coprocessor coprocessor coprocessor

Control network

Data network

HOSK HOSK

9


in Cluster A

coprocessor

clusterIDcontroller


in Cluster Bcopr

oces

sor


in Cluster C

coprocessor

fifoIDcontroller

0

1

N

32

0

1

N

32

0

1

N

32

clusterIDcontroller

clusterIDcontroller

fifoIDcontroller

clock A clock B

clock C

fifoIDcontroller

Control Network: point-to-point FIFO based

FIFO group

Fully programmableconnectivity

Two-level distributed identificationfor FIFOs

Each control transaction is initiatedby a control core with clusterID and fifoID

A control core can issue a command to thecommunication coprocessor in a single cyclefor a control transaction

10

Multimedia Address Translator (MAT)

RequestQueue

Data Recombination

Unit (DRU)

Global Memoryfor I/D Cache Data

(SDRAM)

P

coprocessor

I-$

Local Data Network 1

Local Memory(SRAM)

Local Memory(SRAM)


P

P

P D-$

P

coprocessor

I-$

P

P

P D-$

P

coprocessor

I-$

P

P

P D-$

P

coprocessor

I-$

P

P

P D-$

Local Memory(SRAM)


Global Data Network 1

Memory Controller

Global Memoryfor Streaming Data

(SDRAM)

Memory Controller


DMA Controller

Data Communication Network

Streaming data is stored in either a local memory or a global memory, which depends on the size of the data.

Platform provides nC2 local data links

Local data between two RISC clusters is exchanged through a shared local memory.

11

Global Data Communication with a DMAC

A centralized DMA controller performs address translation, DMA request queueManagement, and data arrangement so thatdata cores are free from tasks related todata transfers

Two global data network for streaming data andI/D cache data can be either unified or separated, which depends on configuration of the memory controllers

A small buffer is usedbetween the DMA controller and a RISC cluster forDMA operations

P

copr

oce

ssor

P

PP


Global Memory(Streaming Data)

Memory Controller

Multimedia Address Translator (MAT)

RequestQueue

Data Recombination

Unit (DRU)

DMA Controller

I-$ D-$

Global Memory(I/D Cache Data)

Memory Controller


12

Design Flow for Application Mapping • video specification• area, power• operating frequency• number of clusters

• configurable network

• SystemC simulation in TLM• Multithreading

• Code generation (for RISC clusters)• RTL coding or generation (for coprocessors)

• Core #, cache sizing for each cluster• Sizing local memories

• FPGA prototyping

application profiling

cluster partitioning

communication mapping

TLM modeling & function profiling

HW/ SW thread partitioning & mapping

performance estimation

verification

Starting with an application model and a platform model with constraints

• function partitioning & clustering

13

Partitioning into clusters

According the profiling results for a reference software, the application is first partitionedinto grouped functionsEach grouped function is mapped into a RISC cluster.





thread partitioning & mapping


verification

Assumptions:RISC clusters with 4 cores @ 200MHzutilization rate=0.7

Upper MIPS bound for a 4-core cluster=560MIPS

14

RISC cluster

Cluster Partitioning

Example: an H.264/AVC CIF decoder is mapped into

4 RISC clusters

Resolutions MBs MIPS

CIF 396 2091

D1 1350 7130

720p 3600 19013

1080p 8160 43097

EntropyDecoding

InverseQuantization

IntraPrediction

InterPrediction

Reconstruction

DeblockingFilter

NeighborReference Pixels

Current16x16

Multi ReferenceFrames

FrameN-1

MUX

H.264 bitstream

01011000011010100101010110010111

output

231 MIPS 259 MIPS 113 MIPS

1087 MIPS

45 MIPS

356 MIPS

15

Cluster Partitioning

Example: A H.264/AVC 720p decoder is mapped into 6 RISC clusters

RISC cluster

EntropyDecoding

InverseQuantization

IntraPrediction

InterPrediction

Reconstruction

DeblockingFilter

NeighborReference Pixels

Current16x16

Multi ReferenceFrames

FrameN-1

MUX

H.264 bitstream

01011000011010100101010110010111

output

2106 MIPS 2354 MIPS 1026 MIPS

9882 MIPS

480 MIPS

3240 MIPS

Resolutions MBs MIPS

CIF 396 2091

D1 1350 7130

720p 3600 19013

1080p 8160 43097

16

Communication Mapping

1. identify control and data flows among the clusters

2. Map each control flow into a specific FIFO in a FIFO group

3. Map a data flow for streaming into a local data network or the global data network according to the size of its bandwidth requirement4. Map data flows for I/D cache into the global memory

17

Example 1: Control Network Mapping for an H.264.AVC CIF high-profile decoder

transaction and size

ITQ / INTRA / RECON Cluster

INTER Cluster

ED Cluster

DF Cluster

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

0.24 MB/s, 32 x 5 0.048 MB/s, 32 x 1

0.048 MB/s, 32 x 11.63 MB/s, 32 x 34

0.19 MB/s, 32 x 4

18

Example 1: Data Network Mapping for an H.264.AVC CIF high-profile decoder


ITQ / INTRA / RECON Cluster

INTER Cluster

ED Cluster

DF Cluster

Local Memory

DMAC

Global Memory

40.96 MB/s

4.57 MB/s

4.56 MB/s

9.12 MB/s0.19 MB/s

4.56 MB/s

4.56 MB/s

4.56 MB/s

Local Memory

Local Memory

9.12 MB/s

3.41 KB0.45 KB

2.40 KB

19

Example 2: Control Network Mapping for an H.264.AVC 720p high-profile decoder


ITQ Cluster

INTRA Cluster INTER Cluster

RECON Cluster

ED Cluster

DF Cluster

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

0.43 MB/s, 32 x 1

0.43 MB/s, 32 x 1

0.43 MB/s, 32 x 1 0.43 MB/s, 32 x 1

0.43 MB/s, 32 x 10.43 MB/s, 32 x 1

0.43 MB/s, 32 x 114.69 MB/s, 32 x 34

1.70 MB/s, 32 x 4

1.30 MB/s, 32 x 3

20

Example 2: Data Network Mapping for an H.264.AVC 720p high-profile decoder


ITQ Cluster

INTRA Cluster INTER Cluster

ED Cluster

DF Cluster

Local Memory

Local Memory

Local Memory

Local Memory

Local Memory

Local Memory

Local MemoryLocal MemoryDMAC

Global Memory

RECON Cluster

82.9 MB/s, 1.63KB 82.9 MB/s, 1.54KB

41.5 MB/s, 0.77KB

41.5 MB/s, 0.77KB

41.5 MB/s, 0.77KB

19.9 MB/s, 2.95KB

1.73 MB/s

372.4 MB/s

41.5 MB/s

-, 0.64 KB

-, 2.40 KB

21

HW/SW Thread Partitioning & Mapping

1. Profile the required MIPS of each thread from TLM modeling

3. Allocate the threads to the cores or the coprocessor in the cluster

4. Back to step 2 if the result is not good enough





thread partitioning & mapping


verification

2. Select # of RISC cores and HW threads in the coprocessor

For each RISC cluster

cores

coprocessors

22

~480 MIPS for intra prediction in the 720p decoder Upper bound for a 4-core cluster: 560 MIPS

Example: Thread Partitioning & Mapping for Intra-prediction (1)

Map all threads to SW

Thread-level parallelism is limited due to dependency among the threads, which limits core utilization

threads MIPS

control 2.9

luma

4x4 401.8

8x8 298.3

16x16 95.8

chroma cb/cr 78.0

23

Dependency and intra-prediction order in a MB

Example: Thread Partitioning & Mapping for Intra-prediction (2)

2 3 4 5

0 1 2 3

4 5 6 7

6 7 8 9

4x4 luma intra prediction for luma samples

Core utilization: limited because of limited parallelism (2) Reducing cores from 4 to 3

INTRA Cluster

0.79 0.78

0.23

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

INTRA Cluster

0.26

0.53 0.54

0.25

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

2 3 6 7

0 1 4 5

8 9 12 13

10 11 14 15

dependency Intra-prediction ordering

24

Example: Thread Partitioning & Mapping for Inter-prediction

Inter prediction case in the 720p decoder Upper bound for a 4-core cluster: 560 MIPS

One of several possible SW-HW partitions is selected.

threads MIPS

control 2.9

luma

DMA setup 300.7

Data Recombination

1838.6

Interpolation 4644.9

chroma

DMA setup 269.6

Data Recombination

546.1

Interpolation 414.7

INTER Cluster

0.45

0.81 0.79 0.76

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

25

A Software-Centric Solution

For H.264/AVC 720p High-Profile Decoder

26

Complexity of 720p High-profile Decoder

Logic gate count and memory usage Synthesis conditions

0.18-um CMOS technology 200MHz for RISC clusters and 100MHz for others

Logic part (unit: K gates) Memory part (unit: KB)

Computation

Component RISC cluster Coprocessor I-cache Tag D-cache Tag

ED cluster 186 (2) 35 8.00 0.67 4.00 0.38

ITQ cluster 226 (3) 30 4.00 0.35 1.00 0.10

INTRA cluster 226 (3) 0 32.00 2.43 2.00 0.20

INTER cluster 266 (4) 36 8.00 0.67 2.00 0.20

RECON cluster 145 (1) 5 1.00 0.10 0.50 0.05

DF cluster 145 (1) 10 8.00 0.67 0.50 0.05

Sum 1,194 (14) 116 61.00 4.90 10.00 1.00

Communication

Control network 25 -

Data network 42 11.50

Sum 67 11.50

Total (Logic + Memory) 1,377 88.39

27

Thread Partitioning for 720p (@ 200MHz)

1. intra mode calculation2. motion vector calculation3. NAL decoding

1. VLD parsing2. boundary strength calculation

ED Cluster (2 cores)

1. luma dc transform2. chroma dc transform3. chroma dequantization4. chroma inverse transformation

1. 4x4 luma dequantization2. 4x4 inverse transform3. 8x8 luma dequantization4. 8x8 inverse transform

ITQ Cluster (3 cores)

1. luma 4x4 prediction2. luma 8x8 prediction3. luma 16x16 prediction4. chroma 8x8 prediction

INTRA Cluster (3 cores)

1. chroma predictioin

1. luma sample prediction2. luma DMA setup3. luma data recombination4. chroma data recombination5. chroma DMA setup

INTER Cluster (4 cores)

1. add residual and prediction

RECON Cluster (1 core)

1. DMA setup

1. filtering process2. data recombination

DF Cluster (1 core)

coprocessor task coprocessor task coprocessor task

coprocessor task coprocessor task coprocessor task

processor task processor task processor task

processor task processor task processor task

28

Communication Network

ITQ Cluster (3 Cores)

ED Cluster(2 Cores)

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

FIFO Group

INTRA Cluster (3 Cores) INTER Cluster (4 Cores)

Local Memory

Local Memory

Local Memory

Local Memory

Local Memory

Local Memory

Local MemoryLocal MemoryDMAC

Global Memory

Control network

Local data network

Global data network (streaming data)

Global Memory

RECON Cluster (1 Cores)

DF Cluster(1 Cores)

Global data network(cache data)

21.6 MB/sec

415.63 MB/sec

310.2 MB/sec

196 MB/sec

29

Core Utilization (@200MHz)

ED Cluster (3, 2) ITQ Cluster (4, 7) INTRA Cluster (3, 0)

INTER Cluster (4, 0) RECON Cluster (1, 0) DF Cluster (1, 0)

ED Cluster

0.630.58

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

RECON Cluster

0.55

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

INTER Cluster

0.45

0.81 0.79 0.76

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

ITQ Cluster

0.810.73

0.82

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

INTRA Cluster

0.79 0.78

0.23

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

DF Cluster

0.45

0

0.2

0.4

0.6

0.8

1


RISC Core 1(Data)

RISC Core 2(Data)

RISC Core 3(Data)

utili

zatio

n

(thread number, context switching number per MB)

30

Design Space Exploration

Seven mappings of an H.264 720p decoder With the same networks for control and data communication

0

200

400

600

800

1000

1200

1400

1600

6 8 8 10 11 13 14

Total number of cores

Co

mp

lexi

ty in

log

ic (

K g

ate

s)

0

10

20

30

40

50

60

70

80

90

Co

mp

lexi

t in

me

mo

ry (

KB

)

complexity in logic complexity in memory

software-centric

hardware-centric

31

Future Works

More codec implementations H.264/AVC 720-p high-profile encoder VC-1 720p advanced-profile decoder

Flexible coprocessors: Coarse-grained reconfigurable architecture (CGRA)

RISCCore 0(Control)

RISCCore3(Data)

command queue manager


RISCCore 1(Data)

RISCCore 2(Data)

T0 T1Local

memory

Arbiter command queue

RISC cluster

Tnthread pool

32

Thank you

a flexible multi-core platform for multi-standard video applications soo-ik chae center for soc...

Documents

risc cores

controlrisc core

fifoida control core

dataeach risc

flexible multicore platform

datadata network

larger data

criticallarge data transfers