cstalks-polymorphic heterogeneous multicore systems-17aug

blog.nus.edu.sg/cstalks

Polymorphic Heterogeneous

Multi-Core Systems

Mihai Pricopi

CSTalks

August 17, 2011

Motivation

Mihai Pricopi 3 CSTalks

Single-core performance (complexity) increase

Motivation


Instruction-level parallelism (ILP)

1: e = a + b

2: f = c + d

3: g = e * f

4: h = f * 2

I 2

3 4

Motivation

Mihai Pricopi 5

2006 2007

CSTalks

Motivation


Thread-level parallelism (TLP)

Multi-threaded applications

Multi-programmed jobs

Process

P0 P1 P0 P1

Process0 Process1

Motivation

Mihai Pricopi 7

nVidia Tesla many-core: up to

960 simple and identical

cores.

Massively exploiting the TLP.

Sequential programs suffer

from limited ILP exploitation.

A gap between TLP and ILP.

Solution: heterogeneous

systems to accommodate the

gap between TLP and ILP.

CSTalks

Heterogeneous Chip Multi-processors

Mihai Pricopi 8

Multi-core systems that use cores with different

performance parameters.

Existing results show that heterogeneous systems are

more efficient than homogeneous ones in terms of

performance, power, area and delay.

Heterogeneity can be reached by using:

◦ Asymmetric chip multi-processors (ACMPs)

◦ Multiprocessor system-on-chip (MPSoC)

◦ Architectures that dynamically reconfigure the internal

structure in order to adapt to different software requests

(polymorphic)

CSTalks



Asymmetric chip multi-processors (ACMPs)

P1

P2

P3

P0 P0 P1

P2 P3

P4



Multiprocessor system-on-chip (MPSoC)

ARM

DSP

memory

controller

video

accelerator

bridges

Program Phase Behavior - gzip


Program Phase Behavior - gcc


Polymorphic Heterogeneous Multi-Core

Systems


• General propose applications

• Novel architecture that can be

tailored according to the

software requirements

• Base system: homogeneous

processor

• Reconfigurable capabilities

• Internal structure

adaptation

• Core-coalition

• Memory

P0 P1

P4 P5

P2 P3

P6 P7

P8 P9

P12 P13

P10 P11

P14 P15

RF

RF


Systems – Reconfigurable Fabric


• Reconfigurable hardware shared by different processors

• RF implements custom instructions

• Dynamic reconfiguration at runtime – speedup

1: e = a + b

2: f = c + d

3: g = e * f

4: h = f * 2

I 2

3 4

RF

P0

P1

Custom Instruction


Systems – Reconfigurable Fabric


• Challenging Problems:

• The amount of RF is limited.

• Decide when to reconfigure the RF (scheduling)

• What is the best set of Custom Instructions that

will give the highest speedup.

• Overhead of the dynamic reconfiguration.


Systems – Core Structure Adaptation


• Similar performance can be achieved by using smaller

processor internal units.

• Instruction fetch window size, issue width, instruction

window size, frequency can be dynamically changed.

• Power and thermal concerns.


Systems – Core-Coalition


• Coalition helps creating “stronger” cores using the already

existing light cores:

• accelerates serial applications by extracting more ILP

(if available).

• uses limited amount of shared hardware between

cores.

• up to 4-core coalition can be formed.

P0

(2-way)

P1

(2-way)

P

(4-way) ≡

2-core coalition


Systems – Core-Coalition Execution Model

Mihai Pricopi 18 National University of Singapore

B0

B1 B2

B3

B0

B1

B1

B3

B3

B0

B0

B1

B1

B3

B3

B4

B4

B4

B4

B0

B4

Core 0 Core 1

SF: Sentinel Instruction

fetch and global

renaming

RF: Regular instruction

fetch, decode and

renaming

EX: Regular instruction

execution

CM: Regular instruction

commit

Time SF RF EX CM SF RF EX CM

CFG

Experimental Results - Speedup


Experimental Results – Load Balance


Proposed directions


Next steps:

◦ Implement Coalition on FPGA.

◦ More study on the overhead and power

consumption determined by the shared resources.

◦ Implement a dynamic scheduler for Coalition.


?

Next Week’s Talk

A Unified Framework for Recommendations in

the Social Network by Chen Wei

Join us next Wednesday!

Wednesday, 31 August, 2011 23

cstalks-polymorphic heterogeneous multicore systems-17aug

Technology