intra-core loop-task accelerators for task-based parallel programscbatten/pdfs/batten-lta... ·...

37
Intra-Core Loop-Task Accelerators for Task-Based Parallel Programs Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Spring 2018

Upload: others

Post on 10-Sep-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Intra-Core Loop-Task Accelerators forTask-Based Parallel Programs

Christopher Batten

Computer Systems LaboratorySchool of Electrical and Computer Engineering

Cornell University

Spring 2018

Page 2: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation

Motivating Trends in Computer Architecture

Transistors(Thousands)

MIPSR2K

IntelP4

Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

107

DECAlpha 21264

TypicalPower (W)

Frequency(MHz)

SPECintPerformance

~9%/year

~15%/year

Numberof Cores

Intel 48-Core Prototype

AMD 4-Core Opteron Data-Parallelism via GPGPUs & Vector

HardwareSpecialization

Fine-Grain Task- Level Parallelism Instruction Set Specialization Subgraph Specialization Application-Specific Accelerators Domain-Specific Accelerators Coarse-Grain Reconfig Arrays Field-Programmable Gate Arrays

Cornell University Christopher Batten 2 / 29

Page 3: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation

Performance (Tasks per Second)

Ener

gy E

ffici

ency

(Tas

ks p

er J

oule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibil

ity vs.

Specia

lizatio

nCustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator

Cornell University Christopher Batten 3 / 29

Page 4: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation

Vertically Integrated Research Methodology

Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI

implementation, and hardware design methodologies

CrossCompiler

FunctionalSimulator

Binary

Applications

Functional-LevelModel

Cycle-LevelSimulator

Cycle-LevelModel

Layout

Register-Transfer-Level Model

RTLSimulator

Gate-Level Model

Gate-LevelSimulator

Switching Activity

PowerAnalysis

SynthesisPlace&Route

Key Metrics: Cycle Count, Cycle Time, Area, Energy

Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our

research methodology

Cornell University Christopher Batten 4 / 29

Page 5: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

• Research Overview • LTA Motivation LTA SW LTA HW LTA Evaluation

Projects Within the Batten Research Group

PL

ISA

uArch

RTL

VLSI

Circuits

Tech

Compiler

Apps

Algos

GPGPUArchitecture

[ISCA'13]

(AFOSR)

[MICRO'14a]

IntegratedVoltage

Regulation[MICRO'14b]

[TCAS'18]

AcceleratorCentric

SoC Design[HOTCHIPS'17]

(DARPA,NSF)

[DAC'16] [FPGA'17]

DeepProcessingin Memory

(NSF,SRC)

Accel forLoop-LevelParallelism[MICRO'14c]

(DARPA,AFOSR,

NSF)

Accel forTask-LevelParallelism

[MICRO'17]

(AFOSR,NSF)

[ISCA'16]

PyMTL/PydginFrameworks

[MICRO'14d]

(NSF,DARPA)[DAC'18]

[ISPASS'15,'16]

Accel forTask-LevelParallelism

[MICRO'17]

Cornell University Christopher Batten 5 / 29

Page 6: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Cornell University Ji Kim 3/21Cornell University Ji Kim 3/21Cornell University Ji Yun Kim 3

Inter-Core● Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk

Cornell University Christopher Batten 6 / 29

Page 7: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Cornell University Ji Kim 4/21Cornell University Ji Kim 4/21Cornell University Ji Yun Kim 4

Inter-Core● Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk

Cornell University Christopher Batten 6 / 29

Page 8: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Cornell University Ji Kim 5/21Cornell University Ji Kim 5/21Cornell University Ji Yun Kim 5

Inter-Core● Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk

Intra-Core● Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

Cornell University Christopher Batten 6 / 29

Page 9: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Challenges of Combining Tasks and Vectors

Cornell University Ji Kim 9/21Cornell University Ji Kim 9/21Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

9

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...

// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);

elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);

...

...

Cornell University Christopher Batten 7 / 29

Page 10: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Challenges of Combining Tasks and Vectors

Cornell University Ji Kim 10/21Cornell University Ji Kim 10/21Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

10

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...

// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);

elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);

...

...

Cornell University Christopher Batten 7 / 29

Page 11: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Challenges of Combining Tasks and Vectors

Cornell University Ji Kim 12/21Cornell University Ji Kim 12/21Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

12

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) {// Pack data into padded aligned chunks// src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH]// dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH]...

// Use TBB across coresparallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) {for (int i = r.begin(); i < r.end(); i++) {// Use packed-SIMD within a core#pragma simd vlen(SIMD_WIDTH)for (int j = 0; j < SIMD_WIDTH; j++) {if (src_chunks[i][j] > THRESHOLD)aligned_dst[i] = DoLightCompute(aligned_src[i]);

elsealigned_dst[i] = DoHeavyCompute(aligned_src[i]);

...

...

Challenge #2: Inefficient Execution of Irregular Tasks

Cornell University Christopher Batten 7 / 29

Page 12: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview • LTA Motivation • LTA SW LTA HW LTA Evaluation

Native Performance Results

Cornell University Ji Kim 16/21Cornell University Ji Kim 16/21Cornell University Ji Yun Kim 16Regular Irregular

Native Performance Results

Cornell University Christopher Batten 8 / 29

Page 13: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation

Loop-Task Accelerator (LTA) Vision

Cornell University Ji Kim 20/21Cornell University Ji Kim 20/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

20

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Cornell University Christopher Batten 9 / 29

Page 14: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation

LTA SW: API and ISA Hint

Cornell University Ji Kim 23/21Cornell University Ji Kim 23/21Cornell University Ji Yun Kim

LTA SW: API and ISA Hintvoid app_kernel_lta(int N, float* src, float* dst) {LTA_PARALLEL_FOR(0, N, (dst, src), ({if (src[i] > THRESHOLD)dst[i] = DoComputeLight(src[i]);

elsedst[i] = DoComputeHeavy(src[i]);

}));}

void loop_task_func(void* a, int start, int end, int step=1);

23

Hint that hardware can potentially accelerate task execution

Cornell University Christopher Batten 10 / 29

Page 15: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation • LTA SW • LTA HW LTA Evaluation

LTA SW: Task-Based Runtime

Cornell University Ji Kim 29/21Cornell University Ji Kim 29/21Cornell University Ji Yun Kim

LTA SW: Task-Based Runtime

29Cornell University Christopher Batten 11 / 29

Page 16: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Loop-Task Accelerator (LTA) Vision

Cornell University Ji Kim 30/21Cornell University Ji Kim 30/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

30

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Cornell University Christopher Batten 12 / 29

Page 17: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

LTA HW: Fully Coupled LTA

Cornell University Ji Kim 35/21Cornell University Ji Kim 35/21Cornell University Ji Yun Kim

LTA HW: Fully-Coupled LTA

35

Coupling better for regular workloads (amortize frontend/memory)

Cornell University Christopher Batten 13 / 29

Page 18: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

LTA HW: Fully Decoupled LTA

Cornell University Ji Kim 41/21Cornell University Ji Kim 41/21Cornell University Ji Yun Kim

LTA HW: Fully Decoupled LTA

41Decoupling better for irregular workloads (hide latencies)

Cornell University Christopher Batten 14 / 29

Page 19: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

LTA HW: Task-Coupling Taxonomy

Cornell University Ji Kim 46/21Cornell University Ji Kim 46/21Cornell University Ji Yun Kim

LTA HW: Task-Coupling Taxonomy

46

+ Higher Perf on Irregular- Higher Area/Energy

Task Group (lock-step execution)

More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy

Cornell University Christopher Batten 15 / 29

Page 20: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

LTA HW: Task-Coupling Taxonomy

Cornell University Ji Kim 47/21Cornell University Ji Kim 47/21Cornell University Ji Yun Kim

LTA HW: Task-Coupling Taxonomy

47

Does it matter whether we decouple in space or in time?

Cornell University Christopher Batten 15 / 29

Page 21: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

4/1x2/1 4/2x2/1 4/4x2/1

4/1x2/2 4/2x2/2 4/4x2/2M

ore

Tem

pora

l Cou

plin

g

More Spatial Decoupling

4 Lanes x 2 Chimes

More Spatial Coupling

Mor

e Te

mpo

ral D

ecou

plin

g

2/1x4/1 2/2x4/1

2/1x4/2 2/2x4/2

2/1x4/4 2/2x4/4

2 Lanes x 4 Chimes

8/1x4/1 8/2x4/1 8/4x4/1 8/8x4/1

8/1x4/2 8/2x4/2 8/4x4/2 8/8x4/2

8/1x4/4 8/2x4/4 8/4x4/4 8/8x4/4

8 Lanes x 4 Chimes4/1x8/1 4/2x8/1 4/4x8/1

4/1x8/2 4/2x8/2 4/4x8/2

4/1x8/4 4/2x8/4 4/4x8/4

4/1x8/8 4/2x8/8 4/4x8/8

4 Lanes x 8 Chimes

Cornell University Christopher Batten 16 / 29

Page 22: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

LTA HW: Microarchitectural Template

IMem Xbar

LaneGroup

PIB

DMem Xbar

FPUGroup

LaneGroup

LaneGroup

MDUGroup

FPUXbar

μTaskQueue

MDUXbar

PDB

MemPorts

IMU

TM

UD

MU

L1 Instruction Cache

Task Distributer

L1 DataCache

IU Seq

MDU Interface FPU Interface

IU Seq IU Seq

Writeback Arbiter

WQ

WCU

DU

FU

FPU Xbar

DMU

IMUTMU

FromGPP

LaneGroup

4B

32B

y

SLFU

u

IQ

y y y

v v

z chimes perchime group

y lanes per lane group

gcchimegroups

μRFμRF

μRFμRFμRF

μRFμRFμRF

μRF

MDU Xbar

gc

gcgc gc

gly SLFU y

gc

LSU y

RT PFB

PC

Cornell University Christopher Batten 17 / 29

Page 23: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Loop-Task Accelerator (LTA) Vision

Cornell University Ji Kim 48/21Cornell University Ji Kim 48/21Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

48

● Motivation

● Challenge #1: LTA SW

● Challenge #2: LTA HW

● Evaluation

● Conclusion

Cornell University Christopher Batten 18 / 29

Page 24: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Methodology

Cornell University Ji Kim 49/21Cornell University Ji Kim 49/21Cornell University Ji Yun Kim

Evaluation: Methodology

• Ported 16 application kernels from PBBS and in-house

benchmark suites with diverse loop-task parallelism

• Scientific computing: N-body simulation, MRI-Q, SGEMM

• Image processing: bilateral filter, RGB-to-CMYK, DCT

• Graph algorithms: breadth-first search, maximal matching

• Search/Sort algorithms: radix sort, substring matching

• gem5 + PyMTL co-simulation for cycle-level performance

• Component/event-based area/energy modeling

• Uses area/energy dictionary backed by VLSI results and McPAT

49

Cornell University Christopher Batten 19 / 29

Page 25: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Design-Space Exploration

Cornell University Ji Kim 52/21Cornell University Ji Kim 52/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

52

spatial decoupling

temporal decoupling

resource constraints

Cornell University Christopher Batten 20 / 29

Page 26: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Design-Space Exploration

Cornell University Ji Kim 53/21Cornell University Ji Kim 53/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

53

spatial decoupling

temporal decoupling

Prefer spatial decoupling over temporal decoupling

resource constraints

Cornell University Christopher Batten 20 / 29

Page 27: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Design-Space Exploration

Cornell University Ji Kim 55/21Cornell University Ji Kim 55/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

55

spatial decoupling

temporal decoupling

resource constraints

Cornell University Christopher Batten 20 / 29

Page 28: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Design-Space Exploration

Cornell University Ji Kim 56/21Cornell University Ji Kim 56/21Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

56

spatial decoupling

temporal decoupling

Reduce spatial decoupling to improve energy efficiency

resource constraints

Cornell University Christopher Batten 20 / 29

Page 29: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Energy Breakdown

8/1x4/1 8/2x4/1 8/4x4/1 8/8x4/1

mriq (regular) sarray (irregular)

I CachePIBTMUFrontendRegfileRT+ROBSLFULLFULSUD Cache

Conservative comparison since IO/O3 running serial baseline,while LTA is using parallel runtime even on a single core

Cornell University Christopher Batten 21 / 29

Page 30: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Energy Efficiency vs. Performance

0 1 2 3 4 5 6 7 8 9 10 11Performance

0.00.10.20.30.40.50.60.70.80.91.01.1

Ener

gy E

ffici

ency

0 1 2 3 4 5 6Performance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

In-Order vs IO+LTA Out-of-Order vs IO+LTA

In-OrderBaseline

Out-of-OrderBaseline

Cornell University Christopher Batten 22 / 29

Page 31: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Multicore LTA Performance

Cornell University Ji Kim 61/21Cornell University Ji Kim 61/21Cornell University Ji Yun Kim

Evaluation: Multicore LTA Performance

61Regular Irregular

4.4x2.9x

10.7x5.2x

Cornell University Christopher Batten 23 / 29

Page 32: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Evaluation: Multicore Energy and Area

I Both the baseline CMP and theCMP+LTA designs use the sameapplication code and almost theexact same parallel runtime

I CMP+LTA vs. CMP-IO:improves energy efficiency by1.1⇥ geo mean

I CMP+LTA vs. CMP-O3:improves energy efficiency by3.2⇥ geo mean

Cornell University Christopher Batten 24 / 29

Page 33: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW • LTA Evaluation •

Related Work

I Challenge #1: Intra-Core Parallel Abstraction Gap. Persistent threads for GPGPUs (S. Tzeng et al.). OpenCL, OpenMP, C++ AMP. Cilk for packed-SIMD (B. Ren et al.). and more ...

I Challenge #2: Inefficient Execution of Irregular Tasks. Variable warp sizing (T. Rogers et al.). Temporal SIMT (S. Keckler et al.). Vector-lane threading (S. Rivoire et al.). and more ...

I See MICRO’17 paper for detailed references ...

Cornell University Christopher Batten 25 / 29

Page 34: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation

LTA Take-Away Points

I Intra-core parallel abstraction gapand inefficient execution ofirregular tasks are fundamentalchallenges for CMPs

I LTAs address both challengeswith a lightweight ISA hint and aflexible microarchitecturaltemplate

LTAGPP GPP

Task-Based ParallelProgramming Framework

LTA ISA Hint

Loop-TaskLoop-Task

SIMD SIMD LTA

LTA HW

LTA SW

I Results suggest in a resource-constrained environment, architectsshould favor spatial decoupling over temporal decoupling

Cornell University Christopher Batten 26 / 29

Page 35: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation

Upcoming Computer Laboratory Seminar

“A New Era of Open-Source SoC Design”Wednesday, May 16th @ 4:15pm

Celerity: An Accelerator-CentricSystem-on-Chip

CARRV’17, October 14, 2017, Boston, MA, USA T. Ajayi et al.

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

AX

I

RO

CCRV64G Core

I-CacheD-Cache

RV32IMCore

I Mem

XB

AR

NoC

Router

D Mem

BN

N A

ccel

erat

or

Off-Chip I/O

Bus f

rom

BSG

IP L

ibra

ry

Manycore

General-PurposeTier

Massively ParallelTier

SpecializationTier

Figure 1: Celerity SoC Architecture

stack to support new software research ideas. Complete o�-the-shelf RISC-V processor and memory system implementations (e.g.,Rocket chip SoC generator) enable rapidly deploying traditionalprocessors before modifying and/or extending these initial imple-mentations with new hardware research ideas. Similarly, OCN IPthat is designed within the RISC-V ecosystem (e.g., NASTI, TileLink)can reduce system-level integration e�ort. Standard veri�cation testsuites can greatly simplify developing new processor microarchi-tectures, and turn-key FPGA gateware (e.g., framework for runningRocket cores on various Xilinx Zynq FPGA boards) can help re-duce engineering e�ort. The entire RISC-V ecosystem has a strongemphasis on open-source software and hardware which facilitiesmodifying and/or extending just the component of interest in thecontext of a given research idea.

In this paper, we describe our experiences using the RISC-Vecosystem to build Celerity, an accelerator-centric system-on-chip(SoC) which uses a tiered accelerator fabric to improve energy e�-ciency in the context of high-performance embedded systems [1].The general-purpose tier includes a few fully featured RISC-V proces-sors capable of running general-purpose software including an oper-ating system, networking stack, and non-critical control/con�gurationsoftware. This tier is optimized for high �exibility, but of course atthe cost of energy e�ciency. The massively parallel tier includeshundreds of lightweight RISC-V processors, a distributed, non-cache-coherent memory system, and a mesh-based interconnect.This tier is optimized for e�ciently executing applications with�ne-grain data- and/or thread-level parallelism. The specializationtier includes application-speci�c accelerators (possibly generatedusing high-level synthesis). This tier is optimized for extreme en-ergy e�ciency, but of course at the cost of �exibility. We envisiona three-step process for mapping algorithms to such fabrics. Step 1:Implement the algorithm using the general-purpose tier. Step 2:Accelerate the algorithm using either the massively parallel tier ORthe specialization tier. Step 3: Improve performance and e�ciencyby cooperatively using both the specialization AND the massivelyparallel tier. A key feature of tiered accelerator fabrics is the use ofhigh-throughput parallel links to inter-connect all three tiers.

The Celerity SoC is a 5� 5 mm 385M-transistor chip in TSMC16 nm designed and implemented by a team of over 20 students andfaculty from the University of Michigan, Cornell University, and

the Bespoke Silicon Group at the University of Washington andthe University of California, San Diego, as part of the DARPA Cir-cuit Realization At Faster Timescales (CRAFT) program. Figure 1illustrates the SoC architecture. The Celerity SoC includes �veChisel-generated Rocket RV64G cores in the general-purpose tier, a496-core RV32IM tiled manycore processor in the massively paralleltier, and a complex HLS-generated BNN (binarized neural network)accelerator implemented as a Rocket custom co-processor (RoCC)in the specialization tier. Celerity also includes tightly integratedRocket-to-manycore communication channels, manycore-to-BNNhigh-speed links, sleep-mode subsystem with ten RV32IM cores,fully synthesizable phase-locked-loop clocking subsystem, and dig-ital low-dropout voltage regulator. The chip was taped out in May2017, and it will return from the foundry in the fall. The CeleritySoC is an open-source project, and links to all of the source �lesare available online at http://opencelerity.org.

In the rest of the paper, we describe each of the three tiers inmore detail by answering four key questions: What did we build inthat tier? How did we build it? How did we leverage the RISC-Vecosystem to facilitate design, implementation, and veri�cation inthat tier? and What were the challenges in leveraging the RISC-Vecosystem in that tier? Overall, the RISC-V ecosystem played animportant role in enabling a team of junior graduate students todesign and tapeout the highest-performance RISC-V SoC to date injust nine months.

2 GENERAL-PURPOSE TIERWITH RV64G CORES

The general-purpose tier uses fully featured RISC-V processors toexecute general-purpose software. This tier is optimized for high�exibility, at the potential expense of energy e�ciency.

What Did We Build? – The Celerity SoC general-purpose tierincludes �ve RV64G cores. The RV64G instruction set is comprisedof approximately 150 instructions for 64-bit integer arithmetic, sin-gle and double-precision �oating-point arithmetic, memory access,unconditional and conditional control �ow, and atomic memoryoperations. The cores use a relatively simple �ve-stage, single-issue,in-order pipeline. The RV64G core includes a memory managementunit and support for RISC-V machine, supervisor, and user privi-lege levels. It can be used in either a bare-metal mode, with a proxykernel (i.e., system calls are proxied to a separate host machine),or with a RISC-V port of the Linux operating system. The RV64Gcores serve as the interface between the other tiers and the o�-chip“northbridge” which is implemented as gateware in an FPGA. Thenorthbridge includes support for initial boot-up, o�-chip DRAM,and other I/O. Each RV64G core includes a 16 KB four-way set-associative instruction cache and a 16 KB four-way set-associativedata cache. There is no on-chip L2 cache. The �ve RV64G cores arenot cache-coherent, and thus only support running �ve indepen-dent instances of non-SMP Linux. Limited communication betweenthe RV64G cores is possible using software-managed coherenceand special support in the northbridge.

How Did We Build It? – We used the Berkeley Rocket chip SoCgenerator to create the RV64G core [2]. The Rocket chip SoC gen-erator is written in Chisel [4], a hardware construction languageembedded in Scala. Generating an RV64G core simply required

PyMTL: A Python-BasedHardware Modeling Framework

Cornell University Christopher Batten 27 / 29

Page 36: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation

Ji Kim, Shreesha Srinath, Christopher Torng, Berkin Ilbeyi, Moyang WangShunning Jiang, Khalid Al-Hawaj, Tuan Ta, Lin Cheng

and many M.S./B.S. students

Equipment, Tools, and IPIntel, NVIDIA, Synopsys, Cadence, Xilinx, ARM

Cornell University Christopher Batten 28 / 29

Page 37: Intra-Core Loop-Task Accelerators for Task-Based Parallel Programscbatten/pdfs/batten-lta... · 2020. 7. 14. · Research Overview LTA Motivation LTA SW • LTA HW • LTA Evaluation

Research Overview LTA Motivation LTA SW LTA HW LTA Evaluation

Batten Research Group

Exploring cross-layer hardware specializationusing a vertically integrated research methodology

Performance (Tasks per Second)

Ener

gy E

ffici

ency

(Tas

ks p

er J

oule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibil

ity vs.

Specia

lizatio

nCustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator

LTAGPP GPP

Task-Based ParallelProgramming Framework

LTA ISA Hint

Loop-TaskLoop-Task

SIMD SIMD LTA

LTA HW

LTA SW

Cornell University Christopher Batten 29 / 29