benchmarks for analyzing embedded multicore processor

22
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association

Upload: lynhi

Post on 01-Jan-2017

267 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmarks for Analyzing Embedded Multicore Processor

Leveling the Field for Multicore Open Systems Architectures

Markus LevyPresident, EEMBC

President, Multicore Association

Page 2: Benchmarks for Analyzing Embedded Multicore Processor

Analyzing the Multicore Ecosystem

• Embedded Microprocessor Benchmark Consortium® (EEMBC)

• Industry benchmarks since 1997• Tool for evaluating embedded processors,

compilers, systems• MultiBench for shredding multicore

processors

Page 3: Benchmarks for Analyzing Embedded Multicore Processor

Enabling the Multicore Ecosystem

• Initial engagement began in May 2005• Industry-wide participation• Current efforts

• Communications APIs• Hypervisors• Multicore Programming Practices• Resource Management

Page 4: Benchmarks for Analyzing Embedded Multicore Processor

Multicore Issues to Solve• Concurrent programming • Communications, synchronization, resource

management between/among cores• Performance analysis• Debugging• Distributed power management• OS virtualization• Modeling and simulation• Load balancing• Algorithm partitioning

Page 5: Benchmarks for Analyzing Embedded Multicore Processor

Multicore Benchmarking Rules

• Do not rely on a single answer• Match your application requirements

– Small or large data sets– Few or many threads– Dependencies– OS overhead

Page 6: Benchmarks for Analyzing Embedded Multicore Processor

Benchmarking Multicore – What’s Important?

• Measuring scalability• Memory and I/O bandwidth• Inter-core communications• OS scheduling support• Efficiency of synchronization• System-level functionality

Page 7: Benchmarks for Analyzing Embedded Multicore Processor

EEMBC Multicore Strategy• Evaluation and future development of

scalable SMP architectures– Includes multicore and manycore

• Measure impact of parallelization and scalability across both data processing and computationally-intensive tasks

Page 8: Benchmarks for Analyzing Embedded Multicore Processor

Some Results

Quad Core

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 2 4 8

Number of concurrent streams

Spee

dup

64M-check-reassembly 64M-cmykw2 64M-rotatew2 64M-tcp-mixed

Huge drop in performance when oversubscribed Nice scaling on networking

only workloads

Some benchmarks plateau earlier then expected

MultiBench Results

Page 9: Benchmarks for Analyzing Embedded Multicore Processor

Two Processor System Utilizing Single Memory Controller

QuadCore

Processor 1

QuadCore

Processor 2

DDR2Interface

Processors 1 and 2 must always arbitrate for memory via their front side bus connection through the North Bridge.

NorthBridge

Intel Front Side Bus

Page 10: Benchmarks for Analyzing Embedded Multicore Processor

Dual Quads vs. Single Quad

1. Find max values for each workload2. Ratio of all scores

0

0.5

1

1.5

2

2.5

64M

-che

ck-r

eass

embl

y

64M

-che

ck-r

eass

embl

y-tc

p

64M

-che

ck-r

eass

embl

y-tc

p-cm

ykw

2-ro

tate

w2

64M

-che

ck-r

eass

embl

y-tc

p-h2

64w

2

64M

-cm

ykw

2

64M

-cm

ykw

2-ro

tate

w2

64M

-rot

atew

2

64M

-tcp

-mix

ed

64M

-x26

4-1w

orke

r

64M

-x26

4-2w

orke

rs

64M

-x26

4-4w

orke

rs

64M

-x26

4-8w

orke

rs

ippk

tche

ck-6

4M-1

Wor

ker

ipre

s-72

M1w

orke

r

ipre

s-72

M2w

orke

r

md5

-32M

1wor

ker

md5

-32M

2wor

ker

md5

-32M

4wor

ker

rgbc

myk

-5x1

2M1w

orke

rs

rgbc

myk

-5x1

2M2w

orke

rs

rgbc

myk

-5x1

2M4w

orke

rs

rgbc

myk

-5x1

2M8w

orke

rs

rota

te-1

6x4M

s1w

1

rota

te-1

6x4M

s1w

2

rota

te-1

6x4M

s1w

4

rota

te-1

6x4M

s1w

8

rota

te-1

6x4M

s32w

1

rota

te-1

6x4M

s32w

2

rota

te-1

6x4M

s32w

4

rota

te-1

6x4M

s32w

8

rota

te-1

6x4M

s4w

1

rota

te-1

6x4M

s4w

2

rota

te-1

6x4M

s4w

4

rota

te-1

6x4M

s4w

8

rota

te-3

4kX

512-

90de

g

rota

te-c

olor

-4M

-90d

eg

No Workloads Yield 2x Scaling

Page 11: Benchmarks for Analyzing Embedded Multicore Processor

Two Processor System Utilizing Dual Memory Controllers

QuadCore

Processor 1

QuadCore

Processor 2

LinkDDR2Interface

DDR2Interface

Direct AccessShared Access

Doubly Shared Access

Page 12: Benchmarks for Analyzing Embedded Multicore Processor

Two Processor System Utilizing Single Memory Controller

QuadCore

Processor 1

QuadCore

Processor 2

Link DDR2Interface

• Processor 1 must always access memory by traversing link to Processor 2• Requires arbitration to access Processor 2’s memory

• Processor 2 always has prioritized access to this memory since it is directly attached.

• Affinity can help performance

Page 13: Benchmarks for Analyzing Embedded Multicore Processor

Dual Quad Cores –Single Memory Controller

Dual Quad Cores –Dual Memory Controllers

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 2 3 4 6 8 12 16 20

64M-cmykw2

64M-cmykw2-rotatew2

64M-rotatew2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 2 3 4 6 8 12 16 20

64M-cmykw2

64M-cmykw2-rotatew2

64M-rotatew2

Page 14: Benchmarks for Analyzing Embedded Multicore Processor

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9

rotate-color-4M-90deg (DS)

rotate-color-4M-90deg (DD)

•Rotation by multiple workers cooperating to process a single image.•Each worker thread acquires slices and writes them to the output buffer.

•Potential bottlenecks related to memory interfaces and synchronization between worker threads.

Page 15: Benchmarks for Analyzing Embedded Multicore Processor

Dual Quad Cores –Single Memory Controller

Dual Quad Cores –Dual Memory Controllers0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9

rotate-16x4Ms32w1rotate-16x4Ms32w2

rotate-16x4Ms32w4rotate-16x4Ms32w8

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9

rotate-16x4Ms32w1rotate-16x4Ms32w2

rotate-16x4Ms32w4

rotate-16x4Ms32w8

Page 16: Benchmarks for Analyzing Embedded Multicore Processor

The Multicore Association Roadmap

Communications- MCAPI: ultra-light weight

Resource Management - Memory management- Basic synchronization - Resource registration- Resource partitioning

Task Management-Task scheduling

The Four MCA Pillars

Virtualization (or OS)

Communication ResourceManagement

TaskManagement

Debug

Multicore SystemAdopted stds

MCA Foundation APIs

Value Added Functions• Languages

• Programming Models• Design Environments

• Application Generators• Benchmarks

Services•Load Balancing

•System Mgt.•Power Mgt.•Reliability

•Quality of Service

Page 17: Benchmarks for Analyzing Embedded Multicore Processor

Why Virtualize?

• Hardware consolidation• Migration and hosting of legacy applications• Resource management and balancing• Faster provisioning• Fault tolerance

Page 18: Benchmarks for Analyzing Embedded Multicore Processor

Virtualized Multicore System

Fractional CPU assignment for background tasks such as firmware update/system health monitor/power management

Benchmarking involves many system-level elements

HARDWARE PERIPHERALS

ARM CORE #1

TRANGO #1

ARM CORE #2

TRANGO #2

ARM CORE #3

TRANGO #3

ARM CORE #4

TRANGO #4

Firm

ware

Upda

te

AppApp App AppAppApp App App

SMPSMP RTOS

AppApp App App

Prop

rieta

ryAp

plica

tion

VPU A VPU B VPU C VPU D VPU E VPU F VPU G

VPU GVPU FVPU A VPU B VPU C VPU D VPU E

CORE #2 CORE #3 CORE #4 CORE #1

Page 19: Benchmarks for Analyzing Embedded Multicore Processor

EEMBC HypermarkImportant Metrics

• Overall performance overhead of a hypervisor, i.e. CPU loading

• Static footprint (code and data size)• Jitter• Interrupt latency• Comparison to native performance

Page 20: Benchmarks for Analyzing Embedded Multicore Processor

20

Multicore Programming Practices(MPP)

• Long term– Continue research into languages, methodologies, etc

• Short term– How today’s embedded C/C++ code may be written to be

“multicore ready”

• Influence of a group of like-minded methodology experts to ensure completeness, usefulness and industry-wide compatibility

• Creation of a standard “best practices” guide through a recognized, neutral industry body– Based on capturing current best practices

Page 21: Benchmarks for Analyzing Embedded Multicore Processor

MPP Scope & Approach

21

Sequential C/C++

Architecture 1e.g. Shared

Memory

Architecture 2e.g. Programmer Managed Memory

Architecture 3e.g. Message

Passing

• Focus on existing C/C++ without extensions, targeting current architectures

• Draw up framework of common pitfalls when transitioning from serial to parallel

• Consider solutions or avoidance tactics

• Analyze performance of solution

Page 22: Benchmarks for Analyzing Embedded Multicore Processor

Summary

• Performance analysis highlights benefits and pitfalls

• EEMBC licensing and membership

• Portability prevents entrapment• Multicore Association standards and

working groups