power optimization through manycore multiprocessing

23
1 Power Optimization Through Many-Core Multiprocessing Delivering High Performance in a Low Power World ChipEx2012 Haydn Povey Marketing Director – Implementation & Security ARM Processor Division May 2, 2012

Upload: chiportal

Post on 13-Dec-2014

827 views

Category:

Technology


0 download

DESCRIPTION

John Goodacre, ARM

TRANSCRIPT

Page 1: Power Optimization Through Manycore Multiprocessing

1

Power Optimization Through Many-Core Multiprocessing

Delivering High Performance in a Low Power World

ChipEx2012Haydn Povey

Marketing Director – Implementation & SecurityARM Processor Division

May 2, 2012

Page 2: Power Optimization Through Manycore Multiprocessing

2

Billions of Connected Devices

Performance expectations continue to increase exponentially but power

efficiency and scalability are becoming formidable challenges

ABI Research, IDC, Gartner and ARM forecasts

Form FactorTAM(m)

2015Mobile Phones 1,750

Media players 300Mobile Computers 750Desktop PCs 150Digital TV/STB 500Automotive Infotainment 100Other* 450

Total 4 billion

*Includes PND, photo-frames, etc

May 2, 2012

Page 3: Power Optimization Through Manycore Multiprocessing

3

Functionality

Up to 1980s

Mainframes/mini

Functionality

$

1990s

The PC

Functionality

Power × $

2000s

Notebooks

Functionality

Energy×$

2010s

MobileComputing

Historic Technology Drivers

May 2, 2012

Page 4: Power Optimization Through Manycore Multiprocessing

5

Limitations with Multiprocessing

Cost of offering the peak single thread performance on each CPU quickly exceedschassis thermal limits

System and softwarebottlenecks limit overall scalability

Single die integrationoffered some roadmap

May 2, 2012

Page 5: Power Optimization Through Manycore Multiprocessing

6

Evolution to Many-Core Base theorem

Simpler and smaller processor designs require exponentially less energy to accomplish same amount of compute as a more complex and larger processor design.

“Approximate rule of thumb” To increase performance 50% you double the power and area cost of

the processor design

Quickly reaches point of diminishing returns

May 2, 2012

Page 6: Power Optimization Through Manycore Multiprocessing

7

Challenge of Many-Core Many-core definition

Use ‘lots’ of smaller, more efficient processors to achieve a higher aggregate performance than can be reached through multiprocessing

Smaller processors are not capable of executing the same single thread as a higher performance processor in the same time – so can’t execute existing applications effectively

Many threads can not easily be decomposed into simpler smaller tasks so as to benefit from multiprocessing on the smaller processor

Software development challenge

May 2, 2012

Page 7: Power Optimization Through Manycore Multiprocessing

8

Software Data Decomposition

Split large quantity of DATA into smaller chunks that can

be operated in parallel

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Each data item is independent

TASK

TASK

TASK

TASK

TASK

May 2, 2012

Page 8: Power Optimization Through Manycore Multiprocessing

9

Software Task Decomposition

Functionally independent tasks can be executed

concurrently

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Each task item is functionally independent

TASK TASK TASK TASK TASK TASK TASK TASKTASK

TASK TASKTASK

TASK TASKTASK

TASK TASKTASK

TASK TASKTASK

May 2, 2012

Page 9: Power Optimization Through Manycore Multiprocessing

10

Example:Real Time Video Encoding

Functional Block Partitioning Functional blocks are serially dependent

But temporary independent

Distribute different functional blocks across available processors Split into defined functional threads Uses passing of data blocks between threads

to allocate work

Requires code changes and fine tuning

AnalogueVideo Sampling

RemoveInter-FrameRedundancy

RemoveIntra-FrameRedundancy

Quantise Samples

Run-LengthCompress

BufferStore

MotionCompensation

(Simplified MPEG encoding functional block diagram)

CPU0 CPU1

CPU2

CPU3

TIME

May 2, 2012

Page 10: Power Optimization Through Manycore Multiprocessing

11

Strategy Focus: The Thermal Wall SOC sustained power is limited in mobile devices by thermals;

1.5W to 2W with low-cost POP and stacked memories

3W without stacked memories

May 2, 2012

Pow

er

Time

Un-managed Max Power (@Tjmax)

Burst for responsiveness(e.g. Browsing)

Sustained performance (e.g. HD Video Record, Gaming)

Power Optimised Low End (e.g. e-Mail, Voice, MP3)

T >= Tjmax, Tskin

Managed Sustained Power

Tj >= Tmax Tj < Tmax

“Opportunistic Residency”

Responsiveness is a must

Complex active management is needed

Page 11: Power Optimization Through Manycore Multiprocessing

12

Applying Nominal Use Case Typical Day for Smartphone User

90 min voice calling

60 min email / social networking

30 min reading web

50 min angry birds / other gaming

90 min jogging while listening to music and logging GPS co-ordinates

10 min video recording

7 hrs sleep with music alarm clock

OS typically executing ~28 active processes

Apps synching in background

May 2, 2012

Page 12: Power Optimization Through Manycore Multiprocessing

13

Use Case Measurements

May 2, 2012

Page 13: Power Optimization Through Manycore Multiprocessing

14

Use Case Conclusion

May 2, 2012

Profiled CPU States

Minutes % of CPU Active

Deep Sleep 1186 n/a

200MHz 154 60%

500 MHz 69 27%

800 MHz 18 7%

1000 MHz 4 2%

1200 MHz 10 4%

If the phone was ARM big.LITTLE™ enabled...

Active CPU time

12% big

88% LITTLE

Page 14: Power Optimization Through Manycore Multiprocessing

15

Big.LITTLE Processing

May 2, 2012

Multiprocessing Capable Many core Benefits

Page 15: Power Optimization Through Manycore Multiprocessing

16

“big” Processor – Cortex-A15 ARM Cortex™-A15 Processor

3.5+ DMIPS/MHz

1-4 core MPCore™ configurable

Advanced Capabilities Full ARMv7A architecture

Thumb®-2, TrustZone®, VFP, NEON™

Virtualization, large address extensions

AMBA® 4 ACE™ coherency

High Performance Targeting 1.5GHz mobile implementation on 28nm

Hard Macro Quad-core Implementation @ 2GHz on 28HPM process

May 2, 2012

Page 16: Power Optimization Through Manycore Multiprocessing

17

“LITTLE” Processor – Cortex-A7 ARM Cortex-A7 Processor

“LITTLE” to Cortex-A15 “big”

1-4 core MPCore configurable

Same Architectural Capabilities Full ARMv7A architecture

Thumb-2, TrustZone, VFP, NEON

Virtualization, large address extensions

AMBA 4 ACE Coherency

ISA identical to Cortex-A15 processor

High Performance Up to 1.2GHz for mobile implementation on 28nm

May 2, 2012

Page 17: Power Optimization Through Manycore Multiprocessing

18

Comparison of big.LITTLE Pipelines

May 2, 2012

Page 18: Power Optimization Through Manycore Multiprocessing

19

Performance Comparison

May 2, 2012

Page 19: Power Optimization Through Manycore Multiprocessing

20

Power Efficiency Comparison

May 2, 2012

Page 20: Power Optimization Through Manycore Multiprocessing

21

Software Use Models Big.LITTLE Task Migration – One CPU active

Migrate between Cortex-A15 and Cortex-A7 depending on performance requirements

Big.LITTLE MP – Both CPUs can be active Allocate threads that need high-performance to cortex-A15

Allocate threads that don’t require high performance to Cortex-A7 for best energy efficiency

AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7

May 2, 2012

Page 21: Power Optimization Through Manycore Multiprocessing

22

Task Migration Mechanics

May 2, 2012

Page 22: Power Optimization Through Manycore Multiprocessing

23

CCI-400 Cache Coherent Interconnect

CCI-400 2+3 (x3)

2 full AMBA 4 ACE slave interfaces

+3 ACE-Lite I/O coherent slave interfaces

x3 master interfaces

CCI interfaces:

AMBA 4 ACE and ACE-Lite manage all coherency, sharability and barriers

AMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency

May 2, 2012

CoreLink™ CCI-400 Cache Coherent Interconnect128 bit @ up to 0.5 Cortex-A15 frequency

Quad Cortex-A7

CoherentI/O

device

128b

Mali-T604Graphics

ADB-400 ADB-400

128b 128b

MMU-400 MMU-400

128b 128b

ACE

ACE ACE-Lite + DVM

ACE-LiteACE-LiteACE-Lite

ACE-Lite

NIC-400

Other Slaves

Other Slaves

128b

NIC-400

LCDDMA

Quad Cortex-

A15

128b

ACE

ACE

AXI4

AXI4

Configurable: AXI4/AXI3/AHB/APB

Configurable: AXI4/AXI3/AHB

GIC-400

ACE-Lite + DVM ACE-Lite + DVM

128b

MMU-400ADB-400 ADB-400

DMC-400

DDR3/2LPDDR2/3

ACE-LiteACE-Lite

PHYPHY

DDR3/2LPDDR2/3

Page 23: Power Optimization Through Manycore Multiprocessing

24

Summary Multiprocessing enables the scaling of today’s application to

grow while maintaining single thread performance Addresses nicely the multi-tasking of stacked usage scenarios

Many-core brings the energy advantages of simpler and smaller processor but with the challenge of software complexity and lack of backwards compatibility with respect to single thread performance

The big.LITTLE processing as delivered by the ARM Cortex-A15 and Cortex-A7 offers both the performance and compatibility advantages of Multiprocessing along with the power efficiency and scalability advantages of many-core processing

May 2, 2012