embedded computing systems for signal processing...

Embedded Computing Systems for Signal Processing Applications Part 1: Introduction November 7th 2014 Eric Debes

2

What is this about?

Introduction to power/performance tradeoffs and system architecture Overview of existing processor and system architectures Consumer vs. Industrial/Embedded

Why do we care? Engineering added value is in complex and critical system architecture Need to know different components available Software/Hardware System Architecture and Modelling Power/Performance/Price Tradeoffs

Introduction

3

1. Introduction 2. General-Purpose Processors and Parallelism 3. Application Specific Processors: DSPs, FPGAs, accelerators,

SoCs 4. PC Architecture vs. Embedded System Architecture 5. Hard Real-time Systems and RTOS 6. Power Constraints 7. Critical and Complex Systems, MDE, MDA

Outline

4

Embedded

Size and thermal constraints Sometime battery life (energy) constraints

Real-time Time constraints Can be hard real-time Or soft-real time

Systems Typically includes multiple components Requires different expertises:

Signal Processing, computer vision, machine learning/Cognition and other algorithmic expertise Software Architecture Hardware/Computing Architecture Thermal and mechanical engineering

Embedded Real-time Systems

5

Consumer : DVD/video players, Set-top-box, Playstation, printers, disk drives, GPS, cameras, mp3 players Communications: Cellphone, Mobile Internet Devices, Netbooks, PDAs with WiFi, GSM/3G, WiMax, GPS, cameras, music/video Automotive: Driving innovation for many embedded applications, e.g. Sensors, buses, info-tainment Industrial Applications: Process control, Instrumentation Other niche markets: video surveillance, satellites, airplanes, sonars, radars, military applications

Application Examples

6

Texec = NI * CPI * Tc

NI = Number of Instructions CPI = Clock per Instruction Tc = Cycle Time

Texec = NI / (IPC * F)

IPC = Instructions Per Cycle F = Frequency

Performance improves with Silicon manufacturing technology

Microarchitecture improvements

Higher frequencies with deeper pipelines Higher IPC through parallelism

Performance

7

Performance

PentiumII(R)

Pentium Pro

Pentium(R)486

386

601, 603

604 604+MPC75021066

21064A 2116421164A

21264

21264S

10

100

1,000

10,000

1987 1989 1991 1993 1995 1997 1999 2001 2003 2005

Mhz

1

10

100

ClockPeriod/Avggatedelay

Processor freqscales by 2X per

generation

8

Dynamic Power = CV² = activity

C = capacitance V = voltage = frequency

Power = Pdyn + Pstatic Power is limited by

maximum current (Voltage regulator limitation) Thermal constraints

Power

9

Power

100

1 386

486

Pentium Pentium MMX

PentiumPro

Pentium II

10

1.5 1.0 0.8 0.6 0.35 0.25 0.18 Process (microns)

Max

imum

Pow

er (W

)

1

10

100

1000

Wat

ts2 /c

m

i386 i486

Pentium processor Pentium Pro processor

Pentium II processor

Pentium III processor

Hot plate

Nuclear Reactor Rocket Nozzle

Surface

Power density

10

ASIC

High-performance Dedicated to one specific application Not programmable

Processor Programmable General-purpose

Reconfigurable Architecture Good compromise between programmability and performance

Processor Architecture Spectrum

Microprocessor Reconfigurable ASIC

11


12


13

What are the key components in a Computing System?

Processor with Arithmetic and Logic Units Register File Caches or local memory

Memory Buses/Interconnect I/O Devices

Key Components of a Computing System

Part 2: General-purpose Processors and Parallelism November 7th 2014 Eric Debes

Embedded Computing Systems for Signal Processing Applications

15

Laundry Example Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes Dryer takes 40 minutes

Pipelining: Its Natural!

A B C D

16

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Sequential Laundry

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T a s k

O r d e r

Time

17

latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup

speedup

Pipelining Lessons

A

B

C

D

6 PM 7 8 9

T a s k

O r d e r

Time

30 40 40 40 40 20

18

more transistors for advanced architectures

Delivers higher peak perf But lower power efficiency Performance = Frequency x

Instruction per Clock Cycle Power = Switching Activity x

Dynamic Capacitance x Voltage x Voltage x Frequency

History: How did we increase Perf in the Past?

0

1

2

3

4

5

Pipelined S-Scalar OOO-Spec

Deep Pipe

Incr

ease

(X)

Area XPerf X

-1

0

1

2

3

Pipelined S-Scalar OOO-Spec

DeepPipe

Incr

ease

(X)

Power XMips/W (%)

19

In many systems today power is the limiting factor and will drive most of the architecture decisions New Goal: optimize performance in a given power envelope

Why Multi-Cores?

20

Dual Core

Voltage Frequency Power Performance 1% 1% 3% 0.66%

Rule of thumb (in the same process technology)

Core

Cache

Core

Cache

Core

Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1

Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8

How to maximize performance in the same power envelope?

Power = Dynamic Capacitance x Voltage x Voltage x Frequency

21

Multicore

Small Core 1 1

Large Core

Cache

1

2

3

4

1

2

Power

Performance Power = 1/4

Performance = 1/2

C1 C2

C3 C4

Cache

1

2

3

4

1

2

3

4 Multi-Core: Power efficient

Better power and thermal management

22

Era of Parallelism

23

Thermal is the main limitation factor in future design (not size) Move away from Frequency alone to deliver performance Challenges in scaling need to exploit thread level parallelism to efficiently use the transistors available thanks

Power/performance tradeoffs dictate architectural choices Multi-everywhere

Multi-threading Chip level multi-processing

Throughput oriented designs

Summary: Why Multi-Cores?

24

Processors are designed to address the need of the mass market. Mobile applications low power and good power management

are top priorities to enable thinner systems and longer battery life Office, image, video single threaded perf matters, some level

of multithreaded perf Multi-core RMS (Recognition, Mining, Synthesis) Applications and

Model based Computing massively parallel apps, good scaling on a large number of cores Many-core

Because of the large markets in each of the classes above, they are the focus of silicon manufacturers and are driving innovation in the semiconductor market

Application-driven Architecture Design

25

RMS Scaling on a Many-Core Simulator

0

16

32

48

64

0 16 32 48 64# of cores

Sp

eed

-up

Gauss-SeidelSparse Matrix (Avg.)Dense Matrix (Avg.)KmeansSVDSVM_classificationCloth-AVDFCloth-AVIFCloth-USFace-AFDSepiaTone (Core Image)

0

16

32

48

64

0 16 32 48 64# of cores

Spe

ed-u

p

Text indexingCFDRay TracingFB_EstimationBody TrackerPortifolio managementPlay physics

Data from Intel Application Research Lab

26

Low-power architecture and SoCs

ARM based LPIA/Atom based

Multi-core Core microarchitecture PowerPC

Many-core GP GPU Larrabee

3 Classes of Applications 3 Types of Processors

27

Examples of ARM-based low power architectures and SoCs: TI OMAP, Nvidia Tegra, Apple A4/A5, Samsung Exynos

Low-power Architecture and SoCs

28

Towards PC on a chip

Same Intel Core (e.g. Bay Trail) for Tablets/Smartphones, Consumer Electronic Devices and Embedded Market

29

Multi-core IBM Power4 IBM Cell Intel Ivy Bridge

Multicore

30

Tick-Tock model Modular design to

decrease cost (design, test,

validation) Integrate graphics on chip

Intel Roadmap for Intel Core Microarchitecture

31

Binning for leakage distribution and performance P = .C.v2. + leakage

Turbo mode to optimize performance under a given power envelope Policy to balance thermal budget between general purpose cores, and between GPP cores and graphics Next: Maximize performance under a given thermal envelope at the platform level

Power/Performance Tradeoffs

32

GP GPU: NVidia GeForce more than 2000 PEs

33

No need to put a lot of cache for GPUs because the number of threads are hiding the latency. The chip is designed for DRAM latency through a huge number of threads. Local memory are still present to limit bandwidth to GDDR CPU need multi-level large caches because the data need to be close to the execution units Fast growing video game industry exerts strong economic pressure that forces constant innovation

CPUs vs. GPUs

34

For a given application, processor architectures should be chosen depending on the performance/power efficiency MIPS/Watt or Gflops/Watt Energy efficiency (Energy Delay Product)

This is highly dependent on the application and targeted

power envelope. Examples: ARM and Atom are optimized for mainstream office and media apps for a power envelope between 1W and <10W Core microarchitecture is optimized for high-end office and media apps for a power envelope between 15W and ~75W GPUs are optimized for graphics applications and some selected scientific applications between 10W and more than 400W

Performance/Power for different architectures

35

Processor will integrate - Big core for single thread perf - Small core for multithreaded perf - some dedicated hardware units for

- graphics - media - encryption - networking function - other function specific logic

Systems will be heterogeneous Processor core will be connected to - one or multiple many-core cards - and dedicated function hw in the chipset + reconfigurable logic in the system or on chip?

Future: PC on a Chip

IA IA IA IA

IA IA IA IA

IA IA IA IA

IA IA IA IA

PCI-Ex PCI-Ex

Gfx/Media

Memory Ch

High-End Add-in

IA IA IA IA

IA IA IA IA

IA IA IA IA

IA IA IA IA

PCI-Ex PCI-Ex

Gfx/Media

Memory Ch

IA (Big core)

IA (Big core)

GCH

Part 3: App Specific Proc: DSPs, FPGAs, Accelerators, SoCs November 7th 2014 Eric Debes

Embedded Computing Systems for Signal Processing Applications

37

What are application specific processors? Processors or System-on-chip targeting a specific (class of) application(s)

Very common for Audio: MP3, AAC coding and decoding in audio players Image: JPEG or JPEG2000 coding and decoding, e.g. Digital cameras Video: MPEG, H264 coding and decoding, e.g. DVD players or set-top-boxes Encryption: RSA, AES Communication: GSM, 3G in cellphones

Why?

Large markets can justify the development of application specific processors Dedicated circuits provide higher performance with lower power dissipation, better battery life and very often lower cost.

Application Specific Processors

38

Application Specific Signal Processor Spectrum

39

DSPs Dedicated ASICs FPGAs Accelerators as coprocessors ISA extensions SoCs

Different Types of ASPs

40

Summary of Architectural Features of DSPs

Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate

Multiple memory banks and buses - Harvard Architecture: separate data and instruction memory Multiple data memories

Specialized addressing modes Bit-reversed addressing Circular buffers

Specialized instruction set and execution control Zero-overhead loops Support for MAC

Specialized peripherals for DSP

41

DSP Example: 320C62x/67x DSP

42

Many dedicated ASICs exist on the market, especially for media and communication applications. Example:

MP3 player DVD player Video processing engines, e.g. De-interlacing, super-resolution Video Encoder/Decoder GSM/3G TCP/IP Offload engine

Advantages: Low power, high perf/power efficiency Small area compared to same functionality in DSP or GPP

Drawbacks Cost of designing ASICs requires large volume Not flexible: cannot handle different applications, cannot evolve to follow standard evolution

Dedicated ASICs

43

Reconfigurable architectures FPGAs contain gates that can be programmed for a specific application

Each logic element outputs one data bit

Interconnect programmable between elements FPGAs can be reconfigured to target a different function by loading another configuration

44

Spécifications Input: RTL coding structural or behavioral description

RTL Simulation Functional simulation check logic and data flow (no temporal analysis)

Synthesis Translate into specific hardware primitives Optimisation to meet area and performance constraints

Place and Route Map hw primitives to specific places on the chip based on area and performance for the given technology Specify routing

Temporal Analysis Verification that temporal specification are met

Test and Verification of the component on the FPGA board

FPGAs Design Flow

45

Vivado HLS Design flow: from C to VHDL

46

Vivado HLS Development Flow

C/C++ programming

C Simulation Algorithm validation

Optimization directives insertion

Synthesis

Cosimulation RTL design validation

IP generation

No

Yes

Pipeline Unroll Merge

Loops

- Array Partitionning - Interfaces

Dataflow Results OK ?

(Perf / Resources)

47

Vivado HLS User Interface

Project explorer Directives insertion Code editor

Synthesis log

48

Synthesis report

Vivado HLS Tooling

Instructions analysis view

Clock cycle accurate representation Verification of actual parallelisation of instructions (e.g. pipelining) Localization of data dependencies

Latencies

Loops pipelining (latencies/

Throughput)

Resources

49

Xilinx Design methodology

Design methodology options A combination is possible!

50

Current generations of FPGAs add a GPP on the chip

Hardwired PowerPC (Xilinx) NIOS Softcore (Altera) MicroBlaze Softcore (Xilinx)

SoC with ARM on Xilinx Zynq

FPGAs with On-chip GPP

51

DSP blocks in reconfigurable architectures

Stratix DSP blocks consist of hardware multipliers, adders, subtractors, accumulators, and pipeline registers

Some FPGAs add DSP blocks to increase performance of DSP algorithms Example: Stratix DSP blocks

52

Reconf matrix of DSP blocks as media coproc.

Execution

Unit

Data Cache

Instruction

Unit

Memory

Instruction

Cache

General purpose processor

Control (PLA)

Memory group #1

Memory group #2

Co processor

Matrix of Processing Elements

32b mult 32b add/sub

Shift reg

Row of Processing E lements

mem

op1

Reconfigurable MatriX (8x3 PEs)

mem

op2

Embedded memories

read write address

read data write data

Control (ROM) chipselect


Shift reg mem

op4


Shift reg

mem res

mem

op6

It is possible to build complex system based on recent FPGA architectures Taking advantage of the regular structure of the DSP blocks in the FPGA matrix

53

Dedicated circuits to accelerate a specific part of the processor Typically will be connected to a general-purpose processor or a DSP Granularity can vary

accelerator for a DCT function Accelerator for a whole JPEG encoder

Accelerators are very common in system on chip Are typically called through an API function call from the main CPU

Accelerators as Coprocessors

54

Extending the ISA of a general purpose processor with SIMD instructions and specific instructions targeting media and communication applications is very common It adds application specific features to a processor and turns a general purpose processor into a signal/image/video processor. Example:

Intel MMX, SSE PowerPC AltiVec SUN VIS Xscale WMMX ARM Neon, Thumb-2, Trustzone, Jazelle, etc.

ISA extension in General-Purpose Processors

55

Conflicting requirements

ASICs Media Proc/DSPs GPPs

Better Power efficiency, runs at lower frequency

Flexibility, re-programmability (vs. redesign cost)

Better programming tools, shorter TTM for new app

Smaller chip size, lower leakage

56

The Energy-Flexibility Gap

Embedded Processors

Media Processors DSPs

Dedicated HW

F lexibility (Coverage)

Ener

gy E

ffic

ienc

y M

OPS

/mW

(or

MIP

S/m

W)

0.1

1

10

100

1000

Reconfigurable Processor/Logic

57

SoCs integrate the optimal mix of processors and dedicated hardware units for the different applications targeted by the system. Typically integrate a general purpose processor, e.g. ARM Can integrate a DSP Accelerators for specific functions Dedicated memories Integration boosts performance, cuts cost, reduces power consumption compared to a similar mix of processors on a card

System-on-Chip

58

Digital Camera hardware diagram

Mechanical Shutter

A/DC M OS Imager

Image Processing ASI C

256K x16DR A M

256K x16DR A M

M C U Memory Card I/F

L C D Control ASI C

L C D

32 K x8SR A M

68-p

in c

onn. ASI C

PCMCIA

Ser ialE EPR O M

PowerControl

3.3V C R-123Lithium Cell

ExposeUser Interface K eys

Activity L E D

DoorInter lock

Memory Card

ASIC Integration Opportunity

59

MPSoC: A Platform Story

architectural constraints imposed to support reuse of

Best of all worlds:

Provides some level of flexibility While being power efficient And enabling some level of reusability Can last multiple product generations Requires forward-looking platform based design to integrate potential

Programming model and design efficiency are key!

60

Nvidia Tegra

61

TI OMAP

62

Freescale iMX6

63

Intel Silvermont/Bay Trail

64

Tegra K1

65

Tegra K1

66

Tegra K1

67

Embedded Processor Architecture Trends Where do we come from?

DSP, FPGA, CPU, GPUs

68


DSP, FPGA, CPU, GPUs

What is available today? Mix of multicore/manycore and hardware accelerator on the same chip (e.g. Tegra K1) Or mix of multicore and FPGA on the same chip (Xynq)

SoC i.MX6

I / O

Multicore CPU

GPU/Manycore

Multicore+Manycore

Image / Video

Accelerator

SoC i.MX6

I / O

Multicore CPU

FPGA

Multicore + FPGA

Image / Video

Accelerator

69


DSP, FPGA, CPU, GPUs What is available today?

Mix of multicore/manycore and hardware accelerator on the same chip (e.g. Tegra K1) Or mix of multicore and FPGA on the same chip (Xynq)

Where are we going? Mix of multicore, manycore, FPGA and Hardware accelerators on the same chip Designed for real-time sensor processing

I / O

Multicore CPU

GPU/Manycore

Image / Video Accelerator

FPGA

Higher Integration/Lower Power

70

Very Fast Moving Industry Performance and power evolve at a very fast pace!

CPUs and GPUs driven by the PC market System-on-Chip driven by the cellphone/tablet market ~50x perf/Watt improvement over the last 5 years

GTX 285 (2009) TegraK1 (2014) Type PCIe Card System-on-chip CPU Intel (PC board) ARM (integrated) Interconnect PCIe On-Chip # Cores 240 192 Power 200W 2W Total Power 600W 5W Performance 1000 Gflops 365 Gflops Perf/Watt 5 Gflops/W 180 Gflops/W

50x perf/Watt improvement in 5 years fanless design possible today!

71

Efficiently Use Heterogeneous SoCs

Use the right core for the right task:

GPU massively parallel CPU control oriented tasks FPGA compute intensive tasks with hard-real time constraints

Competitive programming approaches:

Parallel programming languages and tools for CPU and GPU High Level Synthesis (from C to VHDL) for FPGAs Development, profiling, debugging tools are evolving as fast as the hardware

2 6

3,0 4

15

3,8 2

20 10

Power(Watt)

Performance(Fps)

Perf/Watt

Single CPU

4 x CPU

4 x CPU + GPU

Video pipeline exampe

Improved power efficiency, time to market and portability within product line

72

New architecture is driven by power and thermal

Most systems are limited by thermals

Parallelism is needed for perf and power efficiency Instruction level parallelism: Pipeline, OOO, VLIW Data-level parallelism: SIMD, Vector, 2D SIMD Matrices Thread level parallelism: SMP, CMP, SMT/HT System level parallelism: I/Os, Memory Hierarchy

Key Issues with Parallelism

Extracting parallelism from applications Systems Issues the rest of the system needs to be well balanced Programming models need to be portable, easy to learn and efficient

Application Specific Signal Processors and SoCs Spectrum: ASICs, FPGA, Media Proc, DSP, GPP + ISA extensions Depending on power/performance constraints, often a mix (SoC)

Summary

embedded computing systems for signal processing...

Documents