embedded computing systems for signal processing...
TRANSCRIPT
Embedded Computing Systems for Signal Processing Applications Part 1: Introduction November 7th 2014 Eric Debes
2
What is this about?
Introduction to power/performance tradeoffs and system architecture Overview of existing processor and system architectures Consumer vs. Industrial/Embedded
Why do we care? Engineering added value is in complex and critical system architecture Need to know different components available Software/Hardware System Architecture and Modelling Power/Performance/Price Tradeoffs
Introduction
3
1. Introduction 2. General-Purpose Processors and Parallelism 3. Application Specific Processors: DSPs, FPGAs, accelerators,
SoCs 4. PC Architecture vs. Embedded System Architecture 5. Hard Real-time Systems and RTOS 6. Power Constraints 7. Critical and Complex Systems, MDE, MDA
Outline
4
Embedded
Size and thermal constraints Sometime battery life (energy) constraints
Real-time Time constraints Can be hard real-time Or soft-real time
Systems Typically includes multiple components Requires different expertises:
Signal Processing, computer vision, machine learning/Cognition and other algorithmic expertise Software Architecture Hardware/Computing Architecture Thermal and mechanical engineering
Embedded Real-time Systems
5
Consumer : DVD/video players, Set-top-box, Playstation, printers, disk drives, GPS, cameras, mp3 players Communications: Cellphone, Mobile Internet Devices, Netbooks, PDAs with WiFi, GSM/3G, WiMax, GPS, cameras, music/video Automotive: Driving innovation for many embedded applications, e.g. Sensors, buses, info-tainment Industrial Applications: Process control, Instrumentation Other niche markets: video surveillance, satellites, airplanes, sonars, radars, military applications
Application Examples
6
Texec = NI * CPI * Tc
NI = Number of Instructions CPI = Clock per Instruction Tc = Cycle Time
Texec = NI / (IPC * F)
IPC = Instructions Per Cycle F = Frequency
Performance improves with Silicon manufacturing technology
Microarchitecture improvements
Higher frequencies with deeper pipelines Higher IPC through parallelism
Performance
7
Performance
PentiumII(R)
Pentium Pro
Pentium(R)486
386
601, 603
604 604+MPC75021066
21064A 2116421164A
21264
21264S
10
100
1,000
10,000
1987 1989 1991 1993 1995 1997 1999 2001 2003 2005
Mhz
1
10
100
ClockPeriod/Avggatedelay
Processor freqscales by 2X per
generation
8
Dynamic Power = CV² = activity
C = capacitance V = voltage = frequency
Power = Pdyn + Pstatic Power is limited by
maximum current (Voltage regulator limitation) Thermal constraints
Power
9
Power
100
1 386
486
Pentium Pentium MMX
PentiumPro
Pentium II
10
1.5 1.0 0.8 0.6 0.35 0.25 0.18 Process (microns)
Max
imum
Pow
er (W
)
1
10
100
1000
Wat
ts2 /c
m
i386 i486
Pentium processor Pentium Pro processor
Pentium II processor
Pentium III processor
Hot plate
Nuclear Reactor Rocket Nozzle
Surface
Power density
10
ASIC
High-performance Dedicated to one specific application Not programmable
Processor Programmable General-purpose
Reconfigurable Architecture Good compromise between programmability and performance
Processor Architecture Spectrum
Microprocessor Reconfigurable ASIC
11
Processor Architecture Spectrum
12
Processor Architecture Spectrum
13
What are the key components in a Computing System?
Processor with Arithmetic and Logic Units Register File Caches or local memory
Memory Buses/Interconnect I/O Devices
Key Components of a Computing System
Part 2: General-purpose Processors and Parallelism November 7th 2014 Eric Debes
Embedded Computing Systems for Signal Processing Applications
15
Laundry Example Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes Dryer takes 40 minutes
Pipelining: Its Natural!
A B C D
16
Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
Sequential Laundry
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T a s k
O r d e r
Time
17
latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup
speedup
Pipelining Lessons
A
B
C
D
6 PM 7 8 9
T a s k
O r d e r
Time
30 40 40 40 40 20
18
more transistors for advanced architectures
Delivers higher peak perf But lower power efficiency Performance = Frequency x
Instruction per Clock Cycle Power = Switching Activity x
Dynamic Capacitance x Voltage x Voltage x Frequency
History: How did we increase Perf in the Past?
0
1
2
3
4
5
Pipelined S-Scalar OOO-Spec
Deep Pipe
Incr
ease
(X)
Area XPerf X
-1
0
1
2
3
Pipelined S-Scalar OOO-Spec
DeepPipe
Incr
ease
(X)
Power XMips/W (%)
19
In many systems today power is the limiting factor and will drive most of the architecture decisions New Goal: optimize performance in a given power envelope
Why Multi-Cores?
20
Dual Core
Voltage Frequency Power Performance 1% 1% 3% 0.66%
Rule of thumb (in the same process technology)
Core
Cache
Core
Cache
Core
Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1
Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8
How to maximize performance in the same power envelope?
Power = Dynamic Capacitance x Voltage x Voltage x Frequency
21
Multicore
Small Core 1 1
Large Core
Cache
1
2
3
4
1
2
Power
Performance Power = 1/4
Performance = 1/2
C1 C2
C3 C4
Cache
1
2
3
4
1
2
3
4 Multi-Core: Power efficient
Better power and thermal management
22
Era of Parallelism
23
Thermal is the main limitation factor in future design (not size) Move away from Frequency alone to deliver performance Challenges in scaling need to exploit thread level parallelism to efficiently use the transistors available thanks
Power/performance tradeoffs dictate architectural choices Multi-everywhere
Multi-threading Chip level multi-processing
Throughput oriented designs
Summary: Why Multi-Cores?
24
Processors are designed to address the need of the mass market. Mobile applications low power and good power management
are top priorities to enable thinner systems and longer battery life Office, image, video single threaded perf matters, some level
of multithreaded perf Multi-core RMS (Recognition, Mining, Synthesis) Applications and
Model based Computing massively parallel apps, good scaling on a large number of cores Many-core
Because of the large markets in each of the classes above, they are the focus of silicon manufacturers and are driving innovation in the semiconductor market
Application-driven Architecture Design
25
RMS Scaling on a Many-Core Simulator
0
16
32
48
64
0 16 32 48 64# of cores
Sp
eed
-up
Gauss-SeidelSparse Matrix (Avg.)Dense Matrix (Avg.)KmeansSVDSVM_classificationCloth-AVDFCloth-AVIFCloth-USFace-AFDSepiaTone (Core Image)
0
16
32
48
64
0 16 32 48 64# of cores
Spe
ed-u
p
Text indexingCFDRay TracingFB_EstimationBody TrackerPortifolio managementPlay physics
Data from Intel Application Research Lab
26
Low-power architecture and SoCs
ARM based LPIA/Atom based
Multi-core Core microarchitecture PowerPC
Many-core GP GPU Larrabee
3 Classes of Applications 3 Types of Processors
27
Examples of ARM-based low power architectures and SoCs: TI OMAP, Nvidia Tegra, Apple A4/A5, Samsung Exynos
Low-power Architecture and SoCs
28
Towards PC on a chip
Same Intel Core (e.g. Bay Trail) for Tablets/Smartphones, Consumer Electronic Devices and Embedded Market
29
Multi-core IBM Power4 IBM Cell Intel Ivy Bridge
Multicore
30
Tick-Tock model Modular design to
decrease cost (design, test,
validation) Integrate graphics on chip
Intel Roadmap for Intel Core Microarchitecture
31
Binning for leakage distribution and performance P = .C.v2. + leakage
Turbo mode to optimize performance under a given power envelope Policy to balance thermal budget between general purpose cores, and between GPP cores and graphics Next: Maximize performance under a given thermal envelope at the platform level
Power/Performance Tradeoffs
32
GP GPU: NVidia GeForce more than 2000 PEs
33
No need to put a lot of cache for GPUs because the number of threads are hiding the latency. The chip is designed for DRAM latency through a huge number of threads. Local memory are still present to limit bandwidth to GDDR CPU need multi-level large caches because the data need to be close to the execution units Fast growing video game industry exerts strong economic pressure that forces constant innovation
CPUs vs. GPUs
34
For a given application, processor architectures should be chosen depending on the performance/power efficiency MIPS/Watt or Gflops/Watt Energy efficiency (Energy Delay Product)
This is highly dependent on the application and targeted
power envelope. Examples: ARM and Atom are optimized for mainstream office and media apps for a power envelope between 1W and <10W Core microarchitecture is optimized for high-end office and media apps for a power envelope between 15W and ~75W GPUs are optimized for graphics applications and some selected scientific applications between 10W and more than 400W
Performance/Power for different architectures
35
Processor will integrate - Big core for single thread perf - Small core for multithreaded perf - some dedicated hardware units for
- graphics - media - encryption - networking function - other function specific logic
Systems will be heterogeneous Processor core will be connected to - one or multiple many-core cards - and dedicated function hw in the chipset + reconfigurable logic in the system or on chip?
Future: PC on a Chip
IA IA IA IA
IA IA IA IA
IA IA IA IA
IA IA IA IA
PCI-Ex PCI-Ex
Gfx/Media
Memory Ch
High-End Add-in
IA IA IA IA
IA IA IA IA
IA IA IA IA
IA IA IA IA
PCI-Ex PCI-Ex
Gfx/Media
Memory Ch
IA (Big core)
IA (Big core)
GCH
Part 3: App Specific Proc: DSPs, FPGAs, Accelerators, SoCs November 7th 2014 Eric Debes
Embedded Computing Systems for Signal Processing Applications
37
What are application specific processors? Processors or System-on-chip targeting a specific (class of) application(s)
Very common for Audio: MP3, AAC coding and decoding in audio players Image: JPEG or JPEG2000 coding and decoding, e.g. Digital cameras Video: MPEG, H264 coding and decoding, e.g. DVD players or set-top-boxes Encryption: RSA, AES Communication: GSM, 3G in cellphones
Why?
Large markets can justify the development of application specific processors Dedicated circuits provide higher performance with lower power dissipation, better battery life and very often lower cost.
Application Specific Processors
38
Application Specific Signal Processor Spectrum
39
DSPs Dedicated ASICs FPGAs Accelerators as coprocessors ISA extensions SoCs
Different Types of ASPs
40
Summary of Architectural Features of DSPs
Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate
Multiple memory banks and buses - Harvard Architecture: separate data and instruction memory Multiple data memories
Specialized addressing modes Bit-reversed addressing Circular buffers
Specialized instruction set and execution control Zero-overhead loops Support for MAC
Specialized peripherals for DSP
41
DSP Example: 320C62x/67x DSP
42
Many dedicated ASICs exist on the market, especially for media and communication applications. Example:
MP3 player DVD player Video processing engines, e.g. De-interlacing, super-resolution Video Encoder/Decoder GSM/3G TCP/IP Offload engine
Advantages: Low power, high perf/power efficiency Small area compared to same functionality in DSP or GPP
Drawbacks Cost of designing ASICs requires large volume Not flexible: cannot handle different applications, cannot evolve to follow standard evolution
Dedicated ASICs
43
Reconfigurable architectures FPGAs contain gates that can be programmed for a specific application
Each logic element outputs one data bit
Interconnect programmable between elements FPGAs can be reconfigured to target a different function by loading another configuration
44
Spécifications Input: RTL coding structural or behavioral description
RTL Simulation Functional simulation check logic and data flow (no temporal analysis)
Synthesis Translate into specific hardware primitives Optimisation to meet area and performance constraints
Place and Route Map hw primitives to specific places on the chip based on area and performance for the given technology Specify routing
Temporal Analysis Verification that temporal specification are met
Test and Verification of the component on the FPGA board
FPGAs Design Flow
45
Vivado HLS Design flow: from C to VHDL
46
Vivado HLS Development Flow
C/C++ programming
C Simulation Algorithm validation
Optimization directives insertion
Synthesis
Cosimulation RTL design validation
IP generation
No
Yes
Pipeline Unroll Merge
Loops
- Array Partitionning - Interfaces
Dataflow Results OK ?
(Perf / Resources)
47
Vivado HLS User Interface
Project explorer Directives insertion Code editor
Synthesis log
48
Synthesis report
Vivado HLS Tooling
Instructions analysis view
Clock cycle accurate representation Verification of actual parallelisation of instructions (e.g. pipelining) Localization of data dependencies
Latencies
Loops pipelining (latencies/
Throughput)
Resources
49
Xilinx Design methodology
Design methodology options A combination is possible!
50
Current generations of FPGAs add a GPP on the chip
Hardwired PowerPC (Xilinx) NIOS Softcore (Altera) MicroBlaze Softcore (Xilinx)
SoC with ARM on Xilinx Zynq
FPGAs with On-chip GPP
51
DSP blocks in reconfigurable architectures
Stratix DSP blocks consist of hardware multipliers, adders, subtractors, accumulators, and pipeline registers
Some FPGAs add DSP blocks to increase performance of DSP algorithms Example: Stratix DSP blocks
52
Reconf matrix of DSP blocks as media coproc.
Execution
Unit
Data Cache
Instruction
Unit
Memory
Instruction
Cache
General purpose processor
Control (PLA)
Memory group #1
Memory group #2
Co processor
Matrix of Processing Elements
32b mult 32b add/sub
Shift reg
Row of Processing E lements
mem
op1
Reconfigurable MatriX (8x3 PEs)
mem
op2
Embedded memories
read write address
read data write data
Control (ROM) chipselect
32b mult 32b add/sub
Shift reg mem
op4
32b mult 32b add/sub
Shift reg
mem res
mem
op6
It is possible to build complex system based on recent FPGA architectures Taking advantage of the regular structure of the DSP blocks in the FPGA matrix
53
Dedicated circuits to accelerate a specific part of the processor Typically will be connected to a general-purpose processor or a DSP Granularity can vary
accelerator for a DCT function Accelerator for a whole JPEG encoder
Accelerators are very common in system on chip Are typically called through an API function call from the main CPU
Accelerators as Coprocessors
54
Extending the ISA of a general purpose processor with SIMD instructions and specific instructions targeting media and communication applications is very common It adds application specific features to a processor and turns a general purpose processor into a signal/image/video processor. Example:
Intel MMX, SSE PowerPC AltiVec SUN VIS Xscale WMMX ARM Neon, Thumb-2, Trustzone, Jazelle, etc.
ISA extension in General-Purpose Processors
55
Conflicting requirements
ASICs Media Proc/DSPs GPPs
Better Power efficiency, runs at lower frequency
Flexibility, re-programmability (vs. redesign cost)
Better programming tools, shorter TTM for new app
Smaller chip size, lower leakage
56
The Energy-Flexibility Gap
Embedded Processors
Media Processors DSPs
Dedicated HW
F lexibility (Coverage)
Ener
gy E
ffic
ienc
y M
OPS
/mW
(or
MIP
S/m
W)
0.1
1
10
100
1000
Reconfigurable Processor/Logic
57
SoCs integrate the optimal mix of processors and dedicated hardware units for the different applications targeted by the system. Typically integrate a general purpose processor, e.g. ARM Can integrate a DSP Accelerators for specific functions Dedicated memories Integration boosts performance, cuts cost, reduces power consumption compared to a similar mix of processors on a card
System-on-Chip
58
Digital Camera hardware diagram
Mechanical Shutter
A/DC M OS Imager
Image Processing ASI C
256K x16DR A M
256K x16DR A M
M C U Memory Card I/F
L C D Control ASI C
L C D
32 K x8SR A M
68-p
in c
onn. ASI C
PCMCIA
Ser ialE EPR O M
PowerControl
3.3V C R-123Lithium Cell
ExposeUser Interface K eys
Activity L E D
DoorInter lock
Memory Card
ASIC Integration Opportunity
59
MPSoC: A Platform Story
architectural constraints imposed to support reuse of
Best of all worlds:
Provides some level of flexibility While being power efficient And enabling some level of reusability Can last multiple product generations Requires forward-looking platform based design to integrate potential
Programming model and design efficiency are key!
60
Nvidia Tegra
61
TI OMAP
62
Freescale iMX6
63
Intel Silvermont/Bay Trail
64
Tegra K1
65
Tegra K1
66
Tegra K1
67
Embedded Processor Architecture Trends Where do we come from?
DSP, FPGA, CPU, GPUs
68
Embedded Processor Architecture Trends Where do we come from?
DSP, FPGA, CPU, GPUs
What is available today? Mix of multicore/manycore and hardware accelerator on the same chip (e.g. Tegra K1) Or mix of multicore and FPGA on the same chip (Xynq)
SoC i.MX6
I / O
Multicore CPU
GPU/Manycore
Multicore+Manycore
Image / Video
Accelerator
SoC i.MX6
I / O
Multicore CPU
FPGA
Multicore + FPGA
Image / Video
Accelerator
69
Embedded Processor Architecture Trends Where do we come from?
DSP, FPGA, CPU, GPUs What is available today?
Mix of multicore/manycore and hardware accelerator on the same chip (e.g. Tegra K1) Or mix of multicore and FPGA on the same chip (Xynq)
Where are we going? Mix of multicore, manycore, FPGA and Hardware accelerators on the same chip Designed for real-time sensor processing
I / O
Multicore CPU
GPU/Manycore
Image / Video Accelerator
FPGA
Higher Integration/Lower Power
70
Very Fast Moving Industry Performance and power evolve at a very fast pace!
CPUs and GPUs driven by the PC market System-on-Chip driven by the cellphone/tablet market ~50x perf/Watt improvement over the last 5 years
GTX 285 (2009) TegraK1 (2014) Type PCIe Card System-on-chip CPU Intel (PC board) ARM (integrated) Interconnect PCIe On-Chip # Cores 240 192 Power 200W 2W Total Power 600W 5W Performance 1000 Gflops 365 Gflops Perf/Watt 5 Gflops/W 180 Gflops/W
50x perf/Watt improvement in 5 years fanless design possible today!
71
Efficiently Use Heterogeneous SoCs
Use the right core for the right task:
GPU massively parallel CPU control oriented tasks FPGA compute intensive tasks with hard-real time constraints
Competitive programming approaches:
Parallel programming languages and tools for CPU and GPU High Level Synthesis (from C to VHDL) for FPGAs Development, profiling, debugging tools are evolving as fast as the hardware
2 6
3,0 4
15
3,8 2
20 10
Power(Watt)
Performance(Fps)
Perf/Watt
Single CPU
4 x CPU
4 x CPU + GPU
Video pipeline exampe
Improved power efficiency, time to market and portability within product line
72
New architecture is driven by power and thermal
Most systems are limited by thermals
Parallelism is needed for perf and power efficiency Instruction level parallelism: Pipeline, OOO, VLIW Data-level parallelism: SIMD, Vector, 2D SIMD Matrices Thread level parallelism: SMP, CMP, SMT/HT System level parallelism: I/Os, Memory Hierarchy
Key Issues with Parallelism
Extracting parallelism from applications Systems Issues the rest of the system needs to be well balanced Programming models need to be portable, easy to learn and efficient
Application Specific Signal Processors and SoCs Spectrum: ASICs, FPGA, Media Proc, DSP, GPP + ISA extensions Depending on power/performance constraints, often a mix (SoC)
Summary