Download - Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs
Frank Vahid, UC Riverside
1
Improving Embedded System Software Speed and Energy usingMicroprocessor/FPGA Platform ICs
Frank VahidAssociate Professor
Dept. of Computer Science and EngineeringUniversity of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahid
This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend
Frank Vahid, UC Riverside 2
General Purpose vs. Special Purpose
Standard tradeoff
Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts.
Amazing to think this came from wolves
Frank Vahid, UC Riverside 3
General Purpose vs. Single Purpose Processors
Designers have long known that:
General-purpose processors are flexible
Single-purpose processors are fast
ENIAC, 1940’sIts flexibility was the big deal
DatapathController
Controllogic
State register
Datamemory
i
total
+
IR PC
Registerfile
GeneralALU
DatapathController
Program memory
Assembly code for:
total = 0 for i =1 to
…
Controllogic and
State register
Datamemory
total = 0for i = 1 to N loop total += M[i]end loop
General purpose
Single purposeOR
FlexibilityDesign cost
Time-to-market
PerformancePower efficiency
Size
Frank Vahid, UC Riverside 4
Mixing General and Single Purpose Processors
A.k.a. Hardware/software partitioning
Hardware: single-purpose processors
coprocessor, accelerator, peripheral, etc.
Software: general-purpose processors
Though hardware underneath!
Especially important for embedded systems
Computers embedded in devices (cameras, cars, toys, even people)
Speed, cost, time-to-market, power, size, … demands are tough
Microcontroller
CCD preprocessor Pixel coprocessorA2D
D2A
JPEG codec
DMA controller
Memory controller ISA bus interface UART LCD control
Display control
Multiplier/Accumulator
Digital camera chip
lens
CCD
Sw only Hw Only Hw/ Sw
FlexibilitySpeedEnergy
Frank Vahid, UC Riverside 5
How is Partitioning Done for Embedded Systems?
Partitioning into hw and sw blocks done early
During conceptual stage Sw design done separately
from hw design Attempts since late 1980s to
automate not yet successful Partitioning manually is
reasonably straightforward Spec is informal and not
machine readable Sw algorithms may differ from
hw algorithms No compelling need for tools
Informal spec
Sw spec
Sw design Hw design
Hw spec
Processor
ASIC
System Partitioning
Frank Vahid, UC Riverside 6
New Platforms Invite New Efforts in Hw/Sw Partitioning
New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable
gate array Programmable just like
software Flexible Intended largely to implement
single-purpose processors Can we perform a later
partitioning to improve the software too?
Sw spec
Sw design Hw design
Hw spec
Processor + FPGA
ASIC
System Partitioning
Informal spec
Processor + FPGA
Partitioning
Frank Vahid, UC Riverside 7
Commercial Single-Chip Microprocessor/FPGA Platforms
Triscend E5 chip
Con
fig
ura
ble
log
ic8051 processor plus other peripherals
Memory
Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at
40MHz up to 40K logic gates Cost only about $4
Frank Vahid, UC Riverside 8
Single-Chip Microprocessor/FPGA Platforms
Atmel FPSLIC Field-Programmable
System-Level IC Based on AVR 8-bit
RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10
Courtesy of Atmel
Frank Vahid, UC Riverside 9
Single-Chip Microprocessor/FPGA Platforms
Triscend A7 chip (2001)
Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS
at 60 MHz Up to 40k logic
gates $10-$20 in volume
Courtesy of Triscend
Frank Vahid, UC Riverside 10
Single-Chip Microprocessor/FPGA Platforms
Altera’s Excalibur EPXA 10 (2002)
ARM (922T) hard core 200 Dhrystone MIPS at
200 MHz ~200k to ~2 million
logic gates
Source: www.altera.com
Frank Vahid, UC Riverside 11
Single-Chip Microprocessor/FPGA Platforms
Xilinx Virtex II Pro (2002)
PowerPC based 420 Dhrystone MIPS
at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit
transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000
units)
Con
fig
.lo
gic
Up to 16 serial transceivers• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps
Pow
erP
Cs
Courtesy of Xilinx
Frank Vahid, UC Riverside 12
Why wouldn’t future microprocessor chips include some amount of on-chip FPGA?
Single-Chip Microprocessor/FPGA Platforms
One argument against – area Lots of silicon area taken up by
FPGA FPGA about 20-30 times less
area efficient than custom logic
FPGA used to be for prototyping, too big for final products
But chip trends imply that FPGAs will be O.K. in final products…
Frank Vahid, UC Riverside 13
How Much is Enough?
Perhaps a bit small
Frank Vahid, UC Riverside 14
How Much is Enough?
Reasonably sized
Frank Vahid, UC Riverside 15
How Much is Enough?
Probably plenty big for most of us
Frank Vahid, UC Riverside 16
How Much is Enough?
More than typically necessary
Frank Vahid, UC Riverside 17
How Much Custom Logic is Enough?
1993: ~ 1 million logic transistors
IC package IC
Perhaps a bit small
8-bit processor: 50,000 tr.Pentium: 3 million tr.
MPEG decoder: several million tr.
Frank Vahid, UC Riverside 18
1996: ~ 5-8 million logic transistors
Reasonably sized
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 19
1999: ~ 10-50 million logic transistors
Probably plenty big for most of us
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 20
2002: ~ 100-200 million logic transistors
More than typically necessary
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 21
2008: >1 BILLION logic transistors
1993: 1 M
Perhaps very few people
could design this
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 22
Very Few Companies Can Design High-End ICs
Designer productivity growing at slower rate 1981: 100 designer months ~$1M 2002: 30,000 designer months ~$300M
10,000
1,000
100
10
1
0.1
0.01
0.001
Logic transistors per chip
(in millions)
100,000
10,000
1000
100
10
1
0.1
0.01
Productivity(K) Trans./Staff-Mo.
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
IC capacity
productivity
Gap
Design productivity gap
Source: ITRS’99
Moore’s
Law
Frank Vahid, UC Riverside 23
Single-Chip Platforms with On-Chip FPGAs
0
10
20
30
40
50
60
70
1 2 3 4
Volume
Cost
per
IC
199020002010Mainstream
design
Becoming out of reach of mainstream
designers
So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways
But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs?
Frank Vahid, UC Riverside 24
Shrinking Chips
Yes, but there’s a limit Chips becoming pin
limited
Pads connecting to external pins
A football huddle can only get so small
This area will exist whether we use it all or not
Shrink
Frank Vahid, UC Riverside 25
Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific standard product
Domain-specific pre-fabricated IC
e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC
Unique IC design Ignores quantity of same IC
ASIC design starts decreasing Due to strong benefits of
using pre-fabricated devices
Sourc
e:
Gart
ner/
Data
quest
Septe
mber’
01
Frank Vahid, UC Riverside 26
Microprocessor/FPGA Platforms
Trends point towards such platforms increasing in popularity
Can we automatically partition the software to utilize the FPGA? For improved speed and
energy
Frank Vahid, UC Riverside 27
Automatic Hardware/Software Partitioning
Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why?
// From MediaBench’s JPEG codec
GLOBAL(void)
jpeg_fdct_ifast (DCTELEM * data)
{
DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
DCTELEM tmp10, tmp11, tmp12, tmp13;
DCTELEM z1, z2, z3, z4, z5, z11, z13;
DCTELEM *dataptr;
int ctr;
SHIFT_TEMPS
/* Pass 1: process rows. */
dataptr = data;
for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {
tmp0 = dataptr[0] + dataptr[7];
tmp7 = dataptr[0] - dataptr[7];
tmp1 = dataptr[1] + dataptr[6];
…
// Thousands of lines like this in dozens of files
Software
Hardware
“Spec”
Partitioner
Processor ASIC/FPGA
Compilation
Synthesis
Software
Ideal
Frank Vahid, UC Riverside 28
Why No Successful Tool Yet?
Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into
fine-grained operations Apply sophisticated
partitioning algorithms Examples
Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc.
Is this overkill?
1000s of nodes (like
circuit partitioning)
“Spec”
Partitioner
Frank Vahid, UC Riverside 29
We Really Only Need Consider a Few Loops – Due to the 90-10 Rule
Recent appearance of embedded benchmark suites Enables analysis understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC)
UCR loop analysis tools based on SimpleScalar and Simics
00.10.20.30.40.50.60.70.80.9
1
1 2 3 4 5 6 7 8 910
// From MediaBench’s JPEG codec
GLOBAL(void)
jpeg_fdct_ifast (DCTELEM * data)
{
DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
DCTELEM tmp10, tmp11, tmp12, tmp13;
DCTELEM z1, z2, z3, z4, z5, z11, z13;
DCTELEM *dataptr;
int ctr;
SHIFT_TEMPS
/* Pass 1: process rows. */
dataptr = data;
for (ctr = DCTSIZE-1; ctr >= 0; ctr--) {
tmp0 = dataptr[0] + dataptr[7];
tmp7 = dataptr[0] - dataptr[7];
tmp1 = dataptr[1] + dataptr[6];
…
Assigned each loop a
number, sorted by fraction of
contribution to total
execution time
Frank Vahid, UC Riverside 30
The 90-10 Rule Holds for Embedded Systems
00.10.20.30.40.50.60.70.80.91
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
00.10.20.30.40.50.60.70.80.91
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 10
% execution time
% size of program
In fact, the most
frequent loop alone
took 50% of time, using 1% of code
Frank Vahid, UC Riverside 31
So Need We Only Consider the First Few Loops? Not Necessarily
What if programs were self-similar w.r.t. 90-10 rule? Remove most frequent loop – 90-10 rule still hold? Intuition might say yes – remove loop, and we have another
program.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
% R
emai
nin
g
Exe
cuti
on
Tim
e
0
500
1000
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
% R
emai
nin
g
Exe
cuti
on
Tim
e
0
500
1000
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
So we need only speedup the first few loops
After that, speedups are limited
Good from tool perspective!
Frank Vahid, UC Riverside 32
Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips
Used multimeter and timer to measure performance and power Obtained good speedups and energy
savings by partitioning software among microprocessor and on-chip FPGA
A7 results
Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav
PS_g3fax 11.47 7.44 1.5 1.320 1.332 15.140 9.910 35%PS_crc 10.92 4.51 2.4 1.320 1.320 14.414 5.953 59%PS_brev 9.84 3.28 3.0 1.332 1.344 13.107 4.408 66%
Average: 2.3 Average: 53%
E5 results
Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav
PS_g3fax 15.16 7.11 2.1 0.252 0.270 3.820 1.920 50%PS_crc 10.64 4.64 2.3 0.207 0.225 2.202 1.044 53%PS_brev 17.81 1.81 9.8 0.252 0.270 4.488 0.489 89%
Average: 4.8 Average: 64%
E5 IC
Triscend A7 development
board
Frank Vahid, UC Riverside 33
Simulation-Based Results for More Benchmarks
Example Archit Cyclesorig Cyclessw Cycleshw
Loop Sp. Clkhw
Total Sp. Psw Phw Eorig Esw/hw ESav
PS_g3fax 8051 19,675,456 10,812,544 176,562 61 25 2.2 0.05 0.032 0.1142 0.05408 53%PS_crc 8051 291,196 180,224 7,168 25 25 2.5 0.05 0.028 0.0017 0.00071 58%PS_summin 8051 109,821,892 20,394,080 384,416 53 25 1.2 0.05 0.033 0.6376 0.53657 16%PS_brev 8051 330,064 305,768 1,360 225 25 12.9 0.05 0.034 0.0019 0.00015 92%PS_matmul 8051 119,420 101,576 2,560 40 25 5.9 0.05 0.035 0.0007 0.00012 82%PS_g3fax MIPS 15,600,000 4,720,000 599,000 8 100 1.4 0.07 0.111 0.0265 0.02163 18%PS_adpcm MIPS 113,000 29,300 5,440 5 100 1.3 0.07 0.181 0.0002 0.00018 6%PS_crc MIPS 5,040,000 3,480,000 460,800 8 100 2.5 0.07 0.061 0.0086 0.00379 56%PS_des MIPS 142,000 70,700 15,100 5 100 1.6 0.07 0.197 0.0002 0.00019 20%PS_engine MIPS 915,000 145,000 28,100 5 100 1.1 0.07 0.082 0.0016 0.00146 6%PS_jpeg MIPS 7,900,000 646,000 171,000 4 100 1.1 0.07 0.092 0.0134 0.01360 -1%PS_summin MIPS 2,920,000 1,270,000 266,000 5 100 1.5 0.07 0.111 0.0050 0.00375 24%PS_v42 MIPS 3,850,000 846,000 216,000 4 100 1.2 0.07 0.102 0.0065 0.00605 7%PS_brev MIPS 3,566 2,499 138 18 100 3.0 0.07 0.107 0.0000 0.00000 62%MB_g721 MIPS 838,230,002 457,674,179 9,985,261 46 100 2.1 0.07 0.152 1.4250 0.75035 47%MB_adpcm MIPS 32,894,094 32,866,110 1,183,260 28 42 11.6 0.07 0.130 0.0559 0.00821 85%MB_pegwit MIPS 42,752,919 33,276,287 2,167,651 15 50 3.1 0.07 0.170 0.0727 0.03241 55%NB_dh MIPS 1,793,032,157 1,349,063,192 45,156,767 30 69 3.5 0.07 0.121 3.0482 1.00547 67%NB_md5 MIPS 5,374,034 3,046,881 289,877 11 47 1.8 0.07 0.251 0.0091 0.00722 21%NB_tl MIPS 57,412,470 29,244,221 2,479,552 12 58 1.8 0.07 0.059 0.0976 0.05930 39%
Average: 30 3.2 Average: 34%
Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)
(Quicker than physical implementation, results matched reasonably well)
Frank Vahid, UC Riverside 34
Looking at Multiple Loops per Benchmark
Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates!
Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of
Embedded Systems, 2002 (to appear).
1.0
2.0
3.0
4.0
5.0
0 5,000 10,000 15,000 20,000 25,000
Gates
Sp
ee
du
p
G721(MB)
ADPCM(MB)
PEGWIT(MB)
DH(NB)
MD5(NB)
TL(NB)
URL(NB)
27.2
2.05 at 90,000
Frank Vahid, UC Riverside 35
Ideal Speedups for Different Architectures
123456789
10
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
123456789
10
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2
Loop speedups of 5 or more work fine for first few loops, not hard to achieve
Frank Vahid, UC Riverside 36
Ideal Energy Savings for Different Architectures
Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0
Energy savings quite resilient to variations
00.10.20.30.40.50.60.70.80.91
012345678910012345678910012345678910012345678910012345678910
00.10.20.30.40.50.60.70.80.91
012345678910012345678910012345678910012345678910012345678910
Frank Vahid, UC Riverside 37
How is Automated Partitioning Done?
Sw spec
Sw design Hw design
Hw spec
Processor + FPGA
ASIC
System Partitioning
Informal spec
Partitioning
Previous data
obtained manually
Frank Vahid, UC Riverside 38
Source-Level Partitioning
Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format)
Intermediate format explored for hardware candidates
Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist
SW Source_____________________
Compiler Front-End
Hw/Sw Partitioning
Compiler Back-End
Hw source
Assembler & Linker
Synthesis
Assembly &
object files
Binary Netlists
Processor FPGA
Frank Vahid, UC Riverside 39
Problems with Source-Level Partitioning
Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly
Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code
C SUIF Compiler ?
C Source C++ Source Java Source
C++ SUIF Compiler
Compiler Front-end
Frank Vahid, UC Riverside 40
Binary PartitioningSW Source
_____________________
Compilation
Hw/Sw Partitioning
Hw source
Assembler & Linker
Synthesis
Assembly &
object files
Binary
Netlists
Processor FPGA
Updated Binary
Source code is first compiled and linked in order to create a binary.
Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning
HDL is generated and synthesized, and binary is updated to use hardware
Frank Vahid, UC Riverside 41
Binary-Level Partitioning Results (ICCAD’02)
• Source-Level• Average speedup,
1.5• Average energy
savings, 27%• Average 4,361 gates
• Binary-Level• Average speedup, 1.4• Average energy
savings, 13%• Large area overhead
averaging 10,325 gates
Frank Vahid, UC Riverside 42
Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning
Dynamic software optimization gaining interest
e.g., HP’s Dynamo What better optimization than
moving to FPGA? Add component on-chip:
Detects most frequent sw loops
Decompiles a loop Performs compiler
optimizations Synthesizes to a netlist Places and routes the netlist
onto (simple) FPGA Updates sw to call FPGA
Config. Logic
MemProcessor
DMA
D$
I$
Profiler
Proc.
Mem
Self-improving IC Can be invisible to designer Appears as efficient
processor HARD! Much future work.
Frank Vahid, UC Riverside 43
Conclusions
Hardware/software partitioning can significantly improve software speed and energy
Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive
Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible?
Distinction between sw/hw continually being blurred! Many people involved:
Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others…
Support from NSF, Triscend, and soon SRC… Exciting new directions!