l33:low power reconfigurable system jun-dong cho sungkyunkwan univ. dept. of ece, vada lab
Post on 30-Dec-2015
215 Views
Preview:
TRANSCRIPT
L33:Low Power Reconfigurable system
Jun-Dong ChoSungKyunKwan Univ.
Dept. of ECE, Vada Lab. http://vada.skku.ac.kr
Answer IV:Reconfigurable Processor
• Configurable datapaths (e. g., splittable ALUs,complex operations)
• Configurable interconnect (e. g., nearest neighbor,k buses)
• SIMD processor, many functional units,preferably VLIW, possibly superscalar
ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS
• Arthur Abnous and Jan Rabaey
• Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithms
• Domain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms
Flexibility vs. Energy-Efficiency
• Trade-off between efficiency and
flexibility, programmable designs incur
significant performance and power
penalties compared to ASIC.
• The parallel algorithm of signal processing can be achieved
significant power savings by executing the dominant computational
kernels of a given class of applications with common features on
dedicated, optimized processing elements with minimum energy
overhead.
Application Domains
CELP- Based Speech Coding• LPC Analysis and Synthesis• Codebook Search• Lag ComputationDCT- Based Video Compression and Decompression• DCT and Inverse- DCT• Motion Estimation and Compensation• Huffman Coding and Decoding Baseband Processing for Digital Radios• Demodulation, Channel Equalization• Timing Recovery, Error Correction
The Re-configurable Terminal
Low- Power Multimedia Processing
• Hybrid, Re-configurable Architecture– application- specific, parallellism, pipelining,– locality, minimum control- overhead, zero- power when idle
• Task Scheduling, and Miscellaneous Functions on Embedded Core Processor (low speed, minimum functionality)
• Standardized Communication Protocols reduce Design Cycle and enable High Level Support
• Use extensive set of low- power circuit techniques– Reduced swing, variable voltages and frequency, self- timin
g, locally generated clocks
Arithmetic Energy Profile :VSELP Speech Coder
Lag Computation+Basic Vector Filtering+Codebook Search=76% of total time
Hybrid Architecture Template
The dominant, energy-intensive computationalkernels of a given domain of algorithms are implemented as a set of independent,concurrent threads of computation on the satellite processors.
The Popoased Architectue,Arthur Abnous and Jan Rabaey, UC-Berkeley
Energy- Efficiency + Domain- Specific Programmability
Control Processor
• The main task of control processor is to configure the satellite processors and the communication networks and to manage the overall control flow of a given signal processing algorithm
• Uses the available satellite processor and the re-configurable interconnect to compose the data flow graph corresponding to a given kernel of computation in hardware
Overlay operation
• Control processor configures network and co- processors
• Co- processors operate in distributed “data- driven” mode
• At completion, control returns to the core processor for next reconfiguration
Satellite Processors
Elements of Energy- Efficiency
Multi-Processor Implementation
Communication Network
Distributed Data- Driven Control
Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.
Implementation of Handshaking
Single-Wire, Two-Phase Asynchronous Handshaking
Protocol
Low Power Circuit Techniques
• Reduced swing interconnect (communication network, memories, programmable logic modules)
• On chip dc- dc conversion + multiple supply voltages• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadence/ Syno
psys design flow)
Power- Variable Performance
Low Power Circuit Techniques
• Reduced swing interconnect (communication network, memories, programmable logic modules)
• On chip dc- dc conversion + multiple supply voltages• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadence/ Syno
psys design flow)
Design Methodology
Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value
(b) A parallel and serial implementations of an adder tree.
VSELP Synthesis Filter Mapped onto Satellite Processors
Mappings of VSELP Kernel
The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS
Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW
Case Studies
• Voice coder for cellular
• Video decoder
• Baseband radio modem
• Security - encryption processor
Architecture for vector dot product
ConfigurationBus
StrobeAddress
Data
8
16
M em ory M em ory
Network (6 Buses)
AddG en AddG en
M AC
IPor
t
IPor
t
OPo
rt
Network ResetSatellite Reset
S low M ode
IP1 IP2 O P18 18
18AutoAck
M ode
• 0.6 ㎛ CMOS process
• Supply Voltage : 1.5
• Power estimation tool
– PowerMill
• 1 MAC, 2 SRAM, 2 Address
generator, 2 external input p
ort, 1 external output port
• All data and address values a
re 16-bits.
Result
• The most energy efficient CELP-based speech algorithm
- dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS)
- requires 23.4 MOPS
• Proposed VSELP speech coder
- 0.6 um CMOS
- dissipates under 5 mW
IIR Mapping
IIR Comparison
FFT Mapping
FFT Comparison
ResultStrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades
Frequency(MHz)
# of Multipliers
Throughput(cycle/tap)
Energy/tap(J)
Processor
169
0.5
17
37.4n
20
1
40 6 14
1
1
1.3n
1
600p
5 1
2.2n 205p
0.2 1
StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades
Frequency(MHz)
# of Multipliers
Throughput(cycle/IIR)
Energy/IIR(J)
Processor
169
0.5
114
277n
20
1
40 2.1 14
1
20
19.1n
13
9.5n
9 2
103n 1.9n
1 8
StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades
Frequency(MHz)
# of Multipliers
Throughput(cycle/stage)
Energy/stage(J)
Processor
169
0.5
766
1870n
20
1
40 - 14
1
152
131n
76
49.3n
- 4
- 13.3n
- 8
FIRResults
IIRResults
FFTResults
StroangARM: micro-processor[2]
TMS320C2xx: DSP chip
[3,4,5,6]
TMS320LC54x: DSP chip
[7,8,12]
XC4003A: FPGA chip[9,10]
Conclusions• The StrongARM has the worst performance of all because it takes many instru
ctions and cycles to execute a kernel in a highly sequential manners.– The lack of a single-cycle multiplier exacerbates this problem.
– The other architecture have more internal parallelism that allow them to have superior performance.
• Pleiades (architecture for vector dot product) does much better on the energy scale than the TI DSPs.
– Because DSPs are general-purpose, and instruction execution involves a great deal of overhead.
– Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead.
• Pleiades outperforms the other processors by a large margin owing to its ability to exploit higher levels of parallelism by creating a dedicated parallel structure from its computational resources and flexible interconnect.
Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Par
k,UMASS
Motion Estimation
Block Matching Algorithm
Configurable H/W Paradigms
Programmable Logic Modules
Why Hardware for Motion Estimation?
• Most Computationally demanding part of Video Encoding
• Example: CCIR 601 format
• 720 by 576 pixel
• 16 by 16 macro block (n = 16)
• 32 by 32 search area (p = 8)
• 25 Hz Frame rate (f frame = 25)
• 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.
Why Reconguration in Motion Estimation?
• Adjusting the search area at frame-rate according to the changing characteristics of video sequences
• Reducing Power Consumption by avoiding unnecessary computation
Motion Vector Distributions
Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
Re-configurable Architecture for ME
Power Estimation in Recongurable Architecture
Power vs Search area
Resource Reuse in FPGAs
Conclusion
• By adjusting the search area according to the changing characteristics of a picture, power can be saved. Further power saving can be achieved by utilizing freed up resources for local memory
• Extension of Adaptive Search Space Method to Software Implementation– Varying p still reduces computation and hence power– Resource reuse may also be applicable in S/W
implementation by freeing up cache space and compute power for more power efficient use of memory
Future Works
• Reconguration to support more sophisticated motion estimation algorithms ( intelligent search, object-based, ...)
• More detailed performance studies over a wider range of video sequences
• Generalization of this concept to other algorithms and architectures (not just video)
• Modification to FPGA architectures to support the use of logic and configuration cells as local memory
Motion Estimation - Conventional
Motion Estimation - Data Reuse
P P P
P P P P
P P
a add abs
b add add abs
abs add
2 2
2
0 45
2
2 1
2
/
/
.
Therefore, power reduction
factor is 11%
Kernel Scheduling in Reconfigurable Computing
• R. Maestre, F. J. Kurdahi, N. Bagherzadeh, H. Singh, R. Hermida, M. Fernandez, Design and Test in Europe, DATE99, Munich, Germany, Mar 99
The PartitionPartition is to fine some subsets of kernel that may be scheduled
(executed) independently of other kernels.
Partitioning of the application DFG
The SchedulingScheduling is performed within a given partition in detail after
partitioning .
Scheduling within a given partition
The Major Criteria
M E M C DCT Q IQ IDCT IM C
6 blocks blocks blocks blocks blocksblocks blocks
Fram e
8 4 21 6 6 421# of contexts :
M PEGsequence
G ranularityof com putation
¨Í
M E M C DCT Q IQ IDCT IM C
396(Fram e)
6 6 6 6 66¨Î
M E M C DCT Q IQ IDCT IM C
6 ¡¿396
6 66
¨Ï 396 396
a) M PEG sequence and granularity, b) a possib le schedule of an im age fram e, c) an a lternative schedule
• Context reloading
– Minimizing
• Data reuse
– Maximize
• Computation and
data movement
overlapping
– Maximize
Scheduling
C M
F B se t 1
F B se t 2
R 1i-1,R 2
i-1
K 1i K 2
i
C 3i
kc 2kc 1= 0
R 3i-1,D 1
i+1,D 2i+1,D 3
i+1
C 1i+1,C 2
i+1
kc 3
K 3i
T im e
¥ái = even t in ¥á ite ra tion i.
k i = C om pu ta tion tim e .
kc i = P ossib le ove rlap o f com pu ta tion and con text load ing
C i = C on text load ing tim e .
D i = D a ta load ing tim e .
R i = R esu lt read ing tim e .
Ide l tim e
P artition = { k 1, k2, k3 }. A poss ib le schedu le :
< Execution m odel representation >
Algorithm
K i
K j
Km
K p
1 2
3 4
B C = TR U E
a. LE E = ¥õ
K i
K j
Km
K p
2
3 4
B C = TR U E
b. LE E = { 1 }
K i
K j
Km
K p
3 4
B C = TR U E
c. LE E = { 1 , 2 }
K i
K j
Km
K p
2
4
B C =TR U E
d. LE E = { 1 , 3 }
K i
K j
Km
K p
2
B C =TR U E
b. LE E = { 1 , 3 , 4 }
K i
K j
Km
K p
B C =TR U E
c. LE E = { 1 , 4 }
2
3
B C =TR U E
< Som e steps of an exploration sequence >
References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of
the IEEE VLSI Signal Processing Workshop, San Francisco, Oct 1996.
[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.
[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.
[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.
[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.
[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.
[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.
[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.
[9] E. Kusse, Personal communication, 1996.
References
[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2, pp. 40-51, 1991.
[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31, N0. 11, pp. 1703-1714, Nov. 1996.
[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report SPRA281, TI, 1997.
[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer Academic publishers, 1992.
[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.
[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.
[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.
References
[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.
[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.
[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.
[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.
[21] Xilinx XABEL reference manual.
top related