system/behavioural level low power design
TRANSCRIPT
Dimitrios Soudris, NTUA Low Power Design Course
Power savings in terms of Design Level
System level
Behavior level
Logic level
Transistor level
Layout level
RT level
10-20 x
2-5 x
20-50%
Incr
easi
ng
pow
er s
avin
gs
Dimitrios Soudris, NTUA Low Power Design Course
Techniques to reduce supply voltage
Algorithm
Architecture
Circuit/Logic
Technology
Transformation to exploitconcurrency
Parallelism and Pipelining
Transistor S izing, Fast LogicS tructures
Threshold Voltage Reduction,Feature S ize scaling
Dimitrios Soudris, NTUA Low Power Design Course
Techniques to minimizing the switched capacitance
Partitioning, Powe r-down, powe r state s
C omple xity, C oncurre ncy, Re gularity,Local i ty, Data re pre se ntation
C oncurre ncy, Instruction se t se le ction,S ignal corre lations,
Data re pre se ntation , Data Encoding
Transistor siz ing, Logic optimiz ation ,Powe r down, Layout O ptimiz ation
Advance d packaging, SO I
Archite cture
C ircuit/Logic
Te chnology
Algorithm
USyste m
Dimitrios Soudris, NTUA Low Power Design Course
Instruction Level Power Estimation methodology
A ssem bly /M a chineC o de
D eterm ina tio n o fB a s ic B lo cks
E x ecutio nP ro fi l ing
B a se C o s tta ble
G lo ba l P ro g ra mC o st E s tia m tio n
C a che pena lty E s t.(C a che Sim ula tio n)
F ina l P ro g ra m C o s t
Sta l lA na ly s is
B a s ic B lo ck C o s tE s tim a tio n
Dimitrios Soudris, NTUA Low Power Design Course
Power consumption of microprocessors
Nap
Nap
1.0
2.0
pow
er (w
atts
)
PowerPC 603microprocessor1
normal operation
Doze
2.2
0.366
0.1350.047
1.0
1.5
pow
er (w
atts
)
MIPS 4200microprocessor2
normal operation
reduced power
0.4
1.5
100
150
pow
er (w
atts
)
H itachi SH7032microprocessor3
normal operation
Doze
130
50
6.6
50
Dimitrios Soudris, NTUA Low Power Design Course
Power consumption of transfer and storage over datapath operations both in hardware and
software
16-bit carry
-selec
t
13.6
4.4
910
33
rela
tive
ener
gy/o
pera
tion
16-bit Multip
lier
8x128x16 SRAM (read)
8x128x16 SRAM (writ
e)
External I/
O Access
16 bit Mem
ory Access
rela
tive
ener
gyStorage
Interconnect
Other RISC
components
0.0
0.2
0.4
a
Dimitrios Soudris, NTUA Low Power Design Course
Variable-Voltage Techniques
• Existing low-power techniques– static variable-voltage techniques, – efficient design of the voltage converter– dynamic variable-voltage techniques
Dimitrios Soudris, NTUA Low Power Design Course
System for Variable Voltage Supplies
VDDVDD clockclock
RateController
RateController
Workloadfilter
Workloadfilter
FIFOFIFO
VoltageRegulator
VoltageRegulator
RingOcsillator
RingOcsillator
Arbitrary SynchronousDSP
Arbitrary SynchronousDSP
Dimitrios Soudris, NTUA Low Power Design Course
Architecture Power Optimization Techniques
• Architecture-driven voltage reduction: The key idea is to speed upthe circuit in order to be able reduces voltage while meetingthroughput rate constraints. Voltage reduction can be achieved byintroducing parallelism in hardware or inserting flip-flops
• Switching activity minimization: Try to prevent the generation andpropagation of spurious transitions or to reduce the number oftransitions, e.g. retiming, path balancing, data representation
• Switched capacitance minimization: Aim at the minimization ofswitched capacitance
• Dynamic power management: Under certain conditions, a circuitpart becomes inactive, avoiding unnecessary calculations, e.g. gatedclocks, operand isolation, pre-computation, and guarded evaluation
Dimitrios Soudris, NTUA Low Power Design Course
Architecture Trade-offs: Reference Data Path
• Critical path delay Tadder + Tcomparator (= 25ns), fref = 40MHz• Total capacitance being switched = Cref
• Vdd = Vref = 5V • Power for reference datapath = Pref = Cref Vref
2 fref
Dimitrios Soudris, NTUA Low Power Design Course
Voltage Reduction Technique: Parallelism
• The clock rate can be reduced by half with the same throughput
fpar = fref / 2 • Vpar = Vref / 1.7 Cpar = 2.15 Cref
• Ppar = (2.15 Cref ) (Vref /1.7)2 (fref /2) 0.36 P ref
Dimitrios Soudris, NTUA Low Power Design Course
Voltage Reduction Technique: Pipeline
• fpipe = fref, Cpipe = 1.1 Cref, Vpipe = Vref /1.7• Voltage can be dropped while maintaining the original throughput• Ppipe = Cpipe Vpipe
2 fpipe = (1.1 Cref ) (Vref /1.7)2 fref = 0.37 Pref
Dimitrios Soudris, NTUA Low Power Design Course
Logic Style and Power Consumption
• Power-delay product improves as voltage decreases• The “best” logic style minimizes power-delay for a given delay constraint
Dimitrios Soudris, NTUA Low Power Design Course
Data representation
• Sign-extension activity significantly reduced using sign-magnitude representation
Dimitrios Soudris, NTUA Low Power Design Course
Glitching activity reduction (1)
• Depends heavily on the topology of the circuit
• Circuit topology is also important for the clock selection
• The selection of the clock period used during scheduling may affect the glitching activity, since large values of the clock period lead to schedule chains with many functional units
(a) (b)
D
C
+
+
A B
+
A B
+
C D
+
+
Dimitrios Soudris, NTUA Low Power Design Course
Glitching activity reduction (2)
• Sometimes the architecture topology is not detailed
• RTL transformations for reducing glitching activity:
– Architectural delay balancing using buffers and transparentlatches
– Use of the clock signal to suppress glitchy transitions
– Selective delay insertion to minimize glitch propagation
– Multiplexer decomposition and multiplexer tree structuring toeliminate the use of glitchy control signals, and minimize glitchpropagation data and control signals
Dimitrios Soudris, NTUA Low Power Design Course
Glitching activity reduction (3)
x y
z
ARCHITECTURE 1
Power Consumption:
Without glitches: 823.9 μW
With glitches: 1650 μW
ARCHITECTURE 2
Power Consumption:
Without glitches: 951.7 μW
With glitches: 1357.7 μW
Function
if (x < y) then
z=c+d
else
z=a+b
a c
0 1
x y
a b c db d
0 1
0 1
z
Dimitrios Soudris, NTUA Low Power Design Course
Signals and Operations Reordering
• Example: complex multiplicationTrading a multiplication for an addition
(a) (b)
x
Xr
x
-
Xi
Ar Ai
Yr
x
Xr
x
+
Xi
Ai Ar
Yi
Ai-Ar x
Xr
x
+
Ar
Yi
x
Xi
Yr
Ai+Ar
-
+
Xr Xi
Dimitrios Soudris, NTUA Low Power Design Course
Module Selection
* **i ii iii
+i
+ii
(a)
(c)
(d)
* **i ii iii
+
+ii
*ii iii
+i
+ii
**i
Area=2744
Latency=30 ns
Power=1199μW
ripple
adder
carryloohahead
adder
Area=3959
Latency=20 ns
Power=1467μW
array
multiplier
wallace
multiplier
Area=16185
Latency=60 ns
Power=18540μW
Area=18443
Latency=40 ns
Power=23545μW
RTL
Library
(b)
Dimitrios Soudris, NTUA Low Power Design Course
Power Management Techniques in RT-level
• Power Management reduce the unnecessary transitions under certain conditions
• Power Management Techniques– Clock-Based Power Management
automatic synthesis of gated-clocks circuits, clock gating techniques for data path registers, clock tree construction to facilitate clock gating,
and power management using multiple non-
overlapping clocks
NTUA Low Power Design Course
Dimitrios Soudris, NTUA Low Power Design Course
Power Management Techniques in RT-level (cont’d)
– Pre-computation– Operand Isolation
• Guarded evaluation• Operand isolation in the context of high-level
synthesis– Dynamic Frequency Scaling
Dimitrios Soudris, NTUA Low Power Design Course
The concept of gating clock signals
0 1
REG clock
X Y
B
A <
<
clock
gatedclock
scheme 1
<
clock
gatedclock
scheme 2
comparatoroutput
gated clock(scheme 2)
gated clock(scheme 1)
clock
0
0
0
0
1 clock period
(a) (c)(b)
Dimitrios Soudris, NTUA Low Power Design Course
Automatic synthesis of gated clocks
• Reactive systems wait for a certain event to occurbefore changing state. During the wait periods theoutputs of system do not change and if system isclocked power can be wasted. This method recognizesthese idle states and inserts the appropriate logic thatstops the clock
L
CombinationalLogic
CombinationalLogic
Fa
.
..
.ININ OUT
OUT
CLK CLK
GCLK
STATESTATE
Dimitrios Soudris, NTUA Low Power Design Course
Gated-clocked techniques for data path registers
• The aim is to determine the conditions under which the register retains or re-loads its value
• The condition can be activated in terms of the select signals connected to the individual multiplexers along the path
constr(1)
constr(0) constr(2)0 1 10
10
register
Dimitrios Soudris, NTUA Low Power Design Course
Clock tree design to derive gated-clock signals
gatedclock
clockidle
condition
gatedclock cell
x1R1
x2R2
x1+x3R3
x2+x4R4
clock
A
B
R1
x3R3
x4R4
clock
A
B
x1
x2
R2
(a) (b)
Dimitrios Soudris, NTUA Low Power Design Course
Power Management Using Multiple Non-Overlapping Clocks
• The use of gated clocks results in the clock signals, which feed various sub-circuits, being suppressed when the registers in the sub-circuits do not need to load a new value. The cycles during which the clock transitions are suppressed need not follow any regular pattern in general, since the suppression of the clock signal transitions is data-dependent. Some types of designs, however, contain sub-circuits whose idle clock cycles follow a simple, regular pattern. For example, a component may be active and idle in alternating clock cycles. If the cycles in which a sub-circuit is idle follows a regular pattern, the clock generation circuitry need not be data-dependent.
Dimitrios Soudris, NTUA Low Power Design Course
Pre-computation
• Pre-computation [Ald94] is another RT-level and gate-level powermanagement and relies on the idea of duplicating part of the logicwith the purpose of pre-computing the output values one clockcycle earlier than required. Then if this is achieved the originallogic is turned off in the next clock cycle, thus eliminating activityin the internal nodes. In order for pre-computation to achievepower-savings there must be combinational blocks, for which arelatively big percentage of the output values can be pre-computedby a significantly smaller block
Dimitrios Soudris, NTUA Low Power Design Course
Pre-computation: Example (1)
A
g1
g0
R1
FF
FF
..
...
.
.
R2
LE
fX1
XN
X2
AR1
..
.
X1
XN
X2 R2
f
Dimitrios Soudris, NTUA Low Power Design Course
Pre-computation: Example (2)
• The Boolean functions g1 and g0 serve as the predictor functions of the whole architecture, according to the following equations:
g1=1 f=1
g0=1 f=0
• Therefore, if either g1 or g0 is high during clock cycle T, the load enable signal (LE) goes low, and the inputs to block A are forced to retain their values during clock cycle T+1 changing. Hence no gate output transitions inside block A occur, while the correct output value for the next time frame is provide by the two registers located at the output of g1 and g0
Dimitrios Soudris, NTUA Low Power Design Course
The concept of operand isolation
• The concept of operand isolation occurs, where transparent latches are inserted at all the inputs of an embedded logic block, and control circuitry is added to detect the idle conditions for the block. When the clock is not required to perform any useful operation, the transparent latches at its inputs are disabled, and retain the previous cycle's values, avoiding unnecessary power dissipation in the idle block
transparent latch
. . .
. . .
. . .
circuitry detectingidle condition
Embedded
Block
COMBINATIONAL LOGIC
Dimitrios Soudris, NTUA Low Power Design Course
Guarded evaluation
• Guarded evaluation [Tiwari95] is a shut-down technique in the RT and gate-level that does not require to synthesize additional logic to implement the shut-down mechanism; rather it exploits existing signals in the original circuit. The approach is based on placing transparent latches with an enable signal at the input of each block of the circuit that needs to be power managed.
F F
X XOOY Y
ZZ
S'
Dimitrios Soudris, NTUA Low Power Design Course
Operand isolation during high-level synthesis: RTL circuit (1)
• For functional units that have one or more idle controller states, itis possible to insert transparent latches at the functional at thefunctional unit’s inputs to perform operand isolation. The latchenables signals for the latches at the inputs of a functional unit canbe derived directly from its idle controller states
• The expressions for the latch signals LE1,…, LE4 in are:
• LE1 = LE3 = x4
• LE2 = x1 + x2
• LE4 = x2 + x3
Dimitrios Soudris, NTUA Low Power Design Course
Operand isolation during high-level synthesis: RTL circuit (2)
MUL1(*1,*3,*5)
LE1
v1, v5,v6
R1
SUB1(-1,-2)
LE2
u, u1, v7
R2
MUL2(*2,*4,*6)
LE3
v2, v3, v4
R3
ADD1(+1,+2)
LE4
y, y1
R4
x, x1
R5
CMP(<1)
a
control
c1
x
c1=a<x1x1=x+dxy1=y+v4
v2=3*xv3=3*y
v4=u*dx
R23
uR3u
v1
v3dx
v2 dx
v4
R5
dx
x y y
R4
v=u-v5u1=v7-v6
v1=u*dxv5=v1*v2v6=v3*dx
s4 s2
s1
s3
controlFSM
. . .
. . .
LE1
LE4
contr(1)
contr(13)
reset c1
transparentlatches
Dimitrios Soudris, NTUA Low Power Design Course
Glitching in Static CMOS
A
B
X
CZ
ABC 101 000
X
Z
Unit Delay
also called: dynamic hazards
Observe: No glitching in dynamic circuits
Dimitrios Soudris, NTUA Low Power Design Course
Low Power Design Course
RTL Power Estimation Techniques
Dimitrios Soudris, NTUA Low Power Design Course
RTL Estimation Classification
• RTL Power Estimation– Analytical Methods
Complexity-based Models Information theoretic-based Models
– Empirical Methods Constant-Activity Models Variable Activity-based Models
Dimitrios Soudris, NTUA Low Power Design Course
RTL Estimation Methods
• Analytical Methods– attempt to relate the power consumption of a
particular RTL description to fundamental quantities that describe the physical capacitance and a activity of a design
• Empirical Methods– the strategy is to “measure” the power consumption
of existing implementations and produce a model based on those measurements. There techniques employ the so called macromodelling approach to architectural power estimation
Dimitrios Soudris, NTUA Low Power Design Course
RTL Macromodelling (1)
• A RTL power estimation flow consists:– Characterize energy every component in the high-level
design library by simulating it under pseudo-random data and fitting the power macro-model equation to power dissipation results using a least mean square error
– Extract the variable values for the macro-model equation from either static analysis of the circuit structure and functionality, or by performing a behavioral simulation. In the latter case, a power co-simulator linked with a standard RTL simulator can be used to collect input data statistics for various RTL modules in the design
Dimitrios Soudris, NTUA Low Power Design Course
RTL Macromodelling (2)
– Evaluate the power macro-model equations for high-level design components which are found in the library by plugging the parameter values in the corresponding macro-model equations
– Estimate the power dissipation for random logic or interface circuitry by simulating the gate-level description of these components, or by performing probabilistic power estimation. The low level simulation can be significantly sped up by the application of statistical sampling techniques
Dimitrios Soudris, NTUA Low Power Design Course
Low Power Design Course
Low-Power Design at the Logic Level
Dimitrios Soudris, NTUA Low Power Design Course
Retiming: Flip-flop insertion to minimize hazard activity
• The method is based on repositioning the flip-flops in the circuit so as to minimize either the number of flip-flops or the delay through the longest pipeline stage
gCL
g RCL
Dimitrios Soudris, NTUA Low Power Design Course
Two-Level Logic Circuits Switching Activity Minimization (1)
• Taking into account the static and transition probabilities (i.e. temporal correlation) of the primary inputs, we can insert in certain gates of the first logic level (i.e. AND gates), additional input signals resulting into reduced switching activity
• Appropriately-selected input signals force the outputs of the AND gates to logic level zero for a number of combinations of the binary input signals
Dimitrios Soudris, NTUA Low Power Design Course
Two-Level Logic Circuits Switching Activity Minimization (2)
• Example: • Signal x3 exhibits low-transition probability and high
static-1 probability, while the signals x0 , x1, and x2 are characterized by high-transition probabilities
F'g4g4
g1
g2
g3
x0x1
x0x2
x0x3
x3
'y1
'y2
'y3
Fg4
g1
g2
g3
x0x1
x0x2
x0x3
y1
y2
y3
g4
Intial Logic Circuit Modified Logic circuit
F x x x x x x 0 1 0 2 0 3