low voltage system design for highly energy...
TRANSCRIPT
Low Voltage System Design for Highly Energy Constrained
Applications
N. Verma, J. Kwong, Y.K. Ramadass, D.C. Daly, V. Sze, N. Ickes, B. Ginsburg, A.P. Chandrakasan
Massachusetts Institute of Technology
FTFC 2008
2
Emerging Microsystems
What can One Jouleof energy do?
Send a 1 Megabyte file over 802.11b
6 Billion Operations on
a C55x DSP
The Energy Problem
7.5 cm3
AA battery
Alkaline: ~10,000J
Industrial Plants and Power Monitoring (courtesy ABB)
Implantable Medical Electronics [Medtronic]
Seizure Onset DetectionSystem [J. Guttag, MIT]
Portable Medical Electronics
3
Integrated Sensor Nodes
Compact Form Factor (mm3 – cm3)Energy dissipation per function must be minimized
Average power < 10µW (for energy scavenging)For many embedded applications, computing speed is not criticalExploit Application Attributes to Minimize EnergyExploit Application Attributes to Minimize Energy
4
Outline
• Voltage scaling– Advantages, challenges
• Ultra-low-voltage computing– Logic– SRAM– Power delivery and processing
• Low-energy CPUs• Low-energy ADCs• Conclusions
5
Voltage Scaling
Minimum energy VDD for logic results from opposing active and leakage components
Simulation of 32-b CLA adder
E/O
p (J
)
VDD (V)
ELEAK = ILEAKVDDdt∫Op
EACT=CVDD2
ETOT
6
Optimal VDD and Vt
Full minimization of energy requires VDD andtechnology optimization (Vt, slope, etc.)
VDD and Vt yield contours of constant energy and performance
7
ULV MOSFET Degradations
>103 104 107ID,µ
ID,-4σ
ID,+4σ
In Sub-Vt:1) Vt variation exponentially affects strength 2) ION/IOFF is severely degraded
8
ULV Computing PlatformUltra-low-power DSP for bio-medical implants
and sensor-nodes
0.3V MSP430 SoC1) Sub-Vt Logic2) Sub-Vt SRAM3) Integrated DC/DC
Specialized techniques target severe variation and current-ratio with minimum overhead
9
Sub-Vt Logic Level Failure
Sub-Vt static CMOS gates exhibit variation in logic levels (VOH, VOL)
Active
Leakage
RDF
Vt
Inverter VTC VOL Histogram
2) IACTIVE ≈ ILEAKAGE
1) Vt variation
VOLfailures
10
Sub-Vt Library DesignModel worst-case logic path VNAND
VNOR
NOR NAND
VNOR
Can’t support two logic levels
[J. Kwong, et al. ISLPED’06]
11
Sub-Vt Library DesignDefine failure as loss of butterfly plot intersections
Failure rate decreases exponentially with device width and VDD
12
Sub-Vt Library Design
0.340.320.300.280.260.24VDD(V)
111111.632-PMOS11.32.272.32.934.432-NMOS1111.331.6721-PMOS1111.331.6721-NMOS
• Find failure rate vs. width plots for different circuit primitives
• Keep device sizes as small as possible, subject to yield constraint
VDD=0.24V
Width Increase:
13
Sub-Vt Timing Analysis - Challenges
Order-of-magnitude higher delay variation in sub-Vt
0.5 1 1.5 20
100
200
Delay Norm. to Mean
Occ
urr e
nce s
thold
R1D Q D Q
R2Combinational
Logic
tC-Q, min tlogic, min
clk1 clk2tskew = tclk2 -tclk1
Clk Clk
tC-Q,min+tlogic,min > tskew+thold
300mV
1.2V
-5 0 5 100
100
200
Norm. thold
Occ
urre
nces
Data
Gaussian FitLognormal Fit
14
STA Methodology
Focus on hold time violations
Occ
urre
nces
Placed and Routed Design
Path 1 Path 2
…………………………
Exhaustive timing report (from P&R tool)
Path 1 Path 2 Path 3 Path 4
……………………………………………………
1. Sort into bins
2. Select paths with high timing variability (likely
to fail)3. Monte
Carlo simulation
Path 100 Path 101…………………………
Path 300Path 301…………………………
Path 300Path 301……………………………………………………
4. Buffer insertion
15
Determining Path Variability
0.10.12
0.150.20.250.33
Normalized Width (W)
Num
ber o
f Sta
ges
(N)
1 2 3 4 5 6
5
10
15
Path variability metric:(1) Sum σ/µ of all logic
gates in path
(2) Assign weighting to each path according to logic depth
Equal σ/µcontours
16
STA Methodology
Placed and Routed Design
Path 1 Path 2
…………………………
Exhaustive timing report (from P&R tool)
Path 1 Path 2 Path 3 Path 4
……………………………………………………
1. Sort into bins
2. Select paths with high timing variability (likely
to fail)3. Monte
Carlo simulation
Path 100 Path 101…………………………
Path 300Path 301…………………………
Path 300Path 301……………………………………………………
4. Buffer insertion
Hold Time Margin (ns)
Occ
urre
nces
17
SRAM Power Consumption
CPU
DC-DC
65nm 0.3V MSP430
128kb SRAM cache
90nm 2-Core Itanium
Core2
Core1
26MB SRAM cachePower: 12% Power: 69%
[J.Wuu, et al. ISSCC, 2005] [J.Kwong, et al. ISSCC, 2008]
SRAMs can’t be power-gated: leakage-power matters
18
SRAM Voltage Scaling
SRAMLogic
V MIN
(V)
[ISSCC, JSSC, VLSI, CICC, ISLPED `02-`07]
Low-VMIN SRAMs provide major leakage-power savings (>20x) without active-energy overhead of sleep modes
SRAMs pose primary limit to voltage scaling
19
SRAM Bit-Cell StabilityWLWL
BLBLB
Hold SNM Read SNM
Hold SNM preserved to low-voltages
Read SNM degraded at low-voltages
NT NCNT NC
*Soft-oxide breakdown further distorts curves
20
6T Low Voltage Failures
Relative device strengths determine
readabilty/writeability
Read SNM(Mean, 3σ,
4σ)
Hold SNM(Mean, 3σ,
4σ)
Write margin(Mean, 3σ, 4σ)
BLBL
21
1σ, 2σ, 3σ, 4σ
Read Current Distribution Array performance determined by worst-case IREAD
In sub-Vt, reduced overdrive lowers mean IREAD, and variation causes larger degradation of tail IREAD
IREAD
“1”
“1”
BL
22
Bit-Line Leakage
ILEAK depends on stored data and can exceed IREAD at low voltages
IREAD,µ , IREAD,3σ , IREAD,4σ
“1”“1”
“0”“0”
“0”“0”
“0”
IREAD
TotalILEAK
“0”
“0”
23
8T Bit-Cell For Sub-Vt SRAM
Buffer eliminates read SNM limitation,peripheral assists allow sub-Vt write and sensing
WL
BLBL
RDWL
RDBL
M1M2
M3M4
M5M6M7M8
VVDD
Buffer-Foot
6 T Storage Cell 2T ReadBuffer
24
6T: Fails at low VDD 8T: Wide margins at low VDD
8T vs. 6T for ULV
[L.Chang, VLSI, 2000]BLT BLC
WL
NT
NC
RdBL
RdWL
BLT BLC
WL
NT NC
25
Single-bit read failures for equal sized 6Tand 8T
For same area, VDD,MIN goes from 0.6V to 0.3V:Leakage power goes down by 8X
8T vs. 6T for ULV
26
“Zero” Sub-Vt Leakage Read-Buffer
“0”
“1” (256, 128, 64 Cells)
“0”
“1” (64 Cells)“1” (128 Cells)
“1” (256 Cells)
Read buffer VDS=0, VGS<0
6T Storage
Cell
“0”
PCHRG
Sub-Vtleakage
6T Storage
Cell
“0”
PCHRG
“1”
No sub-Vtleakage
27
Read-Buffer Foot-Driver Limitation
Read-buffer foot-driver must have strong drive currentbut consume minimal leakage power and area
6T 6T
“1”
“1”
6T 6T
“0”
“0”128 X IREAD
ILEAK
Acc
ess e
dR
owU
nacc
esse
dR
ow
28
Gate-Boosted Foot-Driver
WLB
BFB
>500x
VEnhance drive current of near
minimum sized NMOS by >500x
CBOOST
M1M2
M3
DD
WLB
VDDBoosted node has minimal capacitance
BFB 128 X IREAD
Array
29
Sub-Vt Write
To ensure write, boost WL 50mV and reduce cell supply
“1” “0”
Adjust trip voltage
Increase gate drive
Reduce gate drive
Mean
3σ
4σ
30
Virtual Cell Supply
Access devices and supply-
driver interact to accurately
set VVDDStacked-effect reduces leakage
during hold
VVDD
WR
QQB
VVDD settles to low intermediate
voltage
“1” “1”
VVDD
“1” “0”
WR“1”
QB
31
Sense-Amplifier Offset
1) Global variation degrades sense-amp accuracy for single-ended read- Pseudo-differential structure eliminates offset
2) Local variation results in uncorrelated error distribution of sense-amps- Only device up-sizing can reduce offset deviation
1σ, 2σ, 3σ
BLREF
ENB
Q
Diff. Input (60mV)
32
Sense-Amplifier Redundancy
Probability of error depends on joint
probability that all sense-amps fail:PERR,tot=(PERR,N)N
InputSwing PERR,1
PERR,2
PERR,4
PERR,8
N=8N=4
N=2
(N=1) (N=2) (N=4)
Column Pitch
(N=8)Sense-Amp Area
33
Redundancy Implementation
Start-up loop selects between 2 sense-
amps, yielding error improvement of 5x
N=2
EN0 EN1
RD
BL
Peri p
her a
lre
dun d
ancy
cont
rol refRead0
refRead1
saRef
rdncyCtrl
rdncyClk
Ref. Bit Cell
2 F-FSelection
logic
34
Prototype SRAM
1.89mm1.
12m
m
0 ºC
25 ºC
75 ºC
Leakage Power
>20x
256kbCapacity350mVVMIN
8 Blocks X 256 Rows X 128 ColumnsArchitecture65nm CMOSProcess
[N. Verma, et al., ISSCC’07]
35
Delivering the Optimum VDD
• Minimum Energy Point (MEP) varies with workload and temperature
• Tracking the MEP : Energy changes by ~3X
Increasing Workload
0.25 0.3 0.35 0.4 0.45 0.51
1.52
2.53
3.54
4.55
2-taps5-taps10-taps12-taps
VDD (V)E o
p(N
orm
aliz
ed)
Temperature, Duration of Leakage
ELEAKAGE
VMEPEACTIVEWorkload, Activity
VMEP
36
VDD,var
Minimum Energy Tracking Loop
CLKvar
VDD,var
Vref Energy Sensor
Circuitry
Load
(FIR Filter)
DC-DC Converter
and Control
Energy Minimization
Algorithm
Critical Path Replica Ring
Oscillator
COMP
Energy /
operationDAC
Cload
AVref
VBAT
Completely on-chip except for the passive filter components
37
Calculating Energy/Operation
COUNTER
Enable
Vcurr
Vcas
Vref (= V1)
N operations
C1
C2
C1 C2Cload+−
V1
+−V2 CLK
C1 C2Cload+−
V2
COMP
opref EVK ∝⋅
)VV(K)( Count 21 −∝=
Isink
)(2 21122
21 VVVVVEOP −≈−∝
38
Prototype MEP Tracking Loop
Additional Leakage
Decreasing Workload
2.1X
0.3 0.4 0.5 0.6 0.71
2
3
4
5
6
7
8
VDD (V)E o
p(N
orm
aliz
ed) WL1
WL4WL7WL7+1µA
1.12mm
1.05mm
C1 C2
DACDC-DC
EMBDC-DC
FIR Filter and Test
Circuitry
Automatic adjustment of MEP based on workload gives 2.1X energy savings
39
+−
CVBAT
2/2/
VVVV
VVV
BAT
BAT
BAT
BATBL ∆−
∆−∆−= ⋅η
CLC
C
CL
BAT
BATBL V
VV
VV=
∆−=η
C
C
+−
VBAT CLCC
2/2/2/VVVV
VVV
BAT
BAT
BAT
BATBL ∆−
∆−∆−= ⋅η
C
CL
BAT
BATBL V
VV
VV=
∆−=
2/2/η
Φ1 Φ2
Φ1 Φ2
VCL = VBAT − ∆V
VCL = VBAT/2 − ∆V
Linear efficiency drop: need VC≈VCL
Capacitive DC/DC Conversion
40
Scalable Voltage Generation
12CB
VBAT
Φ1 Φ2L
4CB
4CB
VBATΦ1 Φ2
Φ2
Φ1 Φ2
L
L
4CB
Φ2
Φ1 Φ2 L
6CB
6CB
VBAT
Φ1 Φ2
Φ2
Φ1 Φ2
L
L
Ratio: 1/1 Ratio: 1/2 Ratio: 1/3
Programmable switches yield DC/DC conversion for desired load voltages
41
Fully On-Chip Conversion to 0.3V
Multiple gain settings
Adjustable switch sizes
Charge recycling
Switch capacitors can be integrated on-chip for low load currents
VBP
CL
VL
Charge Recycler
enW<2:0> Switch MatrixNon- Ov.
Clock Generator
VBAT(1.2V)G<4:0>
IL
Topology for VL < 0.6V
Φ1Φ2
VBAT 6CB
6CB
Φ1 Φ2
Φ2Φ1 Φ2
Φ2 Φ1
VL
Topology for VL < 0.4VVL
VBAT12CB
Φ1 Φ2
Φ1Φ2
VBAT
COMPVREF
VL
42
Charge Recycling
Φ1
Φ2
ΦCR
12CB
Cload
VLΦ1 Φ2
Φ2
Φ1
CCR
כ כ כ כ
ΦCRLCR
VBAT
-0.1
0
0.1
i L(A
)
00.20.40.6
V BP
(V)
0 1 2 3 4 5 600.51
1.5
Time (ns)
ΦC
R(V
)
Reduce losses due to bottom plate parasitic capacitance
Cpar
43
1 10 100 5000.5
0.55
0.6
0.65
0.7
0.75
0.8
Load Power (µW)
Effic
ienc
y
Efficiency with microcontroller as load
= 75%
Measured efficiency while delivering 500mV
DC-DC Converter Efficiency
44
Sub-Vt Microcontroller• 16-bit RISC architecture• Supports 27 instructions and 7 addressing modes of
the standard MSP430 instruction set
HFCLK
LFCLK
...
Sequ
en-c
e r
ALU
Addr
Ctrl
Dat
aCt
rl
Address[15:0]
Data[15:0]
TCK
ACLKSMCLK
MCLKOutput ports
Clock system
ACLK
SMCLKMCLK
Memory interface
Input ports
Watchdogtimer
128kb SRAM
CPU
JTA
GReg. File
45
Prototype 0.3V SoC
Performance
1.36mm2SRAM0.12mm2
DC-DC Converter
Area
0.14mm2Logic
65nm CMOSProcess
VDD = 300mVMinimum Functional VDD
VDD = 500mVMinimum Energy Point
2.5µW leakage power, 10µW total power at 400kHz[J. Kwong, et al., ISSCC’08]
46
Parallelism to Meet Performance
0 0.2 0.4 0.6 0.8 110-3
10-2
10-1
100
Ene
rgy/
oper
atio
n (n
orm
aliz
ed)
Leakage Energy
Active Energy
Total Energy
Minimum Energy Point
VDD (V)
Baseband energy dominated by correlators
MEP occurs at 0.3V, BUT baseband throughput is constrained (500MSymb./s)
GOLD CODE
INPUT OUTPUT
CLEAR
47
Parallelize Baseband Architecture
L = 20 (cross-correlation)M = 31 (preamble symbols)Total # of correlators = 620
500MSymb/s performance at 0.4V:5.8X energy savings
[V.Sze, et al., ISLPED’07]
Correlator Bank
Demodulation
Correlator Sub-bank 1
Correlator 1Correlator 2
Correlator L
Correlator Sub-bank 2
Ret im
ingB
lock
5 Tap FIR Filter
5 Tap FIR Filter
5 Tap FIR Filter
5 Tap FIR Filter
………
……
…
Correlator L+1Correlator L+2
Correlator 2L
Correlator Sub-bank MCorrelator (M-1)L+1Correlator (M-1)L+2
Correlator ML
5-bitInputfromA
DC
Serialt o
Parallel
Thr esh oldD
etector/P
o sit ionE
ncod er
Bi tD
e coder
………
………
………
DemodulatedBits
Acquisition/Timing Control
FIRCoefficients
48
Scalable, Low-Power SAR ADC PU
RGE
AZE
RO
SMPL BIT-CYCLE
CLK
AZE
RO
SMP L BIT-CYCLE
12b mode (17 cycles) 8b Mode (12 cycles)
PUR G
E
12
6
6VREFP
VREFN
VINp
VINn
Sub DAC
Comparator
Main DAC
SAR
Clock Manager
CLK SLEEP
SLEEP
DOUT
CNVRT
VREFP
VREFN
49
AZ1
PRG
PRG
AZ2
PRG
PRG
AZ3
PRG
PRG
AZ3
PRG
PRG
SLEEP12
SLEEP8
VINp
VINn
VOUTp
VOUTn
Scalable Comparator
Regenerative amplifiersare more efficient than
linear stages
Clocked to reset
SLEEP
SLEEP SLEEP
50
Offset Compensating Latch
Total Gain
Nor
mal
ized
Pow
er-D
elay
3 linear ampstages
5 linear ampstages 1 regenerative
stage
[J.-T. Wu, et al., JSSC`88]
VINp VINn
VOUTpVOUTn
CR1 CR2
Reduces gain requirement of less efficient linear pre-amps by 100X
51
12/8b 0-100kS/s SAR ADC
65.5dBSNDR165fJ/Conv. StepFOM
100kS/sMax. Fs25µWPower
12b Mode
[N.Verma, et al., ISSCC’06]
52
Variation Resilient ULV ADC
Required resolution allows use of redundancy to improve yield
• CVDD2 highly digital, low-voltage flash ADC
53
Simplified ADC Architecture
Triangle used for comparator calibration
Memory stores histogram
Many redundant comparators
Highly digital, voltage scalable
architecture
55
Conclusions
• Energy Efficient operation (Joules/op, Joules/conversion, Joules/bit) requires system-level optimization
• Exploit application specific attributes (reduced performance requirements) to save energy
• Exploit highly digital approaches for traditional analog function
• Slower is Better–exploit sub-threshold operation to minimize energy dissipation
56
References
1. A. Wang and A. Chandrakasan, “A 180mv FFT processor using sub-threshold circuit techniques,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2004, pp. 292–293.
2. B. H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling and sizing for minimum energy operation in subthreshold circuits,” IEEE Journal of Solid-State Circuits, vol. 40,no. 5, pp. 1778–1786, Sept. 2005.
3. A. Wang, A. Chandrakasan, and S. Kosonocky, “Optimal suppply and threshold scaling for sub-threshold CMOS circuits,” in Proc. IEEE Comp. Society Annual Int. Symp. VLSI, April 2002, pp. 5–9.
4. J. Chen, L. T. Clark, and Y. Cao, “Ultra-low voltage circuit design in the presence of variations,” IEEE circuits Devices Mag., vol. 21, no. 1, pp. 12–20, Jan./Feb. 2005.
5. J. Kwong and A. Chandrakasan, “Variation-driven device sizing for minimum energy sub-threshold circuits,” in Proc. Int. Symp. Low Power Electronics and Design, 2006, pp. 8–13.
6. B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, “Analysis and mitigation of variability in subthresholddesign,” in Proc. Int. Symp. Low Power Electronics and Design, 2005, pp. 20–25.
7. E. Seevinck, F. J. List, and J. Lohstroh, “Static-noise margin analysis of MOS SRAM cells,” IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5, pp. 748–754, Oct. 1987.
8. M. Agostinelli, J. Hicks, J. Xu, B. Woolery, K. Mistry, K. Zhang, S. Jacobs, J. Jopling,W. Yang, B. Lee, T. Raz, M. Mehalel, P. Kolar, Y. W. J. Sandford, D. Pivin, C. Peterson, M. DiBattista, S. Pae, M. Jones, S. Johnson, and G. Subramanian, “Erratic fluctuations of SRAM cache Vmin at the 90nm process technology node,” in IEDM Dig. Tech. Papers, Dec. 2005, pp. 671–674.
9. R. Rodriguez, J. H. Stathis, B. P. Linder, S. Kowalczyk, C. T. Chuang, R. V. Joshi, G. Northrop, K. Bernstein, A. J. Bhavnagarwala, and S. Lombardo, “The impact of gate-oxide breakdown on SRAM stability,” IEEE Electron Device Letters, vol. 23, no. 9, pp. 559–561, Sept. 2002.
57
References
10 T. Mizumo, J.-I. Okamura, and A. Toriumi, “Experimental study of threshold voltage fluctuations using an 8kMOSFET’s array,” in Proc. IEEE Symp. VLSI Technology, May 1993, pp. 41–42.
11. A. Bhavnagarwala, X. Tang, and J. Meindl, “The impact of intrinsic device fluctuations on CMOS SRAM cell stability,” IEEE Journal of Solid-State Circuits, vol. 36, no. 4, pp. 658–665, April 2001.
12. L. Chang, D. M. Fried, J. Hergenrother, J. W. Sleight, R. H. Dennard, R. Montoye, L. Sekaric, S. J. McNab, A. W. Topol, C. D. Adams, K. W. Guarini, and W. Haensch, “Stable SRAM cell design for the 32nm node and beyond,” in Proc. IEEE Symp. VLSI Circuits, June 2005, pp. 128–129.
13. N. Verma and A. Chandrakasan, “A 65nm 8T sub-Vt SRAM employing sense-amplifier redundancy,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2007, pp. 328–329.
14. Y.Ye, S. Borkar, and V. De, “A new technique for standby leakage reduction in highperformance circuits,” in Proc. IEEE Symp. VLSI Circuits, June 1998, pp. 40–41.
15. K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepath, Y. Wang, B. Zheng , M. Bohr, “A 3-GHz 70Mb SRAM in 65nm CMOS technology with integrated column-based dynamic power supply,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2005, pp. 474–475.
16. K. Zhang, K. Hose, V. De, and B. Senyk, “The scaling of data sensing schemes for high speed cache design in sub-0.18µm technologies,” in Proc. IEEE Symp. VLSI Circuits, June 2000, pp. 226–227.
17. M. Pelgrom, H. Tuinhout, and M. Vertregt, “Transistor matching in analog CMOS applications,” in IEDM Dig. Tech. Papers, Dec. 1998, pp. 915–918.
18. M. P. Flynn, C. Donovan, and L. Sattler, “Digital calibration incorporating redundancy of flash ADCs,” IEEE Transactions on Circuits and Systems-II, vol. 50, no. 3, pp. 205–213, May 2003.
19. Y. Ramadass, A. P. Chandrakasan, "Minimum Energy Tracking Loop with Embedded DC-DC Converter Delivering Voltages Down to 250mV in 65nm CMOS," IEEE International Solid-State Circuits Conference, pp. 64-65, February 2007.
58
References
20. Y. Ramadass, A. P. Chandrakasan, "Voltage Scalable Switched Capacitor DC-DC Converter for Ultra-Low-Power On-Chip Applications," IEEE Power Electronics Specialists Conference, pp. 2353-2359, June 2007.
21. J. Kwong, Y. Ramadass, N. Verma, M. Koesler, K. Huber, H. Moormann, A. Chandrakasan, "A 65nm Sub-VtMicrocontroller with Integrated SRAM and Switched-Capacitor DC-DC Converter," ISSCC, pp. 318-319, February 2008.
22. V. Sze, A. Chandrakasan, "A 0.4-V UWB Baseband Processor," IEEE International Symposium Low Power Electronics and Design, pp. 262-267, August 2007.
23. N. Verma, A. P. Chandrakasan, "A 25uW 100kS/s 12b ADC for Wireless Micro-Sensor Applications," IEEE ISSCC, pp. 222-223, February 2006.
24. D.C. Daly, A. P. Chandrakasan, "A 6-bit, 0.2V to 0.9V Highly Digital Flash ADC with Comparator Redundancy," IEEE International Solid-State Circuits Conference, pp. 554-555, February 2008.