containing the nanometer ‘pandora box’: design techniques for variation aware low power systems...
TRANSCRIPT
Containing the Nanometer ‘Pandora Box’:Design Techniques for
Variation Aware Low Power Systems
Kaushik Roy, Georgios Karakonstantis, Abhijit Chatterjee
U N I V E R S I T YU N I V E R S I T YU N I V E R S I T YU N I V E R S I T YU N I V E R S I T YU N I V E R S I T Y
NANOELECTRONICS
RESEARCH LABNRL
Era of Miniaturization
2
Application Specific Processors
General Processor on Chip
Increasing demand for small multifunctional mobile platforms consisting of heterogeneous components
Memory on Chip
Camera, Video WLAN
RFMixed Signal
Scaling Challenges: Power
3
0
20
40
60
80
100
120
250n 180n 130n 90n 65n 45n
Pow
er (
W)
Leakage
Dynamic
15 mm2 Die
Source: Intel
Large power dissipation Large power density
Increasing battery gap Rapidly increasing processor power consumption
Slowly increasing battery capacity
Need for Efficient Use of Available Energy
Scaling Challenges: Process Variations
4
Leff1<Leff2
Device 1 Device 2
Variation in channel lengthRandom dopant
fluctuation
Inter and Intra-die Variations
Leakage Spread and Delay Variation (Intel)
Device parameters are no longer deterministic
Short Channel Effects
HighPerformance Process
Variations
The Nanometer ‘Pandora Box’
5
Reduced Yield/Quality
Exponential Power Growth
Technology scaling beyond 90nm unlocks the Nanometer ‘Pandora Box’
Soft Errors
Addressing the Issues (Logic)
6
Medicine? Can we address them jointly?Common Symptoms: Delay Failures
Voltage
Pat
h D
elay
Pow
er
TcDelay failures
Vddnom
Vdd
Clk
Voltage
Pat
h D
elay
Pow
er
TcIncreased Power
Vdd
Clk
Robustness (delay variations):Increase supply voltage
Low Power: Reduce supply voltage (Voltage Over-Scaling -VOS)
Robustness under delay variations
Low Power and Robustness have contradictory design requirements
Vddnom
7
Addressing the Issues
Cross layer design is necessary for highly optimized systems
Variable Latency Units (CRISTA, Telescopic )
Error Detection & Correction (RAZOR, Intel)
Algorithmic Noise
Tolerance
Significance Driven
Approach
7
Memory (stability failures) Circuit-Architecture Co-Design (Redundancy, ECC, ABB,…)
Mixed Signal (loss of performance due to guardbands)
Logic (General Purpose, Application Specific)
Outline
General Purpose Processors Error Detection and Correction
Prediction Based techniques
8
Scaling Challenges Variations, Power
Mixed Signal
Conclusion
Application Specific Systems Significance Driven Approach
Algorithmic Noise Tolerance
System Level Techniques
Robust Low Power Memory Circuit – Bit Cell Level
Architecture Level
Logic
Error Detection and Correction
9
Error_L
Errorcomparator
RAZOR FF
clk_del
Main Flip-Flop
clk
Shadow Latch
Q1D101
Error_L
Errorcomparator
RAZOR FF
Main Flip-Flop
clk
Shadow
Latch
Q1D101
Tune Vdd by monitoring the error rate during circuit operation
Eliminates the need for voltage/clock margins
Shadow latch samples the delayed signal
A comparator and a metastability detector identify and validate any timing error
64% energy savings with 3% performance and energy overhead
RAZORII (JSCC ‘09) circumvents the tight timing constraints of RAZOR flip-flops through micro-architecture techniques such as replay
Intel provides an overview of dynamic variation aware and low power techniques at the micro-architecture level (JSCC ‘09)
* Dan Ernst, et. al “ Razor: Circuit-Level Correction of Timing Errors for Low- Power Operation,” IEEE Micro, 2004.
10
Nominal
Scaled
Nominal
Scaled
Conventional System
CRISTA based design
path delay #
of
pa
ths
S
delay target CompleteFailure
path delay
# o
f p
ath
s
delay target
path delay
# o
f p
ath
s
S
delay target
path delay #
of
pa
ths
delay target Predictable and rare by
design
Prediction based (CRISTA)
Meet the delay target while considering variation and VOS induced delay errors
CRISTA – Generic Logic
11
Long paths are activated rarely and are evaluated in two cycles
Shannon expansion and gate sizing CP activated when x1’x2x3=1 with prop.=12.5%
40% less power with 9% area overhead for a two stage pipelined ALU Applied to circuits (variable latency units) as well as at the micro-architectural level (Trifecta *TVLSI ‘10)
* S. Ghosh, et al, “CRISTA: A New Paradigm for Low-Power, Variation-Tolerant, and Adaptive Circuit Synthesis Using Critical Path Isolation,” TCAD, 2007.
CRISTA: Variable Latency Adder Case
12
Delay of circuit depends upon input data and carry propagation A(a0..a3) = 1111 & B (b0..b3) = 0001
Classify operations into long
and short latency ATc
CLK
B (Delay Failure)
C
VDD=1V
D
LONG Lat.
S1
S2 SHORT Lat.
Inputs susceptible to failure are given 2 cyclesUtilize slack S1 and S2 for tackling VOS and variation induced delay errors
FA FA FA FA
a0 b0 a1 b1 a2 b2 a3 b3
Cin Co,0 Co,1 Co,2 Co,3
S0clkv
S1 S2 S3Stretch Clock
Predictor depending on number of monitored inputs trade-offs overhead and penalty
Application Specific Systems
13
Obtain best quality possible, in presence of delay failures under VOS or parametric variations
Slow Corner
Nominal Vdd
Scaled Vdd
System Under Parametric Variations and Nominal Vdd
DCT with Original 6 alphabets
DCT with Original 6 alphabets Proposed 2 alphabets
Proposed 2 alphabets
System Under Voltage Over-Scaling and Nominal Corner
Nominal Corner
Significance Driven Approach
14
All computations “do not contribute” equally to output quality
Significant Computations Less Significant Computations
Adjust Complexity for minimum Quality degradation
under delay errors
Slack to tackle Delay Errors due to Vdd Scaling/Process Variation
Ensure Correct OperationUnder Delay Errors
Algorithm
Maximize Sharing for reducing any area overhead
ArchitectureMinimize sharingTight timing
Energy Efficient & Robust DSP Blocks
Application to DCT
15
1D- intermediate DCT coefficientsInput Block
1D-DCT
x0
x2
x1
x3
x4
x5
x6
x7
w0
w2
w1
w3
w4
w5
w6
w7
w8 w16w24w32w40w48 w56
y0
y1
y2
y3
y4
y5
y6
y7
Final DCT coefficients
z0z1z2z3z4
z6
z5
z7
1D-DCT
Transposed M
emory T
rans
pose
Mem
ory
Final DCT coefficients
Low Frequency
High Frequency
High Frequency
Significance Decrease
Crucial
Less Crucial 1
Less-Crucial 2
1D- intermediate DCT coefficientsInput Block
Significant
Not-So Significant
Significant
Not-So Significant
Significant
Not-So Significant
*G. Karakonstantis, et al, “ Process-Variation Resilient & Voltage Scalable DCT Architecture for Robust Low-Power Computing,” TVLSI, 2010.
z0
z1
z2
z3
z4
z5
z6
z7
16
Scalable DCT Architecture
Under delay failures only less-crucial computations are affected
Low Power (55% savings) with Graceful Quality degradation (33dB t0 23dB)
Crucial
Less Crucial 1
LessCrucial 2
1.5
2
2.5
3
3.5
4
Pat
h1(w
0)
Pat
h2(w
1)
Pat
h3(w
2)
Pat
h4(w
3)
Pat
h5(w
4)
Pat
h6(w
5)
Pat
h7(w
6)
Pat
h8(w
7)
Del
ay(n
s)
Conventional DCT Proposed DCT
Clk
Dc Dlc1 Dlc2
Clk Target
z0
z1
z2
z3
z4
z5
z6
z7
Dc Dlc1 Dlc2
FAIL
16
17
Algorithmic Noise Tolerance
MAINBlock(8-bit )
Reduced Precision Replica
(6-bit )
YaYp
Compare
Input
>Th
Yout
Error Control
Redundancy Based Addition of a reduced precision replica of the main block
Error Control Estimates and Corrects Potential Errors
Threshold determined at design time
Use of linear arithmetic units
Applied to the design of various DSP blocks (FFT, FIR, Viterbi) leading to 20-50% power savings with graceful quality loss
* B. Shim, et al, “Reliable Low-Power Digital Signal Processing via Reduced Precision Redundancy,” TVLSI, 2004.
18
SDA vs ANT
Challenge: Design of reduced overhead estimators based on application specific characteristics
Need extra hardware for approximate computation
Extra hardware translates to less power savings even at scaled voltages
Challenge: Identification of significant computations based on application specific characteristics
Small area and power overhead
Low power overhead (3%) at nominal voltage
~25% more power savings in DCT
Finer granularity => More flexible and able to adapt to conditions
ANT SDA
Algorithm/architecture co-design can lead to energy efficient architectures with minimum area overhead utilizing the inherent
error resiliency of ASIC systems
19
Approximate Computation
Probabilistic Computing*
Recognition, Mining and Synthesis (RMS) applications have inherently statistical behavior and are based on iterations
No single ‘golden’ result
Subsequent iterations may compensate for errors in previous stages
Scalable Effort** SVM machine with more than 40% power savings under acceptable error rates
Stochastic Processors
*K. V. Palem, et al “Sustaining moore’s law in embedded computing through probabilistic and approximate design: retrospects and prospects,” CASES, 2009.
**V. K. Chippa, et al, “Scalable Effort Hardware Design: Exploiting Algorithmic Resilience for Energy Efficiency,” DAC, 2010.
20
Error Resilient System Architecture
Execute control-intensive (error free - significant) in Reliable Core
Execute data operations (errors can be tolerated – less-significant) in Less Reliable Cores
Applied to RMS applications and LDPC decoding 90% accuracy is maintained even under 2x10-4 error/cycle/core
I$Fetch and Decode
Reliable Component Unreliable Component
Issue
FPU ALULd/St
MMU
D$
I$Fetch and Decode
Issue
FPU ALU Ld/St
D$
I$ I$ I$
D$ D$ D$
Interconnect
Super
CoreReliable
Highly ReliableMain Thread OS visibleSupervise RRCs
Relaxed
Reliability
Core
Inexpensive & UnreliableWorker threadSequestered from OSReliable MMU, restart unit
MMU
I$Fetch and Decode
Reliable Component Unreliable Component
Issue
FPU ALULd/St
MMU
D$
I$Fetch and Decode
Issue
FPU ALU Ld/St
D$
I$ I$ I$
D$ D$ D$
Interconnect
Super
CoreReliable
Highly ReliableMain Thread OS visibleSupervise RRCs
Relaxed
Reliability
Core
Inexpensive & UnreliableWorker threadSequestered from OSReliable MMU, restart unit
MMU
I$Fetch and Decode
Reliable Component Unreliable Component
Issue
FPU ALULd/St
MMU
D$
I$Fetch and Decode
Issue
FPU ALU Ld/St
D$
I$ I$ I$
D$ D$ D$
Interconnect
Super
CoreReliable
Highly ReliableMain Thread OS visibleSupervise RRCs
Relaxed
Reliability
Core
Inexpensive & UnreliableWorker threadSequestered from OSReliable MMU, restart unit
MMU
* L. Leem, et al, “ERSA: Error-Resilient System Architecture For Probabilistic Applications,” DATE , 2010.
21
Variation Aware Power Management
Exploit the variable workload of applications over time to adjust the voltage in various power domains on-chip, while considering variations
Outline
General Purpose Processors Error Detection and Correction
Prediction Based techniques
22
Scaling Challenges Variations, Power
Mixed Signal
Conclusion
Application Specific Systems Significance Driven Approach
Algorithmic Noise Tolerance
System Level Techniques
Robust Low Power Memory Circuit – Bit Cell Level
Architecture Level
Logic
Memory Failures
23
RDF and LER impact memory more than logic
Memory stability is quantified in terms of read, write and access failure probability
Read Failures (PRF)
Negative Read SNM, Flip of data
Write Failures (PWF)
Access Failures
Cell Failure probability modeled as the union of all failure probabilities
SNM < 0
Stable Point
Successful Write
1
SNM2
SNM = MIN(SNM1 , SNM2)
Stable Point1
VQ
B (V
)
VQ (V)
Stable Point2
VQ (V)
VQ
B (V
)
ProcessVariation
Positive Read SNM (No Read Functional Failure)
Negative Read SNM (Read Functional Failure)
VQ (V)
Single Stable Point
VQ
B (V
)
Write Margin
Positive Write Margin (No Write Functional Failure)
ProcessVariation
Two StableConverging
Points
Write Failure
VQ (V)
VQ
B (V
)
Negative Write Margin (Write Functional Failure)
Robust Memory Design: Circuit
24
Circuit Level Solutions for 6T based memories Up-size bit-cells, ABB
Read and write contradict
Read SNM
(6T vs. 8T)
(L. Chang et al. VLSI sym. ’03) New bit cells that isolate read from write
8T bit-cells have better read stability
at the cost of 30% area overhead
10T bit-cells better stability at lower Vdd
Schmitt Trigger improves both read, write
25
Architecture Level Addition of redundant rows and columns
Error Correction Codes (ECC)
Robust Memory Design: Architecture
Circuit/Architecture Co-design Hybrid memory with preferential storage for video applications
combining 8T and 6T cells
46% power savings (@10MHz) with 11% area overhead
Access time requirements met at 600mV
6T-only (PF↑, Area↓)
8T-only (PF↓, Area ↑)
Trade-off (power vs. 33% area
penalty)
Novelty
6T 6T 6T 6T 6T 6T
8T 8T 8T 8T 8T 8T
8T 8T8T 6T 6T 6T6T 6T
6T 6T
8T8T
*J. Chang, et al, “A voltage-scalable & process variation resilient hybrid SRAM architecture for MPEG-4 video processors,” DAC, 2009.
Outline
General Purpose Processors Error Detection and Correction
Prediction Based techniques
26
Scaling Challenges Variations, Power
Mixed Signal
Conclusion
Application Specific Systems Significance Driven Approach
Algorithmic Noise Tolerance
System Level Techniques
Robust Low Power Memory Circuit – Bit Cell Level
Architecture Level
Logic
Mixed-Signal: Adaptive Wireless SystemsChannel/Signal Quality
Health
Phase detector
÷ M
÷ N
DAC
Switch Control
DSP
Power Amplifier
PLL
Bandpass Filter
Bandpass Filter
Lowpass Filter
VCO
Crystal Oscillator
Low Noise Amplifier Mixer
Mixer
VGA Lowpass Filter
Lowpass Filter
Baseband Amplifier
Buffer
Detection mechanism:EVM, Null
tone, Fading
Recalib
ratio
n
ADC
Sensor
ADC
Phase detector
÷ M
÷ N
DAC
Switch Control
DSP
Power Amplifier
PLL
Bandpass Filter
Bandpass Filter
Lowpass Filter
VCO
Crystal Oscillator
Low Noise Amplifier Mixer
Mixer
VGA Lowpass Filter
Lowpass Filter
Baseband Amplifier
Buffer
Detection mechanism:EVM, Null
tone, Fading
Recalib
ratio
n
ADC
Sensor
ADC
Adaptation Control: Hardware/Software
System:
RF Front End
Processor
Tuning Methodology
Analysis and control engine
(ACE)
BIST for Multiple Specs
MARS
Gaini
IIP3i
Speci N
Measurement Space
Mapping function
DUT with process variations
Multi-tone input
DSP
Test
input
CaptureMARS
MARS
Gaini
IIP3i
Speci N
Measurement Space
Mapping function
DUT with process variations
Multi-tone input
DSP
Test
input
Capture
NOTE: Model building across
process and tuning knobs!!
Production Phase
Process Tuning Approach
RF to low-frequencyconversion
TestResponse
TestStimulus
Two different techniques for tuning
Diagnosis
Test Stimulus and Test Response
Sensor atTx output
Multiple InstancesProcess Perturbations
Test stimulus (Multi-tone)Transmitter Output(RF signal: 2.4GHz)
Test responseSampled by on-board ADC(low frequency envelope)
Accurate Spec Measurement with NO External Tester Support !
Accurate Spec Measurement with NO External Tester Support !
Loopback test and tuning
Alternate Diagnostic Stimulus
RF system under test
Sensors at test observation nodes
Optimization engine
Optimum
Update Circuit Knobs
One-Shot
Digital Predistortion
Correlation based BIST for multiple specsno
yes
* Predicted Initial guess
*
2
3
1
Improved QoS performance specs of the RF system
Step1: Perform One-shot to predict initial guess for circuit tuning knobs
Step2: Perform BIST assisted Iterative tuning of circuit knobs
Step3: Perform Digital compensation after setting the circuit knobs to optimum values
Alternate Diagnostic Stimulus
RF system under testRF system under test
Sensors at test observation nodes
Optimization engine
Optimum
Update Circuit Knobs
One-Shot
Digital Predistortion
Correlation based BIST for multiple specsno
yes
* Predicted Initial guess
*
2
3
1
Improved QoS performance specs of the RF systemImproved QoS performance specs of the RF system
Step1: Perform One-shot to predict initial guess for circuit tuning knobs
Step2: Perform BIST assisted Iterative tuning of circuit knobs
Step3: Perform Digital compensation after setting the circuit knobs to optimum values
Adaptive LNA
TSMC .18um CMOS Design
Adaptation Performance: LNA
Adaptation Performance: LNA
Parameter Tuning
Experimental Results: Large parameter analysis
Large parameter instances
Small parameter instances
Experimental Results: Large parameter tuning
Non Tunab
le
Experimental Results: BIST
AT-BIST Results Across Process and Tuning Knobs
for Tx
Experimental Results:
Gain IIP2 IIP3
Nominal 42.5 dB
-11.5 dBm
- 7dBm
Lower bound
41.5 dB
-12.5 dBm
-8dBm
Upper bound
43.5 dB
-10.5 dBm
-6 dBmGain IIP
2IIP3
Before 40.1 -10 -5.3
After 41.5726 -11 -7.2569
Nominal Specs
One-Instance (P1)
• 207 possible knob combinations (P1) for yield recovery• Power conscious knob combination (P1) : 0.5724W• Converged Knob combination (P1) : 0.5724W
Mixed-Signal
Concurrent Tuning of Multiple Specifications
Estimation of multiple specs using MARS
Fast convergence: 3 to 5 iterations for 4 knobs
Power Optimum convergence for large and small parameter deviations
30% Improvement in yieldOngoing Work
Hardware Analysis
Receiver Analysis
Concurrent tuning of transmitter and receiver
Conclusion
Cross Layer Design Techniques that facilitate Voltage Over-Scaling and
Tolerance to Parametric Variations
Application to the design of energy efficient Logic (Digital and Mixed Signal) and Memory Blocks
Combination of the presented techniques can allow the design of low power and robust systems in the nano-scale as well as in the post-silicon era
44
Questions