pipelining high-radix srt division algorithmssquire/images/srt1.pdf · • pipelining techniques...
TRANSCRIPT
Pipelining High-Radix SRT Division Algorithms
Saurabh Upadhyay
Illinois Institute of Technology
Pipelining High-Radix SRT Division Algorithms – p. 1/20
Outline
• Introduction• Division by digit recurrence and pipelining• Use of different memory elements to pipeline the algorithm in
static CMOS• Use of pass-transistor logic• Domino clocking techniques for the algorithm• comparison and conclusion
Pipelining High-Radix SRT Division Algorithms – p. 2/20
Introduction
• SRT Division is very popular division algorithm used in currentmicroprocessors.
• It is named for D. Sweeny, J. E. Robertson and K. D. Tocher.• This dissertation shows different ways to pipeline SRT division
algorithm.• Simulation results are compared to find out the fastest possible
implementation.
Pipelining High-Radix SRT Division Algorithms – p. 3/20
What current microprocessors use?
Processor Division Algorithm Connectivity
DEC 21164 Alpha AXP SRT Adder-CoupledHal Sparc64 SRT IndependentHP PA7200 SRT IndependentHP PA8000 SRT Multiplier-accumulate-c/pIBM RS/6000 Power2 Newton-Raphson IntegratedIntel Pentium SRT Adder-coupledIntel Pentium Pro SRT IndependentMips R8000 Multiplicative IntegratedMips R10000 SRT Multiplier-coupledPowerPc 604 SRT IntegratedPowerPc 620 SRT IntegratedSun SuperSparc Goldschmidt Multiplier-integratedSun UltraSparc SRT Independent
Pipelining High-Radix SRT Division Algorithms – p. 4/20
Division by Digit Recurrence
• x = q · d + rem
Pipelining High-Radix SRT Division Algorithms – p. 5/20
Division by Digit Recurrence
• x = q · d + rem
• Divisor is normalized, i.e. 1
2≤ d < 1 for fractional division.
Pipelining High-Radix SRT Division Algorithms – p. 5/20
Division by Digit Recurrence
• x = q · d + rem
• Divisor is normalized, i.e. 1
2≤ d < 1 for fractional division.
• For normalized fractional divisor the quotient is in the range0 < q < 2
Pipelining High-Radix SRT Division Algorithms – p. 5/20
Division by Digit Recurrence
• x = q · d + rem
• Divisor is normalized, i.e. 1
2≤ d < 1 for fractional division.
• For normalized fractional divisor the quotient is in the range0 < q < 2
• wi+1 = r · wi − qi+1 · d
where w0 = x
Pipelining High-Radix SRT Division Algorithms – p. 5/20
Division by Digit Recurrence
• x = q · d + rem
• Divisor is normalized, i.e. 1
2≤ d < 1 for fractional division.
• For normalized fractional divisor the quotient is in the range0 < q < 2
• wi+1 = r · wi − qi+1 · d
where w0 = x
• Consists of n iterations, each iteration produces one digit of thequotient.
Pipelining High-Radix SRT Division Algorithms – p. 5/20
Division by Digit Recurrence (continued)
• qi+1 is selected such that wi+1 is bounded.
Pipelining High-Radix SRT Division Algorithms – p. 6/20
Division by Digit Recurrence (continued)
• qi+1 is selected such that wi+1 is bounded.• qi+1 = QST (r · wi, d) . . . Quotient Selection Table
Pipelining High-Radix SRT Division Algorithms – p. 6/20
Division by Digit Recurrence (continued)
• qi+1 is selected such that wi+1 is bounded.• qi+1 = QST (r · wi, d) . . . Quotient Selection Table
• q = q[n] = q[0] +∑n
i=1qir
−i
Pipelining High-Radix SRT Division Algorithms – p. 6/20
Division by Digit Recurrence (continued)
• qi+1 is selected such that wi+1 is bounded.• qi+1 = QST (r · wi, d) . . . Quotient Selection Table
• q = q[n] = q[0] +∑n
i=1qir
−i
• Quotient is formed by concatenation of quotient bits.
Pipelining High-Radix SRT Division Algorithms – p. 6/20
Division by Digit Recurrence (continued)
• qi+1 is selected such that wi+1 is bounded.• qi+1 = QST (r · wi, d) . . . Quotient Selection Table
• q = q[n] = q[0] +∑n
i=1qir
−i
• Quotient is formed by concatenation of quotient bits.• On-the-fly conversion is used for redundant quotient digit set.
2−1 MUX LOA
D A
ND
SH
IFT
−IN
CO
NT
RO
L
selectLoad
selectLoad
QM
Q QM
Q
QM
QM
Qin
in
qj+1q . .
.
Q REG
2−1 MUX
QM REG
Pipelining High-Radix SRT Division Algorithms – p. 6/20
Basic SRT Division Algorithm
{carry[8:0],2’b00}{sum[8:0],2’b00}
11
CSA
2d
i+1q state0
0 {3’b000,Dividend}
mux41/mux51 mux21
11
d= {3’b000, Divisor}__2d
__d
q={q2+,q+,q−,q2−,0}i+1
ulp
mux21
11
11
88
3
d
QST
11
wi+1 = r · wi − qi+1 · d
Pipelining High-Radix SRT Division Algorithms – p. 7/20
Pipelining
• Pipelining is a technique used to create clock cycle times forarithmetic data-paths.
• Usually, memory elements like flip-flops, transparent latches andpulsed latches are used to impose sequencing.
• Pipelining techniques also vary with different logic gate family.• Intelligent pipelining technique can improve performance greatly,
on the other hand poor pipelining introduces large sequencingoverheads.
Pipelining High-Radix SRT Division Algorithms – p. 8/20
Pipelining with Flip-Flops
11
CSA
clk FFFF
11
11
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux41/mux51 mux21mux21
clk
11
11
i+1q state0
{q2+,q+,q−,q2−,0}
{sum[8:0],2’b00}{carry[8:0],2’b00}
____2ddd2d
ulp
Tc = ∆logic + ∆CQ + ∆DC + tskew
Pipelining High-Radix SRT Division Algorithms – p. 9/20
Pipelining with Flip-Flops
11
CSA
clk FFFF
11
11
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux41/mux51 mux21mux21
clk
11
11
i+1q state0
{q2+,q+,q−,q2−,0}
{sum[8:0],2’b00}{carry[8:0],2’b00}
____2ddd2d
ulp
Tc = ∆logic + ∆CQ + ∆DC + tskew = 2ns
Pipelining High-Radix SRT Division Algorithms – p. 9/20
Pipelining with Latches
CSA
1111
{q2+,q+,q−,q2−,0}
2d d
i+1q state0
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux41/mux51 mux21 mux21
latch
{carry[8:0],2’b00}{sum[8:0],2’b00}
clk_bar clk_bar
111111
11 1111
__ __2dd
ulp
clkclkclk
latchlatch
latch latch
Tc = ∆logic + 2 · ∆DQ
Pipelining High-Radix SRT Division Algorithms – p. 10/20
Pipelining with Latches
CSA
1111
{q2+,q+,q−,q2−,0}
2d d
i+1q state0
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux41/mux51 mux21 mux21
latch
{carry[8:0],2’b00}{sum[8:0],2’b00}
clk_bar clk_bar
111111
11 1111
__ __2dd
ulp
clkclkclk
latchlatch
latch latch
Tc = ∆logic + 2 · ∆DQ = 1.8ns
Pipelining High-Radix SRT Division Algorithms – p. 10/20
Pipelining with Pulsed Latches
{q2+,q+,q−,q2−,0}
2d
i+1q state0
11
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux41/mux51 mux21
ulp
11
{carry[8:0],2’b00}{sum[8:0],2’b00}
11
____d d 2d
clk
mux21
latchlatch
11
CSA
clk
11
Tc = ∆logic + ∆DQ + max(0, ∆DC + tskew − tpw)
Pipelining High-Radix SRT Division Algorithms – p. 11/20
Pipelining with Pulsed Latches
{q2+,q+,q−,q2−,0}
2d
i+1q state0
11
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux41/mux51 mux21
ulp
11
{carry[8:0],2’b00}{sum[8:0],2’b00}
11
____d d 2d
clk
mux21
latchlatch
11
CSA
clk
11
Tc = ∆logic + ∆DQ + max(0, ∆DC + tskew − tpw) = 1.6ns
Pipelining High-Radix SRT Division Algorithms – p. 11/20
Use of Pass Transistor Logic
• Static multiplexors are replaced with transmission gatemultiplexors, a well-known application of pass transistor logic.
• The algorithm now needs a 5-1 multiplexor instead of a 4-1multiplexor.
• The time periods are now,1.6ns for sequencing using Flip-Flops,1.7ns for sequencing using Latches and1.4ns for sequencing using Pulsed Latches.
• The implementation with flip-flops is faster than the one withlatches in this case.
Pipelining High-Radix SRT Division Algorithms – p. 12/20
Using Domino Logic
• Designers prefer domino logic for high-performance systems.• Domino logic is considered twice as fast as the static CMOS logic,
but this is often not the case.• Requirement of dual-rail domino gates, remedies for
charge-sharing problem and clocking techniques used limit thespeed.
• Traditional domino clocking has large sequencing overhead.• Overlapping clocks can be used to hide sequencing overheads,
as shown ahead.
Pipelining High-Radix SRT Division Algorithms – p. 13/20
Traditional Domino Clocking
ulp
{carry[8:0],2’b00}
11x2
clk
latchlatch
latch latchlatch
CSA
{sum[8:0],2’b00}
{q2+,q+,q−,q2−,0}
state0
{3’b000,Dividend}d= {3’b000, Divisor}
mux21 mux21
11x2
clk_bar clk_bar
clk clk
____2d d d 2d 0
clkmux51
clk_bar
clkclk
qi+1
11x2 11x2
11x211x2
11x211x2
Tc = ∆logic + 2 · ∆DC + 2 · tskew
Pipelining High-Radix SRT Division Algorithms – p. 14/20
Traditional Domino Clocking
ulp
{carry[8:0],2’b00}
11x2
clk
latchlatch
latch latchlatch
CSA
{sum[8:0],2’b00}
{q2+,q+,q−,q2−,0}
state0
{3’b000,Dividend}d= {3’b000, Divisor}
mux21 mux21
11x2
clk_bar clk_bar
clk clk
____2d d d 2d 0
clkmux51
clk_bar
clkclk
qi+1
11x2 11x2
11x211x2
11x211x2
Tc = ∆logic + 2 · ∆DC + 2 · tskew = 1.6ns
Pipelining High-Radix SRT Division Algorithms – p. 14/20
Skew Tolerant Domino Clocking
Static
Dynam
ic
Static
Dynam
ic
Static
Dynam
ic
overlap
yx Phase 2
2
1
222111
Phase 1
T c
Dynam
ic
Dynam
ic
Static
Static
Dynam
ic
Static
Pipelining High-Radix SRT Division Algorithms – p. 15/20
Constraints
• te =Tc
N+ thold + tskew
• tp = tprech + tskew
• Tc = te + tp
• toverlap =N−1
NTc − tprech − tskew1 =
thold + tskew2 + tborrow
Pipelining High-Radix SRT Division Algorithms – p. 16/20
Skew Tolerant Domino Implementation
{sum[8:0],2’b00}
CSA
2d d
state0
11x2
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux21 mux21
ulp
{carry[8:0],2’b00}
____2dd
11x2
11x2
mux51
phi2
phi1phi1 phi1
{q2+,q+,q−,q2−,0}
qi+1
Tc = ∆logic
Pipelining High-Radix SRT Division Algorithms – p. 17/20
Skew Tolerant Domino Implementation
{sum[8:0],2’b00}
CSA
2d d
state0
11x2
0 {3’b000,Dividend}d= {3’b000, Divisor}
mux21 mux21
ulp
{carry[8:0],2’b00}
____2dd
11x2
11x2
mux51
phi2
phi1phi1 phi1
{q2+,q+,q−,q2−,0}
qi+1
Tc = ∆logic = 1.1ns
Pipelining High-Radix SRT Division Algorithms – p. 17/20
Comparison
Sequencing method Time Period, Number of devicesTcns(FO4) (p-mos,n-mos)
Skew-tolerant domino 1.1 (6.2) 1052 (264, 788)Pulsed Latches (xmux) 1.4 (7.9) 1096 (548, 548)Pulsed Latches 1.6 (9) 964 (482, 482)Flip-Flops (xmux) 1.6 (9) 1140 (570, 570)Traditional domino 1.6 (9) 1932 (704, 1228)Transparent Latches (xmux) 1.7 (9.6) 1360 (680, 680)Transparent Latches 1.8 (10.1) 1228 (614, 614)Flip-Flops 2 (11.2) 1008 (504, 504)
Pipelining High-Radix SRT Division Algorithms – p. 18/20
Conclusions
• Speed improves significantly with a better pipelining approach.• Skew tolerant domino is the fastest implementation, yet has very
few devices used.• Alternatively, a pulsed latch implementation with transmission
gates offers high clock frequency.• Design complexity may also be an important factor while choosing
a sequencing method.
Pipelining High-Radix SRT Division Algorithms – p. 19/20
Acknowledgements
• My deepest gratitude goes to my advisor Dr. James Stine.• Inspiring working conditions at the Illinois Institute of Technology
greatly facilitated my research.• Thank you!
Pipelining High-Radix SRT Division Algorithms – p. 20/20