pipelining high-radix srt division algorithmssquire/images/srt1.pdf · • pipelining techniques...

Pipelining High-Radix SRT Division Algorithms

Saurabh Upadhyay

Illinois Institute of Technology

Pipelining High-Radix SRT Division Algorithms – p. 1/20

Outline

• Introduction• Division by digit recurrence and pipelining• Use of different memory elements to pipeline the algorithm in

static CMOS• Use of pass-transistor logic• Domino clocking techniques for the algorithm• comparison and conclusion


Introduction

• SRT Division is very popular division algorithm used in currentmicroprocessors.

• It is named for D. Sweeny, J. E. Robertson and K. D. Tocher.• This dissertation shows different ways to pipeline SRT division

algorithm.• Simulation results are compared to find out the fastest possible

implementation.


What current microprocessors use?

Processor Division Algorithm Connectivity

DEC 21164 Alpha AXP SRT Adder-CoupledHal Sparc64 SRT IndependentHP PA7200 SRT IndependentHP PA8000 SRT Multiplier-accumulate-c/pIBM RS/6000 Power2 Newton-Raphson IntegratedIntel Pentium SRT Adder-coupledIntel Pentium Pro SRT IndependentMips R8000 Multiplicative IntegratedMips R10000 SRT Multiplier-coupledPowerPc 604 SRT IntegratedPowerPc 620 SRT IntegratedSun SuperSparc Goldschmidt Multiplier-integratedSun UltraSparc SRT Independent


Division by Digit Recurrence

• x = q · d + rem



• x = q · d + rem

• Divisor is normalized, i.e. 1

2≤ d < 1 for fractional division.



• x = q · d + rem



• For normalized fractional divisor the quotient is in the range0 < q < 2



• x = q · d + rem




• wi+1 = r · wi − qi+1 · d

where w0 = x



• x = q · d + rem




• wi+1 = r · wi − qi+1 · d

where w0 = x

• Consists of n iterations, each iteration produces one digit of thequotient.


Division by Digit Recurrence (continued)

• qi+1 is selected such that wi+1 is bounded.



• qi+1 is selected such that wi+1 is bounded.• qi+1 = QST (r · wi, d) . . . Quotient Selection Table




• q = q[n] = q[0] +∑n

i=1qir

−i




• q = q[n] = q[0] +∑n

i=1qir

−i

• Quotient is formed by concatenation of quotient bits.




• q = q[n] = q[0] +∑n

i=1qir

−i

• Quotient is formed by concatenation of quotient bits.• On-the-fly conversion is used for redundant quotient digit set.

2−1 MUX LOA

D A

ND

SH

IFT

−IN

CO

NT

RO

L

selectLoad

selectLoad

QM

Q QM

Q

QM

QM

Qin

in

qj+1q . .

.

Q REG

2−1 MUX

QM REG


Basic SRT Division Algorithm

{carry[8:0],2’b00}{sum[8:0],2’b00}

11

CSA

2d

i+1q state0

0 {3’b000,Dividend}

mux41/mux51 mux21

11

d= {3’b000, Divisor}__2d

__d

q={q2+,q+,q−,q2−,0}i+1

ulp

mux21

11

11

88

3

d

QST

11

wi+1 = r · wi − qi+1 · d


Pipelining

• Pipelining is a technique used to create clock cycle times forarithmetic data-paths.

• Usually, memory elements like flip-flops, transparent latches andpulsed latches are used to impose sequencing.

• Pipelining techniques also vary with different logic gate family.• Intelligent pipelining technique can improve performance greatly,

on the other hand poor pipelining introduces large sequencingoverheads.


Pipelining with Flip-Flops

11

CSA

clk FFFF

11

11

0 {3’b000,Dividend}d= {3’b000, Divisor}

mux41/mux51 mux21mux21

clk

11

11

i+1q state0

{q2+,q+,q−,q2−,0}

{sum[8:0],2’b00}{carry[8:0],2’b00}

____2ddd2d

ulp

Tc = ∆logic + ∆CQ + ∆DC + tskew


Pipelining with Flip-Flops

11

CSA

clk FFFF

11

11


mux41/mux51 mux21mux21

clk

11

11

i+1q state0

{q2+,q+,q−,q2−,0}

{sum[8:0],2’b00}{carry[8:0],2’b00}

____2ddd2d

ulp

Tc = ∆logic + ∆CQ + ∆DC + tskew = 2ns


Pipelining with Latches

CSA

1111

{q2+,q+,q−,q2−,0}

2d d

i+1q state0


mux41/mux51 mux21 mux21

latch

{carry[8:0],2’b00}{sum[8:0],2’b00}

clk_bar clk_bar

111111

11 1111

__ __2dd

ulp

clkclkclk

latchlatch

latch latch

Tc = ∆logic + 2 · ∆DQ


Pipelining with Latches

CSA

1111

{q2+,q+,q−,q2−,0}

2d d

i+1q state0


mux41/mux51 mux21 mux21

latch

{carry[8:0],2’b00}{sum[8:0],2’b00}

clk_bar clk_bar

111111

11 1111

__ __2dd

ulp

clkclkclk

latchlatch

latch latch

Tc = ∆logic + 2 · ∆DQ = 1.8ns


Pipelining with Pulsed Latches

{q2+,q+,q−,q2−,0}

2d

i+1q state0

11


mux41/mux51 mux21

ulp

11

{carry[8:0],2’b00}{sum[8:0],2’b00}

11

____d d 2d

clk

mux21

latchlatch

11

CSA

clk

11

Tc = ∆logic + ∆DQ + max(0, ∆DC + tskew − tpw)


Pipelining with Pulsed Latches

{q2+,q+,q−,q2−,0}

2d

i+1q state0

11


mux41/mux51 mux21

ulp

11

{carry[8:0],2’b00}{sum[8:0],2’b00}

11

____d d 2d

clk

mux21

latchlatch

11

CSA

clk

11

Tc = ∆logic + ∆DQ + max(0, ∆DC + tskew − tpw) = 1.6ns


Use of Pass Transistor Logic

• Static multiplexors are replaced with transmission gatemultiplexors, a well-known application of pass transistor logic.

• The algorithm now needs a 5-1 multiplexor instead of a 4-1multiplexor.

• The time periods are now,1.6ns for sequencing using Flip-Flops,1.7ns for sequencing using Latches and1.4ns for sequencing using Pulsed Latches.

• The implementation with flip-flops is faster than the one withlatches in this case.


Using Domino Logic

• Designers prefer domino logic for high-performance systems.• Domino logic is considered twice as fast as the static CMOS logic,

but this is often not the case.• Requirement of dual-rail domino gates, remedies for

charge-sharing problem and clocking techniques used limit thespeed.

• Traditional domino clocking has large sequencing overhead.• Overlapping clocks can be used to hide sequencing overheads,

as shown ahead.


Traditional Domino Clocking

ulp

{carry[8:0],2’b00}

11x2

clk

latchlatch

latch latchlatch

CSA

{sum[8:0],2’b00}

{q2+,q+,q−,q2−,0}

state0

{3’b000,Dividend}d= {3’b000, Divisor}

mux21 mux21

11x2

clk_bar clk_bar

clk clk

____2d d d 2d 0

clkmux51

clk_bar

clkclk

qi+1

11x2 11x2

11x211x2

11x211x2

Tc = ∆logic + 2 · ∆DC + 2 · tskew


Traditional Domino Clocking

ulp

{carry[8:0],2’b00}

11x2

clk

latchlatch

latch latchlatch

CSA

{sum[8:0],2’b00}

{q2+,q+,q−,q2−,0}

state0

{3’b000,Dividend}d= {3’b000, Divisor}

mux21 mux21

11x2

clk_bar clk_bar

clk clk

____2d d d 2d 0

clkmux51

clk_bar

clkclk

qi+1

11x2 11x2

11x211x2

11x211x2

Tc = ∆logic + 2 · ∆DC + 2 · tskew = 1.6ns


Skew Tolerant Domino Clocking

Static

Dynam

ic

Static

Dynam

ic

Static

Dynam

ic

overlap

yx Phase 2

2

1

222111

Phase 1

T c

Dynam

ic

Dynam

ic

Static

Static

Dynam

ic

Static


Constraints

• te =Tc

N+ thold + tskew

• tp = tprech + tskew

• Tc = te + tp

• toverlap =N−1

NTc − tprech − tskew1 =

thold + tskew2 + tborrow


Skew Tolerant Domino Implementation

{sum[8:0],2’b00}

CSA

2d d

state0

11x2


mux21 mux21

ulp

{carry[8:0],2’b00}

____2dd

11x2

11x2

mux51

phi2

phi1phi1 phi1

{q2+,q+,q−,q2−,0}

qi+1

Tc = ∆logic


Skew Tolerant Domino Implementation

{sum[8:0],2’b00}

CSA

2d d

state0

11x2


mux21 mux21

ulp

{carry[8:0],2’b00}

____2dd

11x2

11x2

mux51

phi2

phi1phi1 phi1

{q2+,q+,q−,q2−,0}

qi+1

Tc = ∆logic = 1.1ns


Comparison

Sequencing method Time Period, Number of devicesTcns(FO4) (p-mos,n-mos)

Skew-tolerant domino 1.1 (6.2) 1052 (264, 788)Pulsed Latches (xmux) 1.4 (7.9) 1096 (548, 548)Pulsed Latches 1.6 (9) 964 (482, 482)Flip-Flops (xmux) 1.6 (9) 1140 (570, 570)Traditional domino 1.6 (9) 1932 (704, 1228)Transparent Latches (xmux) 1.7 (9.6) 1360 (680, 680)Transparent Latches 1.8 (10.1) 1228 (614, 614)Flip-Flops 2 (11.2) 1008 (504, 504)


Conclusions

• Speed improves significantly with a better pipelining approach.• Skew tolerant domino is the fastest implementation, yet has very

few devices used.• Alternatively, a pulsed latch implementation with transmission

gates offers high clock frequency.• Design complexity may also be an important factor while choosing

a sequencing method.


Acknowledgements

• My deepest gratitude goes to my advisor Dr. James Stine.• Inspiring working conditions at the Illinois Institute of Technology

greatly facilitated my research.• Thank you!


pipelining high-radix srt division algorithmssquire/images/srt1.pdf · • pipelining techniques...

Documents