CALTECH CS137 Winter2006 -- DeHon 1
CS137:Electronic Design Automation
Day 9: January 30, 2006
Parallel Prefix
CALTECH CS137 Winter2006 -- DeHon 2
Today
• Bit-Level– Addition– LUT Cascades
• For Sums– Applications
• FSMs• SATADD• Data Forwarding• Pointer Jumping
– Applications
CALTECH CS137 Winter2006 -- DeHon 3
Introduction / Reminder
Addition in Log Time
CALTECH CS137 Winter2006 -- DeHon 4
Ripple Carry Addition• Simple “definition” of addition
• Serially resolve carry at each bit
CALTECH CS137 Winter2006 -- DeHon 5
CLA
• Think about each adder bit as a computing a function on the carry in– C[i]=g(c[i-1])– Particular function f will
depend on a[i], b[i]– G=f(a,b)
CALTECH CS137 Winter2006 -- DeHon 6
Functions
• What functions can g(c[i-1]) be?– g(x)=1
• a[i]=b[i]=1
– g(x)=x• a[i] xor b[i]=1
– g(x)=0• A[i]=b[i]=0
CALTECH CS137 Winter2006 -- DeHon 7
Functions
• What functions can g(c[i-1]) be?– g(x)=1 Generate
• a[i]=b[i]=1
– g(x)=x Propagate• a[i] xor b[i]=1
– g(x)=0 Squash• A[i]=b[i]=0
CALTECH CS137 Winter2006 -- DeHon 8
Combining
• Want to combine functions– Compute c[i]=gi(gi-1(c[i-2]))
– Compute compose of two functions
• What functions will the compose of two of these functions be?– Same as before
• Propagate, generate, squash
CALTECH CS137 Winter2006 -- DeHon 9
Compose Rules (LSB MSB) Compose Result
GG
GP
GS
PG
PP
PS
SG
SP
SS
CALTECH CS137 Winter2006 -- DeHon 10
Compose Rules (LSB MSB) Compose Result
GG S
GP G
GS S
PG G
PP P
PS S
SG G
SP S
SS S
CALTECH CS137 Winter2006 -- DeHon 11
Combining
• Do it again…
• Combine g[i-3,i-2] and g[i-1,i]
• What do we get?
CALTECH CS137 Winter2006 -- DeHon 12
Reduce Tree
CALTECH CS137 Winter2006 -- DeHon 13
Associative Reduce Prefix
• Shows us how to compute the Nth value in O(log(N)) time
• Can actually produce all intermediate values in this time– w/ only a constant factor more hardware
CALTECH CS137 Winter2006 -- DeHon 14
Prefix TreeP
refix
T
ree
CALTECH CS137 Winter2006 -- DeHon 15
Parallel Prefix
• Important Pattern
• Applicable any time operation is associative
• Function Composition is always associative
CALTECH CS137 Winter2006 -- DeHon 16
Generalizing
LUT Cascade
CALTECH CS137 Winter2006 -- DeHon 17
Cascaded LUT Delay Model
• Tcascade =T(3LUT) + T(mux)• Don’t pay
– General interconnect– Full 4-LUT delay
CALTECH CS137 Winter2006 -- DeHon 18
Parallel Prefix LUT Cascade?
• Can we do better than N×Tmux?• Can we compute LUT cascade in O(log(N))
time?• Can we compute mux cascade using parallel
prefix?
• Can we make mux cascade associative?
CALTECH CS137 Winter2006 -- DeHon 19
Parallel Prefix Mux cascade
• How can mux transform Smux-out?– A=0, B=0 mux-out=0– A=1, B=1 mux-out=1– A=0, B=1 mux-out=S– A=1, B=0 mux-out=/S
CALTECH CS137 Winter2006 -- DeHon 20
Parallel Prefix Mux cascade
• How can mux transform Smux-out?– A=0, B=0 mux-out=0 Stop= S– A=1, B=1 mux-out=1 Generate= G– A=0, B=1 mux-out=S Buffer = B– A=1, B=0 mux-out=/S Invert = I
CALTECH CS137 Winter2006 -- DeHon 21
Parallel Prefix Mux cascade
• How can 2 muxes transform input?
• Can I compute 2-mux transforms from 1 mux transforms?
CALTECH CS137 Winter2006 -- DeHon 22
Two-mux transforms
• SSS• SGG• SBS• SIG
• GSS• GGG• GBG• GIS
• BSS• BGG• BBB• BII
• ISS• IGG• IBI• IIB
CALTECH CS137 Winter2006 -- DeHon 23
Generalizing mux-cascade
• How can N muxes transform the input?
• Is mux transform composition associative?
CALTECH CS137 Winter2006 -- DeHon 24
Associative Reduce Mux-Cascade
Can be hardwired, no general interconnect
CALTECH CS137 Winter2006 -- DeHon 25
For Sums
CALTECH CS137 Winter2006 -- DeHon 26
Prefix Sum
• Common Operation:– Want B[x] such that B[x]=A[0]+A[1]+…A[x]– For I=0 to x
• B[x]=B[x-1]+A[x]
CALTECH CS137 Winter2006 -- DeHon 27
Prefix Sum
• Compute in tree fashion– A[I]+A[I+1]– A[I]+A[I+1]+A[I+2]+A[I+3]– …
• Combine partial sums back down tree– S(0:7)+S(8:9)+S(10)=S(0:10)
CALTECH CS137 Winter2006 -- DeHon 28
Other simple operators
• Prefix-OR
• Prefix-AND
• Prefix-MAX
• Prefix-MIN
CALTECH CS137 Winter2006 -- DeHon 29
Find-First One
• Useful for arbitration– Finds first (highest-priority) requestor– Also magnitude finding in numbers
• How:– Prefix-OR– Locally compute X[I-1]^X[I]– Flags the first one
CALTECH CS137 Winter2006 -- DeHon 30
Arbitration
• Often want to find first M requestors– E.g. Assign unique memory ports to first M
processors requesting
• Prefix-sum across all potential requesters
• Counts requesters, giving unique number to each
• Know if one of first M– Perhaps which resource assigned
CALTECH CS137 Winter2006 -- DeHon 31
Partitioning
• Use something to order – E.g. spectral linear ordering– …or 1D cellular swap to produce linear
order
• Parallel prefix on area of units – If not all same area
• Know where the midpoint is
CALTECH CS137 Winter2006 -- DeHon 32
Channel Width
• Prefix sum on delta wires at each node – To compute net channel widths at all points
along channel– E.g. 1D ordered
• Maybe use with cellular placement scheme
CALTECH CS137 Winter2006 -- DeHon 33
Rank Finding
• Looking for I’th ordered element• Do a prefix-sum on high-bit only
– Know m=number of things > 01111111…
• High-low search on result– I.e. if number > I, recurse on half with
leading zero– If number < I, search for (I-m)’th element in
half with high-bit true
• Find median in log2(N) time
CALTECH CS137 Winter2006 -- DeHon 34
FA/FSM Evaluation
(regular expression recognition)
CALTECH CS137 Winter2006 -- DeHon 35
Finite Automata
• Machine has finite state: S• On each cycle
– Input I– Compute output and new state
• Based on inputs and current state
• Oi,S(i+1)=f(Si,Ii)• Intuitively, a sequential process
– Must know previous state to compute next– Must know state to compute output
CALTECH CS137 Winter2006 -- DeHon 36
Function Specialization
• But, this is just functions– …and function composition is associative
• Given that we know input sequence:– I0,I1,I2…
• Can compute specialized functions:– fi(s)=f(s,Ii)
• What is fi(s)?– Worst-case, a translation table:
• S=0 NS0, S=1 NS1 ….
CALTECH CS137 Winter2006 -- DeHon 37
Function Composition
• Now: O(i+m),S(i+m+1)=
f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si))))
• Can we compute the function composition?– f(i+1,i)(s)=f(i+1)(fi(s))
– What is f(i+1,i)(s)?
• A translation table just like fi(s) and f(i+1)(s)
• Table of size |S|, can fillin in O(|S|) time
CALTECH CS137 Winter2006 -- DeHon 38
Recursive Function Composition
• Now: O(i+m),S(i+m+1)=
f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si))))
• We can compute the composition– f(i+1,i)(s)=f(i+1)(fi(s))
• Repeat to compute – f(i+3,i)(s)=f(i+3,i+2)(f(i+1,i)(s))
– Etc. until have computed: f(i+m,i)(s) in O(log(m)) steps
CALTECH CS137 Winter2006 -- DeHon 39
Implications
• If can get input stream,– Any FA can be evaluated in O(log(N)) time– Regular Expression recognition in
O(log(N))
• Any streaming operator with finite state– Where the input stream is independent of
the output stream– Can be run arbitrarily fast by using parallel-
prefix on FSM evaluation
CALTECH CS137 Winter2006 -- DeHon 40
Saturated Addition
• S(i+1)=max(min(Ii+Si,maxval),minval)
• Could model as FSM with:– |S|=maxval-minval
• So, in theory, FSM result applies
• …but |S| might be 216, 224
CALTECH CS137 Winter2006 -- DeHon 41
SATADD Composition
• Can compute composition efficiently
[Papadantonakis et al. FPT2005]
CALTECH CS137 Winter2006 -- DeHon 42
SATADD Composition
CALTECH CS137 Winter2006 -- DeHon 43
SATADD Reduce Tree
CALTECH CS137 Winter2006 -- DeHon 44
Data Forwarding
UltraScalar From Henry, Kuszmaul, et al.ARVLSI’99, SPAA’99, ISCA’00
CALTECH CS137 Winter2006 -- DeHon 45
Consider Machine
• Each FU has a full RF– FU=Functional Unit– RF=Register File
• Build network between FUs– use network to connect produce/consume – user register names to configure
interconnect
• Signal data ready along network
CALTECH CS137 Winter2006 -- DeHon 46
Ultrascalar: concept model
CALTECH CS137 Winter2006 -- DeHon 47
Ultrascalar Concept
• Linear delay
• O(1) register cost / FU
• Complete renaming at each FU– different set of registers– so when say complete RF at each FU,
that’s only the logical registers
CALTECH CS137 Winter2006 -- DeHon 48
Ultrascalar: cyclic prefix
CALTECH CS137 Winter2006 -- DeHon 49
Parallel Prefix• Basic idea is one we saw with adders• An FU will either
– produce a register (generate)– or transmit a register (propagate)– can do tree combining
• pair of FUs will either both propagate or will generate• compute function by pair in one stage• recurse to next stage• get log-depth tree network connecting producer and
consumer
CALTECH CS137 Winter2006 -- DeHon 50
Ultrascalar: cyclic prefix
CALTECH CS137 Winter2006 -- DeHon 51
Pointer Jumping
CALTECH CS137 Winter2006 -- DeHon 52
Pointer Jumping Motivation
• Have a tree– E.g. is-a relationship tree in NETL
• Want to know if a node is of a particular type (is-a mammal)
• How long to find out?– Naïve: O(distance)
• Spread one level per timestep
CALTECH CS137 Winter2006 -- DeHon 53
Following Pointer Chain
• Naïve: spread/color from target node– On each step push down to children
• Most nodes idle– Only active on the step something arrives
• Can the idle nodes do something to accelerate?
CALTECH CS137 Winter2006 -- DeHon 54
Jumping Intermediates
• Add notion of transitive parent
• Initially: transitive-parent=parent
• On each step:– If my transitive-parent marked
• Mark self
– else• Transitive-parent =
transitive-parent(transitive-parent)
CALTECH CS137 Winter2006 -- DeHon 55
How Much Jumping?
• On each step:– If my transitive-parent marked
• Mark self
– else• Transitive-parent =
transitive-parent(transitive-parent)
• How many such steps?– O(log(distance))
CALTECH CS137 Winter2006 -- DeHon 56
Pointer Jumping
• Same basic idea as data forwarding
• Can find length of a list in O(log(length)) time
CALTECH CS137 Winter2006 -- DeHon 57
Variations
CALTECH CS137 Winter2006 -- DeHon 58
Segmented Parallel Prefix
• fi() can ignore its input
– …or the function can let special I’s tell it to reset the state
• E.g. build huge/hardwired carry chain hardware and configurably break into separate adders (LUT cascades)
CALTECH CS137 Winter2006 -- DeHon 59
Cyclic Segmented Parallel Prefix
• Wrap output back to input• Configurable segmentation defines the
starting/stopping point• E.g.
– In Ultrascalar dataforwarding• Leave data in place and use FUs in FIFO fashion,
redefining the “head” at each cycle
– Priority allocation scheme• Mark priority item as start of segment
– Perhaps chose randomly (e.g. hardware router)
CALTECH CS137 Winter2006 -- DeHon 60
Admin
• Class Wed.
• Baseline due Friday
CALTECH CS137 Winter2006 -- DeHon 61
Big Ideas
• Any associative operation can be made parallel– Performed in log(N) time with O(N) hardware
• Any Finite Automata computation can be accelerated with parallelism– (FA evaluation NC)
• Function composition is associated all functional operations can be associative