IMPLEMENTATION OF
FINITE FIELD
INVERSION
Debdeep Mukhopadhyay Chester Rebeiro
Dept. of Computer Science and Engineering
Indian Institute of Technology Kharagpur
INDIA
Finite Field Inverse
23-27 May 2011 Anurag Labs, DRD0 2
Itoh-Tsujii Method for Binary Fields
23-27 May 2011 Anurag Labs, DRD0 3
The Steps
23-27 May 2011 Anurag Labs, DRD0 4
How do we do a SquaringConsider (again) the field GF(24), with
irreducible polynomial x4+x+1. What is (x3+x2+1)2 in this field ?
23-27 May 2011 Anurag Labs, DRD0 5
Squaring
Squaring can be represented in the form of a matrix multiplication T.a
23-27 May 2011 Anurag Labs, DRD0 6
Quad OperationQuad operation
can be done by two squaring operations.
Quad operation can be written in the form T2.a
23-27 May 2011 Anurag Labs, DRD0 7
Advantage of using Quad Operations
Quad circuits have better LUT utilization compared to Squarer circuits
23-27 May 2011 Anurag Labs, DRD0 8
Generalization of the Itoh-Tsujii Algorithm
23-27 May 2011 Anurag Labs, DRD0 9
Theorem 1
23-27 May 2011 Anurag Labs, DRD0 10
Theorem 2
23-27 May 2011 Anurag Labs, DRD0 11
Quad Itoh-Tsujii Inversion Algorithm
23-27 May 2011 Anurag Labs, DRD0 12
A Circuit for InversionAt every
clock cycle, either the multiplier or the quadblock is active.
The output of the multiplier is stored in mout register
23-27 May 2011 Anurag Labs, DRD0 13
Finding the Inverse
23-27 May 2011 Anurag Labs, DRD0 14
Finding the Inverse Step 2
23-27 May 2011 Anurag Labs, DRD0 15
Finding the Inverse Step 2
23-27 May 2011 Anurag Labs, DRD0 16
Control Signals for the Inverse
23-27 May 2011 Anurag Labs, DRD0 17
Performance Charts
23-27 May 2011 Anurag Labs, DRD0 18
Higher Powered Itoh-Tsujii
23-27 May 2011 Anurag Labs, DRD0 19
• We seen that Quad circuits utilize LUTs in a better way compared to squarer circuits.
• Also LUT size is increasing as silicon technology reduces
• We have seen 4-LUT become 6-LUT, and now 8-LUT
• This gives us a motivation to investigate using higher powers other than quad circuits
Revisiting the Theorems
23-27 May 2011 Anurag Labs, DRD0 20
2n Itoh-Tsujii Inversion
23-27 May 2011 Anurag Labs, DRD0 21
These are the overheads
Higher Powered
Overhead in 2n Itoh-Tsujii
23-27 May 2011 Anurag Labs, DRD0 22
• Computation of .
• Using addition chain for , can be computed in clock cycles, where is the length of addition chain for .
• Computation of , for
• Using addition chain for , that contains , can be
computed during computation, because .
2n Itoh-Tsujii Design
23-27 May 2011 Anurag Labs, DRD0 23
24
Configurable Parameters
• Addition chain.
• Power circuit used in power block.
• Number of cascaded power
circuits in the power block.
• These have an effect on – Number of clock cycles.
– Critical path delay.
Building the Optimal Design
For a given field and a given FPGA how do decide the optimal
design ?
23-27 May 2011 Anurag Labs, DRD0
Estimating AREA required on an FPGA
23-27 May 2011 Anurag Labs, DRD0 25
• A k input LUT (k-LUT) can implement any functionality of maximum k input variables.
• Total number of k-LUTs to implement a function with variables can be expressed as
Estimating Delay of a Design in an FPGA
23-27 May 2011 Anurag Labs, DRD0 26
• Delay in FPGAs comprise of LUT delay and routing delay..
• For this ITA architecture, we have experimentally found, total delay is proportional to number of LUTs in critical path.
• We denote number of LUTs in a delay path as maxlutpath.
• In k-LUT, maxlutpath of an variable function is
Recap : Karatsuba Multiplier
23-27 May 2011 Anurag Labs, DRD0 27
Hybrid Karatsuba Multiplier for GF(2233)Note that the school book multiplier
has replaced the general Karatsuba Multiplier
23-27 May 2011 Anurag Labs, DRD0 28
School Book Multiplier
29
• The field multiplier is a hybrid Karatsuba multiplier.
• A bit hybrid Karatsuba multiplier consists of two bit and one bit multipliers. This happens in recursive manner.
• In threshold ( ) level, School-Book multiplier is invoked.
• Total area of bit hybrid Karatsuba multiplier is given by
• Total area for the School-Book multiplier is
Estimating LUT Requirement for Hybrid Karatsuba Multiplier
23-27 May 2011 Anurag Labs, DRD0
Estimating Delay of Hybrid Karatsuba Multiplier
23-27 May 2011 Anurag Labs, DRD0 30
• The hybrid Karatsuba multiplier is distributed in smaller multipliers like a tree. Height of the tree is
• Each level of the Simple Karatsuba tree introduces one LUT delay.
• In threshold ( ) level, School-Book multiplier delay is added.
• Delay of School-Book multiplier is
• Delay of the entire multiplier in LUTs is given by
31
• For fields generated by trinomials, area of modular reduction
is almost equal to field size and delay is one LUT considering LUT size .
• For fields generated by pentanomials, – and 2 LUT for .
– and 2 LUT for .
Estimating Area & Delay for Modular Reduction
23-27 May 2011 Anurag Labs, DRD0
Area & Delay Estimates for 2n Circuit
23-27 May 2011 Anurag Labs, DRD0 32
• The output of a 2n circuit, which raises an input can be expressed as , where is binary field matrix
and ,
• LUT requirement per output bit is
• Total LUT requirement for the 2n circuit is
• LUT delay per output bit is
• Since all bits are in parallel, delay of 2n circuit is
Area & Delay Estimates for Multiplexer
23-27 May 2011 Anurag Labs, DRD0 33
• For a 2s : 1 MUX, there are s selection lines and thus the output is a function of 2s + s variables.
• For a MUX in , each of the 2s input lines is of width m bits.
• Total LUT requirement is
• Total LUT delay of the MUX is
• When number of inputs to MUX , the above gives a close upper bound
Area & Delay of PowerBlock
23-27 May 2011 Anurag Labs, DRD0 34
• Let the Powerblock contains us number of cascaded 2n circuits.
• The has selection lines, where
• LUT requirement for is
• Total LUT requirement for Powerblock is
• Delay of is
• Total LUT delay of Powerblock in
Area & Delay for the Entire Architecture
23-27 May 2011 Anurag Labs, DRD0 35
• LUT estimate for the entire architecture is
• There are two parallel delay paths.– LUT delay of first path is
– LUT delay of second path is
– LUT delay of entire architecture is
Optimal Number of Cascades
23-27 May 2011 Anurag Labs, DRD0 36
• For a given field and based FPGA, Powerblock can be configured with different power circuits and cascades .
• Increase in reduces clock cycles, but increases delay of Powerblock.
• is fixed, but depends on and .
• is minimum when
• Minimum delay of the ITA architecture is thus
Power Circuit Selection to achieve Minimum Clock Cycles
23-27 May 2011 Anurag Labs, DRD0 37
• Number of clock cycles for the inversion can be approximated as
• Number of clock cycles for increases linearly with .
• The term reduces with increase in .
• When is small, the reduction in is significant for increase in .
• But, for large values of n, the increase in dominates over the decrease in
• So, increases with increase in for large values of .
38
• The performance metric is
• Minimization of without increasing gives best performance. Area remains almost same.
• The following steps are performed to achieve optimal performance
• The optimal architecture is given by
Tuning Design for Optimality
23-27 May 2011 Anurag Labs, DRD0
39
• Our estimation model uses maxlutpath to find LUT delay.
• Routing delay is difficult to model in FPGAs.
• To get overall delay, we have used experimental results for a reference ITA architecture.
• Total delay of reference architecture is the
• Let LUT delay of reference architecture is
• Total delay of any other ITA architecture in the same field is approximately
• Here is a constant and depends on FPGA technology.
• In 4-LUT based and 6-LUT based
Xilinx FPGAs, has values 0.2 and 0.1 respectively.
Validation of Theoretical Estimates
23-27 May 2011 Anurag Labs, DRD0
40
Validation on 4-input LUT FPGAs
23-27 May 2011 Anurag Labs, DRD0
41
Validation on 6-input LUT FPGAs
23-27 May 2011 Anurag Labs, DRD0
42
Experimental Results
23-27 May 2011 Anurag Labs, DRD0
43
Comparison Charts
23-27 May 2011 Anurag Labs, DRD0