cryptographic algorithms and their implementations discussion of how to map different algorithms to...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Cryptographic Algorithms and their Implementations
Discussion of how to map different algorithms to our architecture
Public-Key Algorithms (Modular Exponentiation)
Rijndael
Serpent
Others (Mars, RC6, Twofish, etc.)
Modular Exponentiation
Square and Multiply Algorithm for Modular Exponentiation
Modular Exponentiation
Montgomery Modular Multiplication
Modular Exponentiation
Several Approaches to implementing Modular Multiplication:
Redundant Representation based (e.g. Carry-save)
Residue Number System based. Systolic Array Based.
Word-based implementations preferable, due to similarity with Symmetric-key Rules out systolic arrays
Modular Exponentiation Most popular and fastest were Carry-Save
representation based implementations.
Carry-save based were also word-oriented.
We selected fastest, simplest implementation: Extremely beneficial to have simplicity and
homogeneity in algorithms when designing a custom reconfigurable fabric.
Performance when implemented on Xilinx Virtex FPGAs: almost 5 Mb/s !!! (highest reported that we could find)
Modular Exponentiation
Five-to-two Multiplier Modular Exponentiation (P, E, M)
K = 22k mod M … computed externally1. P10 , P20 = 5to2_MontMult(K , 0 , 1 , 0 , M),
Z10 , Z20 = 5to2_MontMult(K , 0 , P , 0 , M);2. FOR i = 0 to n-1 DO3. Z1i+1 , Z2i+1 = 5to2_MontMult(Z1i , Z2i , Z1i , Z2i ,
M)4. IF ei = 1 THEN
P1i+1 , P2i+1 = 5to2_MontMult(P1i , P2i , Z1i , Z2i , M)ELSE
P1i+1 , P2i+1 = P1i , P2i
5. ENDFOR6. P1n , P2n = 5to2_MontMult(1 , 0 , P1n-1 , P2n-1 , M)7. P = P1n + P2n
8. RETURN P
Modular Exponentiation
Five-to-two CSA Montgomery Multiplication (A1 , A2 , B1 , B2 , M)
1. S10 , S20 = 0 , 02. FOR i = 0 to m-1 DO3. qi = [(S1i + S2i) + Ai*(B1+B2)] mod 24. S1i+1 , S2i+1 = CSR [(S1i + S2i) + Ai*(B1+B2) + qi*M]
div 25. ENDFOR
Modular Exponentiation
1024 Bits CSA
1024 Bits CSA
1024 Bits CSA
1024 Bits Registers
S1[i] S2[i]
MEMAi.B1
MEMAi.B2
MEMqi.n
FA
1024 bits shift register
1024 bits shift register
Ai
FF
Their Implementation of MM
Modular Exponentiation
Implementing MM on our design
64 Bits CSA
64 Bits Registers
S1[i]<63:0>S2[i]<63:0>
MEMAi.B1<63:0>
MEMAi.B2<63:0>
MEMqi.n<63:0>
MEMAi.B1<959:896>
MEMAi.B2<959:896>
MEMqi.n<959:896>
MEMAi.B1<1023:960>
MEMAi.B2<1023:960>
MEMqi.n<1023:960>
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits CSA
64 Bits Registers64 Bits Registers
S1[i]<959:896>S2[i]<959:896>
S1[i]<1023:960>S2[i]<1023:960>
Modular Exponentiation Each of the 64-CSA blocks maps to a single
basic block Outputs of the last basic block are
registered. qi is generated by random-logic block at the
second basic-block Broadcast to all groups
Ai is generated in a similar manner, utilizing two more basic-blocks: Also broadcast to all groups FA
64 bits shift register
64 bits shift register
Ai
FF
Modular Exponentiation Efficient and scalable mapping to our
design1024-bit RSA will need to use 16 groups, while2048-bit will use 32, and 4096-bits will use 64
groups
Primary concern : clock rate may be limited by bit-broadcasts of qi and AiPotential impediment to scalabilityWe are exploring methods for pipelining these
broadcasts as well, to increase cycle-time and scalability.
Rijndael
Primary operations:
Sub-Bytes
Shift-Rows
Mix-Columns
Add-Round-Key
Rijndael
Representation of Data: 128-bit state.
32-bits
128-bits
8-bits each 32-bits 32-bits
128-bits of state
Rijndael
Add-Round-Key Simple 128-bit XOR operation: uses 1 basic-block
Sub-Bytes: Simple operation: byte-wise table lookup from S-Box Each S-box is 2kbits. 16 parallel S-boxes required ! No basic-blocks required, ALL memory-blocks
required !
Shift-Rows Simple operation: 4 x 32-bit permutations Uses only 1 basic-block
Rijndael Mix-Columns
Somewhat complicated: can be implemented using table lookups, but we’re out of Memory !
Alternative implementation:
Rijndael Mix-Columns
Operation may be expressed in terms of “xtime()” function
Mix-columns implementation requires “xtime()” operation on each byte, followed by 4 XOR operations
Rijndael Mix-Columns
In order to efficiently implement “xtime()”, we modified it this way In this form, only 2 basic-blocks are needed to apply “xtime()” to all
16 bytes A single basic-block will take the 128-bit data as input, and generate
the “xtime()” mask (0000x7x70x7) for each of the 16 bytes at the permute unit.
Another basic-block will now first perform the XOR operation, followed by a left shift (and substitute LSB with x7) at the permute unit.
Rijndael Mix-Columns
After generating output from the “xtime()” function, 4 x 128-bit XOR operations need to be performed
4 basic-blocks will be used Note that the mix-column operation is carried out in parallel on all
4 columns.
Permute Unit
4-BitRandom
Logic
64 Carry SaveAdders
A B C D
O1 O2
64 Bits Registers
64
32 32
64
32 32
64
32 32
64
32 32
64
32 32
64
32 32
Xtime masks for all bytes
XOR operation
Rijndael Implementation summary
8 basic-blocks required only 2 (1 each) for Add-Round-Key and Shift-Rows 6 for Mix-Columns (2 for xtime(), 4 for XOR
operations)
16 Memory-blocks required !! All memory blocks used up in a single round!
In-efficient implementation due to memory intensive implementation of Rijndael
Only 10% logic used, versus 100% memory usage.
Rijndael Potential Solutions
Add lots of memory !! At least 10 times more Issues with memory placement
Consider memory-less implementations of Sub-Byte
Requires GF() constant multiplication and Inverse Affine Transforms
Currently under study as the more efficient and practical option.
Serpent
Substitution-permutation cipher comprised of Key Mixing, S-Box Substitution, and Linear Transformation.
S-boxes: 4 x 4 bit 32 copies required each
round 16 x 4 x 32 = 2048 bits
per round.
Serpent
The Linear Transformation step consists of: 8 fixed permute
operations, and 8 XOR operations
All operands are 32-bits wide
Serpent Serpent is an ideal match for our architecture:
8 x 32-bit fixed shifts and rotates can be easily implemented by the permute units of 2 basic-blocks.
Additional 2 basic-blocks required to implement the 8 x 32-bit XOR operations.
128-bit key mixing stage per round would require 1 more basic-block
Total of 5 basic-blocks and 2kbits of memory required per round.
Each round perfectly fits in a single group of our architecture!
16 rounds of Serpent’s total of 32 may be unrolled in our architecture
Other Algorithms DES
Implementation of a single round is trivial: a single group may implement multiple rounds !
Twofish Complex structure, requires more time to
define implementation on our architecture. However, all its basic operations are directly
supported.
RC6 and MARS Involve complicated operations requiring
special purpose logic: Data-dependent rotations Multiplication Modulo 232
Other Algorithms
RC6 and MARSThis special-purpose logic was not
incorporated because: Algorithms are more suitable for software
implementations than in hardware Lack of support and popularity of these
algorithms Addition of special-purpose logic would occur
overhead beyond its area, as additional supporting interconnect must be provided.
Comparison with Related Work
Although we cannot provide results based on empirical evaluation, we can present a logical framework for comparison of individual features
Through deductive reasoning, we identify what possible advantages one approach may have over the other, assuming all other factors normalized.
Comparison with Related Work Comparison with FPGA based implementations
Area Efficiency Use of basic gates instead of LUTs Basic-blocks with limited flexibility, thus fewer configuration
bits Basic units (full adders) combined into clusters of 64, and
programmed as a single entity – further savings in configuration memory elements
Performance Use of basic gates instead of LUTs Simpler Interconnect, with fewer routing-switches Hierarchical organization – no long wires (except for bit-
broadcast) Far smaller configuration data required – faster
reconfiguration time
Comparison with Related Work Comparison with FPGA based
implementationsPotential pitfalls
Design dedicates considerable amount of area to inter-block interconnect.
Until actual area can be quantified, we are unsure of area efficiency estimates.
Need to identify most suitable Performance/Area tradeoff.
Comparison with Related Work Comparison with COBRA Architecture
Uses multiple copies of special purpose logic blocks, couples with extremely simple interconnect.
Comparison with Related Work Comparison with COBRA
Architecture
Low logic-utilization – we have more generic blocks,
Fixed latency operationsIntermediate values registered only at
RCE boundary.
Programming Methodology Reconfigurable Computing devices suffer
from following two critical issues: Lack of a comprehensive programming model Lack of hardware virtualization
First issue implies the difficulty of programming RC architectures such as FPGAs
Second issue deals with exposition of hardware resource limitations to programmer.
Programming Methodology
How COBRA deals with these issues
Essentially a special-purpose programmable architecture than a configurable one
VLIW like instructions alleviate some of the programming model related issues
Also resolve the virtualization aspect.
Programming Methodology
The programming methodology and the impact of the issues mentioned can be seen in terms of a spectrum:
COBRA [3]Microprocessor Our Approach FPGAs
Programming Methodology
Programming model issue less severe for us because:Simple, highly specialized architecture
Hardware Virtualization is still a concern.
Programming Methodology Programming model:
Provide basic primitives that are supported by our architecture.
Programming is to be accomplished by expressing an algorithm using these primitives and interconnecting these primitives together using 32-bit interconnect.
Mapping such a description onto our design should be a trivial software challenge.
Due to special purpose nature, primitives are limited in number and thus programming should be an easy task.
Programming Methodology
32-bit Carry Save Adder 32-bit XOR 32-bit AND 32-bit OR 32, 64, and 128-bit
Ripple Carry Adder 32, 64, and 128-bit Fixed
Shifts 32 bit Rotates and
random permutes. 64-bit, 128-bit limited
permutes (TBD).
ANDing 32-bit value with a single bit
128-bit shift-register Random bit-logic
implementation, since each block is also capable of implementing:
single 4-input function two 3-input functions four 2-input functions 4 global bit-broadcast
lines 32-bit interconnect, point
to point.
Programming Primitives:
Conclusion: Work in Progress Following areas of design still under
consideration and not completely defined yet:
Configurable Memory-block Architecture VLSI Design to evaluate performance metrics and
fine-tuning of logical design i.e. if found to be too slow, reduce no of switches, use
longer wires, minimize the amount of interconnect to that which is necessary, etc.
Furthermore, the iterative process of evaluating more symmetric-key algorithms and refining the architecture is still in progress.