ece 545 lecture 8b hardware architectures of secret-key ... · - outer-round pipelining -...
TRANSCRIPT
George Mason University
Hardware Architectures of Secret-Key Block Ciphers
and Hash Functions
ECE 545 Lecture 8b
2
Recommended reading
• K. Gaj and P. Chodowiec, FPGA and ASIC Implementations of AES, Chapter 10 in C.K. Koc (Ed.), Cryptographic Engineering Section 10.4 Parameters of Hardware Implementations Section 10.5 Hardware Architectures of Symmetric Block Ciphers
3
Recommended reading E. Homsirikamol, M. Rogawski, and K. Gaj, "Throughput vs. Area Trade-offs in High-Speed Architectures of Five Round 3 SHA-3 Candidates Implemented Using Xilinx and Altera FPGAs," in LNCS 6917, Cryptographic Hardware and Embedded Systems - CHES 2011, Nara, Japan, Sep. 28-Oct. 1, pp. 491-506. Sections 1-4.
Secret-key Ciphers
Cipher
message
ciphertext
cryptographic key
N bits
N bits
K bits
Current American Standards AES vs. Triple DES
Triple DES AES
3 DES
64 bits
input
64 bits
output
key
168 bits AES
128 bits
input
128 bits output
key
128, 192, and 256 bits
Initial transformation
Final transformation
#rounds times
Round Key[i] i:=i+1
Round Key[0]
i:=1
i<#rounds?
Cipher Round
Round Key[#rounds+1]
Typical Flow Diagram of a Secret-Key Block Cipher
key scheduling
encryption/decryption
memory of internal keys
output
input/key
input interface
output interface
Control unit
control
Top level block diagram
key expansion encryption/
decryption
memory of internal keys
output
input
input interface
output interface
Control unit
control
key setup
key scheduling
key
Primary parameters of hardware implementations for secret-key block ciphers
Latency Throughput
Encryption/ decryption
Time to encrypt/decrypt a single block
of data
Mi
Ci Number of bits
encrypted/decrypted in a unit of time
Encryption/ decryption
Mi Mi+1 Mi+2
Ci Ci+1 Ci+2
Throughput = Block_size · Number_of_blocks_processed_simultaneously Latency
Encryption time
Latency (Message_size –Block_size)
Time
Message size
Dependence of the encryption time on latency and throughput
Throughput
register
combinational logic
one round
multiplexer
Basic iterative architecture
round key
register
combinational logic
one round
multiplexer
Basic iterative architecture
IN
OUT
P1
C1
P2
C2
P3
Basic architecture: Timing
#rounds · clock_period
CLK
Block vs. stream ciphers
Stream cipher
Internal state - IS Block cipher
K K
M1, M2, …, Mn m1, m2, …, mn
C1, C2, …, Cn c1, c2, …, cn
Ci=fK(Mi) ci = fK(mi, ISi) ISi+1=gK(mi, ISi)
Every block of ciphertext is a function of only one
corresponding block of plaintext
Every block of ciphertext is a function of the current block
of plaintext and the current internal state of the cipher
Typical stream cipher Sender Receiver
Pseudorandom Key Generator
mi
plaintext
ci
ciphertext
ki keystream
key initialization vector (seed)
Pseudorandom Key Generator
mi plaintext
ci
ciphertext
ki keystream
key initialization vector (seed)
ECB (Electronic CodeBook) mode
Electronic CodeBook Mode – ECB Encryption
M1 M2 M3
E
Ci = EK(Mi) for i=1..N
MN-1 MN
E E E E . . .
C1 C2 C3 CN-1 CN
K K K K K
Electronic CodeBook Mode – ECB Decryption
C1 C2 C3
D
Ci = EK(Mi) for i=1..N
CN-1 CN
D D D D . . .
M1 M2 M3 MN-1 MN
K K K K K
Counter Mode
Counter Mode - CTR Encryption
m1 m2 m3
E
ci = mi ⊕ ki ki = EK(IV+i-1) for i=1..N
mN-1 mN
. . .
E E E E . . .
c1 c2 c3 cN-1 cN
IV IV+1 IV+2 IV+N-2 IV+N-1
k1 k2 k3 kN-1 kN
K K K K K
Counter Mode - CTR Decryption
c1 c2 c3
E
mi = ci ⊕ ki ki = EK(IV+i-1) for i=1..N
cN-1 cN
. . .
E E E E . . .
m1 m2 m3 mN-1 mN
IV IV+1 IV+2 IV+N-2 IV+N-1
k1 k2 k3 kN-1 kN
K K K K K
Counter Mode - CTR
E K
IN
OUT
counter
IV
1 L
ci
mi
E K
IN
OUT
counter
IV
1 L
ci
mi
1 L 1 L
IS1 = IV ci = EK(ISi) ⊕ mi ISi+1 = ISi+1
CBC (Cipher Block Chaining) Mode
Cipher Block Chaining Mode - CBC Encryption
m1 m2 m3
E
IV
ci = EK(mi ⊕ ci-1) for i=1..N c0=IV
mN-1 mN . . .
E E E E . . .
c1 c2 c3 cN-1 cN
Cipher Block Chaining Mode - CBC Decryption
mi = DK(ci) ⊕ ci-1 for i=1..N c0=IV
m1 m2 m3 mN-1 mN
IV . . .
D D D D D . . .
c1 c2 c3 cN-1 cN
Primary factor in choosing the encryption/decryption unit architecture
Symmetric-key cipher mode of operation:
1. Non-feedback cipher modes
ECB, counter mode
2. Feedback cipher modes
CBC, CFB, OFB
Non-feedback Counter Mode - CTR
M0 M1 M2
E
Ci = Mi ⊕ AES(IV+i) for i=0..N
MN-1 MN
. . .
E E E E . . .
C1 C2 C3 CN-1 CN
IV IV+1 IV+2 IV+N-1 IV+N
Feedback cipher modes - CBC M1 M2 M3
E
IV
C1 = AES(Mi ⊕ IV) Ci = AES(Mi ⊕ Ci-1) for i=2..N
MN-1 MN . . .
E E E E . . .
C1 C2 C3 CN-1 CN
Feedback cipher modes CBC, CFB, OFB
combinational logic
k rounds
register
multiplexer
round 1 round 2
round k . . . . .
k-rounds Loop Unrolling
Loop Unrolling: Timing
IN
OUT
P1
C1
P2
C2
P3
#rounds/k · extended_clock_period
CLK
speed
area
k=2 k=3 k=4 k=5
loop-unrolling basic architecture
Loop Unrolling: Speed vs. Area
speed = speed basic 1 + τ
1 + τ / k
τ << 1
combinational logic
MUX register
one round
Architectures suitable for feedback modes
round K . . . .
round 1
round 2
MUX
round #rounds
. . . .
round 1
round 2
Decreasing area by resource sharing
F F
D0 D1
D0’ D1’
F
D0 D1
D0’ D1’
multiplexer
Before After
register
Throughput
Area
basic architecture
Resource Sharing: Speed vs. Area
- basic architecture
- resource sharing
resource sharing
Non-Feedback Cipher Modes ECB, counter, OCB
Comparison for non-feedback cipher modes, e.g. Counter Mode - CTR
M0 M1 M2
E
Ci = Mi ⊕ AES(IV+i) for i=0..N
MN-1 MN
. . .
E E E E . . .
C1 C2 C3 CN-1 CN
IV IV+1 IV+2 IV+N-1 IV+N
E
IV
E
C1
M1
Z1
Z1
E
C2
M2
Z2
Z2
E
CN-1
MN-1
ZN-1
ZN-1
E
CN
MN
ZN
MN
. . . L
R
length
g(L)
Zi=f(L, R)
E
0
E
T
ZN
τ bits
Control sum
OCB
Increasing speed by parallel processing
Encryption/ decryption
unit
Encryption/ decryption
unit
Encryption/ decryption
unit
Encryption/ decryption
unit
Encryption/ decryption
unit
Encryption/ decryption
unit
Increasing speed using pipelining
Cipher 1 Cipher 2
round 1 round 1
round 2
round 10
. . .
round 16
. . .
Speed = target_clock_period
block size
target clock period, e.g., 20 ns
Pipelined operation of the encryption unit
B1
clock cycle 1
B2
2
B1 B3
3
B2 B1
B4
4
B3 B2 B1
B5
5
B4 B3 B2
B6
6
B5 B4 B3
B7
7
B6 B5 B4
B8 B7 B6 B5
8
B13 B12 B11 B10
B14 B13 B12 B11
B15 B14 B13 B12
B16 B15 B14 B13
B9 B8 B7 B6
B10 B9 B8 B7
B11 B10 B9 B8
B12 B11 B10 B9
clock cycle 9 10 11 12 13 14 15 16
. . . .
#rounds registers
round #rounds = one pipeline stage
round 1 = one pipeline stage
round 2 = one pipeline stage
Full outer-round pipelining
Total # of pipeline stages = #rounds
Full mixed inner- and outer-round pipelining
round #rounds =k pipeline stages
. . . .
round 1 = k pipeline stages
round 2 =k pipeline stages
. . . .
. . . .
. . . .
k registers
Total # of pipeline stages = #rounds·k
k rounds
register1
register2
register k . . . .
pipeline stage 1 = round 1
pipeline stage 2 = round 2
pipeline stage k = round k
multiplexer
k-stage Outer-Round Pipelining
IN
OUT
P1
C1
P2
C3
P3
Outer-Round Pipelining: Timing
#rounds · clock_period
CLK
P4 P5 P6
C2 C4
speed
area
outer-round pipelining non-feedback modes
basic architecture
k=2
k=3
k=4
k=5
Outer-Round Pipelining: Speed vs. Area
outer-round pipelining feedback modes
round #rounds = one pipeline stage
. . . .
round 1 = one pipeline stage
round 2 = one pipeline stage
K registers
round K = one pipeline stage
. . . .
round 1 = one pipeline stage
round 2 = one pipeline stage
MUX K registers
combinational logic
MUX register
one round, no pipelining
Outer-round Pipelining
one round
register1
register2
register k . . . .
pipeline stage 1
pipeline stage 2
pipeline stage k
multiplexer
k-stage Inner-Round Pipelining
IN
OUT
P1
C1
P2
C3
P3
Inner-Round Pipelining: Timing
#rounds · (k · reduced_clock_period)
CLK
P4 P5 P6
C2 C4
speed
area
inner-round pipelining non-feedback modes
basic architecture k=2
k=3
k=4
k=5
inner-round pipelining feedback modes
Inner-Round Pipelining: Speed vs. Area
Mixed Inner- and Outer-round Pipelining
round #rounds =k pipeline stages
. . . .
round 1 = k pipeline stages
round 2 =k pipeline stages
. . . .
. . . .
. . . .
d) k registers
round K = k pipeline stages
. . . .
round 1 = k pipeline stages
round 2 = k pipeline stages
MUX
. . . .
. . . .
. . . .
c)
k registers
one round = k pipeline stages
MUX
. . . .
b) k registers
one round, no pipelining
MUX
a) register
combinational logic
- basic architecture - outer-round pipelining
- inner-round pipelining - mixed inner and outer-round pipelining
Area
Throughput
basic architecture
inner-round pipelining
mixed inner and outer-round pipelining
outer-round pipelining
K=2 K=3
K=4
K=2
K=3
k=2
kopt
Comparison of the traditional and new design methodologies
Area [CLB slices]
Speed [Mbit/s]
Choosing optimum architecture for non-feedback cipher modes
basic architecture
inner-round pipelining
mixed inner and outer-round pipelining
Area
Latency
basic architecture
inner-round pipelining
mixed inner and outer-round pipelining
outer-round pipelining
K=2 K=4 K=3 K=5
K=2 K=3
k=2
kopt
Latency vs. area dependence for the new design methodology
- basic architecture - outer-round pipelining
- inner-round pipelining - mixed inner and outer-round pipelining
r1 r2 r1
r1 r2 r1 r3 r4
op1 op2 op3 op4 op5
op1 op2 op3 op4 op5
TCLKmin
TCLKmin
Limits on the minimum clock period after pipelining (1)
1. Delay of a single round divided by k = number of internal pipeline stages
k=2
2. Delay of the longest indivisible operation
k=4
r1 r2 r1
cntr1 cntr3 cntr1
rc Control Unit
op1 op3 op4 op5
TCLKmin
op2
Limits on the minimum clock period after pipelining (2)
3. Delays within the control unit
4. Maximum latency
5. Maximum input/output bandwidth
cntr2 cntr4
DES"encryption/decryption"
core"
clock reset
encrypt/decrypt
data input data available data read
64
key input
key available key read
64
IV input IV available IV read
64
Key memory"
Key schedule"
data output
write full
64
56 key
4 4
read bank"write bank"
round number" round key"4 48
DES/3DES
key choice"2
AES"encryption/decryption"
core"
clock reset
encrypt/decrypt
data input data available data read
128
key input key available
key read
64
IV input IV available IV read
128!
Key schedule!
Key memory"
data output
write full
128!
64! Key material!
4"4
read bank"write bank"
round number" round key!4 128!
DES/3DES
Cycle number! 6!
2!key size!
speed
area
k=2 k=3 k=4 k=5
outer-round pipelining
inner-round pipelining
loop-unrolling
resource sharing
- basic architecture - loop unrolling
- inner-round pipelining - outer-round pipelining
- resource sharing
basic architecture
k=2
k=3
k=4
k=2
k=3
k=4
k=5
k=5
Performance of alternative architectures: in non-feedback cipher modes (ECB, counter)
speed
area
k=2 k=3 k=4 k=5
outer-round pipelining inner-round pipelining
loop-unrolling
resource sharing
basic architecture
Performance of alternative architectures: in feedback cipher modes (CBC, CFB, OFB)
- basic architecture - loop unrolling
- inner-round pipelining - outer-round pipelining
- resource sharing
Hash Functions
Hash Function
arbitrary length
message
hash function
hash value h(m)
h
m
fixed length
It is computationally infeasible to find such
m and m’ that h(m)=h(m’)
Collision Resistance:
Message
Hash function
Public key cipher
Alice Signature
Alice’s private key
Bob
Hash function
Alice’s public key
Hash Functions in Digital Signature Schemes
Hash value 1
Hash value 2
Hash value
Public key cipher
yes no
Message Signature
General scheme for constructing a secure hash function: Merkle–Damgård scheme
Message m
Padding, appending bit length, M
M1
IV H0 H1 H2 f f . . .
Ht
compression function
output transformation
h(m) g
M2 Mt . . .
Sponge Scheme
R
H
+
IV
CLR
Wt Kt Step t
Basic iterative architecture of SHA-1 and SHA-2
OUT
Data Stream 1 . . . . . . . . Data Stream k
. . . . .
R
H
+
IV
CLR
Wt KtStep t
R
H
+
IV
CLR
Wt KtStep t
Architecture with Multiple Processing Units
R
H
+
IV
CLR
Wt KtStep t
R
H
+
IV
CLR
Wt KtStep t
Features of architecture with multiple processing units
" Pros — Throughput increases by a factor of k
" Cons — Latency the same as for the basic architecture
— Area increases by a factor of k
— Requires k independent data streams (messages)
Pipelined architecture
H
+
R 1
IV
K t W t
H
IV
R 2 step t, stage 1
step t, stage 2
Features of the pipelined architecture
" Pros — Throughput increases by a factor close to k
— Area increases by a factor smaller than k
" Cons — Latency the same as for the basic architecture
— Requires k independent data streams (messages)
R
H
+
IV
CLR
. . . .
Wt Wt+1
Wt+k-1
Kt Kt+1
Kt+k-1
Step t Step t+1
Step t+k-1
. . . .
Unrolled architecture
Features of the unrolled architecture
" Pros — Reduces both latency and throughput
— Requires only one data stream
" Cons — Area may increase substantially compared
to the basic architecture
74
• datapath width = state size • one clock cycle per one round
Starting Point: Basic Iterative Architecture
Currently, most common architecture used to implement SHA-1, SHA-2, and many other hash functions.
Throughput
Area A
Th x1
75
• datapath width = state size • two clock cycles per one round
Horizontal Folding - /2(h)
Typically Throughput/Area ratio increases
Throughput
Area A
Th x1 /2(h)
76
• datapath width = state size/2 • two clock cycles per one round/step
Vertical Folding - /2(v)
Throughput
Area A
Th x1
/2(v)
Typically Throughput/Area ratio decreases
77
Vertical Folding with the State Kept in Memory BLAKE /4(h)/4(v)-m
• datapath width = state size/4 • 16 clock cycles per one round
Throughput/Area ratio increases
Throughput
Area A
Th x1
/4(h)/4(v)-m
78
• datapath width = state size • one clock cycle per two rounds
Unrolling - x2
Typically Throughput/Area ratio decreases
Throughput
Area A
Th
2A
x1
x2
79
• datapath width = state size • one clock cycle per two rounds
Efficient Unrolling - x2
Throughput
Area A
Th
2A
x1 x2
x2-efficient
Sometimes Throughput/Area ratio increases
,
,,
80
Multiple Packets Available for Parallel Processing
PACKETS
1500B576B576B 64B
64B
1500B1500B64B 64B
64B 40B576B 40B
576B 576B40B N1
N2
N3
NX
PACKETS
PACKETS
PACKETS
Typical sizes of packets: 40B – 1500B 1500 B = Maximum Transmission Unit (MTU) for Ethernet v2
81
Parallel Processing Using Multi-Unit Architecture – MU2
Throughput
Area A
Th
2A
2Th
Typically Throughput/Area ratio stays the same
x1
MU2
82
Unrolled Architecture with Pipelining - x2-PPL2
Throughput
Area A
Th
2A
2Th
x1
x2-PPL2
Typically Throughput/Area ratio stays almost the same
83
Basic Architecture with Pipelining - x1-PPL2
Throughput
Area A
Th
2A
2Th
x1
x1-PPL2
Typically Throughput/Area ratio increases
84
BLAKE-256 in Virtex 5
85
Why Interface Matters?
• Pin limit
Total number of i/o ports ≤ Total number of an FPGA i/o pins
• Support for the maximum throughput
Time to load the next message block ≤ Time to process previous block
86
Interface: Two possible solutions
Length of the message communicated at the beginning
+ easy to implement passive source circuit − area overhead for the counter of message bits
Dedicated end of message port
− more intelligent source circuit required + no need for internal message bit counter
msg_bitlen
zero_word
message end_of_msg SHA core
87
SHA Core: Interface & Typical Configuration
• SHA core is an active component; surrounding FIFOs are passive and widely available • Input interface is separate from an output interface • Processing a current block, reading the next block, and storing a result for the previous message can be all done in parallel
fifoin_empty
fifoin_read
idata w w
odata
fifoout_full
fifoout_write
fifoin_full
fifoin_write
fifoout_empty
fifoout_read
Input FIFO SHA core
clk rst
ext_idata
w
ext_odata din dout
src_ready
src_read
dst_ready
dst_write
din dout
full empty
write read
Output FIFO
din dout
full empty
write read
w
clk rst
clk rst clk rst
clk rst
clk rst
88
SHA Core Interface
w
SHA core
din dout
src_ready
src_read
dst_ready
dst_write
clk rst
clk rst
w
89
Communication Protocol for Unpadded Messages
msg_bitlen
zero_word −−−−−
message
w bits
.
.
.
seg_0_bitlen
zero_word
seg_0
w bits
seg_1_bitlen
seg_1
� � �
seg_n-1_bitlen
seg_n-1
a) b)
−−−−−
90
Communication Protocol for Unpadded Messages Without Message Splitting
last = 1 | msg_len_bp
message
msg_len_bp – message length before padding [bits]
w bits
91
Communication Protocol for Unpadded Messages With Message Splitting
.
.
.
last=0 | seg_0_len_bp
seg_0
w bits
last = 0 | seg_1_len_bp
seg_1
� � �
last = 1 | seg_n-1_len_bp
seg_n-1
seg_i_len_bp – segment i length before padding [bits]
* For all i < n-1 segment i length is assumed to be a multiple of the message block size, b [characteristic to each function], and thus also the word size, w. The last segment cannot consist of only padding bits. It must include at least one message bit.
w
SHA core
din dout
src_ready
src_read
dst_ready
dst_write
clk rst
clk rst
w
io_clk
io_clk
fifoin_empty
fifoin_read
idata w w
odata
fifoout_full
fifoout_write
fifoin_full
fifoin_write
fifoout_empty
fifoout_read
Input FIFO SHA core
clk rst
ext_idata
w
ext_odata din dout
src_ready
src_read
dst_ready
dst_write
din dout
full empty
write read
Output FIFO
din dout
full empty
write read
w
clk rst
io_clk rst io_clk rst
clk rst
clk rst
io_clk
io_clk
fifoin_empty
fifoin_read
idata w w
odata
fifoout_full
fifoout_write
fifoin_full
fifoin_write
fifoout_empty
fifoout_read
Input FIFO SHA core
clk rst
ext_idata
w
ext_odata din dout
src_ready
src_read
dst_ready
dst_write
din dout
full empty
write read
Output FIFO
din dout
full empty
write read
w
clk rst
clk or io_clk rst clk or io_clk rst
clk rst
clk rst
io_clk
io_clk