simd instruction set extensions for keccak with ... · analysis of the instruction set design space...
TRANSCRIPT
SIMD Instruction Set Extensions for KECCAK with Applications to SHA-3, Keyak and Ketje!
Hemendra K. Rawat and Patrick Schaumont! Virginia tech, Blacksburg, USA!
{hrawat, schaum}@vt.edu!
1
Motivation
2 “A flexible SIMD Instruction set, for flexible KECCAK Sponge” !
q Secure systems and protocols rely on a suite of cryptographic applications § Hashing, MAC, Encryption, PRNG, AEAD etc...
q Traditionally, different algorithms: SHA-1, SHA-2, AES, GMAC, HMAC q Different Instruction set extensions: AES-NI, SHA and Carry-less multiplication
§ Vector extensions in Intel, ARM, SPARC and PowerPC
Q KECCAK is the SHA-3 § Successor of SHA-2
q Subset of the cryptographic primitive family KECCAK SPONGE q Not just a Hash q A Flexible Sponge for all symmetric cryptography
Outline
Q KECCAK Background q Novelty Claims q Design Principles q Proposed Instruction Set Extensions q Results
q Performance q Hardware Overhead
q Conclusion
3
KECCAK HASH (SHA-3)
f f f f f f
Absorbing Squeezing
…
Variable length input Output Digest
Padding
State
…rate (r bits)
capacity (c bits)
0
0
Hashing using KECCAK Sponge Construction
Cryptographic Permutation
r bits
4
KECCAK-f[b] Permutation
f
state [0 : b] def keccakf[b]: for numRounds in range(0, maxRounds):
theta(state) # Add column parities rho(state) # Rotate lanes pi(state) # Transpose lanes chi(state) # Add non-linearity iota(state) # L00 ⊕ Kround
0 0
b = 1600, 800, 400, 200, 100, 50, 25 bits
b = (r +c ) bits
5
L00 L10
L01
L20 L30 L40
L11 L21 L31 L41
L02 L12 L22 L32 L42
L03 L13 L23 L33 L43
L04 L14 L24 L34 L44
x
y
x=0 x=1 x=2 x=3 x=4
y=4
y=3
y=2
y=1
y=0
x
zsliceplane
lane
State
KECCAK-f in Nutshell
KECCAK Applications
0 f f f
Message Digest
0 f f f
Key MACMessage
0 f f f
Key IV Keystream
0 f f f
Key MACMessage
Keystream
(a)
(b)
(c)
(d)
rate r
capacity c
Constructions with sponge: (a) Hash, (b) Message Authentication Code, (c) Keystream Generation. Construction with duplex: (d) Authenticated Encryption
6
Application Stack
Sponge Duplex
Hash MAC PRNG AEAD
KECCAK-p[800,12]
KECCAK-f[400]
KECCAK-f[200]
KECCAK-f[1600]
Application Layer
Cryptographic Construction
API
Primitives
rl1xkxorr64chi1chi2
rl1xxorrchi1chi3
rl1xxorrchi1chi3
rl1xxorrchi1chi3
CustomInstructions
7
Novelty Claims
Previous work q Custom-instruction designs for a 16 bit micro-controller for SHA-3 finalists,
Constantin et al. q Integration of 64 bit KECCAK(SHA-3) data path into a 32 bit LEON3, Wang et al. This work q Six new custom instructions for 128 bit SIMD unit. q Multipurpose KECCAK:1600, 800, 400 and 200 bits. q Five KECCAK algorithms: SHA3-512 hash, LakeKEYAK, RiverKEYAK, KetjeSR
and KetjeJR authenticated ciphers. q Compatible with ARM NEON Instruction set
8
Design Principles
Portability § Easy integration into different processor architectures § RISC-like instruction format (2 input,1 output operand) § No non-standard architectural features
Flexibility
§ Support for multiple symmetric cryptographic applications § Hashing § MACing § Stream Ciphers § AEAD § PRNG
Simplicity
§ Small set of instructions § Low hardware Overhead § Simple operations like XOR, AND, NOT and rotations § Short critical path 9
Design Principles
Performance
§ Optimize Computation intensive part – KECCAK-{f,p} primitives
§ Partition dataflow graph into SIMD-like instruction patterns § 2x128 bit input and 128 bit output
§ Minimize the schedule length of the graph § Optimize instruction shapes and functionality for
§ High ILP (instruction-level parallelism) § Minimum register spills § Minimum MOV/VEXT for register rearrangements § In-place round computation
10
Mapping KECCAK to Instructions
L00 L10
L01
L20 L30 L40
L11 L21 L31 L41
L02 L12 L22 L32 L42
L03 L13 L23 L33 L43
L04 L14 L24 L34 L44
C0 C1 C2 C3 C4
D0 D1 D2 D3 D4
XOR XOR XOR XOR XORXOR
ROL1
ROL1
ROL1
ROL1
ROL1
XOR XOR XOR XOR XOR
Column parity
θ-effect
x
y
x=0 x=1 x=2 x=3 x=4
y=4
y=3
y=2
y=1
y=0
x
z sliceplane
lane
State
XOR
ROL1
d0/s0 d1/s1
d2/s2
rl1x.u64 d2, d0, d1rl1x.u32 s2, s0, s1rl1x.u16 s2, s0, s1rl1x.u8 s2, s0, s1
Data dependencies of θ-effect Instruction rl1x
NEON Register Naming Convention Qi = Quad word (128 bit) Di = Double word (64 bit) Si = Single word (32 bit)
NEON Instruction specifier VADD.I32 q1, q2, q3 specifies q1, q2 and q3 have 4x32-bit integer
11
KECCAK Custom Instructions Instruction Step Description Target Primitive Syntax
rl1x theta Rotate Left 1 and XOR 1600, 800, 400,
200
rl1x.u64 d2, d0, d1 rl1x.u32 s2, s0, s1 rl1x.u16 s2, s0, s1 rl1x.u8 s2, s0, s1
kxorr64 theta, rho & pi XOR Rotate & Assign 1600 kxorr64 d2, d0, d1, #i
xorr theta, rho & pi XOR Rotate & Assign 800, 400, 200
xorr.u32 s2, s0, s1, #i xorr.u16 s2, s0, s1, #i xorr.u8 s2, s0, s1, #i
chi1 chi chi step 1600, 800, 400,
200 chi1.u32 q2, q0, q1 chi1.u64 q2, q0, q1
chi2 chi chi step (last lane) 1600 chi2.u64 d4, q0, q1
chi3 chi chi step (last lane) 800, 400, 200 chi3.u32 s4, d0, d1
12
Implementation & Validation q Reference Implementations
§ KeccakCodePackge : http://keccak.noekeon.org/files.html q Hand Optimized Custom Instruction implementations
§ KECCAK-f,p {1600,800,400, 200} q Cross compiled GCC with KECCAK Instruction support
q Timing simple CPU model in GEM5
q Instructions in GEM5’s ISA description language q Simulation
§ Single core ARM CPU @ 1 GHz § 32KB L1 I and D cache § L2 cache of 2MB
13
Compiler
CP
Assembler
Linker
CryptoApplications
GNU CompilerFramework
GEM5ArchitecturalSimulator
ARM Native Binaries
SHA3LakeKeyak RiverKeyak
KetjeJR KetjeSR
GEM5 Front End
Atomic Simple
Timing Simple
InOrder
O3
CPU Model ISA
TLB
Faults
Interrupts
Decoder
KECCAK Instructions
KECCAK Instructions
Results: Performance
14
0
50
100
150
200
250
300
350
SHA3(c=1024) Lakekeyak (E) Lakekeyak (D) Riverkeyak (E) Riverkeyak (D) KetjeSR (E) KetjeSR (D) KetjeJR (E) KetjeJR (D)
INSTRU
CTIONS/BY
TE
OpHmized C 32-‐bit ASM NEON This work(CI)
KECCAK-‐f[1600]
Hash
KECCAK-‐f[800,nr]
KECCAK-‐f[400,nr]
KECCAK-‐p[200,nr]
AEAD (EncrypHon+MAC) Lightweight AEAD
Performance gain between 1.4 - 2.6x over hand
optimized assembly on ARMv7
Performance in instructions/byte for various KECCAK modes
Results: Hardware Cost
15
1869 1221 894 242 122
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1
Gate equivalent es3mates with UMC 90nm (4658 GE)
rl1x kxorr64 xorr chi1 chi2 chi3
[CATEGORY NAME] 0.1%
Overhead
Cortex-‐A9 Core 100%
Keccak InstrucHons Cortex-‐A9 Core
Conclusion
q Analysis of the instruction set design space for KECCAK primitives q Six custom instructions based on NEON instruction set in ARMv7
q Five different KECCAK applications: SHA3, LakeKEYAK, RiverKEYAK, KetjeSR and KetjeJR
q Performance gain between 1.4 - 2.6x over hand optimized assembly on ARMv7 at a hardware overhead of just 4658 GEs.
q Portability Aspects, Intel AVX, Generic 64/32 bit architectures, in the paper….
16
Questions!
Thank you ….
17
Appendix
18
KECCAK (SHA-3)
Instance Definition SHA3-224(M) Keccak[448](M||01, 224) SHA3-256(M) Keccak[512](M||01, 256) SHA3-384(M) Keccak[768](M||01, 384) SHA3-512(M) Keccak[1024](M||01, 512) SHAKE128(M, d) Keccak[256](M||1111, d) SHAKE256(M, d) Keccak[512](M||1111, d) M
essa
ge M
(Arb
itrra
ry le
ngth
)
Has
h Va
lue
(Fix
ed L
engt
h)
H
19
Keccak
20 Source: http://keccak.noekeon.org/
KECCAK-f Permutation (2)
θ step ρ step π step χ step
L00 ⊕
KRound
ι step
Source: http://keccak.noekeon.org/
R= θ o ρ o π o χ o ι
21
Proposed Instruction Set Extensions
D0 D1 D2 D3 D4
B0 B1 B2 B3 B4
XOR XOR XOR XOR XOR
ρ ρ ρ ρ ρ
P0 P1 P2 P3 P4
χ χ χ χ χ
L00 L10 L20 L30 L40
L00 L11 L22 L33 L44
θ &π step
ρ step
χ step
Intermediate value
Intermediate value
XOR
d0/s0 d1/s1
d2/s2
kxorr64 d2, d0, d1, #ixorr.u32 s2, s0, s1, #ixorr.u16 s2, s0, s1, #ixorr.u8 s2, s0, s1, #i
ROL(i)
(b)
Data dependencies of θ,ρ ,π ,χ, ι (one plane)
Instruction kxorr64/xorr
22
Proposed Instruction Set Extensions
chi1.u32 q2, q0, q1
q2
NOT NOT
AND AND
XOR XOR
NOT NOT
AND AND
XOR XOR
q0 q1 q0
chi1.u64 q2, q0, q1
q2
q1
NOT NOT
AND AND
XOR XOR
q0 q1
chi2.u64 d4, q0, q1
AND
XOR
NOT
d4
d0 d1
chi3.u32 s4, d0, d1
AND
XOR
NOT
s4
(a)
(b) (c)
Data dependencies X step (one plane)
Instructions chi1,chi2 and chi3 http://keccak.noekeon.org/ 23
Portability Aspects
Instruction Intel AVX 64 bit Arch 32 bit Arch
rl1x ü ü* ü*
kxorr64 ü ü X
xorr ü ü ü
chi1 ü X X
chi2 ü X X
chi3 ü ü X
Table 2: Feasibility of the proposed instructions on different platforms
• Based on instruction format, register width and the instruction encoding width.
* Not all variants supported 24
Results: Hardware Cost
Instruction µμ2 Gate equivalent Transistor Equivalent
rl1x 1238 310 1238 kxorr64 7474 1869 7474 xorr 4884 1221 4884
chi1 3576 894 3576 chi2 966 242 966 chi3 486 122 486 Total 18624 4658 18624
Table 1: Area, GE and Transistor count estimates for the proposed custom instructions
Typical Cortex-A9 CPU 3.8 Million Gates KECCAK Instructions have 0.1 % overhead
25
Motivation
26
Cryptographic Instruction-Set
Processor Architecture
Symmetric Cryptography Applications
Universal Sponge
Construction
Intel AES-NI, SHA-1,2, Carry-less multiplication extensions ARMv8 Crypto Instructions
AARM (Soft IP) RISC-V (Open Architecture from UC, Berkeley)
HHash, MAC, Encryption/Decryption, AEAD, PRNG
Hash: SHA-3 (Winner of NIST SHA-3 Competition) AEAD: LakeKeyak, Ketje AEAD PRNG, Stream Ciphers etc.…