sandy2x: new curve25519 speed records - win.tue.nltchou/slides/sandy2x.pdfsandy2x: new curve25519...

45
Sandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October 13, 2015

Upload: vukhuong

Post on 25-Mar-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Sandy2x: New Curve25519 Speed Records

Tung Chou

Technische Universiteit Eindhoven, The Netherlands

October 13, 2015

Page 2: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

X25519 and Ed25519

X25519

• ECDH scheme

• public keys and shared secrets are points on the Montgomery curve

y 2 = x3 + 486662x2 + x

over F2255−19

• by Bernstein, 2006

Ed25519

• signature scheme

• public keys and (part of) signatures are points on the twisted Edwardscurve

−x2 + y 2 = 1− 121665/121666x2y 2

over F2255−19

• by Bernstein, Duif, Lange, Schwabe, and Yang, 2011

1

Page 3: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The ECC implementation pyramid

Scalar multiplication

ECC add/double

Finite-field arithmetic

Big-integer or polynomial arithmetic

(slide credit: Peter Schwabe)2

Page 4: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The big multiplier

• used in all papers about ECC speeds on Intel microarchitectures

• 64× 64→ 128-bit multiplication in one instruction (mul)

• (This talk focuses on Sandy Bridge/Ivy Bridge)

3

Page 5: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The big multiplier

• used in all papers about ECC speeds on Intel microarchitectures

• 64× 64→ 128-bit multiplication in one instruction (mul)

• (This talk focuses on Sandy Bridge/Ivy Bridge)

3

Page 6: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The big multiplier

• used in all papers about ECC speeds on Intel microarchitectures

• 64× 64→ 128-bit multiplication in one instruction (mul)

• (This talk focuses on Sandy Bridge/Ivy Bridge)

3

Page 7: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The big multiplier

• used in all papers about ECC speeds on Intel microarchitectures

• 64× 64→ 128-bit multiplication in one instruction (mul)

• (This talk focuses on Sandy Bridge/Ivy Bridge)

3

Page 8: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-251 representation for F2255−19

f = f0+ f1251+ f22102+ f32153+ f42204

g = g0+ g1251+ g22102+ g32153+ g42204

h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0

• 25 multiplication instructions + overhead.

• some carries required.

4

Page 9: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-251 representation for F2255−19

f = f0+ f1251+ f22102+ f32153+ f42204

g = g0+ g1251+ g22102+ g32153+ g42204

h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0

• 25 multiplication instructions + overhead.

• some carries required.

4

Page 10: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-251 representation for F2255−19

f = f0+ f1251+ f22102+ f32153+ f42204

g = g0+ g1251+ g22102+ g32153+ g42204

h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0

• 25 multiplication instructions + overhead.

• some carries required.

4

Page 11: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-251 representation for F2255−19

f = f0+ f1251+ f22102+ f32153+ f42204

g = g0+ g1251+ g22102+ g32153+ g42204

h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0

• 25 multiplication instructions + overhead.

• some carries required.

4

Page 12: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-251 representation for F2255−19

f = f0+ f1251+ f22102+ f32153+ f42204

g = g0+ g1251+ g22102+ g32153+ g42204

h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0

• 25 multiplication instructions + overhead.

• some carries required.

4

Page 13: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-251 representation for F2255−19

f = f0+ f1251+ f22102+ f32153+ f42204

g = g0+ g1251+ g22102+ g32153+ g42204

h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0

• 25 multiplication instructions + overhead.

• some carries required.

4

Page 14: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

A small multiplier

• a 2-way vectorized multiplier

• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)

• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)

5

Page 15: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

A small multiplier

• a 2-way vectorized multiplier

• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)

• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)

5

Page 16: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

A small multiplier

• a 2-way vectorized multiplier

• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)

• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)

5

Page 17: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

A small multiplier

• a 2-way vectorized multiplier

• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)

• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)

5

Page 18: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-225.5 representation for F2255−19

f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230

g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230

h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0

• 100 multiplication instructions + overhead; 50 per multiplication.

• some carries required.

Sandy2x sets new speed records by using the vectorized multiplier.

6

Page 19: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-225.5 representation for F2255−19

f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230

g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230

h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0

• 100 multiplication instructions + overhead; 50 per multiplication.

• some carries required.

Sandy2x sets new speed records by using the vectorized multiplier.

6

Page 20: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-225.5 representation for F2255−19

f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230

g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230

h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0

• 100 multiplication instructions + overhead; 50 per multiplication.

• some carries required.

Sandy2x sets new speed records by using the vectorized multiplier.

6

Page 21: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-225.5 representation for F2255−19

f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230

g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230

h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0

• 100 multiplication instructions + overhead; 50 per multiplication.

• some carries required.

Sandy2x sets new speed records by using the vectorized multiplier.

6

Page 22: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

The radix-225.5 representation for F2255−19

f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230

g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230

h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0

• 100 multiplication instructions + overhead; 50 per multiplication.

• some carries required.

Sandy2x sets new speed records by using the vectorized multiplier.

6

Page 23: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Performance results

SB cycles IB cycles reference

X25519 public-key generation 54 346 52 169 Sandy2x

61 828 57 612 [A. Moon]

194 165 182 876 [Ed25519]

X25519 shared secret computation 156 995 159 128 Sandy2x

194 036 182 708 [Ed25519]

Ed25519 public-key generation 57 164 54 901 Sandy2x

63 712 59 332 [A. Moon]

64 015 61 099 [Ed25519]

Ed25519 sign 63 526 59 949 Sandy2x

67 692 62 624 [A. Moon]

72 444 67 284 [Ed25519]

Ed25519 verification 205 741 198 406 Sandy2x

227 628 204 376 [A. Moon]

222 564 209 060 [Ed25519]

• Andrew Moon “floodyberry”,https://github.com/floodyberry/ed25519-donna

7

Page 24: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 25: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 26: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 27: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.

• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 28: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.

• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 29: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 30: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Ports

• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.

• each instruction is decomposed into microoperations (µops)

• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.

• port utilization gives a lower bound of cycle count

8

Page 31: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 32: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 33: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 34: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 35: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 36: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 37: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

Using the vectorized multiplier

• 109 vpmuludq + 95 vpaddq

• lower bound: 109 cycles (dominated by Port 0)

• actual cycle count 112 cycles (56 cycles per multiplication)

Using the serial multiplier

• 25 mul + 4 imul + 20 add + 20 adc

• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38

• actual cycle count is much larger: 52 cycles

• perf-stat shows that the core fails to distribute the µops equally overthe ports

9

Page 38: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

More reasons

• carries take more cycles when using the serial multiplier

M− Mserial 52 68

vectorized 56 69.5

• batched squarings are faster

• instruction interleaving hides cost for addition/subtraction

• constant-time table lookups are faster with vector instructions

10

Page 39: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

More reasons

• carries take more cycles when using the serial multiplier

M− Mserial 52 68

vectorized 56 69.5

• batched squarings are faster

• instruction interleaving hides cost for addition/subtraction

• constant-time table lookups are faster with vector instructions

10

Page 40: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

More reasons

• carries take more cycles when using the serial multiplier

M− Mserial 52 68

vectorized 56 69.5

• batched squarings are faster

• instruction interleaving hides cost for addition/subtraction

• constant-time table lookups are faster with vector instructions

10

Page 41: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

More reasons

• carries take more cycles when using the serial multiplier

M− Mserial 52 68

vectorized 56 69.5

• batched squarings are faster

• instruction interleaving hides cost for addition/subtraction

• constant-time table lookups are faster with vector instructions

10

Page 42: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Why is vectorization better?

More reasons

• carries take more cycles when using the serial multiplier

M− Mserial 52 68

vectorized 56 69.5

• batched squarings are faster

• instruction interleaving hides cost for addition/subtraction

• constant-time table lookups are faster with vector instructions

10

Page 43: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Ending Remarks

The main messages of this talk:

• Vectorization should be considered on recent Intel microarchitectures.

Code (for X25519 shared secret computation)

• https://sites.google.com/a/crypto.tw/blueprint/

11

Page 44: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Ending Remarks

The main messages of this talk:

• Vectorization should be considered on recent Intel microarchitectures.

Code (for X25519 shared secret computation)

• https://sites.google.com/a/crypto.tw/blueprint/

11

Page 45: Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519 Speed Records Tung Chou Technische Universiteit Eindhoven, The Netherlands October

Ending Remarks

The main messages of this talk:

• Vectorization should be considered on recent Intel microarchitectures.

Code (for X25519 shared secret computation)

• https://sites.google.com/a/crypto.tw/blueprint/

11