fast constant-time gcd computation and modular inversion

59
Fast constant-time gcd computation and modular inversion Daniel J. Bernstein 1,2 and Bo-Yin Yang 3 1 Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607–7045, USA 2 Horst Görtz Institute for IT Security, Ruhr University Bochum, Germany [email protected] 3 Institute of Information Science and Research Center of Information Technology and Innovation, Academia Sinica, 128 Section 2 Academia Road, Taipei 115-29, Taiwan [email protected] Abstract. This paper introduces streamlined constant-time variants of Euclid’s algorithm, both for polynomial inputs and for integer inputs. As concrete applications, this paper saves time in (1) modular inversion for Curve25519, which was previously believed to be handled much more efficiently by Fermat’s method, and (2) key generation for the ntruhrss701 and sntrup4591761 lattice-based cryptosystems. Keywords: Euclid’s algorithm · greatest common divisor · gcd · modular reciprocal · modular inversion · constant-time computations · branchless algorithms · algorithm design · NTRU · Curve25519 1 Introduction There is a vast literature on variants and applications of Euclid’s gcd algorithm. A textbook extension of the algorithm computes modular reciprocals. Stopping the algorithm halfway produces a “half-gcd”, writing one input modulo another as a ratio of two half-size outputs, an example of two-dimensional lattice-basis reduction. Well-known speedups include Lehmer’s use of lower precision [44]; “binary” algorithms such as Stein’s gcd algorithm [63] and Kaliski’s “almost inverse” algorithm [40]; and the subquadratic algorithms of Knuth [41] and Schönhage [60], which combine Lehmer’s idea with subquadratic integer multiplication. All of these algorithms have been adapted from the case of integer inputs to the (simpler) case of polynomial inputs, starting with Stevin’s 16th-century algorithm [64, page 241 of original, page 123 of cited PDF] to compute the gcd of two polynomials. The Berlekamp– Massey algorithm (see [9] and [50]), one of the most important tools in decoding error- correcting codes, was later recognized as being equivalent to a special case of a Euclid–Stevin half-gcd computation, with the input and output coefficients in reverse order, and with one input being a power of x; see [32], [38], and [51]. There are many more examples. Author list in alphabetical order; see https://www.ams.org/profession/leaders/culture/ CultureStatement04.pdf. This work was supported by the U.S. National Science Foundation under grant 1314919, by the Cisco University Research Program, by the Netherlands Organisation for Scientific Research (NWO) under grant 639.073.005, and by DFG Cluster of Excellence 2092 “CASA: Cyber Security in the Age of Large-Scale Adversaries”. This work also was supported by Taiwan Ministry of Science and Technology (MoST) grant MOST105-2221-E-001-014-MY3 and Academia Sinica Investigator Award AS-IA-104-M01. “Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation” (or other funding agencies). Permanent ID of this document: c130922fff0455e43cc7c5ca180787781b409f63. Date: 2019.04.13. Future updates including software at https://gcd.cr.yp.to.

Upload: others

Post on 11-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast constant-time gcd computation and modular inversion

Fast constant-timegcd computation and modular inversion

Daniel J Bernstein12 and Bo-Yin Yang3

1 Department of Computer Science University of Illinois at ChicagoChicago IL 60607ndash7045 USA

2 Horst Goumlrtz Institute for IT Security Ruhr University Bochum Germanydjbcrypto

3 Institute of Information Science and Research Center ofInformation Technology and Innovation Academia Sinica128 Section 2 Academia Road Taipei 115-29 Taiwan

bycryptotw

Abstract This paper introduces streamlined constant-time variants of Euclidrsquosalgorithm both for polynomial inputs and for integer inputs As concrete applicationsthis paper saves time in (1) modular inversion for Curve25519 which was previouslybelieved to be handled much more efficiently by Fermatrsquos method and (2) keygeneration for the ntruhrss701 and sntrup4591761 lattice-based cryptosystemsKeywords Euclidrsquos algorithm middot greatest common divisor middot gcd middot modular reciprocal middotmodular inversion middot constant-time computations middot branchless algorithms middot algorithmdesign middot NTRU middot Curve25519

1 IntroductionThere is a vast literature on variants and applications of Euclidrsquos gcd algorithm A textbookextension of the algorithm computes modular reciprocals Stopping the algorithm halfwayproduces a ldquohalf-gcdrdquo writing one input modulo another as a ratio of two half-size outputsan example of two-dimensional lattice-basis reduction Well-known speedups includeLehmerrsquos use of lower precision [44] ldquobinaryrdquo algorithms such as Steinrsquos gcd algorithm [63]and Kaliskirsquos ldquoalmost inverserdquo algorithm [40] and the subquadratic algorithms of Knuth [41]and Schoumlnhage [60] which combine Lehmerrsquos idea with subquadratic integer multiplicationAll of these algorithms have been adapted from the case of integer inputs to the (simpler)case of polynomial inputs starting with Stevinrsquos 16th-century algorithm [64 page 241 oforiginal page 123 of cited PDF] to compute the gcd of two polynomials The BerlekampndashMassey algorithm (see [9] and [50]) one of the most important tools in decoding error-correcting codes was later recognized as being equivalent to a special case of a EuclidndashStevinhalf-gcd computation with the input and output coefficients in reverse order and withone input being a power of x see [32] [38] and [51] There are many more examples

Author list in alphabetical order see httpswwwamsorgprofessionleaderscultureCultureStatement04pdf This work was supported by the US National Science Foundation undergrant 1314919 by the Cisco University Research Program by the Netherlands Organisation for ScientificResearch (NWO) under grant 639073005 and by DFG Cluster of Excellence 2092 ldquoCASA Cyber Securityin the Age of Large-Scale Adversariesrdquo This work also was supported by Taiwan Ministry of Scienceand Technology (MoST) grant MOST105-2221-E-001-014-MY3 and Academia Sinica Investigator AwardAS-IA-104-M01 ldquoAny opinions findings and conclusions or recommendations expressed in this materialare those of the author(s) and do not necessarily reflect the views of the National Science Foundationrdquo (orother funding agencies) Permanent ID of this document c130922fff0455e43cc7c5ca180787781b409f63Date 20190413 Future updates including software at httpsgcdcrypto

2 Fast constant-time gcd computation and modular inversion

However in cryptography these algorithms are dangerous The central problem is thatthese algorithms have conditional branches that depend on the inputs Often these inputsare secret and the branches leak information to the attacker through cache timing branchtiming etc For example AldayandashGarciacuteandashTapiandashBrumley [7] recently presented a single-trace cache-timing attack successfully extracting RSA keys from OpenSSLrsquos implementationof Steinrsquos binary gcd algorithm

A common defense against these attacks1 is to compute modular reciprocals in acompletely different way namely via Fermatrsquos little theorem if f is a nonzero elementof a finite field Fq then 1f = fqminus2 For example Bernsteinrsquos Curve25519 paper [10]computes a ratio modulo p = 2255 minus 19 using ldquoa straightforward sequence of 254 squaringsand 11 multiplicationsrdquo and says that this is ldquoabout 7 of the Curve25519 computationrdquoSomewhat more multiplications are required for ldquorandomrdquo primes as in the new CSIDH [28]post-quantum system but the bottom line is that a b-bit Fermat-style inversion is notmuch more expensive than b squarings while a b-bit elliptic-curve variable-base-pointscalar multiplication is an order of magnitude more expensive

However the bigger picture is that simply saying ldquouse Fermat for inversion and stopworrying about the speed of Euclidrsquos algorithmrdquo is not satisfactory

bull There are many cryptographic computations where inversion is a much larger fractionof the total time For example typical strategies for ECC signature generation (seeeg [16 Section 4]) involve only about 14 as many elliptic-curve operations makingthe cost of inversion2 much more noticeable As another example polynomialinversion is the primary bottleneck in key generation3 in the NTRU lattice-basedcryptosystem

bull There are many cryptographic computations that rely on Euclidrsquos algorithm foroperations that Fermatrsquos method cannot handle For example decryption in theMcEliece code-based cryptosystem (see eg [13]) relies critically on a half-gcdcomputation This cryptosystem has large public keys but has attracted interest4

for its very strong security track record Many newer proposals with smaller publickeys also rely on half-gcd computations5

These issues have prompted some work on variants of Euclidrsquos algorithm protected againsttiming attacks For example [13] uses a constant-time version of the BerlekampndashMasseyalgorithm inside the McEliece cryptosystem see also [36 Algorithm 10] and [30] Asanother example Bos [21] developed a constant-time version of Kaliskirsquos algorithm andreported in [21 Table 1] that this took 486000 ARM Cortex-A8 cycles for inversion moduloa 254-bit prime For comparison Bernstein and Schwabe [17] had reported 527102 ARMCortex-A8 cycles for an entire Curve25519 scalar multiplication including Fermat inversionan order of magnitude faster than the Euclid inversion in [21]

Euclidrsquos algorithm becomes more competitive for ldquorandomrdquo moduli since these modulimake the modular reductions in Fermatrsquos method more expensive6 It is also easy to see

1There is also a tremendous amount of work aimed at defending against power attacks electromagneticattacks and more broadly attackers with physical sensors close to the computation being carried out Forexample one typically tries to protect modular inversion by multiplying by r before and after each inversionwhere r is a random invertible quantity This paper focuses on protecting deterministic algorithms againsttiming attacks

2This refers to inversion in Fp when the elliptic curve is defined over Fp to convert an elliptic-curvepoint from projective coordinates to affine coordinates Applications that use ECDSA rather thanSchnorr-style signature systems such as Ed25519 also need inversions modulo the group order

3The standard argument that key-generation performance is important is that for forward secrecy TLSgenerates a public key and uses the key just once For several counterarguments see [14 Appendix T2]

4For example two of the round-2 submissions to the NIST post-quantum competition ClassicMcEliece [12] and NTS-KEM [6] are based directly on the McEliece cryptosystem

5See eg the HQC [4] and LAC [45] round-2 submissions to the NIST competition6Standard estimates are that squaring modulo a random prime is 2 or 3 times slower than squaring

Daniel J Bernstein and Bo-Yin Yang 3

that Euclidrsquos algorithm beats Fermatrsquos method for all large enough primes for n-bit primesEuclidrsquos algorithm uses Θ(n) simple linear-time operations such as big-integer subtractionswhile Fermatrsquos method uses Θ(n) big-integer multiplications The Euclid-vs-Fermat cutoffdepends on the cost ratio between big-integer multiplications and big-integer subtractionsso one expects the cutoff to be larger on CPUs with large hardware multipliers but manyinput sizes of interest in cryptography are clearly beyond the cutoff Both [39 Algorithm10] and [14 Appendix T2] use constant-time variants of the EuclidndashStevin algorithm forpolynomial inversions in NTRU key generation each polynomial here has nearly 10000bits in about 700 coefficients

11 Contributions of this paper This paper introduces simple fast constant-timeldquodivision stepsrdquo When division steps are iterated a constant number of times they revealthe gcd of the inputs the modular reciprocal when the inputs are coprime etc Divisionsteps work the same way for integer inputs and for polynomial inputs and are suitablefor all applications of Euclidrsquos algorithm that we have investigated As an illustration wedisplay in Figure 12 an integer gcd algorithm using division steps

This paper uses two case studies to show that these division steps are much faster thanprevious constant-time variants of Euclidrsquos algorithm One case study (Section 12) is aninteger-modular-inversion problem selected to be extremely favorable to Fermatrsquos method

bull The platforms are the popular Intel Haswell Skylake and Kaby Lake CPU coresEach of these CPU cores has a large hardware area devoted to fast multiplications

bull The modulus is the Curve25519 prime 2255 minus 19 Sparse primes allow very fastreductions and the performance of multiplication modulo this particular prime hasbeen studied in detail on many platforms

The latest speed records for inversion modulo 2255 minus 19 on the platforms mentioned aboveare from the recent NathndashSarkar paper ldquoEfficient inversion in (pseudo-)Mersenne primeorder fieldsrdquo [54] which takes 11854 cycles 9301 cycles and 8971 cycles on Haswell Skylakeand Kaby Lake respectively We achieve slightly better inversion speeds on each of theseplatforms 10050 cycles 8778 cycles and 8543 cycles We emphasize the Fermat-friendlynature of this case study It is safe to predict that our algorithm will produce much largerspeedups compared to Fermatrsquos method on CPUs with smaller multipliers on FPGAs andon ASICs7 For example on an ARM Cortex-A7 (only a 32-bit multiplier) our inversionmodulo 2255 minus 19 takes 35277 cycles while the best prior result (Fujii and Aranha [34]using Fermat) was 62648 cycles

The other case study (Section 7) is key generation in the ntruhrss701 cryptosystemfrom the HuumllsingndashRijneveldndashSchanckndashSchwabe paper ldquoHigh-speed key encapsulation fromNTRUrdquo [39] at CHES 2017 We again take an Intel core for comparability specificallyHaswell since [39] selected Haswell That paper used 150000 Haswell cycles to invert apolynomial modulo (x701 minus 1)(x minus 1) with coefficients modulo 3 We reduce the costof this inversion to just 90000 Haswell cycles We also speed up key generation in thesntrup4591761 [14] cryptosystem

13 Comparison to previous constant-time algorithms Earlier constant-timevariants of Euclidrsquos algorithm and the EuclidndashStevin algorithm are like our algorithm inthat they consist of a prespecified number of constant-time steps Each step of the previousalgorithms might seem rather simple at first glance However as illustrated by our new

modulo a special prime Combining these estimates with the ARM Cortex-A8 speeds from [17] suggeststhat Fermat inversion for a random 256-bit prime should take roughly 100000 Cortex-A8 cycles still muchfaster than the Euclid inversion from [21] It is claimed in [21 Table 2] that Fermat inversion takes 584000Cortex-A8 cycles presumably this Fermat software was not properly optimized

7We also expect spinoffs in quantum computation See eg the recent constant-time version of Kaliskirsquosalgorithm by RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] in the context of Shorrsquos algorithm tocompute elliptic-curve discrete logarithms

4 Fast constant-time gcd computation and modular inversion

defshortgcd2(fg)deltafg=1ZZ(f)ZZ(g)assertfamp1m=4+3max(fnbits()gnbits())forloopinrange(m)ifdeltagt0andgamp1deltafg=-deltag-fdeltag=1+delta(g+(gamp1)f)2returnabs(f)

Figure 12 Example of a gcd algorithm for integers using this paperrsquos division stepsThe f input is assumed to be odd Each iteration of the main loop replaces (δ f g) withdivstep(δ f g) Correctness follows from Theorem 112 The algorithm is expressed in theSage [58] computer-algebra system The algorithm is not constant-time as shown but canbe made constant-time with low overhead see Section 52

speed records the details matter Our division steps are particularly simple allowing veryefficient computations

Each step of the constant-time algorithms of Bos [21] and RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] like each step in the algorithms of Stein and Kaliski consists of alimited number of additions and subtractions followed by a division by 2 The decisionsof which operations to perform are based on (1) the bottom bits of the integers and (2)which integer is largermdasha comparison that sometimes needs to inspect many bits of theintegers A constant-time comparison inspects all bits

The time for this comparison might not seem problematic in context subtraction alsoinspects all bits and produces the result of the comparison as a side effect But this requiresa very wide data flow in the algorithm One cannot decide what needs to be be done ina step without inspecting all bits produced by the previous step For our algorithm thebottom t bits of the inputs (or the bottom t coefficients in the polynomial case) decide thelinear combinations used in the next t division steps so the data flow is much narrower

For the polynomial case size comparison is simply a matter of comparing polynomialdegrees but there are still many other details that affect performance as illustrated byour 17times speedup for the ntruhrss701 polynomial-inversion problem mentioned aboveSee Section 7 for a detailed comparison of our algorithm in this case to the algorithmin [39] Important issues include the number of control variables manipulated in the innerloop the complexity of the arithmetic operations being carried out and the way that theoutput is scaled

The closest previous algorithms that we have found in the literature are algorithms forldquosystolicrdquo hardware developed in the 1980s by Bojanczyk Brent and Kung See [24] forinteger gcd [19] for integer inversion and [23] for the polynomial case These algorithmshave predictable output scaling a single control variable δ (optionally represented in unary)beyond the loop counter and the same basic feature that several bottom bits decide linearcombinations used for several steps Our algorithms are nevertheless simpler and fasterfor a combination of reasons First our division steps have fewer case distinctions thanthe steps in [24] (and [19]) Second our data flow can be decomposed into a ldquoconditionalswaprdquo followed by an ldquoeliminationrdquo (see Section 3) whereas the data flow in [23] requiresa more complicated routing of inputs Third surprisingly we use fewer iterations than[24] Fourth for us t bits determine linear combinations for exactly t steps whereas this isnot exactly true in [24] Fifth we gain considerable speed in eg Curve25519 inversionby jumping through division steps

14 Comparison to previous subquadratic algorithms An easy consequence ofour data flow is a constant-time algorithm where the number of operations is subquadratic

Daniel J Bernstein and Bo-Yin Yang 5

in the input sizemdashasymptotically n(logn)2+o(1) This algorithm has important advantagesover earlier subquadratic gcd algorithms as we now explain

The starting point for subquadratic gcd algorithms is the following idea from Lehmer [44]The quotients at the beginning of gcd computation depend only on the most significantbits of the inputs After computing the corresponding 2times 2 transition matrix one canretroactively multiply this matrix by the remaining bits of the inputs and then continuewith the new hopefully smaller inputs

A closer look shows that it is not easy to formulate a precise statement about theamount of progress that one is guaranteed to make by inspecting j most significant bitsof the inputs We return to this difficulty below Assume for the moment that j bits ofthe inputs determine roughly j bits of initial information about the quotients in Euclidrsquosalgorithm

Lehmerrsquos paper predated subquadratic multiplication but is well known to be helpfuleven with quadratic multiplication For example consider the following procedure startingwith 40-bit integers a and b add a to b then add the new b to a and so on back and forthfor a total of 30 additions The results fit into words on a typical 64-bit CPU and thisprocedure might seem to be efficiently using the CPUrsquos arithmetic instructions Howeverit is generally faster to directly compute 832040a+ 514229b and 1346269a+ 832040b usingthe CPUrsquos fast multiplication circuits

Faster algorithms for multiplication make Lehmerrsquos approach more effective Knuth[41] using Lehmerrsquos idea recursively on top of FFT-based integer multiplication achievedcost n(logn)5+o(1) for integer gcd Schoumlnhage [60] improved 5 to 2 Subsequent paperssuch as [53] [22] and [66] adapted the ideas to the polynomial case

The subquadratic algorithms appearing in [41] [60] [22 Algorithm EMGCD] [66Algorithm SCH] [62 Algorithm Half-GB-gcd] [52 Algorithms HGCD-Q and HGCD-Band SGCD and HGCD-D] [25 Algorithm 31] etc all have the following structure Thereis an initial recursive call that computes a 2times 2 transition matrix there is a computationof reduced inputs using a fast multiplication subroutine there is a call to a fast divisionsubroutine producing a possibly large quotient and remainder and then there is morerecursion

We emphasize the role of the division subroutine in each of these algorithms If therecursive computation is expressed as a tree then each node in the tree calls the left subtreerecursively multiplies by a transition matrix calls a division subroutine and calls theright subtree recursively The results of the left subtree are described by some number ofquotients in Euclidrsquos algorithm so variations in the sizes of quotients produce irregularitiesin the amount of information computed by the recursions These irregularities can bequite severe multiplying by the transition matrix does not make any progress when theinputs have sufficiently different sizes The call to a division subroutine is a bandage overthese irregularities guaranteeing significant progress in all cases

Our subquadratic algorithm has a simpler structure We guarantee that a recursivecall to a size-j subtree always jumps through exactly j division steps Each node in thecomputation tree involves multiplications and recursive calls but the large variable-timedivision subroutine has disappeared and the underlying irregularities have also disappearedOne can think of each Euclid-style division as being decomposed into a series of our simplerconstant-time division steps but our recursion does not care where each Euclid-styledivision begins and ends See Section 12 for a case study of the speeds that we achieve

A prerequisite for this regular structure is that the linear combinations in j of ourdivision steps are determined by the bottom j bits of the integer inputs8 without regardto top bits This prerequisite was almost achieved by the BrentndashKung ldquosystolicrdquo gcdalgorithm [24] mentioned above but Brent and Kung did not observe that the data flow

8For polynomials one can work from the bottom or from the top since there are no carries Wechoose to work from bottom coefficients in the polynomial case to emphasize the similarities between thepolynomial case and the integer case

6 Fast constant-time gcd computation and modular inversion

allows a subquadratic algorithm The StehleacutendashZimmermann subquadratic ldquobinary recursivegcdrdquo algorithm [62] also emphasizes that it works with bottom bits but it has the sameirregularities as earlier subquadratic algorithms and it again needs a large variable-timedivision subroutine

Often the literature considers the ldquonormalrdquo case of polynomial gcd computation Thisis by definition the case that each EuclidndashStevin quotient has degree 1 The conventionalrecursion then returns a predictable amount of information allowing a more regularalgorithm such as [53 Algorithm PGCD]9 Our algorithm achieves the same regularityfor the general case

Another previous algorithm that achieves the same regularity for a different special caseof polynomial gcd computation is Thomeacutersquos subquadratic variant [68 Program 41] of theBerlekampndashMassey algorithm Our algorithm can be viewed as extending this regularityto arbitrary polynomial gcd computations and beyond this to integer gcd computations

2 Organization of the paperWe start with the polynomial case Section 3 defines division steps Section 5 relyingon theorems in Section 4 states our main algorithm to compute c coefficients of the nthiterate of divstep This takes n(c+ n) simple operations We also explain how ldquojumpsrdquoreduce the cost for large n to (c+ n)(log cn)2+o(1) operations All of these algorithms takeconstant time ie time independent of the input coefficients for any particular (n c)

There is a long tradition in number theory of defining Euclidrsquos algorithm continuedfractions etc for inputs in a ldquocompleterdquo field such as the field of real numbers or the fieldof power series We define division steps for power series and state our main algorithmfor power series Division steps produce polynomial outputs from polynomial inputs sothe reader can restrict attention to polynomials if desired but there are advantages offollowing the tradition as we now explain

Requiring algorithms to work for general power series means that the algorithms canonly inspect a limited number of leading coefficients of each input without regard to thedegree of inputs that happen to be polynomials This restriction simplifies algorithmdesign and helped us design our main algorithm

Going beyond polynomials also makes it easy to describe further algorithmic optionssuch as iterating divstep on inputs (1 gf) instead of (f g) Restricting to polynomialinputs would require gf to be replaced with a nearby polynomial the limited precision ofthe inputs would affect the table of divstep iterates rather than merely being an option inthe algorithms

Section 6 relying on theorems in Appendix A applies our main algorithm to gcdcomputation and modular inversion Here the inputs are polynomials A constant fractionof the power-series coefficients used in our main algorithm are guaranteed to be 0 in thiscase and one can save some time by skipping computations involving those coefficientsHowever it is still helpful to factor the algorithm-design process into (1) designing the fastpower-series algorithm and then (2) eliminating unused values for special cases

Section 7 is a case study of the software speed of modular inversion of polynomials aspart of key generation in the NTRU cryptosystem

The remaining sections of the paper consider the integer case Section 8 defines divisionsteps for integers Section 10 relying on theorems in Section 9 states our main algorithm tocompute c bits of the nth iterate of divstep Section 11 relying on theorems in Appendix Gapplies our main algorithm to gcd computation and modular inversion Section 12 is a case

9The analysis in [53 Section VI] considers only normal inputs (and power-of-2 sizes) The algorithmworks only for normal inputs The performance of division in the non-normal case was discussed in [53Section VIII] but incorrectly described as solely an issue for the base case of the recursion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 2: Fast constant-time gcd computation and modular inversion

2 Fast constant-time gcd computation and modular inversion

However in cryptography these algorithms are dangerous The central problem is thatthese algorithms have conditional branches that depend on the inputs Often these inputsare secret and the branches leak information to the attacker through cache timing branchtiming etc For example AldayandashGarciacuteandashTapiandashBrumley [7] recently presented a single-trace cache-timing attack successfully extracting RSA keys from OpenSSLrsquos implementationof Steinrsquos binary gcd algorithm

A common defense against these attacks1 is to compute modular reciprocals in acompletely different way namely via Fermatrsquos little theorem if f is a nonzero elementof a finite field Fq then 1f = fqminus2 For example Bernsteinrsquos Curve25519 paper [10]computes a ratio modulo p = 2255 minus 19 using ldquoa straightforward sequence of 254 squaringsand 11 multiplicationsrdquo and says that this is ldquoabout 7 of the Curve25519 computationrdquoSomewhat more multiplications are required for ldquorandomrdquo primes as in the new CSIDH [28]post-quantum system but the bottom line is that a b-bit Fermat-style inversion is notmuch more expensive than b squarings while a b-bit elliptic-curve variable-base-pointscalar multiplication is an order of magnitude more expensive

However the bigger picture is that simply saying ldquouse Fermat for inversion and stopworrying about the speed of Euclidrsquos algorithmrdquo is not satisfactory

bull There are many cryptographic computations where inversion is a much larger fractionof the total time For example typical strategies for ECC signature generation (seeeg [16 Section 4]) involve only about 14 as many elliptic-curve operations makingthe cost of inversion2 much more noticeable As another example polynomialinversion is the primary bottleneck in key generation3 in the NTRU lattice-basedcryptosystem

bull There are many cryptographic computations that rely on Euclidrsquos algorithm foroperations that Fermatrsquos method cannot handle For example decryption in theMcEliece code-based cryptosystem (see eg [13]) relies critically on a half-gcdcomputation This cryptosystem has large public keys but has attracted interest4

for its very strong security track record Many newer proposals with smaller publickeys also rely on half-gcd computations5

These issues have prompted some work on variants of Euclidrsquos algorithm protected againsttiming attacks For example [13] uses a constant-time version of the BerlekampndashMasseyalgorithm inside the McEliece cryptosystem see also [36 Algorithm 10] and [30] Asanother example Bos [21] developed a constant-time version of Kaliskirsquos algorithm andreported in [21 Table 1] that this took 486000 ARM Cortex-A8 cycles for inversion moduloa 254-bit prime For comparison Bernstein and Schwabe [17] had reported 527102 ARMCortex-A8 cycles for an entire Curve25519 scalar multiplication including Fermat inversionan order of magnitude faster than the Euclid inversion in [21]

Euclidrsquos algorithm becomes more competitive for ldquorandomrdquo moduli since these modulimake the modular reductions in Fermatrsquos method more expensive6 It is also easy to see

1There is also a tremendous amount of work aimed at defending against power attacks electromagneticattacks and more broadly attackers with physical sensors close to the computation being carried out Forexample one typically tries to protect modular inversion by multiplying by r before and after each inversionwhere r is a random invertible quantity This paper focuses on protecting deterministic algorithms againsttiming attacks

2This refers to inversion in Fp when the elliptic curve is defined over Fp to convert an elliptic-curvepoint from projective coordinates to affine coordinates Applications that use ECDSA rather thanSchnorr-style signature systems such as Ed25519 also need inversions modulo the group order

3The standard argument that key-generation performance is important is that for forward secrecy TLSgenerates a public key and uses the key just once For several counterarguments see [14 Appendix T2]

4For example two of the round-2 submissions to the NIST post-quantum competition ClassicMcEliece [12] and NTS-KEM [6] are based directly on the McEliece cryptosystem

5See eg the HQC [4] and LAC [45] round-2 submissions to the NIST competition6Standard estimates are that squaring modulo a random prime is 2 or 3 times slower than squaring

Daniel J Bernstein and Bo-Yin Yang 3

that Euclidrsquos algorithm beats Fermatrsquos method for all large enough primes for n-bit primesEuclidrsquos algorithm uses Θ(n) simple linear-time operations such as big-integer subtractionswhile Fermatrsquos method uses Θ(n) big-integer multiplications The Euclid-vs-Fermat cutoffdepends on the cost ratio between big-integer multiplications and big-integer subtractionsso one expects the cutoff to be larger on CPUs with large hardware multipliers but manyinput sizes of interest in cryptography are clearly beyond the cutoff Both [39 Algorithm10] and [14 Appendix T2] use constant-time variants of the EuclidndashStevin algorithm forpolynomial inversions in NTRU key generation each polynomial here has nearly 10000bits in about 700 coefficients

11 Contributions of this paper This paper introduces simple fast constant-timeldquodivision stepsrdquo When division steps are iterated a constant number of times they revealthe gcd of the inputs the modular reciprocal when the inputs are coprime etc Divisionsteps work the same way for integer inputs and for polynomial inputs and are suitablefor all applications of Euclidrsquos algorithm that we have investigated As an illustration wedisplay in Figure 12 an integer gcd algorithm using division steps

This paper uses two case studies to show that these division steps are much faster thanprevious constant-time variants of Euclidrsquos algorithm One case study (Section 12) is aninteger-modular-inversion problem selected to be extremely favorable to Fermatrsquos method

bull The platforms are the popular Intel Haswell Skylake and Kaby Lake CPU coresEach of these CPU cores has a large hardware area devoted to fast multiplications

bull The modulus is the Curve25519 prime 2255 minus 19 Sparse primes allow very fastreductions and the performance of multiplication modulo this particular prime hasbeen studied in detail on many platforms

The latest speed records for inversion modulo 2255 minus 19 on the platforms mentioned aboveare from the recent NathndashSarkar paper ldquoEfficient inversion in (pseudo-)Mersenne primeorder fieldsrdquo [54] which takes 11854 cycles 9301 cycles and 8971 cycles on Haswell Skylakeand Kaby Lake respectively We achieve slightly better inversion speeds on each of theseplatforms 10050 cycles 8778 cycles and 8543 cycles We emphasize the Fermat-friendlynature of this case study It is safe to predict that our algorithm will produce much largerspeedups compared to Fermatrsquos method on CPUs with smaller multipliers on FPGAs andon ASICs7 For example on an ARM Cortex-A7 (only a 32-bit multiplier) our inversionmodulo 2255 minus 19 takes 35277 cycles while the best prior result (Fujii and Aranha [34]using Fermat) was 62648 cycles

The other case study (Section 7) is key generation in the ntruhrss701 cryptosystemfrom the HuumllsingndashRijneveldndashSchanckndashSchwabe paper ldquoHigh-speed key encapsulation fromNTRUrdquo [39] at CHES 2017 We again take an Intel core for comparability specificallyHaswell since [39] selected Haswell That paper used 150000 Haswell cycles to invert apolynomial modulo (x701 minus 1)(x minus 1) with coefficients modulo 3 We reduce the costof this inversion to just 90000 Haswell cycles We also speed up key generation in thesntrup4591761 [14] cryptosystem

13 Comparison to previous constant-time algorithms Earlier constant-timevariants of Euclidrsquos algorithm and the EuclidndashStevin algorithm are like our algorithm inthat they consist of a prespecified number of constant-time steps Each step of the previousalgorithms might seem rather simple at first glance However as illustrated by our new

modulo a special prime Combining these estimates with the ARM Cortex-A8 speeds from [17] suggeststhat Fermat inversion for a random 256-bit prime should take roughly 100000 Cortex-A8 cycles still muchfaster than the Euclid inversion from [21] It is claimed in [21 Table 2] that Fermat inversion takes 584000Cortex-A8 cycles presumably this Fermat software was not properly optimized

7We also expect spinoffs in quantum computation See eg the recent constant-time version of Kaliskirsquosalgorithm by RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] in the context of Shorrsquos algorithm tocompute elliptic-curve discrete logarithms

4 Fast constant-time gcd computation and modular inversion

defshortgcd2(fg)deltafg=1ZZ(f)ZZ(g)assertfamp1m=4+3max(fnbits()gnbits())forloopinrange(m)ifdeltagt0andgamp1deltafg=-deltag-fdeltag=1+delta(g+(gamp1)f)2returnabs(f)

Figure 12 Example of a gcd algorithm for integers using this paperrsquos division stepsThe f input is assumed to be odd Each iteration of the main loop replaces (δ f g) withdivstep(δ f g) Correctness follows from Theorem 112 The algorithm is expressed in theSage [58] computer-algebra system The algorithm is not constant-time as shown but canbe made constant-time with low overhead see Section 52

speed records the details matter Our division steps are particularly simple allowing veryefficient computations

Each step of the constant-time algorithms of Bos [21] and RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] like each step in the algorithms of Stein and Kaliski consists of alimited number of additions and subtractions followed by a division by 2 The decisionsof which operations to perform are based on (1) the bottom bits of the integers and (2)which integer is largermdasha comparison that sometimes needs to inspect many bits of theintegers A constant-time comparison inspects all bits

The time for this comparison might not seem problematic in context subtraction alsoinspects all bits and produces the result of the comparison as a side effect But this requiresa very wide data flow in the algorithm One cannot decide what needs to be be done ina step without inspecting all bits produced by the previous step For our algorithm thebottom t bits of the inputs (or the bottom t coefficients in the polynomial case) decide thelinear combinations used in the next t division steps so the data flow is much narrower

For the polynomial case size comparison is simply a matter of comparing polynomialdegrees but there are still many other details that affect performance as illustrated byour 17times speedup for the ntruhrss701 polynomial-inversion problem mentioned aboveSee Section 7 for a detailed comparison of our algorithm in this case to the algorithmin [39] Important issues include the number of control variables manipulated in the innerloop the complexity of the arithmetic operations being carried out and the way that theoutput is scaled

The closest previous algorithms that we have found in the literature are algorithms forldquosystolicrdquo hardware developed in the 1980s by Bojanczyk Brent and Kung See [24] forinteger gcd [19] for integer inversion and [23] for the polynomial case These algorithmshave predictable output scaling a single control variable δ (optionally represented in unary)beyond the loop counter and the same basic feature that several bottom bits decide linearcombinations used for several steps Our algorithms are nevertheless simpler and fasterfor a combination of reasons First our division steps have fewer case distinctions thanthe steps in [24] (and [19]) Second our data flow can be decomposed into a ldquoconditionalswaprdquo followed by an ldquoeliminationrdquo (see Section 3) whereas the data flow in [23] requiresa more complicated routing of inputs Third surprisingly we use fewer iterations than[24] Fourth for us t bits determine linear combinations for exactly t steps whereas this isnot exactly true in [24] Fifth we gain considerable speed in eg Curve25519 inversionby jumping through division steps

14 Comparison to previous subquadratic algorithms An easy consequence ofour data flow is a constant-time algorithm where the number of operations is subquadratic

Daniel J Bernstein and Bo-Yin Yang 5

in the input sizemdashasymptotically n(logn)2+o(1) This algorithm has important advantagesover earlier subquadratic gcd algorithms as we now explain

The starting point for subquadratic gcd algorithms is the following idea from Lehmer [44]The quotients at the beginning of gcd computation depend only on the most significantbits of the inputs After computing the corresponding 2times 2 transition matrix one canretroactively multiply this matrix by the remaining bits of the inputs and then continuewith the new hopefully smaller inputs

A closer look shows that it is not easy to formulate a precise statement about theamount of progress that one is guaranteed to make by inspecting j most significant bitsof the inputs We return to this difficulty below Assume for the moment that j bits ofthe inputs determine roughly j bits of initial information about the quotients in Euclidrsquosalgorithm

Lehmerrsquos paper predated subquadratic multiplication but is well known to be helpfuleven with quadratic multiplication For example consider the following procedure startingwith 40-bit integers a and b add a to b then add the new b to a and so on back and forthfor a total of 30 additions The results fit into words on a typical 64-bit CPU and thisprocedure might seem to be efficiently using the CPUrsquos arithmetic instructions Howeverit is generally faster to directly compute 832040a+ 514229b and 1346269a+ 832040b usingthe CPUrsquos fast multiplication circuits

Faster algorithms for multiplication make Lehmerrsquos approach more effective Knuth[41] using Lehmerrsquos idea recursively on top of FFT-based integer multiplication achievedcost n(logn)5+o(1) for integer gcd Schoumlnhage [60] improved 5 to 2 Subsequent paperssuch as [53] [22] and [66] adapted the ideas to the polynomial case

The subquadratic algorithms appearing in [41] [60] [22 Algorithm EMGCD] [66Algorithm SCH] [62 Algorithm Half-GB-gcd] [52 Algorithms HGCD-Q and HGCD-Band SGCD and HGCD-D] [25 Algorithm 31] etc all have the following structure Thereis an initial recursive call that computes a 2times 2 transition matrix there is a computationof reduced inputs using a fast multiplication subroutine there is a call to a fast divisionsubroutine producing a possibly large quotient and remainder and then there is morerecursion

We emphasize the role of the division subroutine in each of these algorithms If therecursive computation is expressed as a tree then each node in the tree calls the left subtreerecursively multiplies by a transition matrix calls a division subroutine and calls theright subtree recursively The results of the left subtree are described by some number ofquotients in Euclidrsquos algorithm so variations in the sizes of quotients produce irregularitiesin the amount of information computed by the recursions These irregularities can bequite severe multiplying by the transition matrix does not make any progress when theinputs have sufficiently different sizes The call to a division subroutine is a bandage overthese irregularities guaranteeing significant progress in all cases

Our subquadratic algorithm has a simpler structure We guarantee that a recursivecall to a size-j subtree always jumps through exactly j division steps Each node in thecomputation tree involves multiplications and recursive calls but the large variable-timedivision subroutine has disappeared and the underlying irregularities have also disappearedOne can think of each Euclid-style division as being decomposed into a series of our simplerconstant-time division steps but our recursion does not care where each Euclid-styledivision begins and ends See Section 12 for a case study of the speeds that we achieve

A prerequisite for this regular structure is that the linear combinations in j of ourdivision steps are determined by the bottom j bits of the integer inputs8 without regardto top bits This prerequisite was almost achieved by the BrentndashKung ldquosystolicrdquo gcdalgorithm [24] mentioned above but Brent and Kung did not observe that the data flow

8For polynomials one can work from the bottom or from the top since there are no carries Wechoose to work from bottom coefficients in the polynomial case to emphasize the similarities between thepolynomial case and the integer case

6 Fast constant-time gcd computation and modular inversion

allows a subquadratic algorithm The StehleacutendashZimmermann subquadratic ldquobinary recursivegcdrdquo algorithm [62] also emphasizes that it works with bottom bits but it has the sameirregularities as earlier subquadratic algorithms and it again needs a large variable-timedivision subroutine

Often the literature considers the ldquonormalrdquo case of polynomial gcd computation Thisis by definition the case that each EuclidndashStevin quotient has degree 1 The conventionalrecursion then returns a predictable amount of information allowing a more regularalgorithm such as [53 Algorithm PGCD]9 Our algorithm achieves the same regularityfor the general case

Another previous algorithm that achieves the same regularity for a different special caseof polynomial gcd computation is Thomeacutersquos subquadratic variant [68 Program 41] of theBerlekampndashMassey algorithm Our algorithm can be viewed as extending this regularityto arbitrary polynomial gcd computations and beyond this to integer gcd computations

2 Organization of the paperWe start with the polynomial case Section 3 defines division steps Section 5 relyingon theorems in Section 4 states our main algorithm to compute c coefficients of the nthiterate of divstep This takes n(c+ n) simple operations We also explain how ldquojumpsrdquoreduce the cost for large n to (c+ n)(log cn)2+o(1) operations All of these algorithms takeconstant time ie time independent of the input coefficients for any particular (n c)

There is a long tradition in number theory of defining Euclidrsquos algorithm continuedfractions etc for inputs in a ldquocompleterdquo field such as the field of real numbers or the fieldof power series We define division steps for power series and state our main algorithmfor power series Division steps produce polynomial outputs from polynomial inputs sothe reader can restrict attention to polynomials if desired but there are advantages offollowing the tradition as we now explain

Requiring algorithms to work for general power series means that the algorithms canonly inspect a limited number of leading coefficients of each input without regard to thedegree of inputs that happen to be polynomials This restriction simplifies algorithmdesign and helped us design our main algorithm

Going beyond polynomials also makes it easy to describe further algorithmic optionssuch as iterating divstep on inputs (1 gf) instead of (f g) Restricting to polynomialinputs would require gf to be replaced with a nearby polynomial the limited precision ofthe inputs would affect the table of divstep iterates rather than merely being an option inthe algorithms

Section 6 relying on theorems in Appendix A applies our main algorithm to gcdcomputation and modular inversion Here the inputs are polynomials A constant fractionof the power-series coefficients used in our main algorithm are guaranteed to be 0 in thiscase and one can save some time by skipping computations involving those coefficientsHowever it is still helpful to factor the algorithm-design process into (1) designing the fastpower-series algorithm and then (2) eliminating unused values for special cases

Section 7 is a case study of the software speed of modular inversion of polynomials aspart of key generation in the NTRU cryptosystem

The remaining sections of the paper consider the integer case Section 8 defines divisionsteps for integers Section 10 relying on theorems in Section 9 states our main algorithm tocompute c bits of the nth iterate of divstep Section 11 relying on theorems in Appendix Gapplies our main algorithm to gcd computation and modular inversion Section 12 is a case

9The analysis in [53 Section VI] considers only normal inputs (and power-of-2 sizes) The algorithmworks only for normal inputs The performance of division in the non-normal case was discussed in [53Section VIII] but incorrectly described as solely an issue for the base case of the recursion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 3: Fast constant-time gcd computation and modular inversion

Daniel J Bernstein and Bo-Yin Yang 3

that Euclidrsquos algorithm beats Fermatrsquos method for all large enough primes for n-bit primesEuclidrsquos algorithm uses Θ(n) simple linear-time operations such as big-integer subtractionswhile Fermatrsquos method uses Θ(n) big-integer multiplications The Euclid-vs-Fermat cutoffdepends on the cost ratio between big-integer multiplications and big-integer subtractionsso one expects the cutoff to be larger on CPUs with large hardware multipliers but manyinput sizes of interest in cryptography are clearly beyond the cutoff Both [39 Algorithm10] and [14 Appendix T2] use constant-time variants of the EuclidndashStevin algorithm forpolynomial inversions in NTRU key generation each polynomial here has nearly 10000bits in about 700 coefficients

11 Contributions of this paper This paper introduces simple fast constant-timeldquodivision stepsrdquo When division steps are iterated a constant number of times they revealthe gcd of the inputs the modular reciprocal when the inputs are coprime etc Divisionsteps work the same way for integer inputs and for polynomial inputs and are suitablefor all applications of Euclidrsquos algorithm that we have investigated As an illustration wedisplay in Figure 12 an integer gcd algorithm using division steps

This paper uses two case studies to show that these division steps are much faster thanprevious constant-time variants of Euclidrsquos algorithm One case study (Section 12) is aninteger-modular-inversion problem selected to be extremely favorable to Fermatrsquos method

bull The platforms are the popular Intel Haswell Skylake and Kaby Lake CPU coresEach of these CPU cores has a large hardware area devoted to fast multiplications

bull The modulus is the Curve25519 prime 2255 minus 19 Sparse primes allow very fastreductions and the performance of multiplication modulo this particular prime hasbeen studied in detail on many platforms

The latest speed records for inversion modulo 2255 minus 19 on the platforms mentioned aboveare from the recent NathndashSarkar paper ldquoEfficient inversion in (pseudo-)Mersenne primeorder fieldsrdquo [54] which takes 11854 cycles 9301 cycles and 8971 cycles on Haswell Skylakeand Kaby Lake respectively We achieve slightly better inversion speeds on each of theseplatforms 10050 cycles 8778 cycles and 8543 cycles We emphasize the Fermat-friendlynature of this case study It is safe to predict that our algorithm will produce much largerspeedups compared to Fermatrsquos method on CPUs with smaller multipliers on FPGAs andon ASICs7 For example on an ARM Cortex-A7 (only a 32-bit multiplier) our inversionmodulo 2255 minus 19 takes 35277 cycles while the best prior result (Fujii and Aranha [34]using Fermat) was 62648 cycles

The other case study (Section 7) is key generation in the ntruhrss701 cryptosystemfrom the HuumllsingndashRijneveldndashSchanckndashSchwabe paper ldquoHigh-speed key encapsulation fromNTRUrdquo [39] at CHES 2017 We again take an Intel core for comparability specificallyHaswell since [39] selected Haswell That paper used 150000 Haswell cycles to invert apolynomial modulo (x701 minus 1)(x minus 1) with coefficients modulo 3 We reduce the costof this inversion to just 90000 Haswell cycles We also speed up key generation in thesntrup4591761 [14] cryptosystem

13 Comparison to previous constant-time algorithms Earlier constant-timevariants of Euclidrsquos algorithm and the EuclidndashStevin algorithm are like our algorithm inthat they consist of a prespecified number of constant-time steps Each step of the previousalgorithms might seem rather simple at first glance However as illustrated by our new

modulo a special prime Combining these estimates with the ARM Cortex-A8 speeds from [17] suggeststhat Fermat inversion for a random 256-bit prime should take roughly 100000 Cortex-A8 cycles still muchfaster than the Euclid inversion from [21] It is claimed in [21 Table 2] that Fermat inversion takes 584000Cortex-A8 cycles presumably this Fermat software was not properly optimized

7We also expect spinoffs in quantum computation See eg the recent constant-time version of Kaliskirsquosalgorithm by RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] in the context of Shorrsquos algorithm tocompute elliptic-curve discrete logarithms

4 Fast constant-time gcd computation and modular inversion

defshortgcd2(fg)deltafg=1ZZ(f)ZZ(g)assertfamp1m=4+3max(fnbits()gnbits())forloopinrange(m)ifdeltagt0andgamp1deltafg=-deltag-fdeltag=1+delta(g+(gamp1)f)2returnabs(f)

Figure 12 Example of a gcd algorithm for integers using this paperrsquos division stepsThe f input is assumed to be odd Each iteration of the main loop replaces (δ f g) withdivstep(δ f g) Correctness follows from Theorem 112 The algorithm is expressed in theSage [58] computer-algebra system The algorithm is not constant-time as shown but canbe made constant-time with low overhead see Section 52

speed records the details matter Our division steps are particularly simple allowing veryefficient computations

Each step of the constant-time algorithms of Bos [21] and RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] like each step in the algorithms of Stein and Kaliski consists of alimited number of additions and subtractions followed by a division by 2 The decisionsof which operations to perform are based on (1) the bottom bits of the integers and (2)which integer is largermdasha comparison that sometimes needs to inspect many bits of theintegers A constant-time comparison inspects all bits

The time for this comparison might not seem problematic in context subtraction alsoinspects all bits and produces the result of the comparison as a side effect But this requiresa very wide data flow in the algorithm One cannot decide what needs to be be done ina step without inspecting all bits produced by the previous step For our algorithm thebottom t bits of the inputs (or the bottom t coefficients in the polynomial case) decide thelinear combinations used in the next t division steps so the data flow is much narrower

For the polynomial case size comparison is simply a matter of comparing polynomialdegrees but there are still many other details that affect performance as illustrated byour 17times speedup for the ntruhrss701 polynomial-inversion problem mentioned aboveSee Section 7 for a detailed comparison of our algorithm in this case to the algorithmin [39] Important issues include the number of control variables manipulated in the innerloop the complexity of the arithmetic operations being carried out and the way that theoutput is scaled

The closest previous algorithms that we have found in the literature are algorithms forldquosystolicrdquo hardware developed in the 1980s by Bojanczyk Brent and Kung See [24] forinteger gcd [19] for integer inversion and [23] for the polynomial case These algorithmshave predictable output scaling a single control variable δ (optionally represented in unary)beyond the loop counter and the same basic feature that several bottom bits decide linearcombinations used for several steps Our algorithms are nevertheless simpler and fasterfor a combination of reasons First our division steps have fewer case distinctions thanthe steps in [24] (and [19]) Second our data flow can be decomposed into a ldquoconditionalswaprdquo followed by an ldquoeliminationrdquo (see Section 3) whereas the data flow in [23] requiresa more complicated routing of inputs Third surprisingly we use fewer iterations than[24] Fourth for us t bits determine linear combinations for exactly t steps whereas this isnot exactly true in [24] Fifth we gain considerable speed in eg Curve25519 inversionby jumping through division steps

14 Comparison to previous subquadratic algorithms An easy consequence ofour data flow is a constant-time algorithm where the number of operations is subquadratic

Daniel J Bernstein and Bo-Yin Yang 5

in the input sizemdashasymptotically n(logn)2+o(1) This algorithm has important advantagesover earlier subquadratic gcd algorithms as we now explain

The starting point for subquadratic gcd algorithms is the following idea from Lehmer [44]The quotients at the beginning of gcd computation depend only on the most significantbits of the inputs After computing the corresponding 2times 2 transition matrix one canretroactively multiply this matrix by the remaining bits of the inputs and then continuewith the new hopefully smaller inputs

A closer look shows that it is not easy to formulate a precise statement about theamount of progress that one is guaranteed to make by inspecting j most significant bitsof the inputs We return to this difficulty below Assume for the moment that j bits ofthe inputs determine roughly j bits of initial information about the quotients in Euclidrsquosalgorithm

Lehmerrsquos paper predated subquadratic multiplication but is well known to be helpfuleven with quadratic multiplication For example consider the following procedure startingwith 40-bit integers a and b add a to b then add the new b to a and so on back and forthfor a total of 30 additions The results fit into words on a typical 64-bit CPU and thisprocedure might seem to be efficiently using the CPUrsquos arithmetic instructions Howeverit is generally faster to directly compute 832040a+ 514229b and 1346269a+ 832040b usingthe CPUrsquos fast multiplication circuits

Faster algorithms for multiplication make Lehmerrsquos approach more effective Knuth[41] using Lehmerrsquos idea recursively on top of FFT-based integer multiplication achievedcost n(logn)5+o(1) for integer gcd Schoumlnhage [60] improved 5 to 2 Subsequent paperssuch as [53] [22] and [66] adapted the ideas to the polynomial case

The subquadratic algorithms appearing in [41] [60] [22 Algorithm EMGCD] [66Algorithm SCH] [62 Algorithm Half-GB-gcd] [52 Algorithms HGCD-Q and HGCD-Band SGCD and HGCD-D] [25 Algorithm 31] etc all have the following structure Thereis an initial recursive call that computes a 2times 2 transition matrix there is a computationof reduced inputs using a fast multiplication subroutine there is a call to a fast divisionsubroutine producing a possibly large quotient and remainder and then there is morerecursion

We emphasize the role of the division subroutine in each of these algorithms If therecursive computation is expressed as a tree then each node in the tree calls the left subtreerecursively multiplies by a transition matrix calls a division subroutine and calls theright subtree recursively The results of the left subtree are described by some number ofquotients in Euclidrsquos algorithm so variations in the sizes of quotients produce irregularitiesin the amount of information computed by the recursions These irregularities can bequite severe multiplying by the transition matrix does not make any progress when theinputs have sufficiently different sizes The call to a division subroutine is a bandage overthese irregularities guaranteeing significant progress in all cases

Our subquadratic algorithm has a simpler structure We guarantee that a recursivecall to a size-j subtree always jumps through exactly j division steps Each node in thecomputation tree involves multiplications and recursive calls but the large variable-timedivision subroutine has disappeared and the underlying irregularities have also disappearedOne can think of each Euclid-style division as being decomposed into a series of our simplerconstant-time division steps but our recursion does not care where each Euclid-styledivision begins and ends See Section 12 for a case study of the speeds that we achieve

A prerequisite for this regular structure is that the linear combinations in j of ourdivision steps are determined by the bottom j bits of the integer inputs8 without regardto top bits This prerequisite was almost achieved by the BrentndashKung ldquosystolicrdquo gcdalgorithm [24] mentioned above but Brent and Kung did not observe that the data flow

8For polynomials one can work from the bottom or from the top since there are no carries Wechoose to work from bottom coefficients in the polynomial case to emphasize the similarities between thepolynomial case and the integer case

6 Fast constant-time gcd computation and modular inversion

allows a subquadratic algorithm The StehleacutendashZimmermann subquadratic ldquobinary recursivegcdrdquo algorithm [62] also emphasizes that it works with bottom bits but it has the sameirregularities as earlier subquadratic algorithms and it again needs a large variable-timedivision subroutine

Often the literature considers the ldquonormalrdquo case of polynomial gcd computation Thisis by definition the case that each EuclidndashStevin quotient has degree 1 The conventionalrecursion then returns a predictable amount of information allowing a more regularalgorithm such as [53 Algorithm PGCD]9 Our algorithm achieves the same regularityfor the general case

Another previous algorithm that achieves the same regularity for a different special caseof polynomial gcd computation is Thomeacutersquos subquadratic variant [68 Program 41] of theBerlekampndashMassey algorithm Our algorithm can be viewed as extending this regularityto arbitrary polynomial gcd computations and beyond this to integer gcd computations

2 Organization of the paperWe start with the polynomial case Section 3 defines division steps Section 5 relyingon theorems in Section 4 states our main algorithm to compute c coefficients of the nthiterate of divstep This takes n(c+ n) simple operations We also explain how ldquojumpsrdquoreduce the cost for large n to (c+ n)(log cn)2+o(1) operations All of these algorithms takeconstant time ie time independent of the input coefficients for any particular (n c)

There is a long tradition in number theory of defining Euclidrsquos algorithm continuedfractions etc for inputs in a ldquocompleterdquo field such as the field of real numbers or the fieldof power series We define division steps for power series and state our main algorithmfor power series Division steps produce polynomial outputs from polynomial inputs sothe reader can restrict attention to polynomials if desired but there are advantages offollowing the tradition as we now explain

Requiring algorithms to work for general power series means that the algorithms canonly inspect a limited number of leading coefficients of each input without regard to thedegree of inputs that happen to be polynomials This restriction simplifies algorithmdesign and helped us design our main algorithm

Going beyond polynomials also makes it easy to describe further algorithmic optionssuch as iterating divstep on inputs (1 gf) instead of (f g) Restricting to polynomialinputs would require gf to be replaced with a nearby polynomial the limited precision ofthe inputs would affect the table of divstep iterates rather than merely being an option inthe algorithms

Section 6 relying on theorems in Appendix A applies our main algorithm to gcdcomputation and modular inversion Here the inputs are polynomials A constant fractionof the power-series coefficients used in our main algorithm are guaranteed to be 0 in thiscase and one can save some time by skipping computations involving those coefficientsHowever it is still helpful to factor the algorithm-design process into (1) designing the fastpower-series algorithm and then (2) eliminating unused values for special cases

Section 7 is a case study of the software speed of modular inversion of polynomials aspart of key generation in the NTRU cryptosystem

The remaining sections of the paper consider the integer case Section 8 defines divisionsteps for integers Section 10 relying on theorems in Section 9 states our main algorithm tocompute c bits of the nth iterate of divstep Section 11 relying on theorems in Appendix Gapplies our main algorithm to gcd computation and modular inversion Section 12 is a case

9The analysis in [53 Section VI] considers only normal inputs (and power-of-2 sizes) The algorithmworks only for normal inputs The performance of division in the non-normal case was discussed in [53Section VIII] but incorrectly described as solely an issue for the base case of the recursion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 4: Fast constant-time gcd computation and modular inversion

4 Fast constant-time gcd computation and modular inversion

defshortgcd2(fg)deltafg=1ZZ(f)ZZ(g)assertfamp1m=4+3max(fnbits()gnbits())forloopinrange(m)ifdeltagt0andgamp1deltafg=-deltag-fdeltag=1+delta(g+(gamp1)f)2returnabs(f)

Figure 12 Example of a gcd algorithm for integers using this paperrsquos division stepsThe f input is assumed to be odd Each iteration of the main loop replaces (δ f g) withdivstep(δ f g) Correctness follows from Theorem 112 The algorithm is expressed in theSage [58] computer-algebra system The algorithm is not constant-time as shown but canbe made constant-time with low overhead see Section 52

speed records the details matter Our division steps are particularly simple allowing veryefficient computations

Each step of the constant-time algorithms of Bos [21] and RoettelerndashNaehrigndashSvorendashLauter [57 Section 34] like each step in the algorithms of Stein and Kaliski consists of alimited number of additions and subtractions followed by a division by 2 The decisionsof which operations to perform are based on (1) the bottom bits of the integers and (2)which integer is largermdasha comparison that sometimes needs to inspect many bits of theintegers A constant-time comparison inspects all bits

The time for this comparison might not seem problematic in context subtraction alsoinspects all bits and produces the result of the comparison as a side effect But this requiresa very wide data flow in the algorithm One cannot decide what needs to be be done ina step without inspecting all bits produced by the previous step For our algorithm thebottom t bits of the inputs (or the bottom t coefficients in the polynomial case) decide thelinear combinations used in the next t division steps so the data flow is much narrower

For the polynomial case size comparison is simply a matter of comparing polynomialdegrees but there are still many other details that affect performance as illustrated byour 17times speedup for the ntruhrss701 polynomial-inversion problem mentioned aboveSee Section 7 for a detailed comparison of our algorithm in this case to the algorithmin [39] Important issues include the number of control variables manipulated in the innerloop the complexity of the arithmetic operations being carried out and the way that theoutput is scaled

The closest previous algorithms that we have found in the literature are algorithms forldquosystolicrdquo hardware developed in the 1980s by Bojanczyk Brent and Kung See [24] forinteger gcd [19] for integer inversion and [23] for the polynomial case These algorithmshave predictable output scaling a single control variable δ (optionally represented in unary)beyond the loop counter and the same basic feature that several bottom bits decide linearcombinations used for several steps Our algorithms are nevertheless simpler and fasterfor a combination of reasons First our division steps have fewer case distinctions thanthe steps in [24] (and [19]) Second our data flow can be decomposed into a ldquoconditionalswaprdquo followed by an ldquoeliminationrdquo (see Section 3) whereas the data flow in [23] requiresa more complicated routing of inputs Third surprisingly we use fewer iterations than[24] Fourth for us t bits determine linear combinations for exactly t steps whereas this isnot exactly true in [24] Fifth we gain considerable speed in eg Curve25519 inversionby jumping through division steps

14 Comparison to previous subquadratic algorithms An easy consequence ofour data flow is a constant-time algorithm where the number of operations is subquadratic

Daniel J Bernstein and Bo-Yin Yang 5

in the input sizemdashasymptotically n(logn)2+o(1) This algorithm has important advantagesover earlier subquadratic gcd algorithms as we now explain

The starting point for subquadratic gcd algorithms is the following idea from Lehmer [44]The quotients at the beginning of gcd computation depend only on the most significantbits of the inputs After computing the corresponding 2times 2 transition matrix one canretroactively multiply this matrix by the remaining bits of the inputs and then continuewith the new hopefully smaller inputs

A closer look shows that it is not easy to formulate a precise statement about theamount of progress that one is guaranteed to make by inspecting j most significant bitsof the inputs We return to this difficulty below Assume for the moment that j bits ofthe inputs determine roughly j bits of initial information about the quotients in Euclidrsquosalgorithm

Lehmerrsquos paper predated subquadratic multiplication but is well known to be helpfuleven with quadratic multiplication For example consider the following procedure startingwith 40-bit integers a and b add a to b then add the new b to a and so on back and forthfor a total of 30 additions The results fit into words on a typical 64-bit CPU and thisprocedure might seem to be efficiently using the CPUrsquos arithmetic instructions Howeverit is generally faster to directly compute 832040a+ 514229b and 1346269a+ 832040b usingthe CPUrsquos fast multiplication circuits

Faster algorithms for multiplication make Lehmerrsquos approach more effective Knuth[41] using Lehmerrsquos idea recursively on top of FFT-based integer multiplication achievedcost n(logn)5+o(1) for integer gcd Schoumlnhage [60] improved 5 to 2 Subsequent paperssuch as [53] [22] and [66] adapted the ideas to the polynomial case

The subquadratic algorithms appearing in [41] [60] [22 Algorithm EMGCD] [66Algorithm SCH] [62 Algorithm Half-GB-gcd] [52 Algorithms HGCD-Q and HGCD-Band SGCD and HGCD-D] [25 Algorithm 31] etc all have the following structure Thereis an initial recursive call that computes a 2times 2 transition matrix there is a computationof reduced inputs using a fast multiplication subroutine there is a call to a fast divisionsubroutine producing a possibly large quotient and remainder and then there is morerecursion

We emphasize the role of the division subroutine in each of these algorithms If therecursive computation is expressed as a tree then each node in the tree calls the left subtreerecursively multiplies by a transition matrix calls a division subroutine and calls theright subtree recursively The results of the left subtree are described by some number ofquotients in Euclidrsquos algorithm so variations in the sizes of quotients produce irregularitiesin the amount of information computed by the recursions These irregularities can bequite severe multiplying by the transition matrix does not make any progress when theinputs have sufficiently different sizes The call to a division subroutine is a bandage overthese irregularities guaranteeing significant progress in all cases

Our subquadratic algorithm has a simpler structure We guarantee that a recursivecall to a size-j subtree always jumps through exactly j division steps Each node in thecomputation tree involves multiplications and recursive calls but the large variable-timedivision subroutine has disappeared and the underlying irregularities have also disappearedOne can think of each Euclid-style division as being decomposed into a series of our simplerconstant-time division steps but our recursion does not care where each Euclid-styledivision begins and ends See Section 12 for a case study of the speeds that we achieve

A prerequisite for this regular structure is that the linear combinations in j of ourdivision steps are determined by the bottom j bits of the integer inputs8 without regardto top bits This prerequisite was almost achieved by the BrentndashKung ldquosystolicrdquo gcdalgorithm [24] mentioned above but Brent and Kung did not observe that the data flow

8For polynomials one can work from the bottom or from the top since there are no carries Wechoose to work from bottom coefficients in the polynomial case to emphasize the similarities between thepolynomial case and the integer case

6 Fast constant-time gcd computation and modular inversion

allows a subquadratic algorithm The StehleacutendashZimmermann subquadratic ldquobinary recursivegcdrdquo algorithm [62] also emphasizes that it works with bottom bits but it has the sameirregularities as earlier subquadratic algorithms and it again needs a large variable-timedivision subroutine

Often the literature considers the ldquonormalrdquo case of polynomial gcd computation Thisis by definition the case that each EuclidndashStevin quotient has degree 1 The conventionalrecursion then returns a predictable amount of information allowing a more regularalgorithm such as [53 Algorithm PGCD]9 Our algorithm achieves the same regularityfor the general case

Another previous algorithm that achieves the same regularity for a different special caseof polynomial gcd computation is Thomeacutersquos subquadratic variant [68 Program 41] of theBerlekampndashMassey algorithm Our algorithm can be viewed as extending this regularityto arbitrary polynomial gcd computations and beyond this to integer gcd computations

2 Organization of the paperWe start with the polynomial case Section 3 defines division steps Section 5 relyingon theorems in Section 4 states our main algorithm to compute c coefficients of the nthiterate of divstep This takes n(c+ n) simple operations We also explain how ldquojumpsrdquoreduce the cost for large n to (c+ n)(log cn)2+o(1) operations All of these algorithms takeconstant time ie time independent of the input coefficients for any particular (n c)

There is a long tradition in number theory of defining Euclidrsquos algorithm continuedfractions etc for inputs in a ldquocompleterdquo field such as the field of real numbers or the fieldof power series We define division steps for power series and state our main algorithmfor power series Division steps produce polynomial outputs from polynomial inputs sothe reader can restrict attention to polynomials if desired but there are advantages offollowing the tradition as we now explain

Requiring algorithms to work for general power series means that the algorithms canonly inspect a limited number of leading coefficients of each input without regard to thedegree of inputs that happen to be polynomials This restriction simplifies algorithmdesign and helped us design our main algorithm

Going beyond polynomials also makes it easy to describe further algorithmic optionssuch as iterating divstep on inputs (1 gf) instead of (f g) Restricting to polynomialinputs would require gf to be replaced with a nearby polynomial the limited precision ofthe inputs would affect the table of divstep iterates rather than merely being an option inthe algorithms

Section 6 relying on theorems in Appendix A applies our main algorithm to gcdcomputation and modular inversion Here the inputs are polynomials A constant fractionof the power-series coefficients used in our main algorithm are guaranteed to be 0 in thiscase and one can save some time by skipping computations involving those coefficientsHowever it is still helpful to factor the algorithm-design process into (1) designing the fastpower-series algorithm and then (2) eliminating unused values for special cases

Section 7 is a case study of the software speed of modular inversion of polynomials aspart of key generation in the NTRU cryptosystem

The remaining sections of the paper consider the integer case Section 8 defines divisionsteps for integers Section 10 relying on theorems in Section 9 states our main algorithm tocompute c bits of the nth iterate of divstep Section 11 relying on theorems in Appendix Gapplies our main algorithm to gcd computation and modular inversion Section 12 is a case

9The analysis in [53 Section VI] considers only normal inputs (and power-of-2 sizes) The algorithmworks only for normal inputs The performance of division in the non-normal case was discussed in [53Section VIII] but incorrectly described as solely an issue for the base case of the recursion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 5: Fast constant-time gcd computation and modular inversion

Daniel J Bernstein and Bo-Yin Yang 5

in the input sizemdashasymptotically n(logn)2+o(1) This algorithm has important advantagesover earlier subquadratic gcd algorithms as we now explain

The starting point for subquadratic gcd algorithms is the following idea from Lehmer [44]The quotients at the beginning of gcd computation depend only on the most significantbits of the inputs After computing the corresponding 2times 2 transition matrix one canretroactively multiply this matrix by the remaining bits of the inputs and then continuewith the new hopefully smaller inputs

A closer look shows that it is not easy to formulate a precise statement about theamount of progress that one is guaranteed to make by inspecting j most significant bitsof the inputs We return to this difficulty below Assume for the moment that j bits ofthe inputs determine roughly j bits of initial information about the quotients in Euclidrsquosalgorithm

Lehmerrsquos paper predated subquadratic multiplication but is well known to be helpfuleven with quadratic multiplication For example consider the following procedure startingwith 40-bit integers a and b add a to b then add the new b to a and so on back and forthfor a total of 30 additions The results fit into words on a typical 64-bit CPU and thisprocedure might seem to be efficiently using the CPUrsquos arithmetic instructions Howeverit is generally faster to directly compute 832040a+ 514229b and 1346269a+ 832040b usingthe CPUrsquos fast multiplication circuits

Faster algorithms for multiplication make Lehmerrsquos approach more effective Knuth[41] using Lehmerrsquos idea recursively on top of FFT-based integer multiplication achievedcost n(logn)5+o(1) for integer gcd Schoumlnhage [60] improved 5 to 2 Subsequent paperssuch as [53] [22] and [66] adapted the ideas to the polynomial case

The subquadratic algorithms appearing in [41] [60] [22 Algorithm EMGCD] [66Algorithm SCH] [62 Algorithm Half-GB-gcd] [52 Algorithms HGCD-Q and HGCD-Band SGCD and HGCD-D] [25 Algorithm 31] etc all have the following structure Thereis an initial recursive call that computes a 2times 2 transition matrix there is a computationof reduced inputs using a fast multiplication subroutine there is a call to a fast divisionsubroutine producing a possibly large quotient and remainder and then there is morerecursion

We emphasize the role of the division subroutine in each of these algorithms If therecursive computation is expressed as a tree then each node in the tree calls the left subtreerecursively multiplies by a transition matrix calls a division subroutine and calls theright subtree recursively The results of the left subtree are described by some number ofquotients in Euclidrsquos algorithm so variations in the sizes of quotients produce irregularitiesin the amount of information computed by the recursions These irregularities can bequite severe multiplying by the transition matrix does not make any progress when theinputs have sufficiently different sizes The call to a division subroutine is a bandage overthese irregularities guaranteeing significant progress in all cases

Our subquadratic algorithm has a simpler structure We guarantee that a recursivecall to a size-j subtree always jumps through exactly j division steps Each node in thecomputation tree involves multiplications and recursive calls but the large variable-timedivision subroutine has disappeared and the underlying irregularities have also disappearedOne can think of each Euclid-style division as being decomposed into a series of our simplerconstant-time division steps but our recursion does not care where each Euclid-styledivision begins and ends See Section 12 for a case study of the speeds that we achieve

A prerequisite for this regular structure is that the linear combinations in j of ourdivision steps are determined by the bottom j bits of the integer inputs8 without regardto top bits This prerequisite was almost achieved by the BrentndashKung ldquosystolicrdquo gcdalgorithm [24] mentioned above but Brent and Kung did not observe that the data flow

8For polynomials one can work from the bottom or from the top since there are no carries Wechoose to work from bottom coefficients in the polynomial case to emphasize the similarities between thepolynomial case and the integer case

6 Fast constant-time gcd computation and modular inversion

allows a subquadratic algorithm The StehleacutendashZimmermann subquadratic ldquobinary recursivegcdrdquo algorithm [62] also emphasizes that it works with bottom bits but it has the sameirregularities as earlier subquadratic algorithms and it again needs a large variable-timedivision subroutine

Often the literature considers the ldquonormalrdquo case of polynomial gcd computation Thisis by definition the case that each EuclidndashStevin quotient has degree 1 The conventionalrecursion then returns a predictable amount of information allowing a more regularalgorithm such as [53 Algorithm PGCD]9 Our algorithm achieves the same regularityfor the general case

Another previous algorithm that achieves the same regularity for a different special caseof polynomial gcd computation is Thomeacutersquos subquadratic variant [68 Program 41] of theBerlekampndashMassey algorithm Our algorithm can be viewed as extending this regularityto arbitrary polynomial gcd computations and beyond this to integer gcd computations

2 Organization of the paperWe start with the polynomial case Section 3 defines division steps Section 5 relyingon theorems in Section 4 states our main algorithm to compute c coefficients of the nthiterate of divstep This takes n(c+ n) simple operations We also explain how ldquojumpsrdquoreduce the cost for large n to (c+ n)(log cn)2+o(1) operations All of these algorithms takeconstant time ie time independent of the input coefficients for any particular (n c)

There is a long tradition in number theory of defining Euclidrsquos algorithm continuedfractions etc for inputs in a ldquocompleterdquo field such as the field of real numbers or the fieldof power series We define division steps for power series and state our main algorithmfor power series Division steps produce polynomial outputs from polynomial inputs sothe reader can restrict attention to polynomials if desired but there are advantages offollowing the tradition as we now explain

Requiring algorithms to work for general power series means that the algorithms canonly inspect a limited number of leading coefficients of each input without regard to thedegree of inputs that happen to be polynomials This restriction simplifies algorithmdesign and helped us design our main algorithm

Going beyond polynomials also makes it easy to describe further algorithmic optionssuch as iterating divstep on inputs (1 gf) instead of (f g) Restricting to polynomialinputs would require gf to be replaced with a nearby polynomial the limited precision ofthe inputs would affect the table of divstep iterates rather than merely being an option inthe algorithms

Section 6 relying on theorems in Appendix A applies our main algorithm to gcdcomputation and modular inversion Here the inputs are polynomials A constant fractionof the power-series coefficients used in our main algorithm are guaranteed to be 0 in thiscase and one can save some time by skipping computations involving those coefficientsHowever it is still helpful to factor the algorithm-design process into (1) designing the fastpower-series algorithm and then (2) eliminating unused values for special cases

Section 7 is a case study of the software speed of modular inversion of polynomials aspart of key generation in the NTRU cryptosystem

The remaining sections of the paper consider the integer case Section 8 defines divisionsteps for integers Section 10 relying on theorems in Section 9 states our main algorithm tocompute c bits of the nth iterate of divstep Section 11 relying on theorems in Appendix Gapplies our main algorithm to gcd computation and modular inversion Section 12 is a case

9The analysis in [53 Section VI] considers only normal inputs (and power-of-2 sizes) The algorithmworks only for normal inputs The performance of division in the non-normal case was discussed in [53Section VIII] but incorrectly described as solely an issue for the base case of the recursion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 6: Fast constant-time gcd computation and modular inversion

6 Fast constant-time gcd computation and modular inversion

allows a subquadratic algorithm The StehleacutendashZimmermann subquadratic ldquobinary recursivegcdrdquo algorithm [62] also emphasizes that it works with bottom bits but it has the sameirregularities as earlier subquadratic algorithms and it again needs a large variable-timedivision subroutine

Often the literature considers the ldquonormalrdquo case of polynomial gcd computation Thisis by definition the case that each EuclidndashStevin quotient has degree 1 The conventionalrecursion then returns a predictable amount of information allowing a more regularalgorithm such as [53 Algorithm PGCD]9 Our algorithm achieves the same regularityfor the general case

Another previous algorithm that achieves the same regularity for a different special caseof polynomial gcd computation is Thomeacutersquos subquadratic variant [68 Program 41] of theBerlekampndashMassey algorithm Our algorithm can be viewed as extending this regularityto arbitrary polynomial gcd computations and beyond this to integer gcd computations

2 Organization of the paperWe start with the polynomial case Section 3 defines division steps Section 5 relyingon theorems in Section 4 states our main algorithm to compute c coefficients of the nthiterate of divstep This takes n(c+ n) simple operations We also explain how ldquojumpsrdquoreduce the cost for large n to (c+ n)(log cn)2+o(1) operations All of these algorithms takeconstant time ie time independent of the input coefficients for any particular (n c)

There is a long tradition in number theory of defining Euclidrsquos algorithm continuedfractions etc for inputs in a ldquocompleterdquo field such as the field of real numbers or the fieldof power series We define division steps for power series and state our main algorithmfor power series Division steps produce polynomial outputs from polynomial inputs sothe reader can restrict attention to polynomials if desired but there are advantages offollowing the tradition as we now explain

Requiring algorithms to work for general power series means that the algorithms canonly inspect a limited number of leading coefficients of each input without regard to thedegree of inputs that happen to be polynomials This restriction simplifies algorithmdesign and helped us design our main algorithm

Going beyond polynomials also makes it easy to describe further algorithmic optionssuch as iterating divstep on inputs (1 gf) instead of (f g) Restricting to polynomialinputs would require gf to be replaced with a nearby polynomial the limited precision ofthe inputs would affect the table of divstep iterates rather than merely being an option inthe algorithms

Section 6 relying on theorems in Appendix A applies our main algorithm to gcdcomputation and modular inversion Here the inputs are polynomials A constant fractionof the power-series coefficients used in our main algorithm are guaranteed to be 0 in thiscase and one can save some time by skipping computations involving those coefficientsHowever it is still helpful to factor the algorithm-design process into (1) designing the fastpower-series algorithm and then (2) eliminating unused values for special cases

Section 7 is a case study of the software speed of modular inversion of polynomials aspart of key generation in the NTRU cryptosystem

The remaining sections of the paper consider the integer case Section 8 defines divisionsteps for integers Section 10 relying on theorems in Section 9 states our main algorithm tocompute c bits of the nth iterate of divstep Section 11 relying on theorems in Appendix Gapplies our main algorithm to gcd computation and modular inversion Section 12 is a case

9The analysis in [53 Section VI] considers only normal inputs (and power-of-2 sizes) The algorithmworks only for normal inputs The performance of division in the non-normal case was discussed in [53Section VIII] but incorrectly described as solely an issue for the base case of the recursion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 7: Fast constant-time gcd computation and modular inversion

Daniel J Bernstein and Bo-Yin Yang 7

study of the software speed of inversion of integers modulo primes used in elliptic-curvecryptography

21 General notation We write M2(R) for the ring of 2 times 2 matrices over a ring RFor example M2(R) is the ring of 2times 2 matrices over the field R of real numbers

22 Notation for the polynomial case Fix a field k The formal-power-series ringk[[x]] is the set of infinite sequences (f0 f1 ) with each fi isin k The ring operations are

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr (f0 + g0 f1 + g1 )

bull minus (f0 f1 ) (g0 g1 ) 7rarr (f0 minus g0 f1 minus g1 )

bull middot (f0 f1 f2 ) (g0 g1 g2 ) 7rarr (f0g0 f0g1 + f1g0 f0g2 + f1g1 + f2g0 )

The element (0 1 0 ) is written ldquoxrdquo and the entry fi in f = (f0 f1 ) is called theldquocoefficient of xi in frdquo As in the Sage [58] computer-algebra system we write f [i] for thecoefficient of xi in f The ldquoconstant coefficientrdquo f [0] has a more standard notation f(0)and we use the f(0) notation when we are not also inspecting other coefficients

The unit group of k[[x]] is written k[[x]]lowast This is the same as the set of f isin k[[x]] suchthat f(0) 6= 0

The polynomial ring k[x] is the subring of sequences (f0 f1 ) that have only finitelymany nonzero coefficients We write deg f for the degree of f isin k[x] this means themaximum i such that f [i] 6= 0 or minusinfin if f = 0 We also define f [minusinfin] = 0 so f [deg f ] isalways defined Beware that the literature sometimes instead defines deg 0 = minus1 and inparticular Sage defines deg 0 = minus1

We define f mod xn as f [0] + f [1]x+ middot middot middot+ f [nminus 1]xnminus1 for any f isin k[[x]] We definef mod g for any f g isin k[x] with g 6= 0 as the unique polynomial r such that deg r lt deg gand f minus r = gq for some q isin k[x]

The ring k[[x]] is a domain The formal-Laurent-series field k((x)) is by definition thefield of fractions of k[[x]] Each element of k((x)) can be written (not uniquely) as fxifor some f isin k[[x]] and some i isin 0 1 2 The subring k[1x] is the smallest subringof k((x)) containing both k and 1x

23 Notation for the integer case The set of infinite sequences (f0 f1 ) with eachfi isin 0 1 forms a ring under the following operations

bull 0 (0 0 0 )

bull 1 (1 0 0 )

bull + (f0 f1 ) (g0 g1 ) 7rarr the unique (h0 h1 ) such that for every n ge 0h0 + 2h1 + 4h2 + middot middot middot+ 2nminus1hnminus1 equiv (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) + (g0 + 2g1 +4g2 + middot middot middot+ 2nminus1gnminus1) (mod 2n)

bull minus same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1)minus (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

bull middot same with (f0 + 2f1 + 4f2 + middot middot middot+ 2nminus1fnminus1) middot (g0 + 2g1 + 4g2 + middot middot middot+ 2nminus1gnminus1)

We follow the tradition among number theorists of using the notation Z2 for this ring andnot for the finite field F2 = Z2 This ring Z2 has characteristic 0 so Z can be viewed asa subset of Z2

One can think of (f0 f1 ) isin Z2 as an infinite series f0 + 2f1 + middot middot middot The differencebetween Z2 and the power-series ring F2[[x]] is that Z2 has carries for example adding(1 0 ) to (1 0 ) in Z2 produces (0 1 )

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 8: Fast constant-time gcd computation and modular inversion

8 Fast constant-time gcd computation and modular inversion

We define f mod 2n as f0 + 2f1 + middot middot middot+ 2nminus1fnminus1 for any f = (f0 f1 ) isin Z2The unit group of Z2 is written Zlowast2 This is the same as the set of odd f isin Z2 ie the

set of f isin Z2 such that f mod 2 6= 0If f isin Z2 and f 6= 0 then ord2 f means the largest integer e ge 0 such that f isin 2eZ2

equivalently the unique integer e ge 0 such that f isin 2eZlowast2The ring Z2 is a domain The field Q2 is by definition the field of fractions of Z2

Each element of Q2 can be written (not uniquely) as f2i for some f isin Z2 and somei isin 0 1 2 The subring Z[12] is the smallest subring of Q2 containing both Z and12

3 Definition of x-adic division stepsDefine divstep Ztimes k[[x]]lowast times k[[x]]rarr Ztimes k[[x]]lowast times k[[x]] as follows

divstep(δ f g) =

(1minus δ g (g(0)f minus f(0)g)x) if δ gt 0 and g(0) 6= 0(1 + δ f (f(0)g minus g(0)f)x) otherwise

Note that f(0)g minus g(0)f is divisible by x in k[[x]] The name ldquodivision steprdquo is justified inTheorem C1 which shows that dividing a degree-d0 polynomial by a degree-d1 polynomialwith 0 le d1 lt d0 can be viewed as computing divstep2d0minus2d1

31 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)where T Ztimes k[[x]]lowast times k[[x]]rarrM2(k[1x]) is defined by

T (δ f g) =

(0 1g(0)x

minusf(0)x

)if δ gt 0 and g(0) 6= 0

(1 0minusg(0)x

f(0)x

)otherwise

and S Ztimes k[[x]]lowast times k[[x]]rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g(0) 6= 0

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by (δ f(0) g(0)) they do notdepend on the remaining coefficients of f and g

32 Decomposition This divstep function is a composition of two simpler functions(and the transition matrices can similarly be decomposed)

bull a conditional swap replacing (δ f g) with (minusδ g f) if δ gt 0 and g(0) 6= 0 followedby

bull an elimination replacing (δ f g) with (1 + δ f (f(0)g minus g(0)f)x)

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 9: Fast constant-time gcd computation and modular inversion

Daniel J Bernstein and Bo-Yin Yang 9

A δ f [0] g[0] f [1] g[1] f [2] g[2] f [3] g[3] middot middot middot

minusδ swap

B δprime f prime[0] gprime[0] f prime[1] gprime[1] f prime[2] gprime[2] f prime[3] gprime[3] middot middot middot

C 1 + δprime f primeprime[0] gprimeprime[0] f primeprime[1] gprimeprime[1] f primeprime[2] gprimeprime[2] f primeprime[3] middot middot middot

Figure 33 Data flow in an x-adic division step decomposed as a conditional swap (A toB) followed by an elimination (B to C) The swap bit is set if δ gt 0 and g[0] 6= 0 Theg outputs are f prime[0]gprime[1]minus gprime[0]f prime[1] f prime[0]gprime[2]minus gprime[0]f prime[2] f prime[0]gprime[3]minus gprime[0]f prime[3] etc Colorscheme consider eg linear combination cg minus df of power series f g where c = f [0] andd = g[0] inputs f g affect jth output coefficient cg[j]minus df [j] via f [j] g[j] (black) and viachoice of c d (red+blue) Green operations are special creating the swap bit

See Figure 33This decomposition is our motivation for defining divstep in the first casemdashthe swapped

casemdashto compute (g(0)f minus f(0)g)x Using (f(0)g minus g(0)f)x in both cases might seemsimpler but would require the conditional swap to forward a conditional negation to theelimination

One can also compose these functions in the opposite order This is compatible withsome of what we say later about iterating the composition However the correspondingtransition matrices would depend on another coefficient of g This would complicate ouralgorithm statements

Another possibility as in [23] is to keep f and g in place using the sign of δ to decidewhether to add a multiple of f into g or to add a multiple of g into f One way to dothis in constant time is to perform two multiplications and two additions Another way isto perform conditional swaps before and after one addition and one multiplication Ourapproach is more efficient

34 Scaling One can define a variant of divstep that computes f minus (f(0)g(0))g in thefirst case (the swapped case) and g minus (g(0)f(0))f in the second case

This variant has two advantages First it multiplies only one series rather than twoby a scalar in k Second multiplying both f and g by any u isin k[[x]]lowast has the same effecton the output so this variant induces a function on fewer variables (δ gf) the ratioρ = gf is replaced by (1ρ minus 1ρ(0))x when δ gt 0 and ρ(0) 6= 0 and by (ρ minus ρ(0))xotherwise while δ is replaced by 1minus δ or 1+ δ respectively This is the composition of firsta conditional inversion that replaces (δ ρ) with (minusδ 1ρ) if δ gt 0 and ρ(0) 6= 0 second anelimination that replaces ρ with (ρminus ρ(0))x while adding 1 to δ

On the other hand working projectively with f g is generally more efficient than workingwith ρ Working with f(0)gminus g(0)f rather than gminus (g(0)f(0))f has the virtue of being aldquofraction-freerdquo computation it involves only ring operations in k not divisions This makesthe effect of scaling only slightly more difficult to track Specifically say divstep(δ f g) =

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 10: Fast constant-time gcd computation and modular inversion

10 Fast constant-time gcd computation and modular inversion

Table 41 Iterates (δn fn gn) = divstepn(δ f g) for k = F7 δ = 1 f = 2 + 7x+ 1x2 +8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6 Each powerseries is displayed as a row of coefficients of x0 x1 etc

n δn fn gnx0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

0 1 2 0 1 1 2 1 1 1 0 0 3 1 4 1 5 2 2 0 0 0 1 0 3 1 4 1 5 2 2 0 0 0 5 2 1 3 6 6 3 0 0 0 2 1 3 1 4 1 5 2 2 0 0 0 1 4 4 0 1 6 0 0 0 0 3 0 1 4 4 0 1 6 0 0 0 0 3 6 1 2 5 2 0 0 0 0 4 1 1 4 4 0 1 6 0 0 0 0 1 3 2 2 5 0 0 0 0 0 5 0 1 3 2 2 5 0 0 0 0 0 1 2 5 3 6 0 0 0 0 0 6 1 1 3 2 2 5 0 0 0 0 0 6 3 1 1 0 0 0 0 0 0 7 0 6 3 1 1 0 0 0 0 0 0 1 4 4 2 0 0 0 0 0 0 8 1 6 3 1 1 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 9 2 6 3 1 1 0 0 0 0 0 0 5 3 0 0 0 0 0 0 0 0

10 minus1 5 3 0 0 0 0 0 0 0 0 4 5 5 0 0 0 0 0 0 0 11 0 5 3 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 12 1 5 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 0 2 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 14 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 6 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(δprime f prime gprime) If u isin k[[x]] with u(0) = 1 then divstep(δ uf ug) = (δprime uf prime ugprime) If u isin klowast then

divstep(δ uf g) = (δprime f prime ugprime) in the first (swapped) casedivstep(δ uf g) = (δprime uf prime ugprime) in the second casedivstep(δ f ug) = (δprime uf prime ugprime) in the first casedivstep(δ f ug) = (δprime f prime ugprime) in the second case

and divstep(δ uf ug) = (δprime uf prime u2gprime)One can also scale g(0)f minus f(0)g by other nonzero constants For example for typical

fields k one can quickly find ldquohalf-sizerdquo a b such that ab = g(0)f(0) Computing af minus bgmay be more efficient than computing g(0)f minus f(0)g

We use the g(0)f(0) variant in our case study in Section 7 for the field k = F3Dividing by f(0) in this field is the same as multiplying by f(0)

4 Iterates of x-adic division stepsStarting from (δ f g) isin Z times k[[x]]lowast times k[[x]] define (δn fn gn) = divstepn(δ f g) isinZ times k[[x]]lowast times k[[x]] for each n ge 0 Also define Tn = T (δn fn gn) isin M2(k[1x]) andSn = S(δn fn gn) isinM2(Z) for each n ge 0 Our main algorithm relies on various simplemathematical properties of these objects this section states those properties Table 41shows all δn fn gn for one example of (δ f g)

Theorem 42(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 11: Fast constant-time gcd computation and modular inversion

Daniel J Bernstein and Bo-Yin Yang 11

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 43 Tnminus1 middot middot middot Tm isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

)if n gt m ge 0

A product of nminusm of these T matrices can thus be stored using 4(nminusm) coefficientsfrom k Beware that the theorem statement relies on n gt m for n = m the product is theidentity matrix

Proof If n minus m = 1 then the product is Tm which is in(

k k1xk

1xk

)by definition For

nminusm gt 1 assume inductively that

Tnminus1 middot middot middot Tm+1 isin(k + middot middot middot+ 1

xnminusmminus2 k k + middot middot middot+ 1xnminusmminus2 k

1xk + middot middot middot+ 1

xnminusmminus1 k1xk + middot middot middot+ 1

xnminusmminus1 k

)

Multiply on the right by Tm isin(

k k1xk

1xk

) The top entries of the product are in (k + middot middot middot+

1xnminusmminus2 k) + ( 1

xk+ middot middot middot+ 1xnminusmminus1 k) = k+ middot middot middot+ 1

xnminusmminus1 k as claimed and the bottom entriesare in ( 1

xk + middot middot middot+ 1xnminusmminus1 k) + ( 1

x2 k + middot middot middot+ 1xnminusm k) = 1

xk + middot middot middot+ 1xnminusm k as claimed

Theorem 44 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

In other words the bottom-left entry is between minus(nminusm) exclusive and nminusm inclusiveand has the same parity as nminusm There are exactly 2(nminusm) possibilities for the productBeware that the theorem statement again relies on n gt m

Proof If nminusm = 1 then 2minus (nminusm) nminusmminus 2 nminusm = 1 and the product isSm which is

( 1 01 plusmn1

)by definition

For nminusm gt 1 assume inductively that

Snminus1 middot middot middot Sm+1 isin(

1 02minus (nminusmminus 1) nminusmminus 3 nminusmminus 1 1minus1

)

and multiply on the right by Sm =( 1 0

1 plusmn1) The top two entries of the product are again 1

and 0 The bottom-left entry is in 2minus (nminusmminus 1) nminusmminus 3 nminusmminus 1+1minus1 =2minus (nminusm) nminusmminus 2 nminusm The bottom-right entry is again 1 or minus1

The next theorem compares δm fm gm TmSm to δprimem f primem gprimem T primemS primem defined in thesame way starting from (δprime f prime gprime) isin Ztimes k[[x]]lowast times k[[x]]

Theorem 45 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod xt) and gprimem equiv gm (mod xt) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod xtminus(nminusmminus1)) and gprimen equiv gn (mod xtminus(nminusm))

In other words δm and the first t coefficients of fm and gm determine all of thefollowing

bull the matrices Tm Tm+1 Tm+tminus1

bull the matrices SmSm+1 Sm+tminus1

bull δm+1 δm+t

bull the first t coefficients of fm+1 the first t minus 1 coefficients of fm+2 the first t minus 2coefficients of fm+3 and so on through the first coefficient of fm+t

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 12: Fast constant-time gcd computation and modular inversion

12 Fast constant-time gcd computation and modular inversion

bull the first tminus 1 coefficients of gm+1 the first tminus 2 coefficients of gm+2 the first tminus 3coefficients of gm+3 and so on through the first coefficient of gm+tminus1

Our main algorithm computes (δn fn mod xtminus(nminusmminus1) gn mod xtminus(nminusm)) for any selectedn isin m+ 1 m+ t along with the product of the matrices Tm Tnminus1 See Sec-tion 5

Proof See Figure 33 for the intuition Formally fix m t and induct on n Observe that(δprimenminus1 f

primenminus1(0) gprimenminus1(0)) equals (δnminus1 fnminus1(0) gnminus1(0))

bull If n = m + 1 By assumption f primem equiv fm (mod xt) and n le m + t so t ge 1 so inparticular f primem(0) = fm(0) Similarly gprimem(0) = gm(0) By assumption δprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod xtminus(nminusmminus2)) by the inductive hypothesis andn le m + t so t minus (n minus m minus 2) ge 2 so in particular f primenminus1(0) = fnminus1(0) similarlygprimenminus1 equiv gnminus1 (mod xtminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1(0) = gnminus1(0) and δprimenminus1 = δnminus1 by the inductive hypothesis

Now use the fact that T and S inspect only the first coefficient of each series to seethat

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 42 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 43 P isin(k + middot middot middot+ 1

xnminusmminus1 k k + middot middot middot+ 1xnminusmminus1 k

1xk + middot middot middot+ 1

xnminusm k1xk + middot middot middot+ 1

xnminusm k

) By assumption

f primem minus fm and gprimem minus gm are multiples of xt Multiply to see that f primen minus fn is a multiple ofxtxnminusmminus1 = xtminus(nminusmminus1) as claimed and gprimen minus gn is a multiple of xtxnminusm = xtminus(nminusm)

as claimed

5 Fast computation of iterates of x-adic division stepsOur goal in this section is to compute (δn fn gn) = divstepn(δ f g) We are giventhe power series f g to precision t t respectively and we compute fn gn to precisiont minus n + 1 t minus n respectively if n ge 1 see Theorem 45 We also compute the n-steptransition matrix Tnminus1 middot middot middot T0

We begin with an algorithm that simply computes one divstep iterate after anothermultiplying by scalars in k and dividing by x exactly as in the divstep definition We specifythis algorithm in Figure 51 as a function divstepsx in the Sage [58] computer-algebrasystem The quantities u v q r isin k[1x] inside the algorithm keep track of the coefficientsof the current f g in terms of the original f g

The rest of this section considers (1) constant-time computation (2) ldquojumpingrdquo throughdivision steps and (3) further techniques to save time inside the computations

52 Constant-time computation The underlying Sage functions do not take constanttime but they can be replaced with functions that do take constant time The casedistinction can be replaced with constant-time bit operations for example obtaining thenew (δ f g) as follows

Daniel J Bernstein and Bo-Yin Yang 13

defdivstepsx(ntdeltafg)asserttgt=nandngt=0fg=ftruncate(t)gtruncate(t)kx=fparent()x=kxgen()uvqr=kx(1)kx(0)kx(0)kx(1)

whilengt0f=ftruncate(t)ifdeltagt0andg[0]=0deltafguvqr=-deltagfqruvf0g0=f[0]g[0]deltagqr=1+delta(f0g-g0f)x(f0q-g0u)x(f0r-g0v)xnt=n-1t-1g=kx(g)truncate(t)

M2kx=MatrixSpace(kxfraction_field()2)returndeltafgM2kx((uvqr))

Figure 51 Algorithm divstepsx to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z f g isin k[[x]] to precision at least t Outputs δn fn toprecision t if n = 0 or tminus (nminus 1) if n ge 1 gn to precision tminus n Tnminus1 middot middot middot T0

bull Let s be the ldquosign bitrdquo of minusδ meaning 0 if minusδ ge 0 or 1 if minusδ lt 0

bull Let z be the ldquononzero bitrdquo of g(0) meaning 0 if g(0) = 0 or 1 if g(0) 6= 0

bull Compute and output 1 + (1minus 2sz)δ For example shift δ to obtain 2δ ldquoANDrdquo withminussz to obtain 2szδ and subtract from 1 + δ

bull Compute and output f oplus sz(f oplus g) obtaining f if sz = 0 or g if sz = 1

bull Compute and output (1minus 2sz)(f(0)gminus g(0)f)x Alternatively compute and outputsimply (f(0)gminusg(0)f)x See the discussion of decomposition and scaling in Section 3

Standard techniques minimize the exact cost of these computations for various repre-sentations of the inputs The details depend on the platform See our case study inSection 7

53 Jumping through division steps Here is a more general divide-and-conqueralgorithm to compute (δn fn gn) and the n-step transition matrix Tnminus1 middot middot middot T0

bull If n le 1 use the definition of divstep and stop

bull Choose j isin 1 2 nminus 1

bull Jump j steps from δ f g to δj fj gj Specifically compute the j-step transition

matrix Tjminus1 middot middot middot T0 and then multiply by(fg

)to obtain

(fjgj

) To compute the

j-step transition matrix call the same algorithm recursively The critical point hereis that this recursive call uses the inputs to precision only j this determines thej-step transition matrix by Theorem 45

bull Similarly jump another nminus j steps from δj fj gj to δn fn gn

14 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defjumpdivstepsx(ntdeltafg)asserttgt=nandngt=0kx=ftruncate(t)parent()

ifnlt=1returndivstepsx(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivstepsx(jjdeltafg)fg=P1vector((fg))fg=kx(f)truncate(t-j)kx(g)truncate(t-j)

deltaf2g2P2=jumpdivstepsx(n-jn-jdeltafg)fg=P2vector((fg))fg=kx(f)truncate(t-n+1)kx(g)truncate(t-n)

returndeltafgP2P1

Figure 54 Algorithm jumpdivstepsx to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 51

This splits an (n t) problem into two smaller problems namely a (j j) problem and an(nminus j nminus j) problem at the cost of O(1) polynomial multiplications involving O(t+ n)coefficients

If j is always chosen as 1 then this algorithm ends up performing the same computationsas Figure 51 compute δ1 f1 g1 to precision tminus 1 compute δ2 f2 g2 to precision tminus 2etc But there are many more possibilities for j

Our jumpdivstepsx algorithm in Figure 54 takes j = bn2c balancing the (j j)problem with the (n minus j n minus j) problem In particular with subquadratic polynomialmultiplication the algorithm is subquadratic With FFT-based polynomial multiplica-tion the algorithm uses (t+ n)(log(t+ n))1+o(1) + n(logn)2+o(1) operations and hencen(logn)2+o(1) operations if t is bounded by n(logn)1+o(1) For comparison Figure 51takes Θ(n(t+ n)) operations

We emphasize that unlike standard fast-gcd algorithms such as [66] this algorithmdoes not call a polynomial-division subroutine between its two recursive calls

55 More speedups Our jumpdivstepsx algorithm is asymptotically Θ(logn) timesslower than fast polynomial multiplication The best previous gcd algorithms also have aΘ(logn) slowdown both in the worst case and in the average case10 This might suggestthat there is very little room for improvement

However Θ says nothing about constant factors There are several standard ideas forconstant-factor speedups in gcd computation beyond lower-level speedups in multiplicationWe now apply the ideas to jumpdivstepsx

The final polynomial multiplications in jumpdivstepsx produce six large outputsf g and four entries of a matrix product P Callers do not always use all of theseoutputs for example the recursive calls to jumpdivstepsx do not use the resulting f

10There are faster cases Strassen [66] gave an algorithm that replaces the log factor with the entropy ofthe list of degrees in the corresponding continued fraction there are occasional inputs where this entropyis o(log n) For example computing gcd 0 takes linear time However these speedups are not usefulfor constant-time computations

Daniel J Bernstein and Bo-Yin Yang 15

and g In principle there is no difficulty applying dead-value elimination and higher-leveloptimizations to each of the 63 possible combinations of desired outputs

The recursive calls in jumpdivstepsx produce two products P1 P2 of transitionmatrices which are then (if desired) multiplied to obtain P = P2P1 In Figure 54jumpdivstepsx multiplies f and g first by P1 and then by P2 to obtain fn and gn Analternative is to multiply by P

The inputs f and g are truncated to t coefficients and the resulting fn and gn aretruncated to tminus n+ 1 and tminus n coefficients respectively Often the caller (for example

jumpdivstepsx itself) wants more coefficients of P(fg

) Rather than performing an

entirely new multiplication by P one can add the truncation errors back into the resultsmultiply P by the f and g truncation errors multiply P2 by the fj and gj truncationerrors and skip the final truncations of fn and gn

There are standard speedups for multiplication in the context of truncated productssums of products (eg in matrix multiplication) partially known products (eg the

coefficients of xminus1 xminus2 xminus3 in P1

(fg

)are known to be 0) repeated inputs (eg each

entry of P1 is multiplied by two entries of P2 and by one of f g) and outputs being reusedas inputs See eg the discussions of ldquoFFT additionrdquo ldquoFFT cachingrdquo and ldquoFFT doublingrdquoin [11]

Any choice of j between 1 and nminus 1 produces correct results Values of j close to theedges do not take advantage of fast polynomial multiplication but this does not imply thatj = bn2c is optimal The number of combinations of choices of j and other options aboveis small enough that for each small (t n) in turn one can simply measure all options andselect the best The results depend on the contextmdashwhich outputs are neededmdashand onthe exact performance of polynomial multiplication

6 Fast polynomial gcd computation and modular inversionThis section presents three algorithms gcdx computes gcdR0 R1 where R0 is a poly-nomial of degree d gt 0 and R1 is a polynomial of degree ltd gcddegree computes merelydeg gcdR0 R1 recipx computes the reciprocal of R1 modulo R0 if gcdR0 R1 = 1Figure 61 specifies all three algorithms and Theorem 62 justifies all three algorithmsThe theorem is proven in Appendix A

Each algorithm calls divstepsx from Section 5 to compute (δ2dminus1 f2dminus1 g2dminus1) =divstep2dminus1(1 f g) Here f = xdR0(1x) is the reversal of R0 and g = xdminus1R1(1x) Ofcourse one can replace divstepsx here with jumpdivstepsx which has the same outputsThe algorithms vary in the input-precision parameter t passed to divstepsx and vary inhow they use the outputs

bull gcddegree uses input precision t = 2dminus 1 to compute δ2dminus1 and then uses one partof Theorem 62 which states that deg gcdR0 R1 = δ2dminus12

bull gcdx uses precision t = 3dminus 1 to compute δ2dminus1 and d+ 1 coefficients of f2dminus1 Thealgorithm then uses another part of Theorem 62 which states that f2dminus1f2dminus1(0)is the reversal of gcdR0 R1

bull recipx uses t = 2dminus1 to compute δ2dminus1 1 coefficient of f2dminus1 and the product P oftransition matrices It then uses the last part of Theorem 62 which (in the coprimecase) states that the top-right corner of P is f2dminus1(0)x2dminus2 times the degree-(dminus 1)reversal of the desired reciprocal

One can generalize to compute gcdR0 R1 for any polynomials R0 R1 of degree ledin time that depends only on d scan the inputs to find the top degree and always perform

16 Fast constant-time gcd computation and modular inversion

fromdivstepsximportdivstepsx

defgcddegree(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)returndelta2

defgcdx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-13d-11fg)returnfreverse(delta2)f[0]

defrecipx(R0R1)d=R0degree()assertdgt0anddgtR1degree()fg=R0reverse(d)R1reverse(d-1)deltafgP=divstepsx(2d-12d-11fg)ifdelta=0returnkx=fparent()x=kxgen()returnkx(x^(2d-2)P[0][1]f[0])reverse(d-1)

Figure 61 Algorithm gcdx to compute gcdR0 R1 algorithm gcddegree to computedeg gcdR0 R1 algorithm recipx to compute the reciprocal of R1 modulo R0 whengcdR0 R1 = 1 All three algorithms assume that R0 R1 isin k[x] that degR0 gt 0 andthat degR0 gt degR1

2d iterations of divstep The extra iteration handles the case that both R0 and R1 havedegree d For simplicity we focus on the case degR0 = d gt degR1 with d gt 0

One can similarly characterize gcd and modular reciprocal in terms of subsequentiterates of divstep with δ increased by the number of extra steps Implementors mightfor example prefer to use an iteration count that is divisible by a large power of 2

Theorem 62 Let k be a field Let d be a positive integer Let R0 R1 be elements ofthe polynomial ring k[x] with degR0 = d gt degR1 Define G = gcdR0 R1 and let Vbe the unique polynomial of degree lt d minus degG such that V R1 equiv G (mod R0) Definef = xdR0(1x) g = xdminus1R1(1x) (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Then

degG = δ2dminus12G = xdegGf2dminus1(1x)f2dminus1(0)V = xminusd+1+degGv2dminus1(1x)f2dminus1(0)

63 A numerical example Take k = F7 d = 7 R0 = 2x7 + 7x6 + 1x5 + 8x4 +2x3 + 8x2 + 1x + 8 and R1 = 3x6 + 1x5 + 4x4 + 1x3 + 5x2 + 9x + 2 Then f =2 + 7x+ 1x2 + 8x3 + 2x4 + 8x5 + 1x6 + 8x7 and g = 3 + 1x+ 4x2 + 1x3 + 5x4 + 9x5 + 2x6The gcdx algorithm computes (δ13 f13 g13) = divstep13(1 f g) = (0 2 6) see Table 41

Daniel J Bernstein and Bo-Yin Yang 17

for all iterates of divstep on these inputs Dividing f13 by its constant coefficient produces1 the reversal of gcdR0 R1 The gcddegree algorithm simply computes δ13 = 0 whichis enough information to see that gcdR0 R1 = 164 Constant-factor speedups Compared to the general power-series setting indivstepsx the inputs to gcddegree gcdx and recipx are more limited For examplethe f input is all 0 after the first d+ 1 coefficients so computations involving subsequentcoefficients can be skipped Similarly the top-right entry of the matrix P in recipx isguaranteed to have coefficient 0 for 1 1x 1x2 and so on through 1xdminus2 so one cansave time in the polynomial computations that produce this entry See Theorems A1and A2 for bounds on the sizes of all intermediate results65 Bottom-up gcd and inversion Our division steps eliminate bottom coefficients ofthe inputs Appendices B C and D relate this to a traditional EuclidndashStevin computationwhich gradually eliminates top coefficients of the inputs The reversal of inputs andoutputs in gcddegree gcdx and recipx exchanges the role of top coefficients and bottomcoefficients The following theorem states an alternative skip the reversal and insteaddirectly eliminate the bottom coefficients of the original inputsTheorem 66 Let k be a field Let d be a positive integer Let f g be elements of thepolynomial ring k[x] with f(0) 6= 0 deg f le d and deg g lt d Define γ = gcdf g

Define (δn fn gn) = divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Thendeg γ = deg f2dminus1

γ = f2dminus1`x2dminus2γ equiv x2dminus2v2dminus1g` (mod f)

where ` is the leading coefficient of f2dminus1Proof Note that γ(0) 6= 0 since f is a multiple of γ

Define R0 = xdf(1x) Then R0 isin k[x] degR0 = d since f(0) 6= 0 and f = xdR0(1x)Define R1 = xdminus1g(1x) Then R1 isin k[x] degR1 lt d and g = xdminus1R1(1x)Define G = gcdR0 R1 We will show below that γ = cxdegGG(1x) for some c isin klowast

Then γ = cf2dminus1f2dminus1(0) by Theorem 62 ie γ is a constant multiple of f2dminus1 Hencedeg γ = deg f2dminus1 as claimed Furthermore γ has leading coefficient 1 so 1 = c`f2dminus1(0)ie γ = f2dminus1` as claimed

By Theorem 42 f2dminus1 = u2dminus1f + v2dminus1g Also x2dminus2u2dminus1 x2dminus2v2dminus1 isin k[x] by

Theorem 43 so

x2dminus2γ = (x2dminus2u2dminus1`)f + (x2dminus2v2dminus1`)g equiv (x2dminus2v2dminus1`)g (mod f)

as claimedAll that remains is to show that γ = cxdegGG(1x) for some c isin klowast If R1 = 0 then

G = gcdR0 0 = R0R0[d] so xdegGG(1x) = xdR0(1x)R0[d] = ff [0] also g = 0 soγ = ff [deg f ] = (f [0]f [deg f ])xdegGG(1x) Assume from now on that R1 6= 0

We have R1 = QG for some Q isin k[x] Note that degR1 = degQ + degG AlsoR1(1x) = Q(1x)G(1x) so xdegR1R1(1x) = xdegQQ(1x)xdegGG(1x) since degR1 =degQ+ degG Hence xdegGG(1x) divides the polynomial xdegR1R1(1x) which in turndivides xdminus1R1(1x) = g Similarly xdegGG(1x) divides f Hence xdegGG(1x) dividesgcdf g = γ

Furthermore G = AR0 +BR1 for some AB isin k[x] Take any integer m above all ofdegGdegAdegB then

xm+dminusdegGxdegGG(1x) = xm+dG(1x)= xmA(1x)xdR0(1x) + xm+1B(1x)xdminus1R1(1x)= xmminusdegAxdegAA(1x)f + xm+1minusdegBxdegBB(1x)g

18 Fast constant-time gcd computation and modular inversion

so xm+dminusdegGxdegGG(1x) is a multiple of γ so the polynomial γxdegGG(1x) dividesxm+dminusdegG so γxdegGG(1x) = cxi for some c isin klowast and some i ge 0 We must have i = 0since γ(0) 6= 0

67 How to divide by x2dminus2 modulo f Unlike top-down inversion (and top-downgcd and bottom-up gcd) bottom-up inversion naturally scales its result by a power of xWe now explain various ways to remove this scaling

For simplicity we focus here on the case gcdf g = 1 The divstep computation inTheorem 66 naturally produces the polynomial x2dminus2v2dminus1` which has degree at mostd minus 1 by Theorem 62 Our objective is to divide this polynomial by x2dminus2 modulo f obtaining the reciprocal of g modulo f by Theorem 66 One way to do this is to multiplyby a reciprocal of x2dminus2 modulo f This reciprocal depends only on d and f ie it can beprecomputed before g is available

Here are several standard strategies to compute 1x2dminus2 modulo f or more generally1xe modulo f for any nonnegative integer e

bull Compute the polynomial (1 minus ff(0))x which is 1x modulo f and then raisethis to the eth power modulo f This takes a logarithmic number of multiplicationsmodulo f

bull Start from 1f(0) the reciprocal of f modulo x Compute the reciprocals of fmodulo x2 x4 x8 etc by Newton (Hensel) iteration After finding the reciprocal rof f modulo xe divide 1minus rf by xe to obtain the reciprocal of xe modulo f Thistakes a constant number of full-size multiplications a constant number of half-sizemultiplications etc The total number of multiplications is again logarithmic butif e is linear as in the case e = 2dminus 2 then the total size of the multiplications islinear without an extra logarithmic factor

bull Combine the powering strategy with the Hensel strategy For example use thereciprocal of f modulo xdminus1 to compute the reciprocal of xdminus1 modulo f and thensquare modulo f to obtain the reciprocal of x2dminus2 modulo f rather than using thereciprocal of f modulo x2dminus2

bull Write down a simple formula for the reciprocal of xe modulo f if f has a specialstructure that makes this easy For example if f = xd minus 1 as in original NTRUand d ge 3 then the reciprocal of x2dminus2 is simply x2 As another example iff = xd + xdminus1 + middot middot middot+ 1 and d ge 5 then the reciprocal of x2dminus2 is x4

In most cases the division by x2dminus2 modulo f involves extra arithmetic that is avoided bythe top-down reversal approach On the other hand the case f = xd minus 1 does not involveany extra arithmetic There are also applications of inversion that can easily handle ascaled result

7 Software case study for polynomial modular inversionAs a concrete case study we consider the problem of inversion in the ring (Z3)[x](x700 +x699 + middot middot middot+ x+ 1) on an Intel Haswell CPU core The situation before our work was thatthis inversion consumed half of the key-generation time in the ntruhrss701 cryptosystemabout 150000 cycles out of 300000 cycles This cryptosystem was introduced by HuumllsingRijneveld Schanck and Schwabe [39] at CHES 201711

11NTRU-HRSS is another round-2 submission in NISTrsquos ongoing post-quantum competition Googlehas also selected NTRU-HRSS for its ldquoCECPQ2rdquo post-quantum experiment CECPQ2 includes a slightlymodified version of ntruhrss701 and the NTRU-HRSS authors have also announced their intention totweak NTRU-HRSS these changes do not affect the performance of key generation

Daniel J Bernstein and Bo-Yin Yang 19

We saved 60000 cycles out of these 150000 cycles reducing the total key-generationtime to 240000 cycles Our approach also simplifies code for example eliminating theseries of conditional power-of-2 rotations in [39]

71 Details of the new computation Our computation works with one integer δ andfour polynomials v r f g isin (Z3)[x] Initially δ = 1 v = 0 r = 1 f = x700 + x699 + middot middot middot+x + 1 and g = g699x

699 + middot middot middot + g0x0 is obtained by reversing the 700 input coefficients

We actually store minusδ instead of δ this turns out to be marginally more efficient for thisplatform

We then apply 2 middot 700minus 1 = 1399 iterations of a main loop Each iteration works asfollows with the order of operations determined experimentally to try to minimize latencyissues

bull Replace v with xv

bull Compute a swap mask as minus1 if δ gt 0 and g(0) 6= 0 otherwise 0

bull Compute c isin Z3 as f(0)g(0)

bull Replace δ with minusδ if the swap mask is set

bull Add 1 to δ

bull Replace (f g) with (g f) if the swap mask is set

bull Replace g with (g minus cf)x

bull Replace (v r) with (r v) if the swap mask is set

bull Replace r with r minus cv

This multiplies the transition matrix T (δ f g) into the vector(vxnminus1

rxn

)where n is the

number of iterations and at the same time replaces (δ f g) with divstep(δ f g) Actuallywe are slightly modifying divstep here (and making the corresponding modification toT ) replacing the original coefficients f(0) and g(0) with the scaled coefficients 1 andg(0)f(0) = f(0)g(0) as in Section 34 In other words if f(0) = minus1 then we negate bothf and g We suppress further comments on this deviation from the definition of divstep

The effect of n iterations is to replace the original (δ f g) with divstepn(δ f g) The

matrix product Tnminus1 middot middot middot T0 has the form(middot middot middot vxnminus1

middot middot middot rxn

) The new f and g are congruent

to vxnminus1 and rxn respectively times the original g modulo x700 + x699 + middot middot middot+ x+ 1After 1399 iterations we have δ = 0 if and only if the input is invertible modulo

x700 + x699 + middot middot middot + x + 1 by Theorem 62 Furthermore if δ = 0 then the inverse isxminus699v1399(1x)f(0) where v1399 = vx1398 ie the inverse is x699v(1x)f(0) We thusmultiply v by f(0) and take the coefficients of x0 x699 in reverse order to obtain thedesired inverse

We use signed representatives minus1 0 1 for elements of Z3 We use 2-bit tworsquos-complement representations of minus1 0 1 the bottom bit is 1 0 1 respectively and thetop bit is 1 0 0 We wrote a simple superoptimizer to find optimal sequences of 3 6 6bit operations respectively for multiplication addition and subtraction We do notclaim novelty for the approach in this paragraph Boothby and Bradshaw in [20 Section41] suggested the same 2-bit representation for Z3 (without explaining it as signedtworsquos-complement) and found the same operation counts by a similar search

We store each of the four polynomials v r f g as six 256-bit vectors for examplethe coefficients of x0 x255 in v are stored as a 256-bit vector of bottom bits and a256-bit vector of top bits Within each 256-bit vector we use the first 64-bit word to

20 Fast constant-time gcd computation and modular inversion

store the coefficients of x0 x4 x252 the next 64-bit word to store the coefficients ofx1 x5 x253 etc Multiplying by x thus moves the first 64-bit word to the secondmoves the second to the third moves the third to the fourth moves the fourth to the firstand shifts the first by 1 bit the bit that falls off the end of the first word is inserted intothe next vector

The first 256 iterations involve only the first 256-bit vectors for v and r so we simplyleave the second and third vectors as 0 The next 256 iterations involve only the first two256-bit vectors for v and r Similarly we manipulate only the first 256-bit vectors for f andg in the final 256 iterations and we manipulate only the first and second 256-bit vectors forf and g in the previous 256 iterations Our main loop thus has five sections 256 iterationsinvolving (1 1 3 3) vectors for (v r f g) 256 iterations involving (2 2 3 3) vectors 375iterations involving (3 3 3 3) vectors 256 iterations involving (3 3 2 2) vectors and 256iterations involving (3 3 1 1) vectors This makes the code longer but saves almost 20of the vector manipulations

72 Comparison to the old computation Many features of our computation arealso visible in the software from [39] We emphasize the ways that our computation ismore streamlined

There are essentially the same four polynomials v r f g in [39] (labeled ldquocrdquo ldquobrdquo ldquogrdquoldquofrdquo) Instead of a single integer δ (plus a loop counter) there are four integers ldquodegfrdquoldquodeggrdquo ldquokrdquo and ldquodonerdquo (plus a loop counter) One cannot simply eliminate ldquodegfrdquo andldquodeggrdquo in favor of their difference δ since ldquodonerdquo is set when ldquodegfrdquo reaches 0 and ldquodonerdquoinfluences ldquokrdquo which in turn influences the results of the computation the final result ismultiplied by x701minusk

Multiplication by x701minusk is a conceptually simple matter of rotating by k positionswithin 701 positions and then reducing modulo x700 + x699 + middot middot middot+ x+ 1 since x701 minus 1 isa multiple of x700 + x699 + middot middot middot+ x+ 1 However since k is a variable the obvious methodof rotating by k positions does not take constant time [39] instead performs conditionalrotations by 1 position 2 positions 4 positions etc where the condition bits are the bitsof k

The inputs and outputs are not reversed in [39] This affects the power of x multipliedinto the final results Our approach skips the powers of x entirely

Each polynomial in [39] is represented as six 256-bit vectors with the five-section patternskipping manipulations of some vectors as explained above The order of coefficients in eachvector is simply x0 x1 multiplication by x in [39] uses more arithmetic instructionsthan our approach does

At a lower level [39] uses unsigned representatives 0 1 2 of Z3 and uses a manuallydesigned sequence of 19 bit operations for a multiply-add operation in Z3 Langley [43]suggested a sequence of 16 bit operations or 14 if an ldquoand-notrdquo operation is available Asin [20] we use just 9 bit operations

The inversion software from [39] is 798 lines (32273 bytes) of Python generatingassembly producing 26634 bytes of object code Our inversion software is 541 lines (14675bytes) of C with intrinsics compiled with gcc 821 to generate assembly producing 10545bytes of object code

73 More NTRU examples More than 13 of the key-generation time in [39] isconsumed by a second inversion which replaces the modulus 3 with 213 This is handledin [39] with an initial inversion modulo 2 followed by 8 multiplications modulo 213 for aNewton iteration12 The initial inversion modulo 2 is handled in [39] by exponentiation

12This use of Newton iteration follows the NTRU key-generation strategy suggested in [61] but we pointout a difference in the details All 8 multiplications in [39] are carried out modulo 213 while [61] insteadfollows the traditional precision doubling in Newton iteration compute an inverse modulo 22 then 24then 28 (or 27) then 213 Perhaps limiting the precision of the first 6 multiplications would save timein [39]

Daniel J Bernstein and Bo-Yin Yang 21

taking just 10332 cycles the point here is that squaring modulo 2 is particularly efficientMuch more inversion time is spent in key generation in sntrup4591761 a slightly larger

cryptosystem from the NTRU Prime [14] family13 NTRU Prime uses a prime modulussuch as 4591 for sntrup4591761 to avoid concerns that a power-of-2 modulus could allowattacks (see generally [14]) but this modulus also makes arithmetic slower Before ourwork the key-generation software for sntrup4591761 took 6 million cycles partly forinversion in (Z3)[x](x761 minus xminus 1) but mostly for inversion in (Z4591)[x](x761 minus xminus 1)We reimplemented both of these inversion steps reducing the total key-generation timeto just 940852 cycles14 Almost all of our inversion code modulo 3 is shared betweensntrup4591761 and ntruhrss701

74 Does lattice-based cryptography need fast inversion Lyubashevsky Peikertand Regev [46] published an NTRU alternative that avoids inversions15 However thiscryptosystem has slower encryption than NTRU has slower decryption than NTRU andappears to be covered by a 2010 patent [35] by Gaborit and Aguilar-Melchor It is easy toenvision applications that will (1) select NTRU and (2) benefit from faster inversions inNTRU key generation16

Lyubashevsky and Seiler have very recently announced ldquoNTTRUrdquo [47] a variant ofNTRU with a ring chosen to allow very fast NTT-based (FFT-based) multiplication anddivision This variant triggers many of the security concerns described in [14] but if thevariant survives security analysis then it will eliminate concerns about key-generationspeed in NTRU

8 Definition of 2-adic division steps

Define divstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

divstep(δ f g) =

(1minus δ g (g minus f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

Note that g minus f is even in the first case (since both f and g are odd) and g + (g mod 2)fis even in the second case (since f is odd)

81 Transition matrices Write (δ1 f1 g1) = divstep(δ f g) Then(f1g1

)= T (δ f g)

(fg

)and

(1δ1

)= S(δ f g)

(1δ

)13NTRU Prime is another round-2 NIST submission One other NTRU-based encryption system

NTRUEncrypt [29] was submitted to the competition but has been merged into NTRU-HRSS for round2 see [59] For completeness we mention that NTRUEncrypt key generation took 980760 cycles forntrukem443 and 2639756 cycles for ntrukem743 As far as we know the NTRUEncrypt software was notdesigned to run in constant time

14This is a median cycle count from the SUPERCOP benchmarking framework There is considerablevariation in the cycle counts more than 3 between quartiles presumably depending on the mappingfrom virtual addresses to physical addresses There is no dependence on the secret input being inverted

15The LPR cryptosystem is the basis for several round-2 NIST submissions Kyber LAC NewHopeRound5 Saber ThreeBears and the ldquoNTRU LPRimerdquo option in NTRU Prime

16Google for example selected NTRU-HRSS for CECPQ2 as noted above with the following explanationldquoSchemes with a quotient-style key (like HRSS) will probably have faster encapdecap operations at thecost of much slower key-generation Since there will be many uses outside TLS where keys can be reusedthis is interesting as long as the key-generation speed is still reasonable for TLSrdquo See [42]

22 Fast constant-time gcd computation and modular inversion

where T Ztimes Zlowast2 times Z2 rarrM2(Z[12]) is defined by

T (δ f g) =

(0 1minus12

12

)if δ gt 0 and g is odd

(1 0

g mod 22

12

)otherwise

and S Ztimes Zlowast2 times Z2 rarrM2(Z) is defined by

S(δ f g) =

(1 01 minus1

)if δ gt 0 and g is odd

(1 01 1

)otherwise

Note that both T (δ f g) and S(δ f g) are defined entirely by δ the bottom bit of f(which is always 1) and the bottom bit of g they do not depend on the remaining bits off and g82 Decomposition This 2-adic divstep like the power-series divstep is a compositionof two simpler functions Specifically it is a conditional swap that replaces (δ f g) with(minusδ gminusf) if δ gt 0 and g is odd followed by an elimination that replaces (δ f g) with(1 + δ f (g + (g mod 2)f)2)83 Sizes Write (δ1 f1 g1) = divstep(δ f g) If f and g are integers that fit into n+ 1bits tworsquos-complement in the sense that minus2n le f lt 2n and minus2n le g lt 2n then alsominus2n le f1 lt 2n and minus2n le g1 lt 2n Similarly if minus2n lt f lt 2n and minus2n lt g lt 2n thenminus2n lt f1 lt 2n and minus2n lt g1 lt 2n Beware however that intermediate results such asg minus f need an extra bit

As in the polynomial case an important feature of divstep is that iterating divstepon integer inputs eventually sends g to 0 Concretely we will show that (D + o(1))niterations are sufficient where D = 2(10minus log2 633) = 2882100569 see Theorem 112for exact bounds The proof technique cannot do better than (d+ o(1))n iterations whered = 14(15minus log2(561 +

radic249185)) = 2828339631 Numerical evidence is consistent

with the possibility that (d + o(1))n iterations are required in the worst case which iswhat matters in the context of constant-time algorithms84 A non-functioning variant Many variations are possible in the definition ofdivstep However some care is required as the following variant illustrates

Define posdivstep Ztimes Zlowast2 times Z2 rarr Ztimes Zlowast2 times Z2 as follows

posdivstep(δ f g) =

(1minus δ g (g + f)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) otherwise

This is just like divstep except that it eliminates the negation of f in the swapped caseIt is not true that iterating posdivstep on integer inputs eventually sends g to 0 For

example posdivstep2(1 1 1) = (1 1 1) For comparison in the polynomial case we werefree to negate f (or g) at any moment85 A centered variant As another variant of divstep we define cdivstep Ztimes Zlowast2 timesZ2 rarr Ztimes Zlowast2 times Z2 as follows

cdivstep(δ f g) =

(1minus δ g (f minus g)2) if δ gt 0 and g is odd(1 + δ f (g + (g mod 2)f)2) if δ = 0(1 + δ f (g minus (g mod 2)f)2) otherwise

Daniel J Bernstein and Bo-Yin Yang 23

Iterating cdivstep produces the same intermediate results as a variable-time gcd algorithmintroduced by Stehleacute and Zimmermann in [62] The analogous gcd algorithm for divstep isa new variant of the StehleacutendashZimmermann algorithm and we analyze the performance ofthis variant to prove bounds on the performance of divstep see Appendices E F and GAs an analogy one of our proofs in the polynomial case relies on relating x-adic divstep tothe EuclidndashStevin algorithm

The extra case distinctions make each cdivstep iteration more complicated than eachdivstep iteration Similarly the StehleacutendashZimmermann algorithm is more complicated thanthe new variant To understand the motivation for cdivstep compare divstep3(minus2 f g)to cdivstep3(minus2 f g) One has divstep3(minus2 f g) = (1 f g3) where

8g3 isin g g + f g + 2f g + 3f g + 4f g + 5f g + 6f g + 7f

One has cdivstep3(minus2 f g) = (1 f gprime3) where

8gprime3 isin g minus 3f g minus 2f g minus f g g + f g + 2f g + 3f g + 4f

It seems intuitively clear that the ldquocenteredrdquo output gprime3 is smaller than the ldquouncenteredrdquooutput g3 that more generally cdivstep has smaller outputs than divstep and thatcdivstep thus uses fewer iterations than divstep

However our analyses do not support this intuition All available evidence indicatesthat surprisingly the worst case of cdivstep uses almost 10 more iterations than theworst case of divstep Concretely our proof technique cannot do better than (c+ o(1))niterations for cdivstep where c = 2(log2(

radic17minus 1)minus 1) = 3110510062 and numerical

evidence is consistent with the possibility that (c+ o(1))n iterations are required

86 A plus-or-minus variant As yet another example the following variant of divstepis essentially17 the BrentndashKung ldquoAlgorithm PMrdquo from [24]

pmdivstep(δ f g) =

(minusδ g (f minus g)2) if δ ge 0 and (g minus f) mod 4 = 0(minusδ g (f + g)2) if δ ge 0 and (g + f) mod 4 = 0(δ f (g minus f)2) if δ lt 0 and (g minus f) mod 4 = 0(δ f (g + f)2) if δ lt 0 and (g + f) mod 4 = 0(1 + δ f g2) otherwise

There are even more cases here than in cdivstep although splitting out an initial conditionalswap reduces the number of cases to 3 Another complication here is that the linearcombinations used in pmdivstep depend on the bottom two bits of f and g rather thanjust the bottom bit

To understand the motivation for pmdivstep note that after each addition or subtractionthere are always at least two halvings as in the ldquosliding windowrdquo approach to exponentiationAgain it seems intuitively clear that this reduces the number of iterations required Considerhowever the following example (which is from [24 Theorem 3] modulo minor details)pmdivstep needs 3bminus 1 iterations to reach g = 0 starting from (1 1 3 middot 2bminus2) Specificallythe first bminus 2 iterations produce

(2 1 3 middot 2bminus3) (3 1 3 middot 2bminus4) (bminus 2 1 3 middot 2) (bminus 1 1 3)

and then the next 2b+ 1 iterations produce

(1minus b 3 2) (2minus b 3 1) (2minus b 3 2) (3minus b 3 1) (3minus b 3 2) (minus1 3 1) (minus1 3 2) (0 3 1) (0 1 2) (1 1 1) (minus1 1 0)

17The BrentndashKung algorithm has an extra loop to allow the case that both inputs f g are even Compare[24 Figure 4] to the definition of pmdivstep

24 Fast constant-time gcd computation and modular inversion

Brent and Kung prove that dcne systolic cells are enough for their algorithm wherec = 3110510062 is the constant mentioned above and they conjecture that (3 + o(1))ncells are enough so presumably (3 + o(1))n iterations of pmdivstep are enough Ouranalysis shows that fewer iterations are sufficient for divstep even though divstep issimpler than pmdivstep

9 Iterates of 2-adic division stepsThe following results are analogous to the x-adic results in Section 4 We state theseresults separately to support verification

Starting from (δ f g) isin ZtimesZlowast2timesZ2 define (δn fn gn) = divstepn(δ f g) isin ZtimesZlowast2timesZ2Tn = T (δn fn gn) isinM2(Z[12]) and Sn = S(δn fn gn) isinM2(Z)

Theorem 91(fngn

)= Tnminus1 middot middot middot Tm

(fmgm

)and

(1δn

)= Snminus1 middot middot middot Sm

(1δm

)if n ge m ge 0

Proof This follows from(fm+1gm+1

)= Tm

(fmgm

)and

(1

δm+1

)= Sm

(1δm

)

Theorem 92 Tnminus1 middot middot middot Tm isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

)if n gt m ge 0

Proof As in Theorem 43

Theorem 93 Snminus1 middot middot middot Sm isin(

1 02minus (nminusm) nminusmminus 2 nminusm 1minus1

)if n gt

m ge 0

Proof As in Theorem 44

Theorem 94 Let m t be nonnegative integers Assume that δprimem = δm f primem equiv fm(mod 2t) and gprimem equiv gm (mod 2t) Then for each integer n with m lt n le m+ t T primenminus1 =Tnminus1 S primenminus1 = Snminus1 δprimen = δn f primen equiv fn (mod 2tminus(nminusmminus1)) and gprimen equiv gn (mod 2tminus(nminusm))

Proof Fix m t and induct on n Observe that (δprimenminus1 fprimenminus1 mod 2 gprimenminus1 mod 2) equals

(δnminus1 fnminus1 mod 2 gnminus1 mod 2)

bull If n = m + 1 By assumption f primem equiv fm (mod 2t) and n le m + t so t ge 1 so inparticular f primem mod 2 = fm mod 2 Similarly gprimem mod 2 = gm mod 2 By assumptionδprimem = δm

bull If n gt m + 1 f primenminus1 equiv fnminus1 (mod 2tminus(nminusmminus2)) by the inductive hypothesis andn le m+ t so tminus(nminusmminus2) ge 2 so in particular f primenminus1 mod 2 = fnminus1 mod 2 similarlygprimenminus1 equiv gnminus1 (mod 2tminus(nminusmminus1)) by the inductive hypothesis and tminus (nminusmminus1) ge 1so in particular gprimenminus1 mod 2 = gnminus1 mod 2 and δprimenminus1 = δnminus1 by the inductivehypothesis

Now use the fact that T and S inspect only the bottom bit of each number to see that

T primenminus1 = T (δprimenminus1 fprimenminus1 g

primenminus1) = T (δnminus1 fnminus1 gnminus1) = Tnminus1

S primenminus1 = S(δprimenminus1 fprimenminus1 g

primenminus1) = S(δnminus1 fnminus1 gnminus1) = Snminus1

as claimedMultiply

(1

δprimenminus1

)=(

1δnminus1

)by S primenminus1 = Snminus1 to see that δprimen = δn as claimed

Daniel J Bernstein and Bo-Yin Yang 25

deftruncate(ft)ift==0return0twot=1ltlt(t-1)return((f+twot)amp(2twot-1))-twot

defdivsteps2(ntdeltafg)asserttgt=nandngt=0fg=truncate(ft)truncate(gt)uvqr=1001

whilengt0f=truncate(ft)ifdeltagt0andgamp1deltafguvqr=-deltag-fqr-u-vg0=gamp1deltagqr=1+delta(g+g0f)2(q+g0u)2(r+g0v)2nt=n-1t-1g=truncate(ZZ(g)t)

M2Q=MatrixSpace(QQ2)returndeltafgM2Q((uvqr))

Figure 101 Algorithm divsteps2 to compute (δn fn gn) = divstepn(δ f g) Inputsn t isin Z with 0 le n le t δ isin Z at least bottom t bits of f g isin Z2 Outputs δn bottom tbits of fn if n = 0 or tminus (nminus 1) bits if n ge 1 bottom tminus n bits of gn Tnminus1 middot middot middot T0

By the inductive hypothesis (T primem T primenminus2) = (Tm Tnminus2) and T primenminus1 = Tnminus1 so

T primenminus1 middot middot middot T primem = Tnminus1 middot middot middot Tm Abbreviate Tnminus1 middot middot middot Tm as P Then(f primengprimen

)= P

(f primemgprimem

)and(

fngn

)= P

(fmgm

)by Theorem 91 so

(f primen minus fngprimen minus gn

)= P

(f primem minus fmgprimem minus gm

)

By Theorem 92 P isin( 1

2nminusmminus1Z 12nminusmminus1Z

12nminusmZ 1

2nminusmZ

) By assumption f primem minus fm and gprimem minus gm

are multiples of 2t Multiply to see that f primen minus fn is a multiple of 2t2nminusmminus1 = 2tminus(nminusmminus1)

as claimed and gprimen minus gn is a multiple of 2t2nminusm = 2tminus(nminusm) as claimed

10 Fast computation of iterates of 2-adic division stepsWe describe just as in Section 5 two algorithms to compute (δn fn gn) = divstepn(δ f g)We are given the bottom t bits of f g and compute fn gn to tminusn+1 tminusn bits respectivelyif n ge 1 see Theorem 94 We also compute the product of the relevant T matrices Wespecify these algorithms in Figures 101ndash102 as functions divsteps2 and jumpdivsteps2in Sage

The first function divsteps2 simply takes divsteps one after another analogouslyto divstepsx The second function jumpdivsteps2 uses a straightforward divide-and-conquer strategy analogously to jumpdivstepsx More generally j in jumpdivsteps2can be taken anywhere between 1 and nminus 1 this generalization includes divsteps2 as aspecial case

As in Section 5 the Sage functions are not constant-time but it is easy to buildconstant-time versions of the same computations The comments on performance inSection 5 are applicable here with integer arithmetic replacing polynomial arithmeticThe same basic optimization techniques are also applicable here

26 Fast constant-time gcd computation and modular inversion

fromdivsteps2importdivsteps2truncate

defjumpdivsteps2(ntdeltafg)asserttgt=nandngt=0ifnlt=1returndivsteps2(ntdeltafg)

j=n2

deltaf1g1P1=jumpdivsteps2(jjdeltafg)fg=P1vector((fg))fg=truncate(ZZ(f)t-j)truncate(ZZ(g)t-j)

deltaf2g2P2=jumpdivsteps2(n-jn-jdeltafg)fg=P2vector((fg))fg=truncate(ZZ(f)t-n+1)truncate(ZZ(g)t-n)

returndeltafgP2P1

Figure 102 Algorithm jumpdivsteps2 to compute (δn fn gn) = divstepn(δ f g) Sameinputs and outputs as in Figure 101

11 Fast integer gcd computation and modular inversionThis section presents two algorithms If f is an odd integer and g is an integer then gcd2computes gcdf g If also gcdf g = 1 then recip2 computes the reciprocal of g modulof Figure 111 specifies both algorithms and Theorem 112 justifies both algorithms

The algorithms take the smallest nonnegative integer d such that |f | lt 2d and |g| lt 2dMore generally one can take d as an extra input the algorithms then take constant time ifd is constant Theorem 112 relies on f2 + 4g2 le 5 middot 22d (which is certainly true if |f | lt 2dand |g| lt 2d) but does not rely on d being minimal One can further generalize to alloweven f by first finding the number of shared powers of 2 in f and g and then reducing tothe odd case

The algorithms then choose a positive integer m as a particular function of d andcall divsteps2 (which can be transparently replaced with jumpdivsteps2) to compute(δm fm gm) = divstepm(1 f g) The choice of m guarantees gm = 0 and fm = plusmn gcdf gby Theorem 112

The gcd2 algorithm uses input precisionm+d to obtain d+1 bits of fm ie to obtain thesigned remainder of fm modulo 2d+1 This signed remainder is exactly fm = plusmn gcdf gsince the assumptions |f | lt 2d and |g| lt 2d imply |gcdf g| lt 2d

The recip2 algorithm uses input precision just m+ 1 to obtain the signed remainder offm modulo 4 This algorithm assumes gcdf g = 1 so fm = plusmn1 so the signed remainderis exactly fm The algorithm also extracts the top-right corner vm of the transition matrixand multiplies by fm obtaining a reciprocal of g modulo f by Theorem 112

Both gcd2 and recip2 are bottom-up algorithms and in particular recip2 naturallyobtains this reciprocal vmfm as a fraction with denominator 2mminus1 It multiplies by 2mminus1

to obtain an integer and then multiplies by ((f + 1)2)mminus1 modulo f to divide by 2mminus1

modulo f The reciprocal of 2mminus1 can be computed ahead of time Another way tocompute this reciprocal is through Hensel lifting to compute fminus1 (mod 2mminus1) for detailsand further techniques see Section 67

We emphasize that recip2 assumes that its inputs are coprime it does not notice ifthis assumption is violated One can check the assumption by computing fm to higherprecision (as in gcd2) or by multiplying the supposed reciprocal by g and seeing whether

Daniel J Bernstein and Bo-Yin Yang 27

fromdivsteps2importdivsteps2

defiterations(d)return(49d+80)17ifdlt46else(49d+57)17

defgcd2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)deltafmgmP=divsteps2(mm+d1fg)returnabs(fm)

defrecip2(fg)assertfamp1d=max(fnbits()gnbits())m=iterations(d)precomp=Integers(f)((f+1)2)^(m-1)deltafmgmP=divsteps2(mm+11fg)V=sign(fm)ZZ(P[0][1]2^(m-1))returnZZ(Vprecomp)

Figure 111 Algorithm gcd2 to compute gcdf g algorithm recip2 to compute thereciprocal of g modulo f when gcdf g = 1 Both algorithms assume that f is odd

the result is 1 modulo f For comparison the recip algorithm from Section 6 does noticeif its polynomial inputs are coprime using a quick test of δm

Theorem 112 Let f be an odd integer Let g be an integer Define (δn fn gn) =

divstepn(1 f g) Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0 Let d be a real number

Assume that f2 + 4g2 le 5 middot 22d Let m be an integer Assume that m ge b(49d+ 80)17c ifd lt 46 and that m ge b(49d+ 57)17c if d ge 46 Then m ge 1 gm = 0 fm = plusmn gcdf g2mminus1vm isin Z and 2mminus1vmg equiv 2mminus1fm (mod f)

In particular if gcdf g = 1 then fm = plusmn1 and dividing 2mminus1vmfm by 2mminus1 modulof produces the reciprocal of g modulo f

Proof First f2 ge 1 so 1 le f2 + 4g2 le 5 middot 22d lt 22d+73 since 53 lt 27 Hence d gt minus76If d lt 46 then m ge b(49d+ 80)17c ge b(49(minus76) + 80)17c = 1 If d ge 46 thenm ge b(49d+ 57)17c ge b(49 middot 46 + 57)17c = 135 Either way m ge 1 as claimed

Define R0 = f R1 = 2g and G = gcdR0 R1 Then G = gcdf g since f is oddDefine b = log2

radicR2

0 +R21 = log2

radicf2 + 4g2 ge 0 Then 22b = f2 + 4g2 le 5 middot 22d so

b le d+ log4 5 We now split into three cases in each case defining n as in Theorem G6

bull If b le 21 Define n = b19b7c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m since 549 le 457

bull If 21 lt b le 46 Define n = b(49b+ 23)17c If d ge 46 then n le b(49d+ 23)17c leb(49d+ 57)17c le m Otherwise d lt 46 so n le b(49(d+ log4 5) + 23)17c leb(49d+ 80)17c le m

bull If b gt 46 Define n = b49b17c Then n le b49b17c le b49(d+ log4 5)17c leb(49d+ 57)17c le m

28 Fast constant-time gcd computation and modular inversion

In all cases n le mNow divstepn(1 R0 R12) isin ZtimesZtimes0 by Theorem G6 so divstepm(1 R0 R12) isin

Ztimes Ztimes 0 so

(δm fm gm) = divstepm(1 f g) = divstepm(1 R0 R12) isin Ztimes GminusG times 0

by Theorem G3 Hence gm = 0 as claimed and fm = plusmnG = plusmn gcdf g as claimedBoth um and vm are in (12mminus1)Z by Theorem 92 In particular 2mminus1vm isin Z as

claimed Finally fm = umf + vmg by Theorem 91 so 2mminus1fm = 2mminus1umf + 2mminus1vmgand 2mminus1um isin Z so 2mminus1fm equiv 2mminus1vmg (mod f) as claimed

12 Software case study for integer modular inversionAs emphasized in Section 1 we have selected an inversion problem to be extremely favorableto Fermatrsquos method We use Intel platforms with 64-bit multipliers We focus on theproblem of inverting modulo the well-known Curve25519 prime p = 2255 minus 19 Therehas been a long line of Curve25519 implementation papers starting with [10] in 2006 wecompare to the latest speed records [54] (in assembly) for inversion modulo this prime

Despite all these Fermat advantages we achieve better speeds using our 2-adic divisionsteps As mentioned in Section 1 we take 10050 cycles 8778 cycles and 8543 cycles onHaswell Skylake and Kaby Lake where [54] used 11854 cycles 9301 cycles and 8971cycles This section looks more closely at how the new computation works

121 Details of the new computation Theorem 112 guarantees that 738 divstepiterations suffice since b(49 middot 255 + 57)17c = 738 To simplify the computation weactually use 744 = 12 middot 62 iterations

The decisions in the first 62 iterations are determined entirely by the bottom 62 bitsof the inputs We perform these iterations entirely in 64-bit words apply the resulting62-step transition matrix to the entire original inputs to jump to the result of 62 divstepiterations and then repeat

This is similar in two important ways to Lehmerrsquos gcd computation from [44] One ofLehmerrsquos ideas mentioned before is to carry out some steps in limited precision Anotherof Lehmerrsquos ideas is for this limit to be a small constant Our advantage over Lehmerrsquoscomputation is that we have a completely regular constant-time data flow

There are many other possibilities for which precision to use at each step All of thesepossibilities can be described using the jumping framework described in Section 53 whichapplies to both the x-adic case and the 2-adic case Figure 122 gives three examples ofjump strategies The first strategy computes each divstep in turn to full precision as inthe case study in Section 7 The second strategy is what we use in this case study Thethird strategy illustrates an asymptotically faster choice of jump distances The thirdstrategy might be more efficient than the second strategy in hardware but the secondstrategy seems optimal for 64-bit CPUs

In more detail we invert a 255-bit x modulo p as follows

1 Set f = p g = x δ = 1 i = 1

2 Set f0 = f (mod 264) g0 = g (mod 264)

3 Compute (δprime f1 g1) = divstep62(δ f0 g0) and obtain a scaled transition matrix Tist Ti

262

(f0g0

)=(f1g1

) The 63-bit signed entries of Ti fit into 64-bit registers We

call this step jump64divsteps2

4 Compute (f g)larr Ti(f g)262 Set δ = δprime

Daniel J Bernstein and Bo-Yin Yang 29

Figure 122 Three jump strategies for 744 divstep iterations on 255-bit integers Leftstrategy j = 1 ie computing each divstep in turn to full precision Middle strategyj = 1 for le62 requested iterations else j = 62 ie using 62 bottom bits to compute 62iterations jumping 62 iterations using 62 new bottom bits to compute 62 more iterationsetc Right strategy j = 1 for 2 requested iterations j = 2 for 3 or 4 requested iterationsj = 4 for 5 or 6 or 7 or 8 requested iterations etc Vertical axis 0 on top through 744 onbottom number of iterations Horizontal axis (within each strategy) 0 on left through254 on right bit positions used in computation

5 Set ilarr i+ 1 Go back to step 3 if i le 12

6 Compute v mod p where v is the top-right corner of T12T11 middot middot middot T1

(a) Compute pair-products T2iT2iminus1 with entries being 126-bit signed integers whichfits into two 64-bit limbs (radix 264)

(b) Compute quad-products T4iT4iminus1T4iminus2T4iminus3 with entries being 252-bit signedintegers (four 64-bit limbs radix 264)

(c) At this point the three quad-products are converted into unsigned integersmodulo p divided into 4 64-bit limbs

(d) Compute final vector-matrix-vector multiplications modulo p

7 Compute xminus1 = sgn(f) middot v middot 2minus744 (mod p) where 2minus744 is precomputed

123 Other CPUs Our advantage is larger on platforms with smaller multipliers Forexample we have written software to invert modulo 2255 minus 19 on an ARM Cortex-A7obtaining a median cycle count of 35277 cycles The best previous Cortex-A7 result wehave found in the literature is 62648 cycles reported by Fujii and Aranha [34] usingFermatrsquos method Internally our software does repeated calls to jump32divsteps2 units

30 Fast constant-time gcd computation and modular inversion

of 30 iterations with the resulting transition matrix fitting inside 32-bit general-purposeregisters

124 Other primes Nath and Sarkar present inversion speeds for 20 different specialprimes 12 of which were considered in previous work For concreteness we considerinversion modulo the double-size prime 2511 minus 187 AranhandashBarretondashPereirandashRicardini [8]selected this prime for their ldquoM-511rdquo curve

Nath and Sarkar report that their Fermat inversion software for 2511 minus 187 takes 72804Haswell cycles 47062 Skylake cycles or 45014 Kaby Lake cycles The slowdown from2255 minus 19 more than 5times in each case is easy to explain the exponent is twice as longproducing almost twice as many multiplications each multiplication handles twice as manybits and multiplication cost per bit is not constant

Our 511-bit software is solidly faster under 30000 cycles on all of these platforms Mostof the cycles in our computation are in evaluating jump64divsteps2 and the number ofcalls to jump64divsteps2 scales linearly with the number of bits in the input There issome overhead for multiplications so a modulus of twice the size takes more than twice aslong but our scalability to large moduli is much better than the Fermat scalability

Our advantage is also larger in applications that use ldquorandomrdquo primes rather thanspecial primes we pay the extra cost for Montgomery multiplication only at the end ofour computation whereas Fermat inversion pays the extra cost in each step

References[1] mdash (no editor) Actes du congregraves international des matheacutematiciens tome 3 Gauthier-

Villars Eacutediteur Paris 1971 See [41]

[2] mdash (no editor) CWIT 2011 12th Canadian workshop on information theoryKelowna British Columbia May 17ndash20 2011 Institute of Electrical and ElectronicsEngineers 2011 ISBN 978-1-4577-0743-8 See [51]

[3] Carlisle Adams Jan Camenisch (editors) Selected areas in cryptographymdashSAC 201724th international conference Ottawa ON Canada August 16ndash18 2017 revisedselected papers Lecture Notes in Computer Science 10719 Springer 2018 ISBN978-3-319-72564-2 See [15]

[4] Carlos Aguilar Melchor Nicolas Aragon Slim Bettaieb Loiumlc Bidoux OlivierBlazy Jean-Christophe Deneuville Philippe Gaborit Edoardo Persichetti GillesZeacutemor Hamming Quasi-Cyclic (HQC) (2017) URL httpspqc-hqcorgdochqc-specification_2017-11-30pdf Citations in this document sect1

[5] Alfred V Aho (chairman) Proceedings of fifth annual ACM symposium on theory ofcomputing Austin Texas April 30ndashMay 2 1973 ACM New York 1973 See [53]

[6] Martin Albrecht Carlos Cid Kenneth G Paterson CJ Tjhai Martin TomlinsonNTS-KEM ldquoThe main document submitted to NISTrdquo (2017) URL httpsnts-kemio Citations in this document sect1

[7] Alejandro Cabrera Aldaya Cesar Pereida Garciacutea Luis Manuel Alvarez Tapia BillyBob Brumley Cache-timing attacks on RSA key generation (2018) URL httpseprintiacrorg2018367 Citations in this document sect1

[8] Diego F Aranha Paulo S L M Barreto Geovandro C C F Pereira JeffersonE Ricardini A note on high-security general-purpose elliptic curves (2013) URLhttpseprintiacrorg2013647 Citations in this document sect124

Daniel J Bernstein and Bo-Yin Yang 31

[9] Elwyn R Berlekamp Algebraic coding theory McGraw-Hill 1968 Citations in thisdocument sect1

[10] Daniel J Bernstein Curve25519 new Diffie-Hellman speed records in PKC 2006 [70](2006) 207ndash228 URL httpscryptopapershtmlcurve25519 Citations inthis document sect1 sect12

[11] Daniel J Bernstein Fast multiplication and its applications in [27] (2008) 325ndash384 URL httpscryptopapershtmlmultapps Citations in this documentsect55

[12] Daniel J Bernstein Tung Chou Tanja Lange Ingo von Maurich Rafael MisoczkiRuben Niederhagen Edoardo Persichetti Christiane Peters Peter Schwabe NicolasSendrier Jakub Szefer Wen Wang Classic McEliece conservative code-based cryp-tography ldquoSupporting Documentationrdquo (2017) URL httpsclassicmcelieceorgnisthtml Citations in this document sect1

[13] Daniel J Bernstein Tung Chou Peter Schwabe McBits fast constant-time code-based cryptography in CHES 2013 [18] (2013) 250ndash272 URL httpsbinarycryptomcbitshtml Citations in this document sect1 sect1

[14] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost full version of[15] (2017) URL httpsntruprimecryptopapershtml Citations in thisdocument sect1 sect1 sect11 sect73 sect73 sect74

[15] Daniel J Bernstein Chitchanok Chuengsatiansup Tanja Lange Christine vanVredendaal NTRU Prime reducing attack surface at low cost in SAC 2017 [3]abbreviated version of [14] (2018) 235ndash260

[16] Daniel J Bernstein Niels Duif Tanja Lange Peter Schwabe Bo-Yin Yang High-speed high-security signatures Journal of Cryptographic Engineering 2 (2012)77ndash89 see also older version at CHES 2011 URL httpsed25519cryptoed25519-20110926pdf Citations in this document sect1

[17] Daniel J Bernstein Peter Schwabe NEON crypto in CHES 2012 [56] (2012) 320ndash339URL httpscryptopapershtmlneoncrypto Citations in this documentsect1 sect1

[18] Guido Bertoni Jean-Seacutebastien Coron (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2013mdash15th international workshop Santa Barbara CA USAAugust 20ndash23 2013 proceedings Lecture Notes in Computer Science 8086 Springer2013 ISBN 978-3-642-40348-4 See [13]

[19] Adam W Bojanczyk Richard P Brent A systolic algorithm for extended GCDcomputation Computers amp Mathematics with Applications 14 (1987) 233ndash238ISSN 0898-1221 URL httpsmaths-peopleanueduau~brentpdrpb096ipdf Citations in this document sect13 sect13

[20] Tomas J Boothby and Robert W Bradshaw Bitslicing and the method of fourRussians over larger finite fields (2009) URL httparxivorgabs09011413Citations in this document sect71 sect72

[21] Joppe W Bos Constant time modular inversion Journal of Cryptographic Engi-neering 4 (2014) 275ndash281 URL httpjoppeboscomfilesCTInversionpdfCitations in this document sect1 sect1 sect1 sect1 sect1 sect13

32 Fast constant-time gcd computation and modular inversion

[22] Richard P Brent Fred G Gustavson David Y Y Yun Fast solution of Toeplitzsystems of equations and computation of Padeacute approximants Journal of Algorithms1 (1980) 259ndash295 ISSN 0196-6774 URL httpsmaths-peopleanueduau~brentpdrpb059ipdf Citations in this document sect14 sect14

[23] Richard P Brent Hsiang-Tsung Kung Systolic VLSI arrays for polynomial GCDcomputation IEEE Transactions on Computers C-33 (1984) 731ndash736 URLhttpsmaths-peopleanueduau~brentpdrpb073ipdf Citations in thisdocument sect13 sect13 sect32

[24] Richard P Brent Hsiang-Tsung Kung A systolic VLSI array for integer GCDcomputation ARITH-7 (1985) 118ndash125 URL httpsmaths-peopleanueduau~brentpdrpb077ipdf Citations in this document sect13 sect13 sect13 sect13 sect14sect86 sect86 sect86

[25] Richard P Brent Paul Zimmermann An O(M(n) logn) algorithm for the Jacobisymbol in ANTS 2010 [37] (2010) 83ndash95 URL httpsarxivorgpdf10042091pdf Citations in this document sect14

[26] Duncan A Buell (editor) Algorithmic number theory 6th international symposiumANTS-VI Burlington VT USA June 13ndash18 2004 proceedings Lecture Notes inComputer Science 3076 Springer 2004 ISBN 3-540-22156-5 See [62]

[27] Joe P Buhler Peter Stevenhagen (editors) Surveys in algorithmic number theoryMathematical Sciences Research Institute Publications 44 Cambridge UniversityPress New York 2008 See [11]

[28] Wouter Castryck Tanja Lange Chloe Martindale Lorenz Panny Joost RenesCSIDH an efficient post-quantum commutative group action in Asiacrypt 2018[55] (2018) 395ndash427 URL httpscsidhisogenyorgcsidh-20181118pdfCitations in this document sect1

[29] Cong Chen Jeffrey Hoffstein William Whyte Zhenfei Zhang NIST PQ SubmissionNTRUEncrypt A lattice based encryption algorithm (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Cita-tions in this document sect73

[30] Tung Chou McBits revisited in CHES 2017 [33] (2017) 213ndash231 URL httpstungchougithubiomcbits Citations in this document sect1

[31] Benoicirct Daireaux Veacuteronique Maume-Deschamps Brigitte Valleacutee The Lyapunovtortoise and the dyadic hare in [49] (2005) 71ndash94 URL httpemisamsorgjournalsDMTCSpdfpapersdmAD0108pdf Citations in this document sectE5sectF2

[32] Jean Louis Dornstetter On the equivalence between Berlekamprsquos and Euclidrsquos algo-rithms IEEE Transactions on Information Theory 33 (1987) 428ndash431 Citations inthis document sect1

[33] Wieland Fischer Naofumi Homma (editors) Cryptographic hardware and embeddedsystemsmdashCHES 2017mdash19th international conference Taipei Taiwan September25ndash28 2017 proceedings Lecture Notes in Computer Science 10529 Springer 2017ISBN 978-3-319-66786-7 See [30] [39]

[34] Hayato Fujii Diego F Aranha Curve25519 for the Cortex-M4 and beyond LATIN-CRYPT 2017 to appear (2017) URL httpswwwlascaicunicampbrmediapublicationspaper39pdf Citations in this document sect11 sect123

Daniel J Bernstein and Bo-Yin Yang 33

[35] Philippe Gaborit Carlos Aguilar Melchor Cryptographic method for communicatingconfidential information patent EP2537284B1 (2010) URL httpspatentsgooglecompatentEP2537284B1en Citations in this document sect74

[36] Mariya Georgieva Freacutedeacuteric de Portzamparc Toward secure implementation ofMcEliece decryption in COSADE 2015 [48] (2015) 141ndash156 URL httpseprintiacrorg2015271 Citations in this document sect1

[37] Guillaume Hanrot Franccedilois Morain Emmanuel Thomeacute (editors) Algorithmic numbertheory 9th international symposium ANTS-IX Nancy France July 19ndash23 2010proceedings Lecture Notes in Computer Science 6197 2010 ISBN 978-3-642-14517-9See [25]

[38] Agnes E Heydtmann Joslashrn M Jensen On the equivalence of the Berlekamp-Masseyand the Euclidean algorithms for decoding IEEE Transactions on Information Theory46 (2000) 2614ndash2624 Citations in this document sect1

[39] Andreas Huumllsing Joost Rijneveld John M Schanck Peter Schwabe High-speedkey encapsulation from NTRU in CHES 2017 [33] (2017) 232ndash252 URL httpseprintiacrorg2017667 Citations in this document sect1 sect11 sect11 sect13 sect7sect7 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect72 sect73 sect73 sect73 sect73 sect73

[40] Burton S Kaliski The Montgomery inverse and its applications IEEE Transactionson Computers 44 (1995) 1064ndash1065 Citations in this document sect1

[41] Donald E Knuth The analysis of algorithms in [1] (1971) 269ndash274 Citations inthis document sect1 sect14 sect14

[42] Adam Langley CECPQ2 (2018) URL httpswwwimperialvioletorg20181212cecpq2html Citations in this document sect74

[43] Adam Langley Add initial HRSS support (2018)URL httpsboringsslgooglesourcecomboringssl+7b935937b18215294e7dbf6404742855e3349092cryptohrsshrssc Citationsin this document sect72

[44] Derrick H Lehmer Euclidrsquos algorithm for large numbers American MathematicalMonthly 45 (1938) 227ndash233 ISSN 0002-9890 URL httplinksjstororgsicisici=0002-9890(193804)454lt227EAFLNgt20CO2-Y Citations in thisdocument sect1 sect14 sect121

[45] Xianhui Lu Yamin Liu Dingding Jia Haiyang Xue Jingnan He Zhenfei ZhangLAC lattice-based cryptosystems (2017) URL httpscsrcnistgovProjectsPost-Quantum-CryptographyRound-1-Submissions Citations in this documentsect1

[46] Vadim Lyubashevsky Chris Peikert Oded Regev On ideal lattices and learningwith errors over rings Journal of the ACM 60 (2013) Article 43 35 pages URLhttpseprintiacrorg2012230 Citations in this document sect74

[47] Vadim Lyubashevsky Gregor Seiler NTTRU truly fast NTRU using NTT (2019)URL httpseprintiacrorg2019040 Citations in this document sect74

[48] Stefan Mangard Axel Y Poschmann (editors) Constructive side-channel analysisand secure designmdash6th international workshop COSADE 2015 Berlin GermanyApril 13ndash14 2015 revised selected papers Lecture Notes in Computer Science 9064Springer 2015 ISBN 978-3-319-21475-7 See [36]

34 Fast constant-time gcd computation and modular inversion

[49] Conrado Martiacutenez (editor) 2005 international conference on analysis of algorithmspapers from the conference (AofArsquo05) held in Barcelona June 6ndash10 2005 TheAssociation Discrete Mathematics amp Theoretical Computer Science (DMTCS)Nancy 2005 URL httpwwwemisamsorgjournalsDMTCSproceedingsdmAD01indhtml See [31]

[50] James Massey Shift-register synthesis and BCH decoding IEEE Transactions onInformation Theory 15 (1969) 122ndash127 ISSN 0018-9448 Citations in this documentsect1

[51] Todd D Mateer On the equivalence of the Berlekamp-Massey and the euclideanalgorithms for algebraic decoding in CWIT 2011 [2] (2011) 139ndash142 Citations inthis document sect1

[52] Niels Moumlller On Schoumlnhagersquos algorithm and subquadratic integer GCD computationMathematics of Computation 77 (2008) 589ndash607 URL httpswwwlysatorliuse~nissearchiveS0025-5718-07-02017-0pdf Citations in this documentsect14

[53] Robert T Moenck Fast computation of GCDs in STOC 1973 [5] (1973) 142ndash151Citations in this document sect14 sect14 sect14 sect14

[54] Kaushik Nath Palash Sarkar Efficient inversion in (pseudo-)Mersenne primeorder fields (2018) URL httpseprintiacrorg2018985 Citations in thisdocument sect11 sect12 sect12

[55] Thomas Peyrin Steven D Galbraith Advances in cryptologymdashASIACRYPT 2018mdash24th international conference on the theory and application of cryptology and infor-mation security Brisbane QLD Australia December 2ndash6 2018 proceedings part ILecture Notes in Computer Science 11272 Springer 2018 ISBN 978-3-030-03325-5See [28]

[56] Emmanuel Prouff Patrick Schaumont (editors) Cryptographic hardware and embed-ded systemsmdashCHES 2012mdash14th international workshop Leuven Belgium September9ndash12 2012 proceedings Lecture Notes in Computer Science 7428 Springer 2012ISBN 978-3-642-33026-1 See [17]

[57] Martin Roetteler Michael Naehrig Krysta M Svore Kristin E Lauter Quantumresource estimates for computing elliptic curve discrete logarithms in ASIACRYPT2017 [67] (2017) 241ndash270 URL httpseprintiacrorg2017598 Citationsin this document sect11 sect13

[58] The Sage Developers (editor) SageMath the Sage Mathematics Software System(Version 80) 2017 URL httpswwwsagemathorg Citations in this documentsect12 sect12 sect22 sect5

[59] John Schanck Announcement of NTRU-HRSS-KEM and NTRUEncrypt merger(2018) URL httpsgroupsgooglecomalistnistgovdmsgpqc-forumSrFO_vK3xbImSmjY0HZCgAJ Citations in this document sect73

[60] Arnold Schoumlnhage Schnelle Berechnung von Kettenbruchentwicklungen Acta Infor-matica 1 (1971) 139ndash144 ISSN 0001-5903 Citations in this document sect1 sect14sect14

[61] Joseph H Silverman Almost inverses and fast NTRU key creation NTRU Cryptosys-tems (Technical Note014) (1999) URL httpsassetssecurityinnovationcomstaticdownloadsNTRUresourcesNTRUTech014pdf Citations in this doc-ument sect73 sect73

Daniel J Bernstein and Bo-Yin Yang 35

[62] Damien Stehleacute Paul Zimmermann A binary recursive gcd algorithm in ANTS 2004[26] (2004) 411ndash425 URL httpspersoens-lyonfrdamienstehleBINARYhtml Citations in this document sect14 sect14 sect85 sectE sectE6 sectE6 sectE6 sectF2 sectF2sectF2 sectH sectH

[63] Josef Stein Computational problems associated with Racah algebra Journal ofComputational Physics 1 (1967) 397ndash405 Citations in this document sect1

[64] Simon Stevin Lrsquoarithmeacutetique Imprimerie de Christophle Plantin 1585 URLhttpwwwdwcknawnlpubbronnenSimon_Stevin-[II_B]_The_Principal_Works_of_Simon_Stevin_Mathematicspdf Citations in this document sect1

[65] Volker Strassen The computational complexity of continued fractions in SYM-SAC1981 [69] (1981) 51ndash67 see also newer version [66]

[66] Volker Strassen The computational complexity of continued fractions SIAM Journalon Computing 12 (1983) 1ndash27 see also older version [65] ISSN 0097-5397 Citationsin this document sect14 sect14 sect53 sect55

[67] Tsuyoshi Takagi Thomas Peyrin (editors) Advances in cryptologymdashASIACRYPT2017mdash23rd international conference on the theory and applications of cryptology andinformation security Hong Kong China December 3ndash7 2017 proceedings part IILecture Notes in Computer Science 10625 Springer 2017 ISBN 978-3-319-70696-2See [57]

[68] Emmanuel Thomeacute Subquadratic computation of vector generating polynomials andimprovement of the block Wiedemann algorithm Journal of Symbolic Computation33 (2002) 757ndash775 URL httpshalinriafrinria-00103417document Ci-tations in this document sect14

[69] Paul S Wang (editor) SYM-SAC rsquo81 proceedings of the 1981 ACM Symposium onSymbolic and Algebraic Computation Snowbird Utah August 5ndash7 1981 Associationfor Computing Machinery New York 1981 ISBN 0-89791-047-8 See [65]

[70] Moti Yung Yevgeniy Dodis Aggelos Kiayias Tal Malkin (editors) Public keycryptographymdash9th international conference on theory and practice in public-keycryptography New York NY USA April 24ndash26 2006 proceedings Lecture Notesin Computer Science 3958 Springer 2006 ISBN 978-3-540-33851-2 See [10]

A Proof of main gcd theorem for polynomialsWe give two separate proofs of the facts stated earlier relating gcd and modular reciprocalto a particular iterate of divstep in the polynomial case

bull The second proof consists of a review of the EuclidndashStevin gcd algorithm (Ap-pendix B) a description of exactly how the intermediate results in the EuclidndashStevinalgorithm relate to iterates of divstep (Appendix C) and finally a translationof gcdreciprocal results from the EuclidndashStevin context to the divstep context(Appendix D)

bull The first proof given in this appendix is shorter it works directly with divstepwithout a detour through the EuclidndashStevin labeling of intermediate results

36 Fast constant-time gcd computation and modular inversion

Theorem A1 Let k be a field Let d be a positive integer Let R0 R1 be elements of thepolynomial ring k[x] with degR0 = d gt degR1 Define f = xdR0(1x) g = xdminus1R1(1x)and (δn fn gn) = divstepn(1 f g) for n ge 0 Then

2 deg fn le 2dminus 1minus n+ δn for n ge 02 deg gn le 2dminus 1minus nminus δn for n ge 0

Proof Induct on n If n = 0 then (δn fn gn) = (1 f g) so 2 deg fn = 2 deg f le 2d =2dminus 1minus n+ δn and 2 deg gn = 2 deg g le 2(dminus 1) = 2dminus 1minus nminus δn

Assume from now on that n ge 1 By the inductive hypothesis 2 deg fnminus1 le 2dminusn+δnminus1and 2 deg gnminus1 le 2dminus nminus δnminus1

Case 1 δnminus1 le 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Both fnminus1 and gnminus1 have degree at most (2d minus n minus δnminus1)2 so the polynomial xgn =fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 also has degree at most (2d minus n minus δnminus1)2 so 2 deg gn le2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn Also 2 deg fn = 2 deg fnminus1 le 2dminus 1minus n+ δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then

(δn fn gn) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

so 2 deg gn = 2 deg gnminus1 minus 2 le 2dminus nminus δnminus1 minus 2 = 2dminus 1minus nminus δn As before 2 deg fn =2 deg fnminus1 le 2dminus 1minus n+ δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then

(δn fn gn) = (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

This time the polynomial xgn = gnminus1(0)fnminus1 minus fnminus1(0)gnminus1 also has degree at most(2d minus n + δnminus1)2 so 2 deg gn le 2d minus n + δnminus1 minus 2 = 2d minus 1 minus n minus δn Also 2 deg fn =2 deg gnminus1 le 2dminus nminus δnminus1 = 2dminus 1minus n+ δn

Theorem A2 In the situation of Theorem A1 define Tn = T (δn fn gn) and(un vnqn rn

)= Tnminus1 middot middot middot T0

Then xn(unrnminusvnqn) isin klowast for n ge 0 xnminus1un isin k[x] for n ge 1 xnminus1vn xnqn x

nrn isin k[x]for n ge 0 and

2 deg(xnminus1un) le nminus 3 + δn for n ge 12 deg(xnminus1vn) le nminus 1 + δn for n ge 0

2 deg(xnqn) le nminus 1minus δn for n ge 02 deg(xnrn) le n+ 1minus δn for n ge 0

Proof Each matrix Tn has determinant in (1x)klowast by construction so the product of nsuch matrices has determinant in (1xn)klowast In particular unrn minus vnqn isin (1xn)klowast

Induct on n For n = 0 we have δn = 1 un = 1 vn = 0 qn = 0 rn = 1 soxnminus1vn = 0 isin k[x] xnqn = 0 isin k[x] xnrn = 1 isin k[x] 2 deg(xnminus1vn) = minusinfin le nminus 1 + δn2 deg(xnqn) = minusinfin le nminus 1minus δn and 2 deg(xnrn) = 0 le n+ 1minus δn

Assume from now on that n ge 1 By Theorem 43 un vn isin (1xnminus1)k[x] andqn rn isin (1xn)k[x] so all that remains is to prove the degree bounds

Case 1 δnminus1 le 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1 qn = (fnminus1(0)qnminus1 minusgnminus1(0)unminus1)x and rn = (fnminus1(0)rnminus1 minus gnminus1(0)vnminus1)x

Daniel J Bernstein and Bo-Yin Yang 37

Note that n gt 1 since δ0 gt 0 By the inductive hypothesis 2 deg(xnminus2unminus1) le nminus 4 +δnminus1 so 2 deg(xnminus1un) le nminus 2 + δnminus1 = nminus 3 + δn Also 2 deg(xnminus2vnminus1) le nminus 2 + δnminus1so 2 deg(xnminus1vn) le n+ δnminus1 = nminus 1 + δn

For qn note that both xnminus1qnminus1 and xnminus1unminus1 have degree at most (nminus2minusδnminus1)2 sothe same is true for their k-linear combination xnqn Hence 2 deg(xnqn) le nminus 2minus δnminus1 =nminus 1minus δn Similarly 2 deg(xnrn) le nminus δnminus1 = n+ 1minus δn

Case 2 δnminus1 gt 0 and gnminus1(0) = 0 Then δn = 1 + δnminus1 un = unminus1 vn = vnminus1qn = fnminus1(0)qnminus1x and rn = fnminus1(0)rnminus1x

If n = 1 then un = u0 = 1 so 2 deg(xnminus1un) = 0 = n minus 3 + δn Otherwise2 deg(xnminus2unminus1) le nminus4 + δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus3 + δnas above

Also 2 deg(xnminus1vn) le nminus 1 + δn as above 2 deg(xnqn) = 2 deg(xnminus1qnminus1) le nminus 2minusδnminus1 = nminus 1minus δn and 2 deg(xnrn) = 2 deg(xnminus1rnminus1) le nminus δnminus1 = n+ 1minus δn

Case 3 δnminus1 gt 0 and gnminus1(0) 6= 0 Then δn = 1 minus δnminus1 un = qnminus1 vn = rnminus1qn = (gnminus1(0)unminus1 minus fnminus1(0)qnminus1)x and rn = (gnminus1(0)vnminus1 minus fnminus1(0)rnminus1)x

If n = 1 then un = q0 = 0 so 2 deg(xnminus1un) = minusinfin le n minus 3 + δn Otherwise2 deg(xnminus1qnminus1) le nminus 2minus δnminus1 by the inductive hypothesis so 2 deg(xnminus1un) le nminus 2minusδnminus1 = nminus 3 + δn Similarly 2 deg(xnminus1rnminus1) le nminus δnminus1 by the inductive hypothesis so2 deg(xnminus1vn) le nminus δnminus1 = nminus 1 + δn

Finally for qn note that both xnminus1unminus1 and xnminus1qnminus1 have degree at most (n minus2 + δnminus1)2 so the same is true for their k-linear combination xnqn so 2 deg(xnqn) lenminus 2 + δnminus1 = nminus 1minus δn Similarly 2 deg(xnrn) le n+ δnminus1 = n+ 1minus δn

First proof of Theorem 62 Write ϕ = f2dminus1 By Theorem A1 2 degϕ le δ2dminus1 and2 deg g2dminus1 le minusδ2dminus1 Each fn is nonzero by construction so degϕ ge 0 so δ2dminus1 ge 0

By Theorem A2 x2dminus2u2dminus1 is a polynomial of degree at most dminus 2 + δ2dminus12 DefineC0 = xminusd+δ2dminus12u2dminus1(1x) then C0 is a polynomial of degree at most dminus 2 + δ2dminus12Similarly define C1 = xminusd+1+δ2dminus12v2dminus1(1x) and observe that C1 is a polynomial ofdegree at most dminus 1 + δ2dminus12

Multiply C0 by R0 = xdf(1x) multiply C1 by R1 = xdminus1g(1x) and add to obtain

C0R0 + C1R1 = xδ2dminus12(u2dminus1(1x)f(1x) + v2dminus1(1x)g(1x))= xδ2dminus12ϕ(1x)

by Theorem 42Define Γ as the monic polynomial xδ2dminus12ϕ(1x)ϕ(0) of degree δ2dminus12 We have

just shown that Γ isin R0k[x] + R1k[x] specifically Γ = (C0ϕ(0))R0 + (C1ϕ(0))R1 Tofinish the proof we will show that R0 isin Γk[x] This forces Γ = gcdR0 R1 = G which inturn implies degG = δ2dminus12 and V = C1ϕ(0)

Observe that δ2d = 1 + δ2dminus1 and f2d = ϕ the ldquoswappedrdquo case is impossible here sinceδ2dminus1 gt 0 implies g2dminus1 = 0 By Theorem A1 again 2 deg g2d le minus1minus δ2d = minus2minus δ2dminus1so g2d = 0

By Theorem 42 ϕ = f2d = u2df + v2dg and 0 = g2d = q2df + r2dg so x2dr2dϕ = ∆fwhere ∆ = x2d(u2dr2dminus v2dq2d) By Theorem A2 ∆ isin klowast and p = x2dr2d is a polynomialof degree at most (2d+ 1minus δ2d)2 = dminus δ2dminus12 The polynomials xdminusδ2dminus12p(1x) andxδ2dminus12ϕ(1x) = Γ have product ∆xdf(1x) = ∆R0 so Γ divides R0 as claimed

B Review of the EuclidndashStevin algorithmThis appendix reviews the EuclidndashStevin algorithm to compute polynomial gcd includingthe extended version of the algorithm that writes each remainder as a linear combinationof the two inputs

38 Fast constant-time gcd computation and modular inversion

Theorem B1 (the EuclidndashStevin algorithm) Let k be a field Let R0 R1 be ele-ments of the polynomial ring k[x] Recursively define a finite sequence of polynomialsR2 R3 Rr isin k[x] as follows if i ge 1 and Ri = 0 then r = i if i ge 1 and Ri 6= 0 thenRi+1 = Riminus1 mod Ri Define di = degRi and `i = Ri[di] Then

bull d1 gt d2 gt middot middot middot gt dr = minusinfin

bull `i 6= 0 if 1 le i le r minus 1

bull if `rminus1 6= 0 then gcdR0 R1 = Rrminus1`rminus1 and

bull if `rminus1 = 0 then R0 = R1 = gcdR0 R1 = 0 and r = 1

Proof If 1 le i le r minus 1 then Ri 6= 0 so `i = Ri[degRi] 6= 0 Also Ri+1 = Riminus1 mod Ri sodegRi+1 lt degRi ie di+1 lt di Hence d1 gt d2 gt middot middot middot gt dr and dr = degRr = deg 0 =minusinfin

If `rminus1 = 0 then r must be 1 so R1 = 0 (since Rr = 0) and R0 = 0 (since `0 = 0) sogcdR0 R1 = 0 Assume from now on that `rminus1 6= 0 The divisibility of Ri+1 minus Riminus1by Ri implies gcdRiminus1 Ri = gcdRi Ri+1 so gcdR0 R1 = gcdR1 R2 = middot middot middot =gcdRrminus1 Rr = gcdRrminus1 0 = Rrminus1`rminus1

Theorem B2 (the extended EuclidndashStevin algorithm) In the situation of Theorem B1define U0 V0 Ur Ur isin k[x] as follows U0 = 1 V0 = 0 U1 = 0 V1 = 1 Ui+1 = Uiminus1minusbRiminus1RicUi for i isin 1 r minus 1 Vi+1 = Viminus1 minus bRiminus1RicVi for i isin 1 r minus 1Then

bull Ri = UiR0 + ViR1 for i isin 0 r

bull UiVi+1 minus Ui+1Vi = (minus1)i for i isin 0 r minus 1

bull Vi+1Ri minus ViRi+1 = (minus1)iR0 for i isin 0 r minus 1 and

bull if d0 gt d1 then deg Vi+1 = d0 minus di for i isin 0 r minus 1

Proof U0R0 + V0R1 = 1R0 + 0R1 = R0 and U1R0 + V1R1 = 0R0 + 1R1 = R1 Fori isin 1 r minus 1 assume inductively that Riminus1 = Uiminus1R0+Viminus1R1 and Ri = UiR0+ViR1and writeQi = bRiminus1Ric Then Ri+1 = Riminus1 mod Ri = Riminus1minusQiRi = (Uiminus1minusQiUi)R0+(Viminus1 minusQiVi)R1 = Ui+1R0 + Vi+1R1

If i = 0 then UiVi+1minusUi+1Vi = 1 middot 1minus 0 middot 0 = 1 = (minus1)i For i isin 1 r minus 1 assumeinductively that Uiminus1Vi minus UiViminus1 = (minus1)iminus1 then UiVi+1 minus Ui+1Vi = Ui(Viminus1 minus QVi) minus(Uiminus1 minusQUi)Vi = minus(Uiminus1Vi minus UiViminus1) = (minus1)i

Multiply Ri = UiR0 + ViR1 by Vi+1 multiply Ri+1 = Ui+1R0 + Vi+1R1 by Ui+1 andsubtract to see that Vi+1Ri minus ViRi+1 = (minus1)iR0

Finally assume d0 gt d1 If i = 0 then deg Vi+1 = deg 1 = 0 = d0 minus di Fori isin 1 r minus 1 assume inductively that deg Vi = d0 minus diminus1 Then deg ViRi+1 lt d0since degRi+1 = di+1 lt diminus1 so deg Vi+1Ri = deg(ViRi+1 +(minus1)iR0) = d0 so deg Vi+1 =d0 minus di

C EuclidndashStevin results as iterates of divstepIf the EuclidndashStevin algorithm is written as a sequence of polynomial divisions andeach polynomial division is written as a sequence of coefficient-elimination steps theneach coefficient-elimination step corresponds to one application of divstep The exactcorrespondence is stated in Theorem C2 This generalizes the known BerlekampndashMasseyequivalence mentioned in Section 1

Theorem C1 is a warmup showing how a single polynomial division relates to asequence of division steps

Daniel J Bernstein and Bo-Yin Yang 39

Theorem C1 (polynomial division as a sequence of division steps) Let k be a fieldLet R0 R1 be elements of the polynomial ring k[x] Write d0 = degR0 and d1 = degR1Assume that d0 gt d1 ge 0 Then

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

where `0 = R0[d0] `1 = R1[d1] and R2 = R0 mod R1

Proof Part 1 The first d0 minus d1 minus 1 iterations If δ isin 1 2 d0 minus d1 minus 1 thend0minusδ gt d1 so R1[d0minusδ] = 0 so the polynomial `δminus1

0 xd0minusδR1(1x) has constant coefficient0 The polynomial xd0R0(1x) has constant coefficient `0 Hence

divstep(δ xd0R0(1x) `δminus10 xd0minusδR1(1x)) = (1 + δ xd0R0(1x) `δ0xd0minusδminus1R1(1x))

by definition of divstep By induction

divstepd0minusd1minus1(1 xd0R0(1x) xd0minus1R1(1x))= (d0 minus d1 x

d0R0(1x) `d0minusd1minus10 xd1R1(1x))

= (d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

where Rprime1 = `d0minusd1minus10 R1 Note that Rprime1[d1] = `prime1 where `prime1 = `d0minusd1minus1

0 `1

Part 2 The swap iteration and the last d0 minus d1 iterations Define

Zd0minusd1 = R0 mod xd0minusd1Rprime1Zd0minusd1+1 = R0 mod xd0minusd1minus1Rprime1

Z2d0minus2d1 = R0 mod Rprime1 = R0 mod R1 = R2

ThenZd0minusd1 = R0 minus (R0[d0]`prime1)xd0minusd1Rprime1

Zd0minusd1+1 = Zd0minusd1 minus (Zd0minusd1 [d0 minus 1]`prime1)xd0minusd1minus1Rprime1

Z2d0minus2d1 = Z2d0minus2d1minus1 minus (Z2d0minus2d1minus1[d1]`prime1)Rprime1

Substitute 1x for x to see that

xd0Zd0minusd1(1x) = xd0R0(1x)minus (R0[d0]`prime1)xd1Rprime1(1x)xd0minus1Zd0minusd1+1(1x) = xd0minus1Zd0minusd1(1x)minus (Zd0minusd1 [d0 minus 1]`prime1)xd1Rprime1(1x)

xd1Z2d0minus2d1(1x) = xd1Z2d0minus2d1minus1(1x)minus (Z2d0minus2d1minus1[d1]`prime1)xd1Rprime1(1x)

We will use these equations to show that the d0 minus d1 2d0 minus 2d1 iterates of divstepcompute `prime1xd0minus1Zd0minusd1 (`prime1)d0minusd1+1xd1minus1Z2d0minus2d1 respectively

The constant coefficients of xd0R0(1x) and xd1Rprime1(1x) are R0[d0] and `prime1 6= 0 respec-tively and `prime1xd0R0(1x)minusR0[d0]xd1Rprime1(1x) = `prime1x

d0Zd0minusd1(1x) so

divstep(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

40 Fast constant-time gcd computation and modular inversion

The constant coefficients of xd1Rprime1(1x) and `prime1xd0minus1Zd0minusd1(1x) are `prime1 and `prime1Zd0minusd1 [d0minus1]respectively and

(`prime1)2xd0minus1Zd0minusd1(1x)minus `prime1Zd0minusd1 [d0 minus 1]xd1Rprime1(1x) = (`prime1)2xd0minus1Zd0minusd1+1(1x)

sodivstep(1minus (d0 minus d1) xd1Rprime1(1x) `prime1xd0minus1Zd0minusd1(1x))

= (2minus (d0 minus d1) xd1Rprime1(1x) (`prime1)2xd0minus2Zd0minusd1+1(1x))

Continue in this way to see that

divstepi(d0 minus d1 xd0R0(1x) xd1Rprime1(1x))

= (iminus (d0 minus d1) xd1Rprime1(1x) (`prime1)ixd0minusiZd0minusd1+iminus1(1x))

for 1 le i le d0 minus d1 + 1 In particular

divstep2d0minus2d1(1 xd0R0(1x) xd0minus1R1(1x))= divstepd0minusd1+1(d0 minus d1 x

d0R0(1x) xd1Rprime1(1x))= (1 xd1Rprime1(1x) (`prime1)d0minusd1+1xd1minus1R2(1x))= (1 `d0minusd1minus1

0 xd1R1(1x) (`d0minusd1minus10 `1)d0minusd1+1xd1minus1R2(1x))

as claimed

Theorem C2 In the situation of Theorem B2 assume that d0 gt d1 Define f =xd0R0(1x) and g = xd0minus1R1(1x) Then f g isin k[x] Define (δn fn gn) = divstepn(1 f g)and Tn = T (δn fn gn)

Part A If i isin 0 1 r minus 2 and 2d0 minus 2di le n lt 2d0 minus di minus di+1 then

δn = 1 + nminus (2d0 minus 2di) gt 0fn = anx

diRi(1x)gn = bnx

2d0minusdiminusnminus1Ri+1(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiUi(1x) an(1x)d0minusdiminus1Vi(1x)bn(1x)nminus(d0minusdi)+1Ui+1(1x) bn(1x)nminus(d0minusdi)Vi+1(1x)

)for some an bn isin klowast

Part B If i isin 0 1 r minus 2 and 2d0 minus di minus di+1 le n lt 2d0 minus 2di+1 then

δn = 1 + nminus (2d0 minus 2di+1) le 0fn = anx

di+1Ri+1(1x)gn = bnx

2d0minusdi+1minusnminus1Zn(1x)

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdi+1Ui+1(1x) an(1x)d0minusdi+1minus1Vi+1(1x)bn(1x)nminus(d0minusdi+1)+1Xn(1x) bn(1x)nminus(d0minusdi+1)Yn(1x)

)for some an bn isin klowast where

Zn = Ri mod x2d0minus2di+1minusnRi+1

Xn = Ui minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnUi+1

Yn = Vi minuslfloorRix

2d0minus2di+1minusnRi+1rfloorx2d0minus2di+1minusnVi+1

Daniel J Bernstein and Bo-Yin Yang 41

Part C If n ge 2d0 minus 2drminus1 then

δn = 1 + nminus (2d0 minus 2drminus1) gt 0fn = anx

drminus1Rrminus1(1x)gn = 0

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)for some an bn isin klowast

The proof also gives recursive formulas for an bn namely (a0 b0) = (1 1) forthe base case (an bn) = (bnminus1 gnminus1(0)anminus1) for the swapped case and (an bn) =(anminus1 fnminus1(0)bnminus1) for the unswapped case One can if desired augment divstep totrack an and bn these are the denominators eliminated in a ldquofraction-freerdquo computation

Proof By hypothesis degR0 = d0 gt d1 = degR1 Hence d0 is a nonnegative integer andf = R0[d0] +R0[d0 minus 1]x+ middot middot middot+R0[0]xd0 isin k[x] Similarly g = R1[d0 minus 1] +R1[d0 minus 2]x+middot middot middot+R1[0]xd0minus1 isin k[x] this is also true for d0 = 0 with R1 = 0 and g = 0

We induct on n and split the analysis of n into six cases There are three differentformulas (A B C) applicable to different ranges of n and within each range we considerthe first n separately For brevity we write Ri = Ri(1x) U i = Ui(1x) V i = Vi(1x)Xn = Xn(1x) Y n = Yn(1x) Zn = Zn(1x)

Case A1 n = 2d0 minus 2di for some i isin 0 1 r minus 2If n = 0 then i = 0 By assumption δn = 1 as claimed fn = f = xd0R0 so fn = anx

d0R0as claimed where an = 1 gn = g = xd0minus1R1 so gn = bnx

d0minus1R1 as claimed where bn = 1Also Ui = 1 Ui+1 = 0 Vi = 0 Vi+1 = 1 and d0 minus di = 0 so(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so i gt 0 Now 2d0minusdiminus1minusdi le nminus 1 lt 2d0minus 2di so by the inductive

hypothesis δnminus1 = 1 + (nminus 1)minus (2d0 minus 2di) = 0 fnminus1 = anminus1xdiRi for some anminus1 isin klowast

gnminus1 = bnminus1xdiZnminus1 for some bnminus1 isin klowast where Znminus1 = Riminus1 mod xRi and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdiXnminus1 bnminus1(1x)d0minusdiminus1Y nminus1

)where Xn = Uiminus1 minus bRiminus1xRicxUi and Yn = Viminus1 minus bRiminus1xRicxVi

Note that degZnminus1 le di Observe that

Ri+1 = Riminus1 mod Ri = Znminus1 minus (Znminus1[di]`i)RiUi+1 = Xnminus1 minus (Znminus1[di]`i)UiVi+1 = Ynminus1 minus (Znminus1[di]`i)Vi

The point is that subtracting (Znminus1[di]`i)Ri from Znminus1 eliminates the coefficient of xdi

and produces a polynomial of degree at most diminus 1 namely Znminus1 mod Ri = Riminus1 mod RiStarting from Ri+1 = Znminus1 minus (Znminus1[di]`i)Ri substitute 1x for x and multiply by

anminus1bnminus1`ixdi to see that

anminus1bnminus1`ixdiRi+1 = anminus1`ignminus1 minus bnminus1Znminus1[di]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

42 Fast constant-time gcd computation and modular inversion

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 = 1 + nminus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnxdiminus1Ri+1 where bn = anminus1bnminus1`i isin klowast as claimed

Finally U i+1 = Xnminus1 minus (Znminus1[di]`i)U i and V i+1 = Y nminus1 minus (Znminus1[di]`i)V i Substi-tute these into the matrix(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)d0minusdi+1U i+1 bn(1x)d0minusdiV i+1

)

along with an = anminus1 and bn = anminus1bnminus1`i to see that this matrix is

Tnminus1 =(

1 0minusbnminus1Znminus1[di]x anminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case A2 2d0 minus 2di lt n lt 2d0 minus di minus di+1 for some i isin 0 1 r minus 2By the inductive hypothesis δnminus1 = n minus (2d0 minus 2di) gt 0 fnminus1 = anminus1x

diRi whereanminus1 isin klowast gnminus1 = bnminus1x

2d0minusdiminusnRi+1 where bnminus1 isin klowast and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)nminus(d0minusdi)U i+1 bnminus1(1x)nminus(d0minusdi)minus1V i+1

)

Note that gnminus1(0) = bnminus1Ri+1[2d0 minus di minus n] = 0 since 2d0 minus di minus n gt di+1 = degRi+1Thus

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 fnminus1(0)gnminus1x)

Hence δn = 1 + n minus (2d0 minus 2di) gt 0 as claimed fn = anxdiRi where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdiminusnminus1Ri+1 where bn = fnminus1(0)bnminus1 = anminus1bnminus1`i isin klowast as

claimedAlso Tnminus1 =

(1 00 anminus1`ix

) Multiply by Tnminus2 middot middot middot T0 above to see that

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdiU i an(1x)d0minusdiminus1V ibn(1x)nminus(d0minusdi)+1U i+1 bn`i(1x)nminus(d0minusdi)V i+1

)as claimed

Case B1 n = 2d0 minus di minus di+1 for some i isin 0 1 r minus 2Note that 2d0 minus 2di le nminus 1 lt 2d0 minus di minus di+1 By the inductive hypothesis δnminus1 =

diminusdi+1 gt 0 fnminus1 = anminus1xdiRi where anminus1 isin klowast gnminus1 = bnminus1x

di+1Ri+1 where bnminus1 isin klowastand

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdiU i anminus1(1x)d0minusdiminus1V ibnminus1(1x)d0minusdi+1U i+1 bnminus1`i(1x)d0minusdi+1minus1V i+1

)

Observe thatZn = Ri minus (`i`i+1)xdiminusdi+1Ri+1

Xn = Ui minus (`i`i+1)xdiminusdi+1Ui+1

Yn = Vi minus (`i`i+1)xdiminusdi+1Vi+1

Substitute 1x for x in the Zn equation and multiply by anminus1bnminus1`i+1xdi to see that

anminus1bnminus1`i+1xdiZn = bnminus1`i+1fnminus1 minus anminus1`ignminus1

= gnminus1(0)fnminus1 minus fnminus1(0)gnminus1

Daniel J Bernstein and Bo-Yin Yang 43

Also note that gnminus1(0) = bnminus1`i+1 6= 0 so

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)= (1minus δnminus1 gnminus1 (gnminus1(0)fnminus1 minus fnminus1(0)gnminus1)x)

Hence δn = 1 minus (di minus di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = bnminus1 isin klowast as

claimed and gn = bnxdiminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

Finally substitute Xn = U i minus (`i`i+1)(1x)diminusdi+1U i+1 and substitute Y n = V i minus(`i`i+1)(1x)diminusdi+1V i+1 into the matrix(

an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1bn(1x)d0minusdi+1Xn bn(1x)d0minusdiY n

)

along with an = bnminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(0 1

bnminus1`i+1x minusanminus1`ix

)times Tnminus2 middot middot middot T0 shown above

Case B2 2d0 minus di minus di+1 lt n lt 2d0 minus 2di+1 for some i isin 0 1 r minus 2Now 2d0minusdiminusdi+1 le nminus1 lt 2d0minus2di+1 By the inductive hypothesis δnminus1 = nminus(2d0minus

2di+1) le 0 fnminus1 = anminus1xdi+1Ri+1 for some anminus1 isin klowast gnminus1 = bnminus1x

2d0minusdi+1minusnZnminus1 forsome bnminus1 isin klowast where Znminus1 = Ri mod x2d0minus2di+1minusn+1Ri+1 and

Tnminus2 middot middot middot T0 =(

anminus1(1x)d0minusdi+1U i+1 anminus1(1x)d0minusdi+1minus1V i+1bnminus1(1x)nminus(d0minusdi+1)Xnminus1 bnminus1(1x)nminus(d0minusdi+1)minus1Y nminus1

)where Xnminus1 = Ui minus

lfloorRix

2d0minus2di+1minusn+1Ri+1rfloorx2d0minus2di+1minusn+1Ui+1 and Ynminus1 = Vi minuslfloor

Rix2d0minus2di+1minusn+1Ri+1

rfloorx2d0minus2di+1minusn+1Vi+1

Observe that

Zn = Ri mod x2d0minus2di+1minusnRi+1

= Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnUi+1

Yn = Ynminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnVi+1

The point is that degZnminus1 le 2d0 minus di+1 minus n and subtracting

(Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

from Znminus1 eliminates the coefficient of x2d0minusdi+1minusnStarting from

Zn = Znminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)x2d0minus2di+1minusnRi+1

substitute 1x for x and multiply by anminus1bnminus1`i+1x2d0minusdi+1minusn to see that

anminus1bnminus1`i+1x2d0minusdi+1minusnZn = anminus1`i+1gnminus1 minus bnminus1Znminus1[2d0 minus di+1 minus n]fnminus1

= fnminus1(0)gnminus1 minus gnminus1(0)fnminus1

Now(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1)

= (1 + δnminus1 fnminus1 (fnminus1(0)gnminus1 minus gnminus1(0)fnminus1)x)

Hence δn = 1 +nminus (2d0minus 2di+1) le 0 as claimed fn = anxdi+1Ri+1 where an = anminus1 isin klowast

as claimed and gn = bnx2d0minusdi+1minusnminus1Zn where bn = anminus1bnminus1`i+1 isin klowast as claimed

44 Fast constant-time gcd computation and modular inversion

Finally substitute

Xn = Xnminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnU i+1

andY n = Y nminus1 minus (Znminus1[2d0 minus di+1 minus n]`i+1)(1x)2d0minus2di+1minusnV i+1

into the matrix (an(1x)d0minusdi+1U i+1 an(1x)d0minusdi+1minus1V i+1

bn(1x)nminus(d0minusdi+1)+1Xn bn(1x)nminus(d0minusdi+1)Y n

)

along with an = anminus1 and bn = anminus1bnminus1`i+1 to see that this matrix is Tnminus1 =(1 0

minusbnminus1Znminus1[2d0 minus di+1 minus n]x anminus1`i+1x

)times Tnminus2 middot middot middot T0 shown above

Case C1 n = 2d0 minus 2drminus1 This is essentially the same as Case A1 except that i isreplaced with r minus 1 and gn ends up as 0 For completeness we spell out the details

If n = 0 then r = 1 and R1 = 0 so g = 0 By assumption δn = 1 as claimedfn = f = xd0R0 so fn = anx

d0R0 as claimed where an = 1 gn = g = 0 as claimed Definebn = 1 Now Urminus1 = 1 Ur = 0 Vrminus1 = 0 Vr = 1 and d0 minus drminus1 = 0 so(

an(1x)d0minusdrminus1Urminus1 an(1x)d0minusdrminus1minus1V rminus1bn(1x)d0minusdrminus1+1Ur bn(1x)d0minusdrminus1V r

)=(

1 00 1

)= Tnminus1 middot middot middot T0

as claimedOtherwise n gt 0 so r ge 2 Now 2d0 minus drminus2 minus drminus1 le n minus 1 lt 2d0 minus 2drminus1 so

by the inductive hypothesis (for i = r minus 2) δnminus1 = 1 + (n minus 1) minus (2d0 minus 2drminus1) = 0fnminus1 = anminus1x

drminus1Rrminus1 for some anminus1 isin klowast gnminus1 = bnminus1xdrminus1Znminus1 for some bnminus1 isin klowast

where Znminus1 = Rrminus2 mod xRrminus1 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1 anminus1(1x)d0minusdrminus1minus1V rminus1bnminus1(1x)d0minusdrminus1Xnminus1 bnminus1(1x)d0minusdrminus1minus1Y nminus1

)where Xn = Urminus2 minus bRrminus2xRrminus1cxUrminus1 and Yn = Vrminus2 minus bRrminus2xRrminus1cxVrminus1

Observe that

0 = Rr = Rrminus2 mod Rrminus1 = Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1

Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

Vr = Ynminus1 minus (Znminus1[drminus1]`rminus1)Vrminus1

Starting from Znminus1 minus (Znminus1[drminus1]`rminus1)Rrminus1 = 0 substitute 1x for x and multiplyby anminus1bnminus1`rminus1x

drminus1 to see that

anminus1`rminus1gnminus1 minus bnminus1Znminus1[drminus1]fnminus1 = 0

ie fnminus1(0)gnminus1 minus gnminus1(0)fnminus1 = 0 Now

(δn fn gn) = divstep(δnminus1 fnminus1 gnminus1) = (1 + δnminus1 fnminus1 0)

Hence δn = 1 = 1 + n minus (2d0 minus 2drminus1) gt 0 as claimed fn = anxdrminus1Rrminus1 where

an = anminus1 isin klowast as claimed and gn = 0 as claimedDefine bn = anminus1bnminus1`rminus1 isin klowast and substitute Ur = Xnminus1 minus (Znminus1[drminus1]`rminus1)Urminus1

and V r = Y nminus1 minus (Znminus1[drminus1]`rminus1)V rminus1 into the matrix(an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)d0minusdrminus1+1Ur(1x) bn(1x)d0minusdrminus1Vr(1x)

)

Daniel J Bernstein and Bo-Yin Yang 45

to see that this matrix is Tnminus1 =(

1 0minusbnminus1Znminus1[drminus1]x anminus1`rminus1x

)times Tnminus2 middot middot middot T0

shown above

Case C2 n gt 2d0 minus 2drminus1By the inductive hypothesis δnminus1 = n minus (2d0 minus 2drminus1) fnminus1 = anminus1x

drminus1Rrminus1 forsome anminus1 isin klowast gnminus1 = 0 and

Tnminus2 middot middot middot T0 =(anminus1(1x)d0minusdrminus1Urminus1(1x) anminus1(1x)d0minusdrminus1minus1Vrminus1(1x)bnminus1(1x)nminus(d0minusdrminus1)Ur(1x) bnminus1(1x)nminus(d0minusdrminus1)minus1Vr(1x)

)for some bnminus1 isin klowast Hence δn = 1 + δn = 1 + n minus (2d0 minus 2drminus1) gt 0 fn = fnminus1 =

anxdrminus1Rrminus1 where an = anminus1 and gn = 0 Also Tnminus1 =

(1 00 anminus1`rminus1x

)so

Tnminus1 middot middot middot T0 =(

an(1x)d0minusdrminus1Urminus1(1x) an(1x)d0minusdrminus1minus1Vrminus1(1x)bn(1x)nminus(d0minusdrminus1)+1Ur(1x) bn(1x)nminus(d0minusdrminus1)Vr(1x)

)where bn = anminus1bnminus1`rminus1 isin klowast

D Alternative proof of main gcd theorem for polynomialsThis appendix gives another proof of Theorem 62 using the relationship in Theorem C2between the EuclidndashStevin algorithm and iterates of divstep

Second proof of Theorem 62 Define R2 R3 Rr and di and `i as in Theorem B1 anddefine Ui and Vi as in Theorem B2 Then d0 = d gt d1 and all of the hypotheses ofTheorems B1 B2 and C2 are satisfied

By Theorem B1 G = gcdR0 R1 = Rrminus1`rminus1 (since R0 6= 0) so degG = drminus1Also by Theorem B2

Urminus1R0 + Vrminus1R1 = Rrminus1 = `rminus1G

so (Vrminus1`rminus1)R1 equiv G (mod R0) and deg Vrminus1 = d0 minus drminus2 lt d0 minus drminus1 = dminus degG soV = Vrminus1`rminus1

Case 1 drminus1 ge 1 Part C of Theorem C2 applies to n = 2dminus 1 since n = 2d0 minus 1 ge2d0 minus 2drminus1 In particular δn = 1 + n minus (2d0 minus 2drminus1) = 2drminus1 = 2 degG f2dminus1 =a2dminus1x

drminus1Rrminus1(1x) and v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)Case 2 drminus1 = 0 Then r ge 2 and drminus2 ge 1 Part B of Theorem C2 applies to

i = r minus 2 and n = 2dminus 1 = 2d0 minus 1 since 2d0 minus di minus di+1 = 2d0 minus drminus2 le 2d0 minus 1 = n lt2d0 = 2d0 minus 2di+1 In particular δn = 1 + n minus (2d0 minus 2di+1) = 2drminus1 = 2 degG againf2dminus1 = a2dminus1x

drminus1Rrminus1(1x) and again v2dminus1 = a2dminus1(1x)d0minusdrminus1minus1Vrminus1(1x)In both cases f2dminus1(0) = a2dminus1`rminus1 so xdegGf2dminus1(1x)f2dminus1(0) = Rrminus1`rminus1 = G

and xminusd+1+degGv2dminus1(1x)f2dminus1(0) = Vrminus1`rminus1 = V

E A gcd algorithm for integersThis appendix presents a variable-time algorithm to compute the gcd of two integersThis appendix also explains how a modification to the algorithm produces the StehleacutendashZimmermann algorithm [62]

Theorem E1 Let e be a positive integer Let f be an element of Zlowast2 Let g be anelement of 2eZlowast2 Then there is a unique element q isin

1 3 5 2e+1 minus 1

such that

qg2e minus f isin 2e+1Z2

46 Fast constant-time gcd computation and modular inversion

We define f div2 g = q2e isin Q and we define f mod2 g = f minus qg2e isin 2e+1Z2 Thenf = (f div2 g)g + (f mod2 g) Note that e = ord2 g so we are not creating any ambiguityby omitting e from the f div2 g and f mod2 g notation

Proof First note that the quotient f(g2e) is odd Define q = (f(g2e)) mod 2e+1 thenq isin

1 3 5 2e+1 and qg2e minus f isin 2e+1Z2 by construction To see uniqueness note

that if qprime isin

1 3 5 2e+1 minus 1also satisfies qprimeg2e minus f isin 2e+1Z2 then (q minus qprime)g2e isin

2e+1Z2 so q minus qprime isin 2e+1Z2 since g2e isin Zlowast2 so q mod 2e+1 = qprime mod 2e+1 so q = qprime

Theorem E2 Let R0 be a nonzero element of Z2 Let R1 be an element of 2Z2Recursively define e0 e1 e2 e3 isin Z and R2 R3 isin 2Z2 as follows

bull If i ge 0 and Ri 6= 0 then ei = ord2 Ri

bull If i ge 1 and Ri 6= 0 then Ri+1 = minus((Riminus12eiminus1)mod2 Ri)2ei isin 2Z2

bull If i ge 1 and Ri = 0 then ei Ri+1 ei+1 Ri+2 ei+2 are undefined

Then (Ri2ei

Ri+1

)=(

0 12ei

minus12ei ((Riminus12eiminus1)div2 Ri)2ei

)(Riminus12eiminus1

Ri

)for each i ge 1 where Ri+1 is defined

Proof We have (Riminus12eiminus1)mod2 Ri = Riminus12eiminus1 minus ((Riminus12eiminus1)div2 Ri)Ri Negateand divide by 2ei to obtain the matrix equation

Theorem E3 Let R0 be an odd element of Z Let R1 be an element of 2Z Definee0 e1 e2 e3 and R2 R3 as in Theorem E2 Then Ri isin 2Z for each i ge 1 whereRi is defined Furthermore if t ge 0 and Rt+1 = 0 then |Rt2et | = gcdR0 R1

Proof Write g = gcdR0 R1 isin Z Then g is odd since R0 is oddWe first show by induction on i that Ri isin gZ for each i ge 0 For i isin 0 1 this follows

from the definition of g For i ge 2 we have Riminus2 isin gZ and Riminus1 isin gZ by the inductivehypothesis Now (Riminus22eiminus2)div2 Riminus1 has the form q2eiminus1 for q isin Z by definition ofdiv2 and 22eiminus1+eiminus2Ri = 2eiminus2qRiminus1 minus 2eiminus1Riminus2 by definition of Ri The right side ofthis equation is in gZ so 2middotmiddotmiddotRi is in gZ This implies first that Ri isin Q but also Ri isin Z2so Ri isin Z This also implies that Ri isin gZ since g is odd

By construction Ri isin 2Z2 for each i ge 1 and Ri isin Z for each i ge 0 so Ri isin 2Z foreach i ge 1

Finally assume that t ge 0 and Rt+1 = 0 Then all of R0 R1 Rt are nonzero Writegprime = |Rt2et | Then 2etgprime = |Rt| isin gZ and g is odd so gprime isin gZ

The equation 2eiminus1Riminus2 = 2eiminus2qRiminus1minus22eiminus1+eiminus2Ri shows that if Riminus1 Ri isin gprimeZ thenRiminus2 isin gprimeZ since gprime is odd By construction Rt Rt+1 isin gprimeZ so Rtminus1 Rtminus2 R1 R0 isingprimeZ Hence g = gcdR0 R1 isin gprimeZ Both g and gprime are positive so gprime = g

E4 The gcd algorithm Here is the algorithm featured in this appendixGiven an odd integer R0 and an even integer R1 compute the even integers R2 R3

defined above In other words for each i ge 1 where Ri 6= 0 compute ei = ord2 Rifind q isin

1 3 5 2ei+1 minus 1

such that qRi2ei minus Riminus12eiminus1 isin 2ei+1Z and compute

Ri+1 = (qRi2ei minusRiminus12eiminus1)2ei isin 2Z If Rt+1 = 0 for some t ge 0 output |Rt2et | andstop

We will show in Theorem F26 that this algorithm terminates The output is gcdR0 R1by Theorem E3 This algorithm is closely related to iterating divstep and we will usethis relationship to analyze iterates of divstep see Appendix G

More generally if R0 is an odd integer and R1 is any integer one can computegcdR0 R1 by using the above algorithm to compute gcdR0 2R1 Even more generally

Daniel J Bernstein and Bo-Yin Yang 47

one can compute the gcd of any two integers by first handling the case gcd0 0 = 0 thenhandling powers of 2 in both inputs then applying the odd-input algorithmE5 A non-functioning variant Imagine replacing the subtractions above withadditions find q isin

1 3 5 2ei+1 minus 1

such that qRi2ei +Riminus12eiminus1 isin 2ei+1Z and

compute Ri+1 = (qRi2ei + Riminus12eiminus1)2ei isin 2Z If R0 and R1 are positive then allRi are positive so the algorithm does not terminate This non-terminating algorithm isrelated to posdivstep (see Section 84) in the same way that the algorithm above is relatedto divstep

Daireaux Maume-Deschamps and Valleacutee in [31 Section 6] mentioned this algorithm asa ldquonon-centered LSB algorithmrdquo said that one can easily add a ldquosupplementary stoppingconditionrdquo to ensure termination and dismissed the resulting algorithm as being ldquocertainlyslower than the centered versionrdquoE6 A centered variant the StehleacutendashZimmermann algorithm Imagine usingcentered remainders 1minus 2e minus3minus1 1 3 2e minus 1 modulo 2e+1 rather than uncen-tered remainders

1 3 5 2e+1 minus 1

The above gcd algorithm then turns into the

StehleacutendashZimmermann gcd algorithm from [62]Specifically Stehleacute and Zimmermann begin with integers a b having ord2 a lt ord2 b

If b 6= 0 then ldquogeneralized binary divisionrdquo in [62 Lemma 1] defines q as the unique oddinteger with |q| lt 2ord2 bminusord2 a such that r = a+ qb2ord2 bminusord2 a is divisible by 2ord2 b+1The gcd algorithm in [62 Algorithm Elementary-GB] repeatedly sets (a b)larr (b r) Whenb = 0 the algorithm outputs the odd part of a

It is equivalent to set (a b)larr (b2ord2 b r2ord2 b) since this simply divides all subse-quent results by 2ord2 b In other words starting from an odd a0 and an even b0 recursivelydefine an odd ai+1 and an even bi+1 by the following formulas stopping when bi = 0bull ai+1 = bi2e where e = ord2 bi

bull bi+1 = (ai + qbi2e)2e isin 2Z for a unique q isin 1minus 2e minus3minus1 1 3 2e minus 1Now relabel a0 b0 b1 b2 b3 as R0 R1 R2 R3 R4 to obtain the following recursivedefinition Ri+1 = (qRi2ei +Riminus12eiminus1)2ei isin 2Z and thus(

Ri2ei

Ri+1

)=(

0 12ei

12ei q22ei

)(Riminus12eiminus1

Ri

)

for a unique q isin 1minus 2ei minus3minus1 1 3 2ei minus 1To summarize starting from the StehleacutendashZimmermann algorithm one can obtain the

gcd algorithm featured in this appendix by making the following three changes Firstdivide out 2ei instead of allowing powers of 2 to accumulate Second add Riminus12eiminus1

instead of subtracting it this does not affect the number of iterations for centered q Thirdswitch from centered q to uncentered q

Intuitively centered remainders are smaller than uncentered remainders and it iswell known that centered remainders improve the worst-case performance of Euclidrsquosalgorithm so one might expect that the StehleacutendashZimmermann algorithm performs betterthan our uncentered variant However as noted earlier our analysis produces the oppositeconclusion See Appendix F

F Performance of the gcd algorithm for integersThe possible matrices in Theorem E2 are(

0 12minus12 14

)

(0 12minus12 34

)for ei = 1(

0 14minus14 116

)

(0 14minus14 316

)

(0 14minus14 516

)

(0 14minus14 716

)for ei = 2

48 Fast constant-time gcd computation and modular inversion

etc In this appendix we show that a product of these matrices shrinks the input by a factorexponential in the weight

sumei This implies that

sumei in the algorithm of Appendix E is

O(b) where b is the number of input bits

F1 Lower bounds It is easy to see that each of these matrices has two eigenvaluesof absolute value 12ei This does not imply however that a weight-w product of thesematrices shrinks the input by a factor 12w Concretely the weight-7 product

147

(328 minus380minus158 233

)=(

0 14minus14 116

)(0 12minus12 34

)2( 0 12minus12 14

)(0 12minus12 34

)2

of 6 matrices has spectral radius (maximum absolute value of eigenvalues)

(561 +radic

249185)215 = 003235425826 = 061253803919 7 = 124949900586

The (w7)th power of this matrix is expressed as a weight-w product and has spectralradius 1207071286552w Consequently this proof strategy cannot obtain an O constantbetter than 7(15minus log2(561 +

radic249185)) = 1414169815 We prove a bound twice the

weight on the number of divstep iterations (see Appendix G) so this proof strategy cannotobtain an O constant better than 14(15 minus log2(561 +

radic249185)) = 2828339631 for

the number of divstep iterationsRotating the list of 6 matrices shown above gives 6 weight-7 products with the same

spectral radius In a small search we did not find any weight-w matrices with spectralradius above 061253803919 w Our upper bounds (see below) are 06181640625w for alllarge w

For comparison the non-terminating variant in Appendix E5 allows the matrix(0 12

12 34

) which has spectral radius 1 with eigenvector

(12

) The StehleacutendashZimmermann

algorithm reviewed in Appendix E6 allows the matrix(

0 1212 14

) which has spectral

radius (1 +radic

17)8 = 06403882032 so this proof strategy cannot obtain an O constantbetter than 2(log2(

radic17minus 1)minus 1) = 3110510062 for the number of cdivstep iterations

F2 A warning about worst-case inputs DefineM as the matrix 4minus7(

328 minus380minus158 233

)shown above Then Mminus3 =

(60321097 11336806047137246 88663112

) Applying our gcd algorithm to

inputs (60321097 47137246) applies a series of 18 matrices of total weight 21 in the

notation of Theorem E2 the matrix(

0 12minus12 34

)twice then the matrix

(0 12minus12 14

)

then the matrix(

0 12minus12 34

)twice then the weight-2 matrix

(0 14minus14 116

) all of

these matrices again and all of these matrices a third timeOne might think that these are worst-case inputs to our algorithm analogous to

a standard matrix construction of Fibonacci numbers as worst-case inputs to Euclidrsquosalgorithm We emphasize that these are not worst-case inputs the inputs (22293minus128330)are much smaller and nevertheless require 19 steps of total weight 23 We now explainwhy the analogy fails

The matrix M3 has an eigenvector (1minus053 ) with eigenvalue 1221middot0707 =12148 and an eigenvector (1 078 ) with eigenvalue 1221middot129 = 12271 so ithas spectral radius around 12148 Our proof strategy thus cannot guarantee a gainof more than 148 bits from weight 21 What we actually show below is that weight21 is guaranteed to gain more than 1397 bits (The quantity α21 in Figure F14 is133569231 = 2minus13972)

Daniel J Bernstein and Bo-Yin Yang 49

The matrix Mminus3 has spectral radius 2271 It is not surprising that each column ofMminus3 is close to 27 bits in size the only way to avoid this would be for the other eigenvectorto be pointing in almost exactly the same direction as (1 0) or (0 1) For our inputs takenfrom the first column of Mminus3 the algorithm gains nearly 27 bits from weight 21 almostdouble the guaranteed gain In short these are good inputs not bad inputs

Now replace M with the matrix F =(

0 11 minus1

) which reduces worst-case inputs to

Euclidrsquos algorithm The eigenvalues of F are (radic

5minus 1)2 = 12069 and minus(radic

5 + 1)2 =minus2069 The entries of powers of Fminus1 are Fibonacci numbers again with sizes wellpredicted by the spectral radius of Fminus1 namely 2069 However the spectral radius ofF is larger than 1 The reason that this large eigenvalue does not spoil the terminationof Euclidrsquos algorithm is that Euclidrsquos algorithm inspects sizes and chooses quotients todecrease sizes at each step

Stehleacute and Zimmermann in [62 Section 62] measure the performance of their algorithm

for ldquoworst caserdquo inputs18 obtained from negative powers of the matrix(

0 1212 14

) The

statement that these inputs are ldquoworst caserdquo is not justified On the contrary if these inputshave b bits then they take (073690 + o(1))b steps of the StehleacutendashZimmermann algorithmand thus (14738 + o(1))b cdivsteps which is considerably fewer steps than manyother inputs that we have tried19 for various values of b The quantity 073690 here islog(2) log((

radic17 + 1)2) coming from the smaller eigenvalue minus2(

radic17 + 1) = minus039038

of the above matrix

F3 Norms We use the standard definitions of the 2-norm of a real vector and the2-norm of a real matrix We summarize the basic theory of 2-norms here to keep thispaper self-contained

Definition F4 If v isin R2 then |v|2 is defined asradicv2

1 + v22 where v = (v1 v2)

Definition F5 If P isinM2(R) then |P |2 is defined as max|Pv|2 v isin R2 |v|2 = 1

This maximum exists becausev isin R2 |v|2 = 1

is a compact set (namely the unit

circle) and |Pv|2 is a continuous function of v See also Theorem F11 for an explicitformula for |P |2

Beware that the literature sometimes uses the notation |P |2 for the entrywise 2-normradicP 2

11 + P 212 + P 2

21 + P 222 We do not use entrywise norms of matrices in this paper

Theorem F6 Let v be an element of Z2 If v 6= 0 then |v|2 ge 1

Proof Write v as (v1 v2) with v1 v2 isin Z If v1 6= 0 then v21 ge 1 so v2

1 + v22 ge 1 If v2 6= 0

then v22 ge 1 so v2

1 + v22 ge 1 Either way |v|2 =

radicv2

1 + v22 ge 1

Theorem F7 Let v be an element of R2 Let r be an element of R Then |rv|2 = |r||v|2

Proof Write v as (v1 v2) with v1 v2 isin R Then rv = (rv1 rv2) so |rv|2 =radicr2v2

1 + r2v22 =radic

r2radicv2

1 + v22 = |r||v|2

18Specifically Stehleacute and Zimmermann claim that ldquothe worst case of the binary variantrdquo (ie of theirgcd algorithm) is for inputs ldquoGn and 2Gnminus1 where G0 = 0 G1 = 1 Gn = minusGnminus1 + 4Gnminus2 which givesall binary quotients equal to 1rdquo Daireaux Maume-Deschamps and Valleacutee in [31 Section 23] state thatStehleacute and Zimmermann ldquoexhibited the precise worst-case number of iterationsrdquo and give an equivalentcharacterization of this case

19For example ldquoGnrdquo from [62] is minus5467345 for n = 18 and 2135149 for n = 17 The inputs (minus5467345 2 middot2135149) use 17 iterations of the ldquoGBrdquo divisions from [62] while the smaller inputs (minus1287513 2middot622123) use24 ldquoGBrdquo iterations The inputs (1minus5467345 2135149) use 34 cdivsteps the inputs (1minus2718281 3141592)use 45 cdivsteps the inputs (1minus1287513 622123) use 58 cdivsteps

50 Fast constant-time gcd computation and modular inversion

Theorem F8 Let P be an element of M2(R) Let r be an element of R Then|rP |2 = |r||P |2

Proof By definition |rP |2 = max|rPv|2 v isin R2 |v|2 = 1

By Theorem F7 this is the

same as max|r||Pv|2 = |r|max|Pv|2 = |r||P |2

Theorem F9 Let P be an element of M2(R) Let v be an element of R2 Then|Pv|2 le |P |2|v|2

Proof If v = 0 then Pv = 0 and |Pv|2 = 0 = |P |2|v|2Otherwise |v|2 6= 0 Write w = v|v|2 Then |w|2 = 1 so |Pw|2 le |P |2 so |Pv|2 =

|Pw|2|v|2 le |P |2|v|2

Theorem F10 Let PQ be elements of M2(R) Then |PQ|2 le |P |2|Q|2

Proof If v isin R2 and |v|2 = 1 then |PQv|2 le |P |2|Qv|2 le |P |2|Q|2|v|2

Theorem F11 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Then |P |2 =radic

a+d+radic

(aminusd)2+4b2

2

Proof P lowastP isin M2(R) is its own transpose so by the spectral theorem it has twoorthonormal eigenvectors with real eigenvalues In other words R2 has a basis e1 e2 suchthat

bull |e1|2 = 1

bull |e2|2 = 1

bull the dot product elowast1e2 is 0 (here we view vectors as column matrices)

bull P lowastPe1 = λ1e1 for some λ1 isin R and

bull P lowastPe2 = λ2e2 for some λ2 isin R

Any v isin R2 with |v|2 = 1 can be written uniquely as c1e1 + c2e2 for some c1 c2 isin R withc2

1 + c22 = 1 explicitly c1 = vlowaste1 and c2 = vlowaste2 Then P lowastPv = λ1c1e1 + λ2c2e2

By definition |P |22 is the maximum of |Pv|22 over v isin R2 with |v|2 = 1 Note that|Pv|22 = (Pv)lowast(Pv) = vlowastP lowastPv = λ1c

21 + λ2c

22 with c1 c2 defined as above For example

0 le |Pe1|22 = λ1 and 0 le |Pe2|22 = λ2 Thus |Pv|22 achieves maxλ1 λ2 It cannot belarger than this if c2

1 +c22 = 1 then λ1c

21 +λ2c

22 le maxλ1 λ2 Hence |P |22 = maxλ1 λ2

To finish the proof we make the spectral theorem explicit for the matrix P lowastP =(a bc d

)isinM2(R) Note that (P lowastP )lowast = P lowastP so b = c The characteristic polynomial of

this matrix is (xminus a)(xminus d)minus b2 = x2 minus (a+ d)x+ adminus b2 with roots

a+ dplusmnradic

(a+ d)2 minus 4(adminus b2)2 =

a+ dplusmnradic

(aminus d)2 + 4b2

2

The largest eigenvalue of P lowastP is thus (a+ d+radic

(aminus d)2 + 4b2)2 and |P |2 is the squareroot of this

Theorem F12 Let P be an element of M2(R) Assume that P lowastP =(a bc d

)where P lowast

is the transpose of P Let N be an element of R Then N ge |P |2 if and only if N ge 02N2 ge a+ d and (2N2 minus aminus d)2 ge (aminus d)2 + 4b2

Daniel J Bernstein and Bo-Yin Yang 51

earlybounds=011126894912^2037794112^2148808332^2251652192^206977232^2078823132^2483067332^239920452^22104392132^25112816812^25126890072^27138243032^28142578172^27156342292^29163862452^29179429512^31185834332^31197136532^32204328912^32211335692^31223282932^33238004212^35244892332^35256040592^36267388892^37271122152^35282767752^3729849732^36308292972^40312534432^39326254052^4133956252^39344650552^42352865672^42361759512^42378586372^4538656472^4239404692^4240247512^42412409172^46425934112^48433643372^48448890152^50455437912^5046418992^47472050052^50489977912^53493071912^52507544232^5451575272^51522815152^54536940732^56542122492^55552582732^56566360932^58577810812^59589529592^60592914752^59607185992^61618789972^62625348212^62633292852^62644043412^63659866332^65666035532^65

defalpha(w)ifwgt=67return(6331024)^wreturnearlybounds[w]

assertall(alpha(w)^49lt2^(-(34w-23))forwinrange(31100))assertmin((6331024)^walpha(w)forwinrange(68))==633^5(2^30165219)

Figure F14 Definition of αw for w isin 0 1 2 computer verificationthat αw lt 2minus(34wminus23)49 for 31 le w le 99 and computer verification thatmin(6331024)wαw 0 le w le 67 = 6335(230165219) If w ge 67 then αw =(6331024)w

Our upper-bound computations work with matrices with rational entries We useTheorem F12 to compare the 2-norms of these matrices to various rational bounds N Allof the computations here use exact arithmetic avoiding the correctness questions thatwould have shown up if we had used approximations to the square roots in Theorem F11Sage includes exact representations of algebraic numbers such as |P |2 but switching toTheorem F12 made our upper-bound computations an order of magnitude faster

Proof Assume thatN ge 0 that 2N2 ge a+d and that (2N2minusaminusd)2 ge (aminusd)2+4b2 Then2N2minusaminusd ge

radic(aminus d)2 + 4b2 since 2N2minusaminusd ge 0 so N2 ge (a+d+

radic(aminus d)2 + 4b2)2

so N geradic

(a+ d+radic

(aminus d)2 + 4b2)2 since N ge 0 so N ge |P |2 by Theorem F11Conversely assume that N ge |P |2 Then N ge 0 Also N2 ge |P |22 so 2N2 ge

2|P |22 = a + d +radic

(aminus d)2 + 4b2 by Theorem F11 In particular 2N2 ge a + d and(2N2 minus aminus d)2 ge (aminus d)2 + 4b2

F13 Upper bounds We now show that every weight-w product of the matrices inTheorem E2 has norm at most αw Here α0 α1 α2 is an exponentially decreasingsequence of positive real numbers defined in Figure F14

Definition F15 For e isin 1 2 and q isin

1 3 5 2e+1 minus 1 define Meq =(

0 12eminus12e q22e

)

52 Fast constant-time gcd computation and modular inversion

Theorem F16 |Meq|2 lt (1 +radic

2)2e if e isin 1 2 and q isin

1 3 5 2e+1 minus 1

Proof Define(a bc d

)= MlowasteqMeq Then a = 122e b = c = minusq23e and d = 122e +

q224e so a+ d = 222e + q224e lt 622e and (aminus d)2 + 4b2 = 4q226e + q428e le 3224eBy Theorem F12 |Meq|2 =

radic(a+ d+

radic(aminus d)2 + 4b2)2 le

radic(622e +

radic3222e)2 =

(1 +radic

2)2e

Definition F17 Define βw = infαw+jαj j ge 0 for w isin 0 1 2

Theorem F18 If w isin 0 1 2 then βw = minαw+jαj 0 le j le 67 If w ge 67then βw = (6331024)w6335(230165219)

The first statement reduces the computation of βw to a finite computation Thecomputation can be further optimized but is not a bottleneck for us Similar commentsapply to γwe in Theorem F20

Proof By definition αw = (6331024)w for all w ge 67 If j ge 67 then also w + j ge 67 soαw+jαj = (6331024)w+j(6331024)j = (6331024)w In short all terms αw+jαj forj ge 67 are identical so the infimum of αw+jαj for j ge 0 is the same as the minimum for0 le j le 67

If w ge 67 then αw+jαj = (6331024)w+jαj The quotient (6331024)jαj for0 le j le 67 has minimum value 6335(230165219)

Definition F19 Define γwe = infβw+j2j70169 j ge e

for w isin 0 1 2 and

e isin 1 2 3

This quantity γwe is designed to be as large as possible subject to two conditions firstγwe le βw+e2e70169 used in Theorem F21 second γwe le γwe+1 used in Theorem F22

Theorem F20 γwe = minβw+j2j70169 e le j le e+ 67

if w isin 0 1 2 and

e isin 1 2 3 If w + e ge 67 then γwe = 2e(6331024)w+e(70169)6335(230165219)

Proof If w + j ge 67 then βw+j = (6331024)w+j6335(230165219) so βw+j2j70169 =(6331024)w(633512)j(70169)6335(230165219) which increases monotonically with j

In particular if j ge e+ 67 then w + j ge 67 so βw+j2j70169 ge βw+e+672e+6770169Hence γwe = inf

βw+j2j70169 j ge e

= min

βw+j2j70169 e le j le e+ 67

Also if

w + e ge 67 then w + j ge 67 for all j ge e so

γwe =(

6331024

)w (633512

)e 70169

6335

230165219 = 2e(

6331024

)w+e 70169

6335

230165219

as claimed

Theorem F21 Define S as the smallest subset of ZtimesM2(Q) such that

bull(0(

1 00 1))isin S and

bull if (wP ) isin S and w = 0 or |P |2 gt βw and e isin 1 2 3 and |P |2 gt γwe andq isin

1 3 5 2e+1 minus 1

then (e+ wM(e q)P ) isin S

Assume that |P |2 le αw for each (wP ) isin S Then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

for all j isin 0 1 2 all e1 ej isin 1 2 3 and all q1 qj isin Z such thatqi isin

1 3 5 2ei+1 minus 1

for each i

Daniel J Bernstein and Bo-Yin Yang 53

The idea here is that instead of searching through all products P of matrices M(e q)of total weight w to see whether |P |2 le αw we prune the search in a way that makesthe search finite across all w (see Theorem F22) while still being sure to find a minimalcounterexample if one exists The first layer of pruning skips all descendants QP of P ifw gt 0 and |P |2 le βw the point is that any such counterexample QP implies a smallercounterexample Q The second layer of pruning skips weight-(w + e) children M(e q)P ofP if |P |2 le γwe the point is that any such child is eliminated by the first layer of pruning

Proof Induct on jIf j = 0 then M(ej qj) middot middot middotM(e1 q1) =

(1 00 1) with norm 1 = α0 Assume from now on

that j ge 1Case 1 |M(ei qi) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+ei

for some i isin 1 2 jThen j minus i lt j so |M(ej qj) middot middot middotM(ei+1 qi+1)|2 le αei+1+middotmiddotmiddot+ej by the inductive

hypothesis so |M(ej qj) middot middot middotM(e1 q1)|2 le βe1+middotmiddotmiddot+eiαei+1+middotmiddotmiddot+ej by Theorem F10 butβe1+middotmiddotmiddot+ei

αei+1+middotmiddotmiddot+ejle αe1+middotmiddotmiddot+ej

by definition of βCase 2 |M(ei qi) middot middot middotM(e1 q1)|2 gt βe1+middotmiddotmiddot+ei for every i isin 1 2 jSuppose that |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 le γe1+middotmiddotmiddot+eiminus1ei

for some i isin 1 2 jBy Theorem F16 |M(ei qi)|2 lt (1 +

radic2)2ei lt (16970)2ei By Theorem F10

|M(ei qi) middot middot middotM(e1 q1)|2 le |M(ei qi)|2γe1+middotmiddotmiddot+eiminus1ei le169γe1+middotmiddotmiddot+eiminus1ei

70 middot 2ei+1 le βe1+middotmiddotmiddot+ei

contradiction Thus |M(eiminus1 qiminus1) middot middot middotM(e1 q1)|2 gt γe1+middotmiddotmiddot+eiminus1eifor all i isin 1 2 j

Now (e1 + middot middot middot + eiM(ei qi) middot middot middotM(e1 q1)) isin S for each i isin 0 1 2 j Indeedinduct on i The base case i = 0 is simply

(0(

1 00 1))isin S If i ge 1 then (wP ) isin S by

the inductive hypothesis where w = e1 + middot middot middot+ eiminus1 and P = M(eiminus1 qiminus1) middot middot middotM(e1 q1)Furthermore

bull w = 0 (when i = 1) or |P |2 gt βw (when i ge 2 by definition of this case)

bull ei isin 1 2 3

bull |P |2 gt γwei(as shown above) and

bull qi isin

1 3 5 2ei+1 minus 1

Hence (e1 + middot middot middot+ eiM(ei qi) middot middot middotM(e1 q1)) = (ei + wM(ei qi)P ) isin S by definition of SIn particular (wP ) isin S with w = e1 + middot middot middot + ej and P = M(ej qj) middot middot middotM(e1 q1) so

|P |2 le αw by hypothesis

Theorem F22 In Theorem F21 S is finite and |P |2 le αw for each (wP ) isin S

Proof This is a computer proof We run the Sage script in Figure F23 We observe thatit terminates successfully (in slightly over 2546 minutes using Sage 86 on one core of a35GHz Intel Xeon E3-1275 v3 CPU) and prints 3787975117

What the script does is search depth-first through the elements of S verifying thateach (wP ) isin S satisfies |P |2 le αw The output is the number of elements searched so Shas at most 3787975117 elements

Specifically if (wP ) isin S then verify(wP) checks that |P |2 le αw and recursivelycalls verify on all descendants of (wP ) Actually for efficiency the script scales eachmatrix to have integer entries it works with 22eM(e q) (labeled scaledM(eq) in thescript) instead of M(e q) and it works with 4wP (labeled P in the script) instead of P

The script computes βw as shown in Theorem F18 computes γw as shown in Theo-rem F20 and compares |P |2 to various bounds as shown in Theorem F12

If w gt 0 and |P |2 le βw then verify(wP) returns There are no descendants of (wP )in this case Also |P |2 le αw since βw le αwα0 = αw

54 Fast constant-time gcd computation and modular inversion

fromalphaimportalphafrommemoizedimportmemoized

R=MatrixSpace(ZZ2)defscaledM(eq)returnR((02^e-2^eq))

memoizeddefbeta(w)returnmin(alpha(w+j)alpha(j)forjinrange(68))

memoizeddefgamma(we)returnmin(beta(w+j)2^j70169forjinrange(ee+68))

defspectralradiusisatmost(PPN)assumingPPhasformPtranspose()P(ab)(cd)=PPX=2N^2-a-dreturnNgt=0andXgt=0andX^2gt=(a-d)^2+4b^2

defverify(wP)nodes=1PP=Ptranspose()Pifwgt0andspectralradiusisatmost(PP4^wbeta(w))returnnodesassertspectralradiusisatmost(PP4^walpha(w))foreinPositiveIntegers()ifspectralradiusisatmost(PP4^wgamma(we))returnnodesforqinrange(12^(e+1)2)nodes+=verify(e+wscaledM(eq)P)

printverify(0R(1))

Figure F23 Sage script verifying |P |2 le αw for each (wP ) isin S Output is number ofpairs (wP ) considered if verification succeeds or an assertion failure if verification fails

Otherwise verify(wP) asserts that |P |2 le αw ie it checks that |P |2 le αw andraises an exception if not It then tries successively e = 1 e = 2 e = 3 etc continuingas long as |P |2 gt γwe For each e where |P |2 gt γwe verify(wP) recursively handles(e+ wM(e q)P ) for each q isin

1 3 5 2e+1 minus 1

Note that γwe le γwe+1 le middot middot middot so if |P |2 le γwe then also |P |2 le γwe+1 etc This iswhere it is important that γwe is not simply defined as βw+e2e70169

To summarize verify(wP) recursively enumerates all descendants of (wP ) Henceverify(0R(1)) recursively enumerates all elements (wP ) of S checking |P |2 le αw foreach (wP )

Theorem F24 If j isin 0 1 2 e1 ej isin 1 2 3 q1 qj isin Z and qi isin1 3 5 2ei+1 minus 1

for each i then |M(ej qj) middot middot middotM(e1 q1)|2 le αe1+middotmiddotmiddot+ej

Proof This is the conclusion of Theorem F21 so it suffices to check the assumptionsDefine S as in Theorem F21 then |P |2 le αw for each (wP ) isin S by Theorem F22

Daniel J Bernstein and Bo-Yin Yang 55

Theorem F25 In the situation of Theorem E3 assume that R0 R1 R2 Rj+1 aredefined Then |(Rj2ej Rj+1)|2 le αe1+middotmiddotmiddot+ej

|(R0 R1)|2 and if e1 + middot middot middot + ej ge 67 thene1 + middot middot middot+ ej le log1024633 |(R0 R1)|2

If R0 and R1 are each bounded by 2b in absolute value then |(R0 R1)|2 is bounded by2b+05 so log1024633 |(R0 R1)|2 le (b+ 05) log2(1024633) = (b+ 05)(14410 )

Proof By Theorem E2 (Ri2ei

Ri+1

)= M(ei qi)

(Riminus12eiminus1

Ri

)where qi = 2ei ((Riminus12eiminus1)div2 Ri) isin

1 3 5 2ei+1 minus 1

Hence(

Rj2ej

Rj+1

)= M(ej qj) middot middot middotM(e1 q1)

(R0R1

)

The product M(ej qj) middot middot middotM(e1 q1) has 2-norm at most αw where w = e1 + middot middot middot+ ej so|(Rj2ej Rj+1)|2 le αw|(R0 R1)|2 by Theorem F9

By Theorem F6 |(Rj2ej Rj+1)|2 ge 1 If e1 + middot middot middot + ej ge 67 then αe1+middotmiddotmiddot+ej =(6331024)e1+middotmiddotmiddot+ej so 1 le (6331024)e1+middotmiddotmiddot+ej |(R0 R1)|2

Theorem F26 In the situation of Theorem E3 there exists t ge 0 such that Rt+1 = 0

Proof Suppose not Then R0 R1 Rj Rj+1 are all defined for arbitrarily large j Takej ge 67 with j gt log1024633 |(R0 R1)|2 Then e1 + middot middot middot + ej ge 67 so j le e1 + middot middot middot + ej lelog1024633 |(R0 R1)|2 lt j by Theorem F25 contradiction

G Proof of main gcd theorem for integersThis appendix shows how the gcd algorithm from Appendix E is related to iteratingdivstep This appendix then uses the analysis from Appendix F to put bounds on thenumber of divstep iterations needed

Theorem G1 (2-adic division as a sequence of division steps) In the situation ofTheorem E1 divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1)

In other words divstep2e(1 f g2) = (1 g2eminus(f mod2 g)2e+1)

Proof Defineze = (g2e minus f)2 isin Z2

ze+1 = (ze + (ze mod 2)g2e)2 isin Z2

ze+2 = (ze+1 + (ze+1 mod 2)g2e)2 isin Z2

z2e = (z2eminus1 + (z2eminus1 mod 2)g2e)2 isin Z2

Thenze = (g2e minus f)2

ze+1 = ((1 + 2(ze mod 2))g2e minus f)4ze+2 = ((1 + 2(ze mod 2) + 4(ze+1 mod 2))g2e minus f)8

z2e = ((1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2))g2e minus f)2e+1

56 Fast constant-time gcd computation and modular inversion

In short z2e = (qg2e minus f)2e+1 where

qprime = 1 + 2(ze mod 2) + middot middot middot+ 2e(z2eminus1 mod 2) isin

1 3 5 2e+1 minus 1

Hence qprimeg2eminusf = 2e+1z2e isin 2e+1Z2 so qprime = q by the uniqueness statement in Theorem E1Note that this construction gives an alternative proof of the existence of q in Theorem E1

All that remains is to prove divstep2e(1 f g2) = (1 g2e (qg2e minus f)2e+1) iedivstep2e(1 f g2) = (1 g2e z2e)

If 1 le i lt e then g2i isin 2Z2 so divstep(i f g2i) = (i + 1 f g2i+1) By in-duction divstepeminus1(1 f g2) = (e f g2e) By assumption e gt 0 and g2e is odd sodivstep(e f g2e) = (1 minus e g2e (g2e minus f)2) = (1 minus e g2e ze) What remains is toprove that divstepe(1minus e g2e ze) = (1 g2e z2e)

If 0 le i le eminus1 then 1minuse+i le 0 so divstep(1minuse+i g2e ze+i) = (2minuse+i g2e (ze+i+(ze+i mod 2)g2e)2) = (2minus e+ i g2e ze+i+1) By induction divstepi(1minus e g2e ze) =(1minus e+ i g2e ze+i) for 0 le i le e In particular divstepe(1minus e g2e ze) = (1 g2e z2e)as claimed

Theorem G2 (2-adic gcd as a sequence of division steps) In the situation of Theo-rem E3 assume that R0 R1 Rj Rj+1 are defined Then divstep2w(1 R0 R12) =(1 Rj2ej Rj+12) where w = e1 + middot middot middot+ ej

Therefore if Rt+1 = 0 then divstepn(1 R0 R12) = (1 + nminus 2wRt2et 0) for alln ge 2w where w = e1 + middot middot middot+ et

Proof Induct on j If j = 0 then w = 0 so divstep2w(1 R0 R12) = (1 R0 R12) andR0 = R02e0 since R0 is odd by hypothesis

If j ge 1 then by definition Rj+1 = minus((Rjminus12ejminus1)mod2 Rj)2ej Furthermore

divstep2e1+middotmiddotmiddot+2ejminus1(1 R0 R12) = (1 Rjminus12ejminus1 Rj2)

by the inductive hypothesis so

divstep2e1+middotmiddotmiddot+2ej (1 R0 R12) = divstep2ej (1 Rjminus12ejminus1 Rj2) = (1 Rj2ej Rj+12)

by Theorem G1

Theorem G3 Let R0 be an odd element of Z Let R1 be an element of 2Z Then thereexists n isin 0 1 2 such that divstepn(1 R0 R12) isin Ztimes Ztimes 0 Furthermore anysuch n has divstepn(1 R0 R12) isin Ztimes GminusG times 0 where G = gcdR0 R1

Proof Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 thereexists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

Conversely assume that n isin 0 1 2 and that divstepn(1 R0 R12) isin ZtimesZtimes0Write divstepn(1 R0 R12) as (δ f 0) Then divstepn+k(1 R0 R12) = (k + δ f 0) forall k isin 0 1 2 so in particular divstep2w+n(1 R0 R12) = (2w + δ f 0) Alsodivstep2w+k(1 R0 R12) = (k + 1 Rt2et 0) for all k isin 0 1 2 so in particu-lar divstep2w+n(1 R0 R12) = (n + 1 Rt2et 0) Hence f = Rt2et isin GminusG anddivstepn(1 R0 R12) isin Ztimes GminusG times 0 as claimed

Theorem G4 Let R0 be an odd element of Z Let R1 be an element of 2Z Defineb = log2

radicR2

0 +R21 Assume that b le 21 Then divstepb19b7c(1 R0 R12) isin ZtimesZtimes 0

Proof If divstep(δ f g) = (δ1 f1 g1) then divstep(δminusfminusg) = (δ1minusf1minusg1) Hencedivstepn(1 R0 R12) isin ZtimesZtimes0 if and only if divstepn(1minusR0minusR12) isin ZtimesZtimes0From now on assume without loss of generality that R0 gt 0

Daniel J Bernstein and Bo-Yin Yang 57

s Ws R0 R10 1 1 01 5 1 minus22 5 1 minus23 5 1 minus24 17 1 minus45 17 1 minus46 65 1 minus87 65 1 minus88 157 11 minus69 181 9 minus1010 421 15 minus1411 541 21 minus1012 1165 29 minus1813 1517 29 minus2614 2789 17 minus5015 3653 47 minus3816 8293 47 minus7817 8293 47 minus7818 24245 121 minus9819 27361 55 minus15620 56645 191 minus14221 79349 215 minus18222 132989 283 minus23023 213053 133 minus44224 415001 451 minus46025 613405 501 minus60226 1345061 719 minus91027 1552237 1021 minus71428 2866525 789 minus1498

s Ws R0 R129 4598269 1355 minus166230 7839829 777 minus269031 12565573 2433 minus257832 22372085 2969 minus368233 35806445 5443 minus248634 71013905 4097 minus736435 74173637 5119 minus692636 205509533 9973 minus1029837 226964725 11559 minus966238 487475029 11127 minus1907039 534274997 13241 minus1894640 1543129037 24749 minus3050641 1639475149 20307 minus3503042 3473731181 39805 minus4346643 4500780005 49519 minus4526244 9497198125 38051 minus8971845 12700184149 82743 minus7651046 29042662405 152287 minus7649447 36511782869 152713 minus11485048 82049276437 222249 minus18070649 90638999365 220417 minus20507450 234231548149 229593 minus42605051 257130797053 338587 minus37747852 466845672077 301069 minus61335453 622968125533 437163 minus65715854 1386285660565 600809 minus101257855 1876581808109 604547 minus122927056 4208169535453 1352357 minus1542498

Figure G5 For each s isin 0 1 56 Minimum R20 + R2

1 where R0 is an odd integerR1 is an even integer and (R0 R1) requires ges iterations of divstep Also an example of(R0 R1) reaching this minimum

The rest of the proof is a computer proof We enumerate all pairs (R0 R1) where R0is a positive odd integer R1 is an even integer and R2

0 + R21 le 242 For each (R0 R1)

we iterate divstep to find the minimum n ge 0 where divstepn(1 R0 R12) isin Ztimes Ztimes 0We then say that (R0 R1) ldquoneeds n stepsrdquo

For each s ge 0 define Ws as the smallest R20 +R2

1 where (R0 R1) needs ges steps Thecomputation shows that each (R0 R1) needs le56 steps and also shows the values Ws

for each s isin 0 1 56 We display these values in Figure G5 We check for eachs isin 0 1 56 that 214s leW 19

s Finally we claim that each (R0 R1) needs leb19b7c steps where b = log2

radicR2

0 +R21

If not then (R0 R1) needs ges steps where s = b19b7c + 1 We have s le 56 since(R0 R1) needs le56 steps We also have Ws le R2

0 + R21 by definition of Ws Hence

214s19 le R20 +R2

1 so 14s19 le 2b so s le 19b7 contradiction

Theorem G6 (Bounds on the number of divsteps required) Let R0 be an odd elementof Z Let R1 be an element of 2Z Define b = log2

radicR2

0 +R21 Define n = b19b7c

if b le 21 n = b(49b+ 23)17c if 21 lt b le 46 and n = b49b17c if b gt 46 Thendivstepn(1 R0 R12) isin Ztimes Ztimes 0

58 Fast constant-time gcd computation and modular inversion

All three cases have n le b(49b+ 23)17c since 197 lt 4917

Proof If b le 21 then divstepn(1 R0 R12) isin Ztimes Ztimes 0 by Theorem G4Assume from now on that b gt 21 This implies n ge b49b17c ge 60Define e0 e1 e2 e3 and R2 R3 as in Theorem E2 By Theorem F26 there

exists t ge 0 such that Rt+1 = 0 By Theorem E3 Rt2et isin GminusG By Theorem G2divstep2w(1 R0 R12) = (1 Rt2et 0) isin Ztimes GminusG times 0 where w = e1 + middot middot middot+ et

To finish the proof we will show that 2w le n Hence divstepn(1 R0 R12) isin ZtimesZtimes0as claimed

Suppose that 2w gt n Both 2w and n are integers so 2w ge n+ 1 ge 61 so w ge 31By Theorem F6 and Theorem F25 1 le |(Rt2et Rt+1)|2 le αw|(R0 R1)|2 = αw2b so

log2(1αw) le bCase 1 w ge 67 By definition 1αw = (1024633)w so w log2(1024633) le b We have

3449 lt log2(1024633) so 34w49 lt b so n + 1 le 2w lt 49b17 lt (49b + 23)17 Thiscontradicts the definition of n as the largest integer below 49b17 if b gt 46 and as thelargest integer below (49b+ 23)17 if b le 46

Case 2 w le 66 Then αw lt 2minus(34wminus23)49 since w ge 31 so b ge lg(1αw) gt(34w minus 23)49 so n + 1 le 2w le (49b + 23)17 This again contradicts the definitionof n if b le 46 If b gt 46 then n ge b49 middot 4617c = 132 so 2w ge n + 1 gt 132 ge 2wcontradiction

H Comparison to performance of cdivstepThis appendix briefly compares the analysis of divstep from Appendix G to an analogousanalysis of cdivstep

First modify Theorem E1 to take the unique element q isin plusmn1plusmn3 plusmn(2e minus 1)such that f + qg2e isin 2e+1Z2 The corresponding modification of Theorem G1 saysthat cdivstep2e(1 f g2) = (1 g2e (f + qg2e)2e+1) The proof proceeds identicallyexcept zj = (zjminus1 minus (zjminus1 mod 2)g2e)2 isin Z2 for e lt j lt 2e and z2e = (z2eminus1 +(z2eminus1 mod 2)g2e)2 isin Z2 Also we have q = minus1minus2(ze mod 2)minusmiddot middot middotminus2eminus1(z2eminus2 mod 2)+2e(z2eminus1 mod 2) isin plusmn1plusmn3 plusmn(2e minus 1) In the notation of [62] GB(f g) = (q r) wherer = f + qg2e = 2e+1z2e

The corresponding gcd algorithm is the centered algorithm from Appendix E6 withaddition instead of subtraction Recall that except for inessential scalings by powers of 2this is the same as the StehleacutendashZimmermann gcd algorithm from [62]

The corresponding set of transition matrices is different from the divstep case As men-tioned in Appendix F there are weight-w products with spectral radius 06403882032 wso the proof strategy for cdivstep cannot obtain weight-w norm bounds below this spectralradius This is worse than our bounds for divstep

Enumerating small pairs (R0 R1) as in Theorem G4 also produces worse results forcdivstep than for divstep See Figure H1

Daniel J Bernstein and Bo-Yin Yang 59

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bull

bull

bull

bull

bullbullbullbullbullbull

bull

bullbullbullbullbullbullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bullbull

bull

bullbull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

bull

minus3

minus2

minus1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure H1 For each s isin 1 2 56 red dot at (b sminus 28283396b) where b is minimumlog2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1) requires ges

iterations of divstep For each s isin 1 2 58 blue dot at (b sminus 28283396b) where b isminimum log2

radicR2

0 +R21 where R0 is an odd integer R1 is an even integer and (R0 R1)

requires ges iterations of cdivstep

  • Introduction
  • Organization of the paper
  • Definition of x-adic division steps
  • Iterates of x-adic division steps
  • Fast computation of iterates of x-adic division steps
  • Fast polynomial gcd computation and modular inversion
  • Software case study for polynomial modular inversion
  • Definition of 2-adic division steps
  • Iterates of 2-adic division steps
  • Fast computation of iterates of 2-adic division steps
  • Fast integer gcd computation and modular inversion
  • Software case study for integer modular inversion
  • Proof of main gcd theorem for polynomials
  • Review of the EuclidndashStevin algorithm
  • EuclidndashStevin results as iterates of divstep
  • Alternative proof of main gcd theorem for polynomials
  • A gcd algorithm for integers
  • Performance of the gcd algorithm for integers
  • Proof of main gcd theorem for integers
  • Comparison to performance of cdivstep
Page 13: Fast constant-time gcd computation and modular inversion
Page 14: Fast constant-time gcd computation and modular inversion
Page 15: Fast constant-time gcd computation and modular inversion
Page 16: Fast constant-time gcd computation and modular inversion
Page 17: Fast constant-time gcd computation and modular inversion
Page 18: Fast constant-time gcd computation and modular inversion
Page 19: Fast constant-time gcd computation and modular inversion
Page 20: Fast constant-time gcd computation and modular inversion
Page 21: Fast constant-time gcd computation and modular inversion
Page 22: Fast constant-time gcd computation and modular inversion
Page 23: Fast constant-time gcd computation and modular inversion
Page 24: Fast constant-time gcd computation and modular inversion
Page 25: Fast constant-time gcd computation and modular inversion
Page 26: Fast constant-time gcd computation and modular inversion
Page 27: Fast constant-time gcd computation and modular inversion
Page 28: Fast constant-time gcd computation and modular inversion
Page 29: Fast constant-time gcd computation and modular inversion
Page 30: Fast constant-time gcd computation and modular inversion
Page 31: Fast constant-time gcd computation and modular inversion
Page 32: Fast constant-time gcd computation and modular inversion
Page 33: Fast constant-time gcd computation and modular inversion
Page 34: Fast constant-time gcd computation and modular inversion
Page 35: Fast constant-time gcd computation and modular inversion
Page 36: Fast constant-time gcd computation and modular inversion
Page 37: Fast constant-time gcd computation and modular inversion
Page 38: Fast constant-time gcd computation and modular inversion
Page 39: Fast constant-time gcd computation and modular inversion
Page 40: Fast constant-time gcd computation and modular inversion
Page 41: Fast constant-time gcd computation and modular inversion
Page 42: Fast constant-time gcd computation and modular inversion
Page 43: Fast constant-time gcd computation and modular inversion
Page 44: Fast constant-time gcd computation and modular inversion
Page 45: Fast constant-time gcd computation and modular inversion
Page 46: Fast constant-time gcd computation and modular inversion
Page 47: Fast constant-time gcd computation and modular inversion
Page 48: Fast constant-time gcd computation and modular inversion
Page 49: Fast constant-time gcd computation and modular inversion
Page 50: Fast constant-time gcd computation and modular inversion
Page 51: Fast constant-time gcd computation and modular inversion
Page 52: Fast constant-time gcd computation and modular inversion
Page 53: Fast constant-time gcd computation and modular inversion
Page 54: Fast constant-time gcd computation and modular inversion
Page 55: Fast constant-time gcd computation and modular inversion
Page 56: Fast constant-time gcd computation and modular inversion
Page 57: Fast constant-time gcd computation and modular inversion
Page 58: Fast constant-time gcd computation and modular inversion
Page 59: Fast constant-time gcd computation and modular inversion