This is gmp.info, produced by makeinfo version 4.2 from gmp.texi.

This manual describes how to install and use the GNU multiple precision
arithmetic library, version 4.1.2.

   Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002 Free Software Foundation, Inc.

   Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version
1.1 or any later version published by the Free Software Foundation;
with no Invariant Sections, with the Front-Cover Texts being "A GNU
Manual", and with the Back-Cover Texts being "You have freedom to copy
and modify this GNU Manual, like GNU software".  A copy of the license
is included in *Note GNU Free Documentation License::.
INFO-DIR-SECTION GNU libraries
START-INFO-DIR-ENTRY
* gmp: (gmp).                   GNU Multiple Precision Arithmetic Library.
END-INFO-DIR-ENTRY


File: gmp.info,  Node: FFT Multiplication,  Next: Other Multiplication,  Prev: Toom-Cook 3-Way Multiplication,  Up: Multiplication Algorithms

FFT Multiplication
------------------

   At large to very large sizes a Fermat style FFT multiplication is
used, following Scho"nhage and Strassen (*note References::).
Descriptions of FFTs in various forms can be found in many textbooks,
for instance Knuth section 4.3.3 part C or Lipson chapter IX.  A brief
description of the form used in GMP is given here.

   The multiplication done is x*y mod 2^N+1, for a given N.  A full
product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
and y with high zero limbs.  The modular product is the native form for
the algorithm, so padding to get a full product is unavoidable.

   The algorithm follows a split, evaluate, pointwise multiply,
interpolate and combine similar to that described above for Karatsuba
and Toom-3.  A k parameter controls the split, with an FFT-k splitting
into 2^k pieces of M=N/2^k bits each.  N must be a multiple of
(2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
bit shifts in the split and combine stages.

   The evaluations, pointwise multiplications, and interpolation, are
all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
2^k and of `mp_bits_per_limb'.  The results of interpolation will be
the following negacyclic convolution of the input pieces, and the
choice of N' ensures these sums aren't truncated.

                ---
                \         b
     w[n] =     /     (-1) * x[i] * y[j]
                ---
            i+j==b*2^k+n
               b=0,1

   The points used for the evaluation are g^i for i=0 to 2^k-1 where
g=2^(2N'/2^k).  g is a 2^k'th root of unity mod 2^N'+1, which produces
necessary cancellations at the interpolation stage, and it's also a
power of 2 so the fast fourier transforms used for the evaluation and
interpolation do only shifts, adds and negations.

   The pointwise multiplications are done modulo 2^N'+1 and either
recurse into a further FFT or use a plain multiplication (Toom-3,
Karatsuba or basecase), whichever is optimal at the size N'.  The
interpolation is an inverse fast fourier transform.  The resulting set
of sums of x[i]*y[j] are added at appropriate offsets to give the final
result.

   Squaring is the same, but x is the only input so it's one transform
at the evaluate stage and the pointwise multiplies are squares.  The
interpolation is the same.

   For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
the exponent representing 2^k recursed modular multiplies each
1/2^(k-1) the size of the original.  Each successive k is an asymptotic
improvement, but overheads mean each is only faster at bigger and
bigger sizes.  In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
thresholds where each k is used.  Each new k effectively swaps some
multiplying for some shifts, adds and overheads.

   A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
A k=4 FFT at O(N^1.333) can be expected to be the first faster than
Toom-3 at O(N^1.465).  In practice this is what's found, with
`MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
and 1000 limbs, depending on the CPU.  So far it's been found that only
very large FFTs recurse into pointwise multiplies above these sizes.

   When an FFT is to give a full product, the change of N to 2N doesn't
alter the theoretical complexity for a given k, but for the purposes of
considering where an FFT might be first used it can be assumed that the
FFT is recursing into a normal multiply and that on that basis it's
doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
making it O(N^(k/(k-2))).  This would mean k=7 at O(N^1.4) would be the
first FFT faster than Toom-3.  In practice `MUL_FFT_THRESHOLD' and
`SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
between 3000 and 10000 limbs.

   The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
a multiple of 2^k and `mp_bits_per_limb' means that when
2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
The +k+3 means some values of N just under such a multiple will be
rounded to the next.  The complexity calculations above assume that a
favourable size is used, meaning one which isn't padded through
rounding, and it's also assumed that the extra +k+3 bits are negligible
at typical FFT sizes.

   The practical effect of the 2^(2k-1) constraint is to introduce a
step-effect into measured speeds.  For example k=8 will round N up to a
multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
groups of sizes for which `mpn_mul_n' runs at the same speed.  Or for
k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc.  In practice
it's been found each k is used at quite small multiples of its size
constraint and so the step effect is quite noticeable in a time versus
size graph.

   The threshold determinations currently measure at the mid-points of
size steps, but this is sub-optimal since at the start of a new step it
can happen that it's better to go back to the previous k for a while.
Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
will be needed.


File: gmp.info,  Node: Other Multiplication,  Prev: FFT Multiplication,  Up: Multiplication Algorithms

Other Multiplication
--------------------

   The 3-way Toom-Cook algorithm described above (*note Toom-Cook 3-Way
Multiplication::) generalizes to split into an arbitrary number of
pieces, as per Knuth section 4.3.3 algorithm C.  This is not currently
used, though it's possible a Toom-4 might fit in between Toom-3 and the
FFTs.  The notes here are merely for interest.

   In general a split into r+1 pieces is made, and evaluations and
pointwise multiplications done at 2*r+1 points.  A 4-way split does 7
pointwise multiplies, 5-way does 9, etc.  Asymptotically an (r+1)-way
algorithm is O(N^(log(2*r+1)/log(r+1))).  Only the pointwise
multiplications count towards big-O complexity, but the time spent in
the evaluate and interpolate stages grows with r and has a significant
practical impact, with the asymptotic advantage of each r realized only
at bigger and bigger sizes.  The overheads grow as O(N*r), whereas in
an r=2^k FFT they grow only as O(N*log(r)).

   Knuth algorithm C evaluates at points 0,1,2,...,2*r, but exercise 4
uses -r,...,0,...,r and the latter saves some small multiplies in the
evaluate stage (or rather trades them for additions), and has a further
saving of nearly half the interpolate steps.  The idea is to separate
odd and even final coefficients and then perform algorithm C steps C7
and C8 on them separately.  The divisors at step C7 become j^2 and the
multipliers at C8 become 2*t*j-j^2.

   Splitting odd and even parts through positive and negative points
can be thought of as using -1 as a square root of unity.  If a 4th root
of unity was available then a further split and speedup would be
possible, but no such root exists for plain integers.  Going to complex
integers with i=sqrt(-1) doesn't help, essentially because in cartesian
form it takes three real multiplies to do a complex multiply.  The
existence of 2^k'th roots of unity in a suitable ring or field lets the
fast fourier transform keep splitting and get to O(N*log(r)).

   Floating point FFTs use complex numbers approximating Nth roots of
unity.  Some processors have special support for such FFTs.  But these
are not used in GMP since it's very difficult to guarantee an exact
result (to some number of bits).  An occasional difference of 1 in the
last bit might not matter to a typical signal processing algorithm, but
is of course of vital importance to GMP.


File: gmp.info,  Node: Division Algorithms,  Next: Greatest Common Divisor Algorithms,  Prev: Multiplication Algorithms,  Up: Algorithms

Division Algorithms
===================

* Menu:

* Single Limb Division::
* Basecase Division::
* Divide and Conquer Division::
* Exact Division::
* Exact Remainder::
* Small Quotient Division::


File: gmp.info,  Node: Single Limb Division,  Next: Basecase Division,  Prev: Division Algorithms,  Up: Division Algorithms

Single Limb Division
--------------------

   Nx1 division is implemented using repeated 2x1 divisions from high
to low, either with a hardware divide instruction or a multiplication by
inverse, whichever is best on a given CPU.

   The multiply by inverse follows section 8 of "Division by Invariant
Integers using Multiplication" by Granlund and Montgomery (*note
References::) and is implemented as `udiv_qrnnd_preinv' in
`gmp-impl.h'.  The idea is to have a fixed-point approximation to 1/d
(see `invert_limb') and then multiply by the high limb (plus one bit)
of the dividend to get a quotient q.  With d normalized (high bit set),
q is no more than 1 too small.  Subtracting q*d from the dividend gives
a remainder, and reveals whether q or q-1 is correct.

   The result is a division done with two multiplications and four or
five arithmetic operations.  On CPUs with low latency multipliers this
can be much faster than a hardware divide, though the cost of
calculating the inverse at the start may mean it's only better on
inputs bigger than say 4 or 5 limbs.

   When a divisor must be normalized, either for the generic C
`__udiv_qrnnd_c' or the multiply by inverse, the division performed is
actually a*2^k by d*2^k where a is the dividend and k is the power
necessary to have the high bit of d*2^k set.  The bit shifts for the
dividend are usually accomplished "on the fly" meaning by extracting
the appropriate bits at each step.  Done this way the quotient limbs
come out aligned ready to store.  When only the remainder is wanted, an
alternative is to take the dividend limbs unshifted and calculate r = a
mod d*2^k followed by an extra final step r*2^k mod d*2^k.  This can
help on CPUs with poor bit shifts or few registers.

   The multiply by inverse can be done two limbs at a time.  The
calculation is basically the same, but the inverse is two limbs and the
divisor treated as if padded with a low zero limb.  This means more
work, since the inverse will need a 2x2 multiply, but the four 1x1s to
do that are independent and can therefore be done partly or wholly in
parallel.  Likewise for a 2x1 calculating q*d.  The net effect is to
process two limbs with roughly the same two multiplies worth of latency
that one limb at a time gives.  This extends to 3 or 4 limbs at a time,
though the extra work to apply the inverse will almost certainly soon
reach the limits of multiplier throughput.

   A similar approach in reverse can be taken to process just half a
limb at a time if the divisor is only a half limb.  In this case the
1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
limb, which can be a saving on CPUs with a fast half limb multiply, or
in fact if the only multiply is a half limb, and especially if it's not
pipelined.


File: gmp.info,  Node: Basecase Division,  Next: Divide and Conquer Division,  Prev: Single Limb Division,  Up: Division Algorithms

Basecase Division
-----------------

   Basecase NxM division is like long division done by hand, but in base
2^mp_bits_per_limb.  See Knuth section 4.3.1 algorithm D, and
`mpn/generic/sb_divrem_mn.c'.

   Briefly stated, while the dividend remains larger than the divisor,
a high quotient limb is formed and the Nx1 product q*d subtracted at
the top end of the dividend.  With a normalized divisor (most
significant bit set), each quotient limb can be formed with a 2x1
division and a 1x1 multiplication plus some subtractions.  The 2x1
division is by the high limb of the divisor and is done either with a
hardware divide or a multiply by inverse (the same as in *Note Single
Limb Division::) whichever is faster.  Such a quotient is sometimes one
too big, requiring an addback of the divisor, but that happens rarely.

   With Q=N-M being the number of quotient limbs, this is an O(Q*M)
algorithm and will run at a speed similar to a basecase QxM
multiplication, differing in fact only in the extra multiply and divide
for each of the Q quotient limbs.


File: gmp.info,  Node: Divide and Conquer Division,  Next: Exact Division,  Prev: Basecase Division,  Up: Division Algorithms

Divide and Conquer Division
---------------------------

   For divisors larger than `DIV_DC_THRESHOLD', division is done by
dividing.  Or to be precise by a recursive divide and conquer algorithm
based on work by Moenck and Borodin, Jebelean, and Burnikel and Ziegler
(*note References::).

   The algorithm consists essentially of recognising that a 2NxN
division can be done with the basecase division algorithm (*note
Basecase Division::), but using N/2 limbs as a base, not just a single
limb.  This way the multiplications that arise are (N/2)x(N/2) and can
take advantage of Karatsuba and higher multiplication algorithms (*note
Multiplication Algorithms::).  The "digits" of the quotient are formed
by recursive Nx(N/2) divisions.

   If the (N/2)x(N/2) multiplies are done with a basecase multiplication
then the work is about the same as a basecase division, but with more
function call overheads and with some subtractions separated from the
multiplies.  These overheads mean that it's only when N/2 is above
`MUL_KARATSUBA_THRESHOLD' that divide and conquer is of use.

   `DIV_DC_THRESHOLD' is based on the divisor size N, so it will be
somewhere above twice `MUL_KARATSUBA_THRESHOLD', but how much above
depends on the CPU.  An optimized `mpn_mul_basecase' can lower
`DIV_DC_THRESHOLD' a little by offering a ready-made advantage over
repeated `mpn_submul_1' calls.

   Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is
the time for an NxN multiplication done with FFTs.  The actual time is
a sum over multiplications of the recursed sizes, as can be seen near
the end of section 2.2 of Burnikel and Ziegler.  For example, within
the Toom-3 range, divide and conquer is 2.63*M(N).  With higher
algorithms the M(N) term improves and the multiplier tends to log(N).
In practice, at moderate to large sizes, a 2NxN division is about 2 to
4 times slower than an NxN multiplication.

   Newton's method used for division is asymptotically O(M(N)) and
should therefore be superior to divide and conquer, but it's believed
this would only be for large to very large N.


File: gmp.info,  Node: Exact Division,  Next: Exact Remainder,  Prev: Divide and Conquer Division,  Up: Division Algorithms

Exact Division
--------------

   A so-called exact division is when the dividend is known to be an
exact multiple of the divisor.  Jebelean's exact division algorithm
uses this knowledge to make some significant optimizations (*note
References::).

   The idea can be illustrated in decimal for example with 368154
divided by 543.  Because the low digit of the dividend is 4, the low
digit of the quotient must be 8.  This is arrived at from 4*7 mod 10,
using the fact 7 is the modular inverse of 3 (the low digit of the
divisor), since 3*7 == 1 mod 10.  So 8*543=4344 can be subtracted from
the dividend leaving 363810.  Notice the low digit has become zero.

   The procedure is repeated at the second digit, with the next
quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
325800.  And finally at the third digit with quotient digit 6 (8*7 mod
10), subtracting 6*543=3258 leaving 0.  So the quotient is 678.

   Notice however that the multiplies and subtractions don't need to
extend past the low three digits of the dividend, since that's enough
to determine the three quotient digits.  For the last quotient digit no
subtraction is needed at all.  On a 2NxN division like this one, only
about half the work of a normal basecase division is necessary.

   For an NxM exact division producing Q=N-M quotient limbs, the saving
over a normal basecase division is in two parts.  Firstly, each of the
Q quotient limbs needs only one multiply, not a 2x1 divide and
multiply.  Secondly, the crossproducts are reduced when Q>M to
Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2.  Notice the savings are
complementary.  If Q is big then many divisions are saved, or if Q is
small then the crossproducts reduce to a small number.

   The modular inverse used is calculated efficiently by
`modlimb_invert' in `gmp-impl.h'.  This does four multiplies for a
32-bit limb, or six for a 64-bit limb.  `tune/modlinv.c' has some
alternate implementations that might suit processors better at bit
twiddling than multiplying.

   The sub-quadratic exact division described by Jebelean in "Exact
Division with Karatsuba Complexity" is not currently implemented.  It
uses a rearrangement similar to the divide and conquer for normal
division (*note Divide and Conquer Division::), but operating from low
to high.  A further possibility not currently implemented is
"Bidirectional Exact Integer Division" by Krandick and Jebelean which
forms quotient limbs from both the high and low ends of the dividend,
and can halve once more the number of crossproducts needed in a 2NxN
division.

   A special case exact division by 3 exists in `mpn_divexact_by3',
supporting Toom-3 multiplication and `mpq' canonicalizations.  It forms
quotient digits with a multiply by the modular inverse of 3 (which is
`0xAA..AAB') and uses two comparisons to determine a borrow for the next
limb.  The multiplications don't need to be on the dependent chain, as
long as the effect of the borrows is applied.  Only a few optimized
assembler implementations currently exist.


File: gmp.info,  Node: Exact Remainder,  Next: Small Quotient Division,  Prev: Exact Division,  Up: Division Algorithms

Exact Remainder
---------------

   If the exact division algorithm is done with a full subtraction at
each stage and the dividend isn't a multiple of the divisor, then low
zero limbs are produced but with a remainder in the high limbs.  For
dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, then this
remainder r is of the form

     a = q*d + r*b^n

   n represents the number of zero limbs produced by the subtractions,
that being the number of limbs produced for q.  r will be in the range
0<=r<d and can be viewed as a remainder, but one shifted up by a factor
of b^n.

   Carrying out full subtractions at each stage means the same number
of cross products must be done as a normal division, but there's still
some single limb divisions saved.  When d is a single limb some
simplifications arise, providing good speedups on a number of
processors.

   `mpn_bdivmod', `mpn_divexact_by3', `mpn_modexact_1_odd' and the
`redc' function in `mpz_powm' differ subtly in how they return r,
leading to some negations in the above formula, but all are essentially
the same.

   Clearly r is zero when a is a multiple of d, and this leads to
divisibility or congruence tests which are potentially more efficient
than a normal division.

   The factor of b^n on r can be ignored in a GCD when d is odd, hence
the use of `mpn_bdivmod' in `mpn_gcd', and the use of
`mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui' etc (*note
Greatest Common Divisor Algorithms::).

   Montgomery's REDC method for modular multiplications uses operands
of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
uses the factor of b^n in the exact remainder to reach a product in the
same form (x*y)*b^-n (*note Modular Powering Algorithm::).

   Notice that r generally gives no useful information about the
ordinary remainder a mod d since b^n mod d could be anything.  If
however b^n == 1 mod d, then r is the negative of the ordinary
remainder.  This occurs whenever d is a factor of b^n-1, as for example
with 3 in `mpn_divexact_by3'.  Other such factors include 5, 17 and
257, but no particular use has been found for this.


File: gmp.info,  Node: Small Quotient Division,  Prev: Exact Remainder,  Up: Division Algorithms

Small Quotient Division
-----------------------

   An NxM division where the number of quotient limbs Q=N-M is small
can be optimized somewhat.

   An ordinary basecase division normalizes the divisor by shifting it
to make the high bit set, shifting the dividend accordingly, and
shifting the remainder back down at the end of the calculation.  This
is wasteful if only a few quotient limbs are to be formed.  Instead a
division of just the top 2*Q limbs of the dividend by the top Q limbs
of the divisor can be used to form a trial quotient.  This requires
only those limbs normalized, not the whole of the divisor and dividend.

   A multiply and subtract then applies the trial quotient to the M-Q
unused limbs of the divisor and N-Q dividend limbs (which includes Q
limbs remaining from the trial quotient division).  The starting trial
quotient can be 1 or 2 too big, but all cases of 2 too big and most
cases of 1 too big are detected by first comparing the most significant
limbs that will arise from the subtraction.  An addback is done if the
quotient still turns out to be 1 too big.

   This whole procedure is essentially the same as one step of the
basecase algorithm done in a Q limb base, though with the trial
quotient test done only with the high limbs, not an entire Q limb
"digit" product.  The correctness of this weaker test can be
established by following the argument of Knuth section 4.3.1 exercise
20 but with the v2*q>b*r+u2 condition appropriately relaxed.


File: gmp.info,  Node: Greatest Common Divisor Algorithms,  Next: Powering Algorithms,  Prev: Division Algorithms,  Up: Algorithms

Greatest Common Divisor
=======================

* Menu:

* Binary GCD::
* Accelerated GCD::
* Extended GCD::
* Jacobi Symbol::


File: gmp.info,  Node: Binary GCD,  Next: Accelerated GCD,  Prev: Greatest Common Divisor Algorithms,  Up: Greatest Common Divisor Algorithms

Binary GCD
----------

   At small sizes GMP uses an O(N^2) binary style GCD.  This is
described in many textbooks, for example Knuth section 4.5.2 algorithm
B.  It simply consists of successively reducing operands a and b using
gcd(a,b) = gcd(min(a,b),abs(a-b)), and also that if a and b are first
made odd then abs(a-b) is even and factors of two can be discarded.

   Variants like letting a-b become negative and doing a different next
step are of interest only as far as they suit particular CPUs, since on
small operands it's machine dependent factors that determine
performance.

   The Euclidean GCD algorithm, as per Knuth algorithms E and A,
reduces using a mod b but this has so far been found to be slower
everywhere.  One reason the binary method does well is that the implied
quotient at each step is usually small, so often only one or two
subtractions are needed to get the same effect as a division.
Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
section 4.5.3 Theorem E.

   When the implied quotient is large, meaning b is much smaller than
a, then a division is worthwhile.  This is the basis for the initial a
mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
and 1x1 cases).  But after that initial reduction, big quotients occur
too rarely to make it worth checking for them.


File: gmp.info,  Node: Accelerated GCD,  Next: Extended GCD,  Prev: Binary GCD,  Up: Greatest Common Divisor Algorithms

Accelerated GCD
---------------

   For sizes above `GCD_ACCEL_THRESHOLD', GMP uses the Accelerated GCD
algorithm described independently by Weber and Jebelean (the latter as
the "Generalized Binary" algorithm), *note References::.  This
algorithm is still O(N^2), but is much faster than the binary algorithm
since it does fewer multi-precision operations.  It consists of
alternating the k-ary reduction by Sorenson, and a "dmod" exact
remainder reduction.

   For operands u and v the k-ary reduction replaces u with n*v-d*u
where n and d are single limb values chosen to give two trailing zero
limbs on that value, which can be stripped.  n and d are calculated
using an algorithm similar to half of a two limb GCD (see `find_a' in
`mpn/generic/gcd.c').

   When u and v differ in size by more than a certain number of bits, a
dmod is performed to zero out bits at the low end of the larger.  It
consists of an exact remainder style division applied to an appropriate
number of bits (*note Exact Division::, and *note Exact Remainder::).
This is faster than a k-ary reduction but useful only when the operands
differ in size.  There's a dmod after each k-ary reduction, and if the
dmod leaves the operands still differing in size then it's repeated.

   The k-ary reduction step can introduce spurious factors into the GCD
calculated, and these are eliminated at the end by taking GCDs with the
original inputs gcd(u,gcd(v,g)) using the binary algorithm.  Since g is
almost always small this takes very little time.

   At small sizes the algorithm needs a good implementation of
`find_a'.  At larger sizes it's dominated by `mpn_addmul_1' applying n
and d.


File: gmp.info,  Node: Extended GCD,  Next: Jacobi Symbol,  Prev: Accelerated GCD,  Up: Greatest Common Divisor Algorithms

Extended GCD
------------

   The extended GCD calculates gcd(a,b) and also cofactors x and y
satisfying a*x+b*y=gcd(a,b).  Lehmer's multi-step improvement of the
extended Euclidean algorithm is used.  See Knuth section 4.5.2
algorithm L, and `mpn/generic/gcdext.c'.  This is an O(N^2) algorithm.

   The multipliers at each step are found using single limb
calculations for sizes up to `GCDEXT_THRESHOLD', or double limb
calculations above that.  The single limb code is faster but doesn't
produce full-limb multipliers, hence not making full use of the
`mpn_addmul_1' calls.

   When a CPU has a data-dependent multiplier, meaning one which is
faster on operands with fewer bits, the extra work in the double-limb
calculation might only save some looping overheads, leading to a large
`GCDEXT_THRESHOLD'.

   Currently the single limb calculation doesn't optimize for the small
quotients that often occur, and this can lead to unusually low values of
`GCDEXT_THRESHOLD', depending on the CPU.

   An analysis of double-limb calculations can be found in "A
Double-Digit Lehmer-Euclid Algorithm" by Jebelean (*note References::).
The code in GMP was developed independently.

   It should be noted that when a double limb calculation is used, it's
used for the whole of that GCD, it doesn't fall back to single limb
part way through.  This is because as the algorithm proceeds, the
inputs a and b are reduced, but the cofactors x and y grow, so the
multipliers at each step are applied to a roughly constant total number
of limbs.


File: gmp.info,  Node: Jacobi Symbol,  Prev: Extended GCD,  Up: Greatest Common Divisor Algorithms

Jacobi Symbol
-------------

   `mpz_jacobi' and `mpz_kronecker' are currently implemented with a
simple binary algorithm similar to that described for the GCDs (*note
Binary GCD::).  They're not very fast when both inputs are large.
Lehmer's multi-step improvement or a binary based multi-step algorithm
is likely to be better.

   When one operand fits a single limb, and that includes
`mpz_kronecker_ui' and friends, an initial reduction is done with
either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
algorithm on a single limb.  The binary algorithm is well suited to a
single limb, and the whole calculation in this case is quite efficient.

   In all the routines sign changes for the result are accumulated
using some bit twiddling, avoiding table lookups or conditional jumps.


File: gmp.info,  Node: Powering Algorithms,  Next: Root Extraction Algorithms,  Prev: Greatest Common Divisor Algorithms,  Up: Algorithms

Powering Algorithms
===================

* Menu:

* Normal Powering Algorithm::
* Modular Powering Algorithm::


File: gmp.info,  Node: Normal Powering Algorithm,  Next: Modular Powering Algorithm,  Prev: Powering Algorithms,  Up: Powering Algorithms

Normal Powering
---------------

   Normal `mpz' or `mpf' powering uses a simple binary algorithm,
successively squaring and then multiplying by the base when a 1 bit is
seen in the exponent, as per Knuth section 4.6.3.  The "left to right"
variant described there is used rather than algorithm A, since it's
just as easy and can be done with somewhat less temporary memory.


File: gmp.info,  Node: Modular Powering Algorithm,  Prev: Normal Powering Algorithm,  Up: Powering Algorithms

Modular Powering
----------------

   Modular powering is implemented using a 2^k-ary sliding window
algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
(*note References::).  k is chosen according to the size of the
exponent.  Larger exponents use larger values of k, the choice being
made to minimize the average number of multiplications that must
supplement the squaring.

   The modular multiplies and squares use either a simple division or
the REDC method by Montgomery (*note References::).  REDC is a little
faster, essentially saving N single limb divisions in a fashion similar
to an exact remainder (*note Exact Remainder::).  The current REDC has
some limitations.  It's only O(N^2) so above `POWM_THRESHOLD' division
becomes faster and is used.  It doesn't attempt to detect small bases,
but rather always uses a REDC form, which is usually a full size
operand.  And lastly it's only applied to odd moduli.


File: gmp.info,  Node: Root Extraction Algorithms,  Next: Radix Conversion Algorithms,  Prev: Powering Algorithms,  Up: Algorithms

Root Extraction Algorithms
==========================

* Menu:

* Square Root Algorithm::
* Nth Root Algorithm::
* Perfect Square Algorithm::
* Perfect Power Algorithm::


File: gmp.info,  Node: Square Root Algorithm,  Next: Nth Root Algorithm,  Prev: Root Extraction Algorithms,  Up: Root Extraction Algorithms

Square Root
-----------

   Square roots are taken using the "Karatsuba Square Root" algorithm
by Paul Zimmermann (*note References::).  This is expressed in a divide
and conquer form, but as noted in the paper it can also be viewed as a
discrete variant of Newton's method.

   In the Karatsuba multiplication range this is an O(1.5*M(N/2))
algorithm, where M(n) is the time to multiply two numbers of n limbs.
In the FFT multiplication range this grows to a bound of O(6*M(N/2)).
In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and
Toom-3 ranges, growing to 2 or 3 in the FFT range.

   The algorithm does all its calculations in integers and the resulting
`mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'.  The extended
precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.


File: gmp.info,  Node: Nth Root Algorithm,  Next: Perfect Square Algorithm,  Prev: Square Root Algorithm,  Up: Root Extraction Algorithms

Nth Root
--------

   Integer Nth roots are taken using Newton's method with the following
iteration, where A is the input and n is the root to be taken.

              1         A
     a[i+1] = - * ( --------- + (n-1)*a[i] )
              n     a[i]^(n-1)

   The initial approximation a[1] is generated bitwise by successively
powering a trial root with or without new 1 bits, aiming to be just
above the true root.  The iteration converges quadratically when
started from a good approximation.  When n is large more initial bits
are needed to get good convergence.  The current implementation is not
particularly well optimized.


File: gmp.info,  Node: Perfect Square Algorithm,  Next: Perfect Power Algorithm,  Prev: Nth Root Algorithm,  Up: Root Extraction Algorithms

Perfect Square
--------------

   `mpz_perfect_square_p' is able to quickly exclude most non-squares by
checking whether the input is a quadratic residue modulo some small
integers.

   The first test is modulo 256 which means simply examining the least
significant byte.  Only 44 different values occur as the low byte of a
square, so 82.8% of non-squares can be immediately excluded.  Similar
tests modulo primes from 3 to 29 exclude 99.5% of those remaining, or
if a limb is 64 bits then primes up to 53 are used, excluding 99.99%.
A single Nx1 remainder using `PP' from `gmp-impl.h' quickly gives all
these remainders.

   A square root must still be taken for any value that passes the
residue tests, to verify it's really a square and not one of the 0.086%
(or 0.000156% for 64 bits) non-squares that get through.  *Note Square
Root Algorithm::.


File: gmp.info,  Node: Perfect Power Algorithm,  Prev: Perfect Square Algorithm,  Up: Root Extraction Algorithms

Perfect Power
-------------

   Detecting perfect powers is required by some factorization
algorithms.  Currently `mpz_perfect_power_p' is implemented using
repeated Nth root extractions, though naturally only prime roots need
to be considered.  (*Note Nth Root Algorithm::.)

   If a prime divisor p with multiplicity e can be found, then only
roots which are divisors of e need to be considered, much reducing the
work necessary.  To this end divisibility by a set of small primes is
checked.


File: gmp.info,  Node: Radix Conversion Algorithms,  Next: Other Algorithms,  Prev: Root Extraction Algorithms,  Up: Algorithms

Radix Conversion
================

   Radix conversions are less important than other algorithms.  A
program dominated by conversions should probably use a different data
representation.

* Menu:

* Binary to Radix::
* Radix to Binary::


File: gmp.info,  Node: Binary to Radix,  Next: Radix to Binary,  Prev: Radix Conversion Algorithms,  Up: Radix Conversion Algorithms

Binary to Radix
---------------

   Conversions from binary to a power-of-2 radix use a simple and fast
O(N) bit extraction algorithm.

   Conversions from binary to other radices use one of two algorithms.
Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
Repeated divisions by b^n are made, where b is the radix and n is the
biggest power that fits in a limb.  But instead of simply using the
remainder r from such divisions, an extra divide step is done to give a
fractional limb representing r/b^n.  The digits of r can then be
extracted using multiplications by b rather than divisions.  Special
case code is provided for decimal, allowing multiplications by 10 to
optimize to shifts and adds.

   Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
used.  For an input t, powers b^(n*2^i) of the radix are calculated,
until a power between t and sqrt(t) is reached.  t is then divided by
that largest power, giving a quotient which is the digits above that
power, and a remainder which is those below.  These two parts are in
turn divided by the second highest power, and so on recursively.  When
a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
limbs, the basecase algorithm described above is used.

   The advantage of this algorithm is that big divisions can make use
of the sub-quadratic divide and conquer division (*note Divide and
Conquer Division::), and big divisions tend to have less overheads than
lots of separate single limb divisions anyway.  But in any case the
cost of calculating the powers b^(n*2^i) must first be overcome.

   `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
the same basic thing, the point where it becomes worth doing a big
division to cut the input in half.  `GET_STR_PRECOMPUTE_THRESHOLD'
includes the cost of calculating the radix power required, whereas
`GET_STR_DC_THRESHOLD' assumes that's already available, which is the
case when recursing.

   Since the base case produces digits from least to most significant
but they want to be stored from most to least, it's necessary to
calculate in advance how many digits there will be, or at least be sure
not to underestimate that.  For GMP the number of input bits is
multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
The result is either correct or one too big.

   Examining some of the high bits of the input could increase the
chance of getting the exact number of digits, but an exact result every
time would not be practical, since in general the difference between
numbers 100... and 99... is only in the last few bits and the work to
identify 99...  might well be almost as much as a full conversion.

   `mpf_get_str' doesn't currently use the algorithm described here, it
multiplies or divides by a power of b to move the radix point to the
just above the highest non-zero digit (or at worst one above that
location), then multiplies by b^n to bring out digits.  This is O(N^2)
and is certainly not optimal.

   The r/b^n scheme described above for using multiplications to bring
out digits might be useful for more than a single limb.  Some brief
experiments with it on the base case when recursing didn't give a
noticable improvement, but perhaps that was only due to the
implementation.  Something similar would work for the sub-quadratic
divisions too, though there would be the cost of calculating a bigger
radix power.

   Another possible improvement for the sub-quadratic part would be to
arrange for radix powers that balanced the sizes of quotient and
remainder produced, ie. the highest power would be an b^(n*k)
approximately equal to sqrt(t), not restricted to a 2^i factor.  That
ought to smooth out a graph of times against sizes, but may or may not
be a net speedup.


File: gmp.info,  Node: Radix to Binary,  Prev: Binary to Radix,  Up: Radix Conversion Algorithms

Radix to Binary
---------------

   Conversions from a power-of-2 radix into binary use a simple and fast
O(N) bitwise concatenation algorithm.

   Conversions from other radices use one of two algorithms.  Sizes
below `SET_STR_THRESHOLD' use a basic O(N^2) method.  Groups of n
digits are converted to limbs, where n is the biggest power of the base
b which will fit in a limb, then those groups are accumulated into the
result by multiplying by b^n and adding.  This saves multi-precision
operations, as per Knuth section 4.4 part E (*note References::).  Some
special case code is provided for decimal, giving the compiler a chance
to optimize multiplications by 10.

   Above `SET_STR_THRESHOLD' a sub-quadratic algorithm is used.  First
groups of n digits are converted into limbs.  Then adjacent limbs are
combined into limb pairs with x*b^n+y, where x and y are the limbs.
Adjacent limb pairs are combined into quads similarly with x*b^(2n)+y.
This continues until a single block remains, that being the result.

   The advantage of this method is that the multiplications for each x
are big blocks, allowing Karatsuba and higher algorithms to be used.
But the cost of calculating the powers b^(n*2^i) must be overcome.
`SET_STR_THRESHOLD' usually ends up quite big, around 5000 digits, and
on some processors much bigger still.

   `SET_STR_THRESHOLD' is based on the input digits (and tuned for
decimal), though it might be better based on a limb count, so as to be
independent of the base.  But that sort of count isn't used by the base
case and so would need some sort of initial calculation or estimate.

   The main reason `SET_STR_THRESHOLD' is so much bigger than the
corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that `mpn_mul_1' is
much faster than `mpn_divrem_1' (often by a factor of 10, or more).


File: gmp.info,  Node: Other Algorithms,  Next: Assembler Coding,  Prev: Radix Conversion Algorithms,  Up: Algorithms

Other Algorithms
================

* Menu:

* Factorial Algorithm::
* Binomial Coefficients Algorithm::
* Fibonacci Numbers Algorithm::
* Lucas Numbers Algorithm::


File: gmp.info,  Node: Factorial Algorithm,  Next: Binomial Coefficients Algorithm,  Prev: Other Algorithms,  Up: Other Algorithms

Factorial
---------

   Factorials n! are calculated by a simple product from 1 to n, but
arranged into certain sub-products.

   First as many factors as fit in a limb are accumulated, then two of
those multiplied to give a 2-limb product.  When two 2-limb products
are ready they're multiplied to a 4-limb product, and when two 4-limbs
are ready they're multiplied to an 8-limb product, etc.  A stack of
outstanding products is built up, with two of the same size multiplied
together when ready.

   Arranging for multiplications to have operands the same (or nearly
the same) size means the Karatsuba and higher multiplication algorithms
can be used.  And even on sizes below the Karatsuba threshold an NxN
multiply will give a basecase multiply more to work on.

   An obvious improvement not currently implemented would be to strip
factors of 2 from the products and apply them at the end with a bit
shift.  Another possibility would be to determine the prime
factorization of the result (which can be done easily), and use a
powering method, at each stage squaring then multiplying in those
primes with a 1 in their exponent at that point.  The advantage would
be some multiplies turned into squares.


File: gmp.info,  Node: Binomial Coefficients Algorithm,  Next: Fibonacci Numbers Algorithm,  Prev: Factorial Algorithm,  Up: Other Algorithms

Binomial Coefficients
---------------------

   Binomial coefficients C(n,k) are calculated by first arranging k <=
n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the
following product simply from i=2 to i=k.

                           k  (n-k+i)
     C(n,k) =  (n-k+1) * prod -------
                          i=2    i

   It's easy to show that each denominator i will divide the product so
far, so the exact division algorithm is used (*note Exact Division::).

   The numerators n-k+i and denominators i are first accumulated into
as many fit a limb, to save multi-precision operations, though for
`mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
and n-k+i in general won't fit in a limb at all.

   An obvious improvement would be to strip factors of 2 from each
multiplier and divisor and count them separately, to be applied with a
bit shift at the end.  Factors of 3 and perhaps 5 could even be handled
similarly.  Another possibility, if n is not too big, would be to
determine the prime factorization of the result based on the factorials
involved, and power up those primes appropriately.  This would help
most when k is near n/2.


File: gmp.info,  Node: Fibonacci Numbers Algorithm,  Next: Lucas Numbers Algorithm,  Prev: Binomial Coefficients Algorithm,  Up: Other Algorithms

Fibonacci Numbers
-----------------

   The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed
for calculating isolated F[n] or F[n],F[n-1] values efficiently.

   For small n, a table of single limb values in `__gmp_fib_table' is
used.  On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
to F[93].  For convenience the table starts at F[-1].

   Beyond the table, values are generated with a binary powering
algorithm, calculating a pair F[n] and F[n-1] working from high to low
across the bits of n.  The formulas used are

     F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
     F[2k-1] =   F[k]^2 + F[k-1]^2
     
     F[2k] = F[2k+1] - F[2k-1]

   At each step, k is the high b bits of n.  If the next bit of n is 0
then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
and the process repeated until all bits of n are incorporated.  Notice
these formulas require just two squares per bit of n.

   It'd be possible to handle the first few n above the single limb
table with simple additions, using the defining Fibonacci recurrence
F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
be faster for only about 10 or 20 values of n, and including a block of
code for just those doesn't seem worthwhile.  If they really mattered
it'd be better to extend the data table.

   Using a table avoids lots of calculations on small numbers, and
makes small n go fast.  A bigger table would make more small n go fast,
it's just a question of balancing size against desired speed.  For GMP
the code is kept compact, with the emphasis primarily on a good
powering algorithm.

   `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
interested in F[n].  In this case the last step of the algorithm can
become one multiply instead of two squares.  One of the following two
formulas is used, according as n is odd or even.

     F[2k]   = F[k]*(F[k]+2F[k-1])
     
     F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k

   F[2k+1] here is the same as above, just rearranged to be a multiply.
For interest, the 2*(-1)^k term both here and above can be applied
just to the low limb of the calculation, without a carry or borrow into
further limbs, which saves some code size.  See comments with
`mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.


File: gmp.info,  Node: Lucas Numbers Algorithm,  Prev: Fibonacci Numbers Algorithm,  Up: Other Algorithms

Lucas Numbers
-------------

   `mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
Fibonacci numbers with the following simple formulas.

     L[k]   =   F[k] + 2*F[k-1]
     L[k-1] = 2*F[k] -   F[k-1]

   `mpz_lucnum_ui' is only interested in L[n], and some work can be
saved.  Trailing zero bits on n can be handled with a single square
each.

     L[2k] = L[k]^2 - 2*(-1)^k

   And the lowest 1 bit can be handled with one multiply of a pair of
Fibonacci numbers, similar to what `mpz_fib_ui' does.

     L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k


File: gmp.info,  Node: Assembler Coding,  Prev: Other Algorithms,  Up: Algorithms

Assembler Coding
================

   The assembler subroutines in GMP are the most significant source of
speed at small to moderate sizes.  At larger sizes algorithm selection
becomes more important, but of course speedups in low level routines
will still speed up everything proportionally.

   Carry handling and widening multiplies that are important for GMP
can't be easily expressed in C.  GCC `asm' blocks help a lot and are
provided in `longlong.h', but hand coding low level routines invariably
offers a speedup over generic C by a factor of anything from 2 to 10.

* Menu:

* Assembler Code Organisation::
* Assembler Basics::
* Assembler Carry Propagation::
* Assembler Cache Handling::
* Assembler Floating Point::
* Assembler SIMD Instructions::
* Assembler Software Pipelining::
* Assembler Loop Unrolling::


File: gmp.info,  Node: Assembler Code Organisation,  Next: Assembler Basics,  Prev: Assembler Coding,  Up: Assembler Coding

Code Organisation
-----------------

   The various `mpn' subdirectories contain machine-dependent code,
written in C or assembler.  The `mpn/generic' subdirectory contains
default code, used when there's no machine-specific version of a
particular file.

   Each `mpn' subdirectory is for an ISA family.  Generally 32-bit and
64-bit variants in a family cannot share code and will have separate
directories.  Within a family further subdirectories may exist for CPU
variants.


File: gmp.info,  Node: Assembler Basics,  Next: Assembler Carry Propagation,  Prev: Assembler Code Organisation,  Up: Assembler Coding

Assembler Basics
----------------

   `mpn_addmul_1' and `mpn_submul_1' are the most important routines
for overall GMP performance.  All multiplications and divisions come
down to repeated calls to these.  `mpn_add_n', `mpn_sub_n',
`mpn_lshift' and `mpn_rshift' are next most important.

   On some CPUs assembler versions of the internal functions
`mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
mainly through avoiding function call overheads.  They can also
potentially make better use of a wide superscalar processor.

   The restrictions on overlaps between sources and destinations (*note
Low-level Functions::) are designed to facilitate a variety of
implementations.  For example, knowing `mpn_add_n' won't have partly
overlapping sources and destination means reading can be done far ahead
of writing on superscalar processors, and loops can be vectorized on a
vector processor, depending on the carry handling.


File: gmp.info,  Node: Assembler Carry Propagation,  Next: Assembler Cache Handling,  Prev: Assembler Basics,  Up: Assembler Coding

Carry Propagation
-----------------

   The problem that presents most challenges in GMP is propagating
carries from one limb to the next.  In functions like `mpn_addmul_1' and
`mpn_add_n', carries are the only dependencies between limb operations.

   On processors with carry flags, a straightforward CISC style `adc' is
generally best.  AMD K6 `mpn_addmul_1' however is an example of an
unusual set of circumstances where a branch works out better.

   On RISC processors generally an add and compare for overflow is
used.  This sort of thing can be seen in `mpn/generic/aors_n.c'.  Some
carry propagation schemes require 4 instructions, meaning at least 4
cycles per limb, but other schemes may use just 1 or 2.  On wide
superscalar processors performance may be completely determined by the
number of dependent instructions between carry-in and carry-out for
each limb.

   On vector processors good use can be made of the fact that a carry
bit only very rarely propagates more than one limb.  When adding a
single bit to a limb, there's only a carry out if that limb was
`0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
`mpn/cray/add_n.c' is an example of this, it adds all limbs in
parallel, adds one set of carry bits in parallel and then only rarely
needs to fall through to a loop propagating further carries.

   On the x86s, GCC (as of version 2.95.2) doesn't generate
particularly good code for the RISC style idioms that are necessary to
handle carry bits in C.  Often conditional jumps are generated where
`adc' or `sbb' forms would be better.  And so unfortunately almost any
loop involving carry bits needs to be coded in assembler for best
results.