GMP MULTIPLICATION


This file describes briefly the multiplication and squaring used in GMP.
The code is likely to be hard to understand without knowing something about
the algorithms.

GMP does NxN limb multiplications and squares using one of four algorithms,
according to the size N.

         Algorithm      Sizes

         basecase    < KARATSUBA_MUL_THRESHOLD
	 karatsuba   >= KARATSUBA_MUL_THRESHOLD
	 toom3       >= TOOM3_MUL_THRESHOLD
	 fft         >= FFT_MUL_THRESHOLD

Similarly for squaring, with the SQR thresholds.  Note though that the FFT
is only used if GMP is configured with --enable-fft.

MxN multiplications of operands with different sizes are currently done by
splitting it up into various NxN pieces while above KARATSUBA_MUL_THRESHOLD.
The karatsuba and toom3 routines then operate only on equal size operands.
This is rather inefficient, and is slated for improvement in the future.


BASECASE MULTIPLY

This a straightforward rectangular set of cross-products, the same as long
multiplication done by hand and for that reason sometimes known as the
schoolbook or grammar school method.

See Knuth (reference in gmp.texi) volume 2 section 4.3.1 algorithm M, or the
mpn/generic/mul_basecase.c code.

Assembler implementations of mul_basecase are essentially the same as the
generic C version, but have all the usual assembler tricks and obscurities
introduced for speed.


BASECASE SQUARE

A square can be done in roughly half the time of a multiply, by using the
fact that the cross products above and below the diagonal are the same.  In
practice squaring isn't 2x faster than multiplying, but it's always faster
by a decent factor.

             u0  u1  u2  u3  u4
           +---+---+---+---+---+
        u0 | x |   |   |   |   |
           +---+---+---+---+---+
        u1 |   | x |   |   |   |
           +---+---+---+---+---+
        u2 |   |   | x |   |   |
           +---+---+---+---+---+
        u3 |   |   |   | x |   |
           +---+---+---+---+---+
        u4 |   |   |   |   | x |
           +---+---+---+---+---+

The basic algorithm is to calculate a triangle of products below the
diagonal, double it (left shift by one bit), and add in the products on the
diagonal.  This can be seen in mpn/generic/sqr_basecase.c.  Again the
assembler implementations take essentially this same approach.


KARATSUBA MULTIPLY

The Karatsuba multiplication algorithm is described in Knuth volume 2
section 4.3.3 part A, and in other texts like Geddes et al. section 4.3
(reference below).  A brief description is given here.

The Karatsuba algorithm treats its inputs x and y as each split in two parts
of equal length (or the most significant part one limb shorter if N is odd).

         high              low
        +----------+----------+
        |    x1    |    x0    |
        +----------+----------+

        +----------+----------+
        |    y1    |    y0    |
        +----------+----------+

Let b be the power of 2 where the split occurs, ie. if x0 is k limbs (y0 the
same) then b=2^(k*mp_bits_per_limb).  Then x=x1*b+x0 and y=y1*b+y0, and the
following holds,

	x*y = (b^2+b)*x1*y1 + b*(x1-x0)*(y1-y0) + (b+1)*x0*y0

This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas
a basecase multiply of NxN limbs is roughly equivalent to four multiplies of
(N/2)x(N/2).

The factors (b^2+b) etc in the formula look worse than they are,
representing simply the positions where the products must be added in.

         high                              low
        +--------+--------+ +--------+--------+
        |      x1*y1      | |      x0*y0      |
        +--------+--------+ +--------+--------+
                  +--------+--------+
                  |      x1*y1      |
                  +--------+--------+
                  +--------+--------+
                  |      x0*y0      |
                  +--------+--------+
                  +--------+--------+
                  | (x1-x0)*(y1-y0) |
                  +--------+--------+

The term (x1-x0)*(y1-y0) can be negative, meaning a subtraction, but the
final result is of course always positive.

The use of three multiplies of N/2 limbs each leads to an asymptotic speed
O(N^1.585).  (The exponent is log(3)/log(2).)  This is a big improvement
over the basecase multiply at O(N^2) and the algorithmic advantage soon
overcomes the extra additions Karatsuba must perform.


KARATSUBA SQUARE

A square is very similar to a multiply, but with x==y the formula reduces to
an equivalent with three squares,

        x^2 = (b^2+b)*x1^2 + b*(x1-x0)^2 + (b+1)*x0^2
 
The final result is accumulated from those three squares the same way as for
the three multiplies above.  The middle term (x1-x0)^2 however is now always
positive.


TOOM-COOK 3-WAY MULTIPLY

The Karatsuba formula is part of a general approach to splitting inputs
leading to both Toom-Cook and FFT algorithms.  A description of Toom-Cook
can be found in Knuth volume 2 section 4.3.3, with an example 3-way
calculation after Theorem A.

Toom-Cook 3-way treats the operands as split into 3 pieces of equal size (or
the most significant part 1 or 2 limbs shorter than the others).

         high                         low
        +----------+----------+----------+
        |    x2    |    x1    |    x0    |
        +----------+----------+----------+

        +----------+----------+----------+
        |    y2    |    y1    |    y0    |
        +----------+----------+----------+

These parts are treated as the coefficients of two polynomials

	X(t) = x2*t^2 + x1*t + x0
	Y(t) = y2*t^2 + y1*t + y0

Again let b equal the power of 2 which is the size of the x0, x1, y0 and y1
pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb).  With
this then x=X(b) and y=Y(b).

Let a polynomial W(t)=X(t)*Y(t) and suppose it's coefficients are

	W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0

The w[i] are going to be determined, and when they are they'll give the
final result using w=W(b), since x*y=X(b)*Y(b)=W(b).  The coefficients will
each be roughly b^2, so the final W(b) will be an addition like,

         high                                        low
        +-------+-------+
        |       w4      |
        +-------+-------+
               +--------+-------+
               |        w3      |
               +--------+-------+
                       +--------+-------+
                       |        w2      |
                       +--------+-------+
                               +--------+-------+
                               |        w1      |
                               +--------+-------+
                                        +-------+-------+
                                        |       w0      |
                                        +-------+-------+
        -------------------------------------------------


The w[i] coefficients could be formed by a simple set of cross products,
like w4=x2*x2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need
all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a
basecase multiply.  Instead the following approach is used.

X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of
W(t) at those points.  The points used can be chosen in various ways, but in
GMP the following are used
  
	t=0    meaning x0*y0, which gives w0 immediately
        t=2    meaning (4*x2+2*x1*x0)*(4*y2+2*y1+y0)
        t=1    meaning (x2+x1+x0)*(y2+y1+y0)
        t=1/2  meaning (x2+2*x1+4*x0)*(y2+2*y1+4*y0)
        t=inf  meaning x2*y2, which gives w4 immediately

At t=1/2 the value calculated is actually 4*X(1/2)*Y(1/2), giving a value
for 16*W(1/2) (this is always an integer).  At t=inf the value is actually
X(t)*Y(t)/t^2 in the limit as t approaches infinity, but it's much easier to
think of that as simply x2*y2 giving w4 immediately (much like at t=0 x0*y0
gives w0 immediately).

Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a linear
combination of the w[i] coefficients, and the value of those combinations
has just been calculated.

           W(0)   =                                 w0
        16*W(1/2) =    w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0
           W(1)   =    w4 +   w3 +   w2 +   w1 +    w0
           W(2)   = 16*w4 + 8*w3 + 4*w2 + 2*w1 +    w0
           W(inf) =    w4

This is a set of five equations in five unknowns, and some elementary linear
algebra quickly isolates each w[i], by subtracting multiples of one equation
from another.

In the code the set of five values W(0),...,W(inf) will represent those
certain linear combinations.  By adding or subtracting one from another as
necessary, values which are each w[i] alone are arrived at.  This involves
only a few subtractions of small multiples (some of which are powers of 2),
and so is very fast.  A couple of divisions remain by powers of 2 and one
division by 3 (or by 6 rather), and that last uses the special fast
mpn_divexact_by3.

In the code the values w4, w2 and w0 are formed in the destination, and w3
and w1 are added to them.  There's an extra word at the high end of w3, w2
and w1 that are handled separately.  With A=w0,B=w1,...,E=w4, the additions
are as follows.

         high                                        low
        +-------+-------+-------+-------+-------+-------+
        |       E       |       C       |       A       |
        +-------+-------+-------+-------+-------+-------+
                 +------+-------++------+-------+
                 |      D       ||      B       |
                 +------+-------++------+-------+
              --      --      --
             |tD|    |tC|    |tB|        
              --      --      --
        -------------------------------------------------


The conversion of W(t) values to the coefficients is called interpolation.
A polynomial of degree 5 like W(t) is uniquely determined by values known at
5 different points.  The points can be chosen to make the linear equations
come out with a convenient set of steps for isolating the w[i]s.

In mpn/generic/mul_n.c the interpolate3() routine performs the
interpolation.  The open-coded one-pass version may be a bit hard to
understand, the steps performed can be better seen in the USE_MORE_MPN
version.

The use of five multiplies of N/3 limbs each leads to an asymptotic speed
O(N^1.465).  (The exponent is log(5)/log(3).)  This is an improvement over
Karatsuba at O(N^1.585), though Toom-Cook does more work in the evaluation
and interpolation and so it's only above a certain size that Toom-Cook
realizes its advantage.

The formula given above for the Karatsuba algorithm has an equivalent for
Toom-Cook 3-way, involving only five multiplies, but this would be
complicated and unenlightening.

An alternate view of Toom-Cook 3-way can be found in Zuras (reference
below).  He uses a vector to represent the x and y splits and a matrix
multiplication for the evaluation and interpolation stages.  The matrix
inverses are not meant to be actually used, and they have elements with
values much greater than in fact arise in the interpolation steps.  The
diagram shown for the 3-way is attractive, but again it doesn't have to be
implemented that way and for example with a bit of rearrangement just one
division by 6 (not two of them) can be done.


TOOM-COOK 3-WAY SQUARE

Toom-Cook squaring follows the same procedure as multiplication, but there's
only one X(t) and it's evaluated at 5 points, and those values squared to
give values of W(t).  The interpolation is then identical, and in fact the
same interpolate3() subroutine is used for both squaring and multiplying.


FFT MULTIPLY

At large to very large sizes a Fermat style FFT is used, meaning an FFT in a
ring of integers modulo 2^M+1.  This is asymptotically fast, but overheads
mean it's only worthwhile for large products.

Some brief notes are given here.  Full explanations can be found in various
texts.  Knuth section 4.3.3 part C describes the method, but using complex
numbers.  In the references below Schonhage and Strassen is the original
paper, and Pollard gives some of the mathematics for a finite field as used
here.

The FFT does its multiplication modulo 2^N+1, but by choosing
N>=bits(A)+bits(B), a full product A*B is obtained.  The algorithm splits
the inputs into 2^k pieces, for a chosen k, and will recursively perform 2^k
pointwise multiplications modulo 2^M+1, where M=N/2^k.  N must be a multiple
of 2^k.  Those multiplications are either done by recursing into a further
FFT, or by a plain toom3 etc multiplication, whichever is optimal at the
resultant size.  Note that in the current implementation M is always a
multiple of the limb size.

The steps leading to the pointwise products are like the evaluation and
interpolation stages of the Karatsuba and Toom-Cook algorithms, but use a
convolution, which can be efficiently calculated because 2^(2N/2^k) is a
2^k'th root of unity.

As operand sizes gets bigger, bigger splits are used.  Each time a bigger k
is used some multiplying is effectively swapped for some shifts, adds and
overheads.  A table of thresholds gives the points where a k+1 FFT is faster
than a k FFT.  A separate threshold gives the point where a mod 2^N+1 FFT
first becomes faster than a plain multiply of that size, and this normally
happens in the k=4 range.  A further threshold gives the point where an
N/2xN/2 multiply done with an FFT mod 2^N+1 is faster than a plain multiply
of that size, and this is normally in the k=7 or k=8 range.


TOOM-COOK N-WAY (NOT USED)

The 3-way Toom-Cook procedure described above generalizes to split into an
arbitrary number of pieces, as per Knuth volume 2 section 4.3.3 algorithm C.
Some code implementing this for GMP exists, but is not present since it has
yet to prove worthwhile.  The notes here are merely for interest.

In general a split into r+1 pieces will be made, and evaluations and
pointwise multiplications done at 2*r+1 points.  So a 4-way split does 7
pointwise multiplies, 5-way does 9, etc.

Asymptotically an r+1-way algorithm is O(n^(log(2*r+1)/log(r+1)).  Only the
pointwise multiplications count towards big O() complexity, but the time
spent in the evaluate and interpolate stages grows with r and has a
significant practical impact, with the asymptotic advantage of each r
realized only at bigger and bigger sizes.

Knuth algorithm C presents a version evaluating at points 0,1,2,...,2*r, but
exercise 4 uses -r,...,0,...,r and the latter saves some small multiplies in
the evaluate stage (or rather trades them for additions), and has a further
saving of nearly half the interpolate steps.  The answer to the exercise
doesn't give full details of the interpolate, but essentially the idea is to
separate odd and even final coefficients and then perform algorithm C steps
C7 and C8 on them separately.  The multipliers and divisors at step C7 then
become j^2 and 2*t*j-j*j respectively.

In the references below, Zuras presents 3-way and 4-way Toom-Cook methods,
and compares them to small order FFTs.  Hollerbach presents an N-way
algorithm from first principles.  Bernstein presents the N-way algorithm in
a style that facilitates comparing its mathematics to other multiplication
algorithms.


REFERENCES

"Algorithms for Computer Algebra", Keith O. Geddes, Stephen R. Czapor,
George Labahn, Kluwer Academic Publishers, 1992, ISBN 0-7923-9259-0.

"Schnelle Multiplikation grosser Zahlen", by Arnold Schonhage and Volker
Strassen, Computing 7, p. 281-292, 1971.

"The Fast Fourier Transform in a Finite Field", J.M. Pollard, Mathematics of
Computation, vol 25, num 114, April 1971.

"On Squaring and Multiplying Large Integers", Dan Zuras, ARITH-11: IEEE
Symposium on Computer Arithmetic, 1993, pages 260 to 271.  And reprinted as
"More on Multiplying and Squaring Large Integers", IEEE Transactions on
Computers, August 1994.

"Fast Multiplication & Division of Very Large Numbers", Uwe Hollerbach, post
to sci.math.research, Jan 1996, archived at Swarthmore,
http://forum.swarthmore.edu/epigone/sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@forum.swarthmore.edu

"Multidigit Multiplication for Mathematicians", Daniel J. Bernstein,
preprint available at http://koobera.math.uic.edu/www/papers.  (Every known
multiplication technique, many references.)


----------------
Local variables:
mode: text
fill-column: 76
End: