GMP MULTIPLICATION This file describes briefly the multiplication and squaring used in GMP. The code is likely to be hard to understand without knowing something about the algorithms. GMP does NxN limb multiplications and squares using one of four algorithms, according to the size N. Algorithm Sizes basecase < KARATSUBA_MUL_THRESHOLD karatsuba >= KARATSUBA_MUL_THRESHOLD toom3 >= TOOM3_MUL_THRESHOLD fft >= FFT_MUL_THRESHOLD Similarly for squaring, with the SQR thresholds. Note though that the FFT is only used if GMP is configured with --enable-fft. MxN multiplications of operands with different sizes are currently done by splitting it up into various NxN pieces while above KARATSUBA_MUL_THRESHOLD. The karatsuba and toom3 routines then operate only on equal size operands. This is rather inefficient, and is slated for improvement in the future. BASECASE MULTIPLY This a straightforward rectangular set of cross-products, the same as long multiplication done by hand and for that reason sometimes known as the schoolbook or grammar school method. See Knuth (reference in gmp.texi) volume 2 section 4.3.1 algorithm M, or the mpn/generic/mul_basecase.c code. Assembler implementations of mul_basecase are essentially the same as the generic C version, but have all the usual assembler tricks and obscurities introduced for speed. BASECASE SQUARE A square can be done in roughly half the time of a multiply, by using the fact that the cross products above and below the diagonal are the same. In practice squaring isn't 2x faster than multiplying, but it's always faster by a decent factor. u0 u1 u2 u3 u4 +---+---+---+---+---+ u0 | x | | | | | +---+---+---+---+---+ u1 | | x | | | | +---+---+---+---+---+ u2 | | | x | | | +---+---+---+---+---+ u3 | | | | x | | +---+---+---+---+---+ u4 | | | | | x | +---+---+---+---+---+ The basic algorithm is to calculate a triangle of products below the diagonal, double it (left shift by one bit), and add in the products on the diagonal. This can be seen in mpn/generic/sqr_basecase.c. Again the assembler implementations take essentially this same approach. KARATSUBA MULTIPLY The Karatsuba multiplication algorithm is described in Knuth volume 2 section 4.3.3 part A, and in other texts like Geddes et al. section 4.3 (reference below). A brief description is given here. The Karatsuba algorithm treats its inputs x and y as each split in two parts of equal length (or the most significant part one limb shorter if N is odd). high low +----------+----------+ | x1 | x0 | +----------+----------+ +----------+----------+ | y1 | y0 | +----------+----------+ Let b be the power of 2 where the split occurs, ie. if x0 is k limbs (y0 the same) then b=2^(k*mp_bits_per_limb). Then x=x1*b+x0 and y=y1*b+y0, and the following holds, x*y = (b^2+b)*x1*y1 + b*(x1-x0)*(y1-y0) + (b+1)*x0*y0 This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas a basecase multiply of NxN limbs is roughly equivalent to four multiplies of (N/2)x(N/2). The factors (b^2+b) etc in the formula look worse than they are, representing simply the positions where the products must be added in. high low +--------+--------+ +--------+--------+ | x1*y1 | | x0*y0 | +--------+--------+ +--------+--------+ +--------+--------+ | x1*y1 | +--------+--------+ +--------+--------+ | x0*y0 | +--------+--------+ +--------+--------+ | (x1-x0)*(y1-y0) | +--------+--------+ The term (x1-x0)*(y1-y0) can be negative, meaning a subtraction, but the final result is of course always positive. The use of three multiplies of N/2 limbs each leads to an asymptotic speed O(N^1.585). (The exponent is log(3)/log(2).) This is a big improvement over the basecase multiply at O(N^2) and the algorithmic advantage soon overcomes the extra additions Karatsuba must perform. KARATSUBA SQUARE A square is very similar to a multiply, but with x==y the formula reduces to an equivalent with three squares, x^2 = (b^2+b)*x1^2 + b*(x1-x0)^2 + (b+1)*x0^2 The final result is accumulated from those three squares the same way as for the three multiplies above. The middle term (x1-x0)^2 however is now always positive. TOOM-COOK 3-WAY MULTIPLY The Karatsuba formula is part of a general approach to splitting inputs leading to both Toom-Cook and FFT algorithms. A description of Toom-Cook can be found in Knuth volume 2 section 4.3.3, with an example 3-way calculation after Theorem A. Toom-Cook 3-way treats the operands as split into 3 pieces of equal size (or the most significant part 1 or 2 limbs shorter than the others). high low +----------+----------+----------+ | x2 | x1 | x0 | +----------+----------+----------+ +----------+----------+----------+ | y2 | y1 | y0 | +----------+----------+----------+ These parts are treated as the coefficients of two polynomials X(t) = x2*t^2 + x1*t + x0 Y(t) = y2*t^2 + y1*t + y0 Again let b equal the power of 2 which is the size of the x0, x1, y0 and y1 pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb). With this then x=X(b) and y=Y(b). Let a polynomial W(t)=X(t)*Y(t) and suppose it's coefficients are W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0 The w[i] are going to be determined, and when they are they'll give the final result using w=W(b), since x*y=X(b)*Y(b)=W(b). The coefficients will each be roughly b^2, so the final W(b) will be an addition like, high low +-------+-------+ | w4 | +-------+-------+ +--------+-------+ | w3 | +--------+-------+ +--------+-------+ | w2 | +--------+-------+ +--------+-------+ | w1 | +--------+-------+ +-------+-------+ | w0 | +-------+-------+ ------------------------------------------------- The w[i] coefficients could be formed by a simple set of cross products, like w4=x2*x2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a basecase multiply. Instead the following approach is used. X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of W(t) at those points. The points used can be chosen in various ways, but in GMP the following are used t=0 meaning x0*y0, which gives w0 immediately t=2 meaning (4*x2+2*x1*x0)*(4*y2+2*y1+y0) t=1 meaning (x2+x1+x0)*(y2+y1+y0) t=1/2 meaning (x2+2*x1+4*x0)*(y2+2*y1+4*y0) t=inf meaning x2*y2, which gives w4 immediately At t=1/2 the value calculated is actually 4*X(1/2)*Y(1/2), giving a value for 16*W(1/2) (this is always an integer). At t=inf the value is actually X(t)*Y(t)/t^2 in the limit as t approaches infinity, but it's much easier to think of that as simply x2*y2 giving w4 immediately (much like at t=0 x0*y0 gives w0 immediately). Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a linear combination of the w[i] coefficients, and the value of those combinations has just been calculated. W(0) = w0 16*W(1/2) = w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0 W(1) = w4 + w3 + w2 + w1 + w0 W(2) = 16*w4 + 8*w3 + 4*w2 + 2*w1 + w0 W(inf) = w4 This is a set of five equations in five unknowns, and some elementary linear algebra quickly isolates each w[i], by subtracting multiples of one equation from another. In the code the set of five values W(0),...,W(inf) will represent those certain linear combinations. By adding or subtracting one from another as necessary, values which are each w[i] alone are arrived at. This involves only a few subtractions of small multiples (some of which are powers of 2), and so is very fast. A couple of divisions remain by powers of 2 and one division by 3 (or by 6 rather), and that last uses the special fast mpn_divexact_by3. In the code the values w4, w2 and w0 are formed in the destination, and w3 and w1 are added to them. There's an extra word at the high end of w3, w2 and w1 that are handled separately. With A=w0,B=w1,...,E=w4, the additions are as follows. high low +-------+-------+-------+-------+-------+-------+ | E | C | A | +-------+-------+-------+-------+-------+-------+ +------+-------++------+-------+ | D || B | +------+-------++------+-------+ -- -- -- |tD| |tC| |tB| -- -- -- ------------------------------------------------- The conversion of W(t) values to the coefficients is called interpolation. A polynomial of degree 5 like W(t) is uniquely determined by values known at 5 different points. The points can be chosen to make the linear equations come out with a convenient set of steps for isolating the w[i]s. In mpn/generic/mul_n.c the interpolate3() routine performs the interpolation. The open-coded one-pass version may be a bit hard to understand, the steps performed can be better seen in the USE_MORE_MPN version. The use of five multiplies of N/3 limbs each leads to an asymptotic speed O(N^1.465). (The exponent is log(5)/log(3).) This is an improvement over Karatsuba at O(N^1.585), though Toom-Cook does more work in the evaluation and interpolation and so it's only above a certain size that Toom-Cook realizes its advantage. The formula given above for the Karatsuba algorithm has an equivalent for Toom-Cook 3-way, involving only five multiplies, but this would be complicated and unenlightening. An alternate view of Toom-Cook 3-way can be found in Zuras (reference below). He uses a vector to represent the x and y splits and a matrix multiplication for the evaluation and interpolation stages. The matrix inverses are not meant to be actually used, and they have elements with values much greater than in fact arise in the interpolation steps. The diagram shown for the 3-way is attractive, but again it doesn't have to be implemented that way and for example with a bit of rearrangement just one division by 6 (not two of them) can be done. TOOM-COOK 3-WAY SQUARE Toom-Cook squaring follows the same procedure as multiplication, but there's only one X(t) and it's evaluated at 5 points, and those values squared to give values of W(t). The interpolation is then identical, and in fact the same interpolate3() subroutine is used for both squaring and multiplying. FFT MULTIPLY At large to very large sizes a Fermat style FFT is used, meaning an FFT in a ring of integers modulo 2^M+1. This is asymptotically fast, but overheads mean it's only worthwhile for large products. Some brief notes are given here. Full explanations can be found in various texts. Knuth section 4.3.3 part C describes the method, but using complex numbers. In the references below Schonhage and Strassen is the original paper, and Pollard gives some of the mathematics for a finite field as used here. The FFT does its multiplication modulo 2^N+1, but by choosing N>=bits(A)+bits(B), a full product A*B is obtained. The algorithm splits the inputs into 2^k pieces, for a chosen k, and will recursively perform 2^k pointwise multiplications modulo 2^M+1, where M=N/2^k. N must be a multiple of 2^k. Those multiplications are either done by recursing into a further FFT, or by a plain toom3 etc multiplication, whichever is optimal at the resultant size. Note that in the current implementation M is always a multiple of the limb size. The steps leading to the pointwise products are like the evaluation and interpolation stages of the Karatsuba and Toom-Cook algorithms, but use a convolution, which can be efficiently calculated because 2^(2N/2^k) is a 2^k'th root of unity. As operand sizes gets bigger, bigger splits are used. Each time a bigger k is used some multiplying is effectively swapped for some shifts, adds and overheads. A table of thresholds gives the points where a k+1 FFT is faster than a k FFT. A separate threshold gives the point where a mod 2^N+1 FFT first becomes faster than a plain multiply of that size, and this normally happens in the k=4 range. A further threshold gives the point where an N/2xN/2 multiply done with an FFT mod 2^N+1 is faster than a plain multiply of that size, and this is normally in the k=7 or k=8 range. TOOM-COOK N-WAY (NOT USED) The 3-way Toom-Cook procedure described above generalizes to split into an arbitrary number of pieces, as per Knuth volume 2 section 4.3.3 algorithm C. Some code implementing this for GMP exists, but is not present since it has yet to prove worthwhile. The notes here are merely for interest. In general a split into r+1 pieces will be made, and evaluations and pointwise multiplications done at 2*r+1 points. So a 4-way split does 7 pointwise multiplies, 5-way does 9, etc. Asymptotically an r+1-way algorithm is O(n^(log(2*r+1)/log(r+1)). Only the pointwise multiplications count towards big O() complexity, but the time spent in the evaluate and interpolate stages grows with r and has a significant practical impact, with the asymptotic advantage of each r realized only at bigger and bigger sizes. Knuth algorithm C presents a version evaluating at points 0,1,2,...,2*r, but exercise 4 uses -r,...,0,...,r and the latter saves some small multiplies in the evaluate stage (or rather trades them for additions), and has a further saving of nearly half the interpolate steps. The answer to the exercise doesn't give full details of the interpolate, but essentially the idea is to separate odd and even final coefficients and then perform algorithm C steps C7 and C8 on them separately. The multipliers and divisors at step C7 then become j^2 and 2*t*j-j*j respectively. In the references below, Zuras presents 3-way and 4-way Toom-Cook methods, and compares them to small order FFTs. Hollerbach presents an N-way algorithm from first principles. Bernstein presents the N-way algorithm in a style that facilitates comparing its mathematics to other multiplication algorithms. REFERENCES "Algorithms for Computer Algebra", Keith O. Geddes, Stephen R. Czapor, George Labahn, Kluwer Academic Publishers, 1992, ISBN 0-7923-9259-0. "Schnelle Multiplikation grosser Zahlen", by Arnold Schonhage and Volker Strassen, Computing 7, p. 281-292, 1971. "The Fast Fourier Transform in a Finite Field", J.M. Pollard, Mathematics of Computation, vol 25, num 114, April 1971. "On Squaring and Multiplying Large Integers", Dan Zuras, ARITH-11: IEEE Symposium on Computer Arithmetic, 1993, pages 260 to 271. And reprinted as "More on Multiplying and Squaring Large Integers", IEEE Transactions on Computers, August 1994. "Fast Multiplication & Division of Very Large Numbers", Uwe Hollerbach, post to sci.math.research, Jan 1996, archived at Swarthmore, http://forum.swarthmore.edu/epigone/sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@forum.swarthmore.edu "Multidigit Multiplication for Mathematicians", Daniel J. Bernstein, preprint available at http://koobera.math.uic.edu/www/papers. (Every known multiplication technique, many references.) ---------------- Local variables: mode: text fill-column: 76 End: