Annotation of OpenXM_contrib/gmp/doc/multiplication, Revision 1.1
1.1 ! maekawa 1:
! 2: GMP MULTIPLICATION
! 3:
! 4:
! 5: This file describes briefly the multiplication and squaring used in GMP.
! 6: The code is likely to be hard to understand without knowing something about
! 7: the algorithms.
! 8:
! 9: GMP does NxN limb multiplications and squares using one of four algorithms,
! 10: according to the size N.
! 11:
! 12: Algorithm Sizes
! 13:
! 14: basecase < KARATSUBA_MUL_THRESHOLD
! 15: karatsuba >= KARATSUBA_MUL_THRESHOLD
! 16: toom3 >= TOOM3_MUL_THRESHOLD
! 17: fft >= FFT_MUL_THRESHOLD
! 18:
! 19: Similarly for squaring, with the SQR thresholds. Note though that the FFT
! 20: is only used if GMP is configured with --enable-fft.
! 21:
! 22: MxN multiplications of operands with different sizes are currently done by
! 23: splitting it up into various NxN pieces while above KARATSUBA_MUL_THRESHOLD.
! 24: The karatsuba and toom3 routines then operate only on equal size operands.
! 25: This is rather inefficient, and is slated for improvement in the future.
! 26:
! 27:
! 28:
! 29: BASECASE MULTIPLY
! 30:
! 31: This a straightforward rectangular set of cross-products, the same as long
! 32: multiplication done by hand and for that reason sometimes known as the
! 33: schoolbook or grammar school method.
! 34:
! 35: See Knuth (reference in gmp.texi) volume 2 section 4.3.1 algorithm M, or the
! 36: mpn/generic/mul_basecase.c code.
! 37:
! 38: Assembler implementations of mul_basecase are essentially the same as the
! 39: generic C version, but have all the usual assembler tricks and obscurities
! 40: introduced for speed.
! 41:
! 42:
! 43:
! 44: BASECASE SQUARE
! 45:
! 46: A square can be done in roughly half the time of a multiply, by using the
! 47: fact that the cross products above and below the diagonal are the same. In
! 48: practice squaring isn't 2x faster than multiplying, but it's always faster
! 49: by a decent factor.
! 50:
! 51: u0 u1 u2 u3 u4
! 52: +---+---+---+---+---+
! 53: u0 | x | | | | |
! 54: +---+---+---+---+---+
! 55: u1 | | x | | | |
! 56: +---+---+---+---+---+
! 57: u2 | | | x | | |
! 58: +---+---+---+---+---+
! 59: u3 | | | | x | |
! 60: +---+---+---+---+---+
! 61: u4 | | | | | x |
! 62: +---+---+---+---+---+
! 63:
! 64: The basic algorithm is to calculate a triangle of products below the
! 65: diagonal, double it (left shift by one bit), and add in the products on the
! 66: diagonal. This can be seen in mpn/generic/sqr_basecase.c. Again the
! 67: assembler implementations take essentially this same approach.
! 68:
! 69:
! 70:
! 71:
! 72: KARATSUBA MULTIPLY
! 73:
! 74: The Karatsuba multiplication algorithm is described in Knuth volume 2
! 75: section 4.3.3 part A, and in other texts like Geddes et al. section 4.3
! 76: (reference below). A brief description is given here.
! 77:
! 78: The Karatsuba algorithm treats its inputs x and y as each split in two parts
! 79: of equal length (or the most significant part one limb shorter if N is odd).
! 80:
! 81: high low
! 82: +----------+----------+
! 83: | x1 | x0 |
! 84: +----------+----------+
! 85:
! 86: +----------+----------+
! 87: | y1 | y0 |
! 88: +----------+----------+
! 89:
! 90: Let b be the power of 2 where the split occurs, ie. if x0 is k limbs (y0 the
! 91: same) then b=2^(k*mp_bits_per_limb). Then x=x1*b+x0 and y=y1*b+y0, and the
! 92: following holds,
! 93:
! 94: x*y = (b^2+b)*x1*y1 + b*(x1-x0)*(y1-y0) + (b+1)*x0*y0
! 95:
! 96: This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas
! 97: a basecase multiply of NxN limbs is roughly equivalent to four multiplies of
! 98: (N/2)x(N/2).
! 99:
! 100: The factors (b^2+b) etc in the formula look worse than they are,
! 101: representing simply the positions where the products must be added in.
! 102:
! 103: high low
! 104: +--------+--------+ +--------+--------+
! 105: | x1*y1 | | x0*y0 |
! 106: +--------+--------+ +--------+--------+
! 107: +--------+--------+
! 108: | x1*y1 |
! 109: +--------+--------+
! 110: +--------+--------+
! 111: | x0*y0 |
! 112: +--------+--------+
! 113: +--------+--------+
! 114: | (x1-x0)*(y1-y0) |
! 115: +--------+--------+
! 116:
! 117: The term (x1-x0)*(y1-y0) can be negative, meaning a subtraction, but the
! 118: final result is of course always positive.
! 119:
! 120: The use of three multiplies of N/2 limbs each leads to an asymptotic speed
! 121: O(N^1.585). (The exponent is log(3)/log(2).) This is a big improvement
! 122: over the basecase multiply at O(N^2) and the algorithmic advantage soon
! 123: overcomes the extra additions Karatsuba must perform.
! 124:
! 125:
! 126:
! 127:
! 128: KARATSUBA SQUARE
! 129:
! 130: A square is very similar to a multiply, but with x==y the formula reduces to
! 131: an equivalent with three squares,
! 132:
! 133: x^2 = (b^2+b)*x1^2 + b*(x1-x0)^2 + (b+1)*x0^2
! 134:
! 135: The final result is accumulated from those three squares the same way as for
! 136: the three multiplies above. The middle term (x1-x0)^2 however is now always
! 137: positive.
! 138:
! 139:
! 140:
! 141:
! 142: TOOM-COOK 3-WAY MULTIPLY
! 143:
! 144: The Karatsuba formula is part of a general approach to splitting inputs
! 145: leading to both Toom-Cook and FFT algorithms. A description of Toom-Cook
! 146: can be found in Knuth volume 2 section 4.3.3, with an example 3-way
! 147: calculation after Theorem A.
! 148:
! 149: Toom-Cook 3-way treats the operands as split into 3 pieces of equal size (or
! 150: the most significant part 1 or 2 limbs shorter than the others).
! 151:
! 152: high low
! 153: +----------+----------+----------+
! 154: | x2 | x1 | x0 |
! 155: +----------+----------+----------+
! 156:
! 157: +----------+----------+----------+
! 158: | y2 | y1 | y0 |
! 159: +----------+----------+----------+
! 160:
! 161: These parts are treated as the coefficients of two polynomials
! 162:
! 163: X(t) = x2*t^2 + x1*t + x0
! 164: Y(t) = y2*t^2 + y1*t + y0
! 165:
! 166: Again let b equal the power of 2 which is the size of the x0, x1, y0 and y1
! 167: pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb). With
! 168: this then x=X(b) and y=Y(b).
! 169:
! 170: Let a polynomial W(t)=X(t)*Y(t) and suppose it's coefficients are
! 171:
! 172: W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0
! 173:
! 174: The w[i] are going to be determined, and when they are they'll give the
! 175: final result using w=W(b), since x*y=X(b)*Y(b)=W(b). The coefficients will
! 176: each be roughly b^2, so the final W(b) will be an addition like,
! 177:
! 178: high low
! 179: +-------+-------+
! 180: | w4 |
! 181: +-------+-------+
! 182: +--------+-------+
! 183: | w3 |
! 184: +--------+-------+
! 185: +--------+-------+
! 186: | w2 |
! 187: +--------+-------+
! 188: +--------+-------+
! 189: | w1 |
! 190: +--------+-------+
! 191: +-------+-------+
! 192: | w0 |
! 193: +-------+-------+
! 194: -------------------------------------------------
! 195:
! 196:
! 197: The w[i] coefficients could be formed by a simple set of cross products,
! 198: like w4=x2*x2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need
! 199: all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a
! 200: basecase multiply. Instead the following approach is used.
! 201:
! 202: X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of
! 203: W(t) at those points. The points used can be chosen in various ways, but in
! 204: GMP the following are used
! 205:
! 206: t=0 meaning x0*y0, which gives w0 immediately
! 207: t=2 meaning (4*x2+2*x1*x0)*(4*y2+2*y1+y0)
! 208: t=1 meaning (x2+x1+x0)*(y2+y1+y0)
! 209: t=1/2 meaning (x2+2*x1+4*x0)*(y2+2*y1+4*y0)
! 210: t=inf meaning x2*y2, which gives w4 immediately
! 211:
! 212: At t=1/2 the value calculated is actually 4*X(1/2)*Y(1/2), giving a value
! 213: for 16*W(1/2) (this is always an integer). At t=inf the value is actually
! 214: X(t)*Y(t)/t^2 in the limit as t approaches infinity, but it's much easier to
! 215: think of that as simply x2*y2 giving w4 immediately (much like at t=0 x0*y0
! 216: gives w0 immediately).
! 217:
! 218: Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a linear
! 219: combination of the w[i] coefficients, and the value of those combinations
! 220: has just been calculated.
! 221:
! 222: W(0) = w0
! 223: 16*W(1/2) = w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0
! 224: W(1) = w4 + w3 + w2 + w1 + w0
! 225: W(2) = 16*w4 + 8*w3 + 4*w2 + 2*w1 + w0
! 226: W(inf) = w4
! 227:
! 228: This is a set of five equations in five unknowns, and some elementary linear
! 229: algebra quickly isolates each w[i], by subtracting multiples of one equation
! 230: from another.
! 231:
! 232: In the code the set of five values W(0),...,W(inf) will represent those
! 233: certain linear combinations. By adding or subtracting one from another as
! 234: necessary, values which are each w[i] alone are arrived at. This involves
! 235: only a few subtractions of small multiples (some of which are powers of 2),
! 236: and so is very fast. A couple of divisions remain by powers of 2 and one
! 237: division by 3 (or by 6 rather), and that last uses the special fast
! 238: mpn_divexact_by3.
! 239:
! 240: In the code the values w4, w2 and w0 are formed in the destination, and w3
! 241: and w1 are added to them. There's an extra word at the high end of w3, w2
! 242: and w1 that are handled separately. With A=w0,B=w1,...,E=w4, the additions
! 243: are as follows.
! 244:
! 245: high low
! 246: +-------+-------+-------+-------+-------+-------+
! 247: | E | C | A |
! 248: +-------+-------+-------+-------+-------+-------+
! 249: +------+-------++------+-------+
! 250: | D || B |
! 251: +------+-------++------+-------+
! 252: -- -- --
! 253: |tD| |tC| |tB|
! 254: -- -- --
! 255: -------------------------------------------------
! 256:
! 257:
! 258: The conversion of W(t) values to the coefficients is called interpolation.
! 259: A polynomial of degree 5 like W(t) is uniquely determined by values known at
! 260: 5 different points. The points can be chosen to make the linear equations
! 261: come out with a convenient set of steps for isolating the w[i]s.
! 262:
! 263: In mpn/generic/mul_n.c the interpolate3() routine performs the
! 264: interpolation. The open-coded one-pass version may be a bit hard to
! 265: understand, the steps performed can be better seen in the USE_MORE_MPN
! 266: version.
! 267:
! 268: The use of five multiplies of N/3 limbs each leads to an asymptotic speed
! 269: O(N^1.465). (The exponent is log(5)/log(3).) This is an improvement over
! 270: Karatsuba at O(N^1.585), though Toom-Cook does more work in the evaluation
! 271: and interpolation and so it's only above a certain size that Toom-Cook
! 272: realizes its advantage.
! 273:
! 274: The formula given above for the Karatsuba algorithm has an equivalent for
! 275: Toom-Cook 3-way, involving only five multiplies, but this would be
! 276: complicated and unenlightening.
! 277:
! 278: An alternate view of Toom-Cook 3-way can be found in Zuras (reference
! 279: below). He uses a vector to represent the x and y splits and a matrix
! 280: multiplication for the evaluation and interpolation stages. The matrix
! 281: inverses are not meant to be actually used, and they have elements with
! 282: values much greater than in fact arise in the interpolation steps. The
! 283: diagram shown for the 3-way is attractive, but again it doesn't have to be
! 284: implemented that way and for example with a bit of rearrangement just one
! 285: division by 6 (not two of them) can be done.
! 286:
! 287:
! 288:
! 289:
! 290: TOOM-COOK 3-WAY SQUARE
! 291:
! 292: Toom-Cook squaring follows the same procedure as multiplication, but there's
! 293: only one X(t) and it's evaluated at 5 points, and those values squared to
! 294: give values of W(t). The interpolation is then identical, and in fact the
! 295: same interpolate3() subroutine is used for both squaring and multiplying.
! 296:
! 297:
! 298:
! 299:
! 300: FFT MULTIPLY
! 301:
! 302: At large to very large sizes a Fermat style FFT is used, meaning an FFT in a
! 303: ring of integers modulo 2^M+1. This is asymptotically fast, but overheads
! 304: mean it's only worthwhile for large products.
! 305:
! 306: Some brief notes are given here. Full explanations can be found in various
! 307: texts. Knuth section 4.3.3 part C describes the method, but using complex
! 308: numbers. In the references below Schonhage and Strassen is the original
! 309: paper, and Pollard gives some of the mathematics for a finite field as used
! 310: here.
! 311:
! 312: The FFT does its multiplication modulo 2^N+1, but by choosing
! 313: N>=bits(A)+bits(B), a full product A*B is obtained. The algorithm splits
! 314: the inputs into 2^k pieces, for a chosen k, and will recursively perform 2^k
! 315: pointwise multiplications modulo 2^M+1, where M=N/2^k. N must be a multiple
! 316: of 2^k. Those multiplications are either done by recursing into a further
! 317: FFT, or by a plain toom3 etc multiplication, whichever is optimal at the
! 318: resultant size. Note that in the current implementation M is always a
! 319: multiple of the limb size.
! 320:
! 321: The steps leading to the pointwise products are like the evaluation and
! 322: interpolation stages of the Karatsuba and Toom-Cook algorithms, but use a
! 323: convolution, which can be efficiently calculated because 2^(2N/2^k) is a
! 324: 2^k'th root of unity.
! 325:
! 326: As operand sizes gets bigger, bigger splits are used. Each time a bigger k
! 327: is used some multiplying is effectively swapped for some shifts, adds and
! 328: overheads. A table of thresholds gives the points where a k+1 FFT is faster
! 329: than a k FFT. A separate threshold gives the point where a mod 2^N+1 FFT
! 330: first becomes faster than a plain multiply of that size, and this normally
! 331: happens in the k=4 range. A further threshold gives the point where an
! 332: N/2xN/2 multiply done with an FFT mod 2^N+1 is faster than a plain multiply
! 333: of that size, and this is normally in the k=7 or k=8 range.
! 334:
! 335:
! 336:
! 337:
! 338: TOOM-COOK N-WAY (NOT USED)
! 339:
! 340: The 3-way Toom-Cook procedure described above generalizes to split into an
! 341: arbitrary number of pieces, as per Knuth volume 2 section 4.3.3 algorithm C.
! 342: Some code implementing this for GMP exists, but is not present since it has
! 343: yet to prove worthwhile. The notes here are merely for interest.
! 344:
! 345: In general a split into r+1 pieces will be made, and evaluations and
! 346: pointwise multiplications done at 2*r+1 points. So a 4-way split does 7
! 347: pointwise multiplies, 5-way does 9, etc.
! 348:
! 349: Asymptotically an r+1-way algorithm is O(n^(log(2*r+1)/log(r+1)). Only the
! 350: pointwise multiplications count towards big O() complexity, but the time
! 351: spent in the evaluate and interpolate stages grows with r and has a
! 352: significant practical impact, with the asymptotic advantage of each r
! 353: realized only at bigger and bigger sizes.
! 354:
! 355: Knuth algorithm C presents a version evaluating at points 0,1,2,...,2*r, but
! 356: exercise 4 uses -r,...,0,...,r and the latter saves some small multiplies in
! 357: the evaluate stage (or rather trades them for additions), and has a further
! 358: saving of nearly half the interpolate steps. The answer to the exercise
! 359: doesn't give full details of the interpolate, but essentially the idea is to
! 360: separate odd and even final coefficients and then perform algorithm C steps
! 361: C7 and C8 on them separately. The multipliers and divisors at step C7 then
! 362: become j^2 and 2*t*j-j*j respectively.
! 363:
! 364: In the references below, Zuras presents 3-way and 4-way Toom-Cook methods,
! 365: and compares them to small order FFTs. Hollerbach presents an N-way
! 366: algorithm from first principles. Bernstein presents the N-way algorithm in
! 367: a style that facilitates comparing its mathematics to other multiplication
! 368: algorithms.
! 369:
! 370:
! 371:
! 372:
! 373: REFERENCES
! 374:
! 375: "Algorithms for Computer Algebra", Keith O. Geddes, Stephen R. Czapor,
! 376: George Labahn, Kluwer Academic Publishers, 1992, ISBN 0-7923-9259-0.
! 377:
! 378: "Schnelle Multiplikation grosser Zahlen", by Arnold Schonhage and Volker
! 379: Strassen, Computing 7, p. 281-292, 1971.
! 380:
! 381: "The Fast Fourier Transform in a Finite Field", J.M. Pollard, Mathematics of
! 382: Computation, vol 25, num 114, April 1971.
! 383:
! 384: "On Squaring and Multiplying Large Integers", Dan Zuras, ARITH-11: IEEE
! 385: Symposium on Computer Arithmetic, 1993, pages 260 to 271. And reprinted as
! 386: "More on Multiplying and Squaring Large Integers", IEEE Transactions on
! 387: Computers, August 1994.
! 388:
! 389: "Fast Multiplication & Division of Very Large Numbers", Uwe Hollerbach, post
! 390: to sci.math.research, Jan 1996, archived at Swarthmore,
! 391: http://forum.swarthmore.edu/epigone/sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@forum.swarthmore.edu
! 392:
! 393: "Multidigit Multiplication for Mathematicians", Daniel J. Bernstein,
! 394: preprint available at http://koobera.math.uic.edu/www/papers. (Every known
! 395: multiplication technique, many references.)
! 396:
! 397:
! 398:
! 399:
! 400: ----------------
! 401: Local variables:
! 402: mode: text
! 403: fill-column: 76
! 404: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>