[BACK]Return to multiplication CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / doc

Annotation of OpenXM_contrib/gmp/doc/multiplication, Revision 1.1

1.1     ! maekawa     1:
        !             2:                          GMP MULTIPLICATION
        !             3:
        !             4:
        !             5: This file describes briefly the multiplication and squaring used in GMP.
        !             6: The code is likely to be hard to understand without knowing something about
        !             7: the algorithms.
        !             8:
        !             9: GMP does NxN limb multiplications and squares using one of four algorithms,
        !            10: according to the size N.
        !            11:
        !            12:          Algorithm      Sizes
        !            13:
        !            14:          basecase    < KARATSUBA_MUL_THRESHOLD
        !            15:         karatsuba   >= KARATSUBA_MUL_THRESHOLD
        !            16:         toom3       >= TOOM3_MUL_THRESHOLD
        !            17:         fft         >= FFT_MUL_THRESHOLD
        !            18:
        !            19: Similarly for squaring, with the SQR thresholds.  Note though that the FFT
        !            20: is only used if GMP is configured with --enable-fft.
        !            21:
        !            22: MxN multiplications of operands with different sizes are currently done by
        !            23: splitting it up into various NxN pieces while above KARATSUBA_MUL_THRESHOLD.
        !            24: The karatsuba and toom3 routines then operate only on equal size operands.
        !            25: This is rather inefficient, and is slated for improvement in the future.
        !            26:
        !            27:
        !            28:
        !            29: BASECASE MULTIPLY
        !            30:
        !            31: This a straightforward rectangular set of cross-products, the same as long
        !            32: multiplication done by hand and for that reason sometimes known as the
        !            33: schoolbook or grammar school method.
        !            34:
        !            35: See Knuth (reference in gmp.texi) volume 2 section 4.3.1 algorithm M, or the
        !            36: mpn/generic/mul_basecase.c code.
        !            37:
        !            38: Assembler implementations of mul_basecase are essentially the same as the
        !            39: generic C version, but have all the usual assembler tricks and obscurities
        !            40: introduced for speed.
        !            41:
        !            42:
        !            43:
        !            44: BASECASE SQUARE
        !            45:
        !            46: A square can be done in roughly half the time of a multiply, by using the
        !            47: fact that the cross products above and below the diagonal are the same.  In
        !            48: practice squaring isn't 2x faster than multiplying, but it's always faster
        !            49: by a decent factor.
        !            50:
        !            51:              u0  u1  u2  u3  u4
        !            52:            +---+---+---+---+---+
        !            53:         u0 | x |   |   |   |   |
        !            54:            +---+---+---+---+---+
        !            55:         u1 |   | x |   |   |   |
        !            56:            +---+---+---+---+---+
        !            57:         u2 |   |   | x |   |   |
        !            58:            +---+---+---+---+---+
        !            59:         u3 |   |   |   | x |   |
        !            60:            +---+---+---+---+---+
        !            61:         u4 |   |   |   |   | x |
        !            62:            +---+---+---+---+---+
        !            63:
        !            64: The basic algorithm is to calculate a triangle of products below the
        !            65: diagonal, double it (left shift by one bit), and add in the products on the
        !            66: diagonal.  This can be seen in mpn/generic/sqr_basecase.c.  Again the
        !            67: assembler implementations take essentially this same approach.
        !            68:
        !            69:
        !            70:
        !            71:
        !            72: KARATSUBA MULTIPLY
        !            73:
        !            74: The Karatsuba multiplication algorithm is described in Knuth volume 2
        !            75: section 4.3.3 part A, and in other texts like Geddes et al. section 4.3
        !            76: (reference below).  A brief description is given here.
        !            77:
        !            78: The Karatsuba algorithm treats its inputs x and y as each split in two parts
        !            79: of equal length (or the most significant part one limb shorter if N is odd).
        !            80:
        !            81:          high              low
        !            82:         +----------+----------+
        !            83:         |    x1    |    x0    |
        !            84:         +----------+----------+
        !            85:
        !            86:         +----------+----------+
        !            87:         |    y1    |    y0    |
        !            88:         +----------+----------+
        !            89:
        !            90: Let b be the power of 2 where the split occurs, ie. if x0 is k limbs (y0 the
        !            91: same) then b=2^(k*mp_bits_per_limb).  Then x=x1*b+x0 and y=y1*b+y0, and the
        !            92: following holds,
        !            93:
        !            94:        x*y = (b^2+b)*x1*y1 + b*(x1-x0)*(y1-y0) + (b+1)*x0*y0
        !            95:
        !            96: This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas
        !            97: a basecase multiply of NxN limbs is roughly equivalent to four multiplies of
        !            98: (N/2)x(N/2).
        !            99:
        !           100: The factors (b^2+b) etc in the formula look worse than they are,
        !           101: representing simply the positions where the products must be added in.
        !           102:
        !           103:          high                              low
        !           104:         +--------+--------+ +--------+--------+
        !           105:         |      x1*y1      | |      x0*y0      |
        !           106:         +--------+--------+ +--------+--------+
        !           107:                   +--------+--------+
        !           108:                   |      x1*y1      |
        !           109:                   +--------+--------+
        !           110:                   +--------+--------+
        !           111:                   |      x0*y0      |
        !           112:                   +--------+--------+
        !           113:                   +--------+--------+
        !           114:                   | (x1-x0)*(y1-y0) |
        !           115:                   +--------+--------+
        !           116:
        !           117: The term (x1-x0)*(y1-y0) can be negative, meaning a subtraction, but the
        !           118: final result is of course always positive.
        !           119:
        !           120: The use of three multiplies of N/2 limbs each leads to an asymptotic speed
        !           121: O(N^1.585).  (The exponent is log(3)/log(2).)  This is a big improvement
        !           122: over the basecase multiply at O(N^2) and the algorithmic advantage soon
        !           123: overcomes the extra additions Karatsuba must perform.
        !           124:
        !           125:
        !           126:
        !           127:
        !           128: KARATSUBA SQUARE
        !           129:
        !           130: A square is very similar to a multiply, but with x==y the formula reduces to
        !           131: an equivalent with three squares,
        !           132:
        !           133:         x^2 = (b^2+b)*x1^2 + b*(x1-x0)^2 + (b+1)*x0^2
        !           134:
        !           135: The final result is accumulated from those three squares the same way as for
        !           136: the three multiplies above.  The middle term (x1-x0)^2 however is now always
        !           137: positive.
        !           138:
        !           139:
        !           140:
        !           141:
        !           142: TOOM-COOK 3-WAY MULTIPLY
        !           143:
        !           144: The Karatsuba formula is part of a general approach to splitting inputs
        !           145: leading to both Toom-Cook and FFT algorithms.  A description of Toom-Cook
        !           146: can be found in Knuth volume 2 section 4.3.3, with an example 3-way
        !           147: calculation after Theorem A.
        !           148:
        !           149: Toom-Cook 3-way treats the operands as split into 3 pieces of equal size (or
        !           150: the most significant part 1 or 2 limbs shorter than the others).
        !           151:
        !           152:          high                         low
        !           153:         +----------+----------+----------+
        !           154:         |    x2    |    x1    |    x0    |
        !           155:         +----------+----------+----------+
        !           156:
        !           157:         +----------+----------+----------+
        !           158:         |    y2    |    y1    |    y0    |
        !           159:         +----------+----------+----------+
        !           160:
        !           161: These parts are treated as the coefficients of two polynomials
        !           162:
        !           163:        X(t) = x2*t^2 + x1*t + x0
        !           164:        Y(t) = y2*t^2 + y1*t + y0
        !           165:
        !           166: Again let b equal the power of 2 which is the size of the x0, x1, y0 and y1
        !           167: pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb).  With
        !           168: this then x=X(b) and y=Y(b).
        !           169:
        !           170: Let a polynomial W(t)=X(t)*Y(t) and suppose it's coefficients are
        !           171:
        !           172:        W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0
        !           173:
        !           174: The w[i] are going to be determined, and when they are they'll give the
        !           175: final result using w=W(b), since x*y=X(b)*Y(b)=W(b).  The coefficients will
        !           176: each be roughly b^2, so the final W(b) will be an addition like,
        !           177:
        !           178:          high                                        low
        !           179:         +-------+-------+
        !           180:         |       w4      |
        !           181:         +-------+-------+
        !           182:                +--------+-------+
        !           183:                |        w3      |
        !           184:                +--------+-------+
        !           185:                        +--------+-------+
        !           186:                        |        w2      |
        !           187:                        +--------+-------+
        !           188:                                +--------+-------+
        !           189:                                |        w1      |
        !           190:                                +--------+-------+
        !           191:                                         +-------+-------+
        !           192:                                         |       w0      |
        !           193:                                         +-------+-------+
        !           194:         -------------------------------------------------
        !           195:
        !           196:
        !           197: The w[i] coefficients could be formed by a simple set of cross products,
        !           198: like w4=x2*x2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need
        !           199: all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a
        !           200: basecase multiply.  Instead the following approach is used.
        !           201:
        !           202: X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of
        !           203: W(t) at those points.  The points used can be chosen in various ways, but in
        !           204: GMP the following are used
        !           205:
        !           206:        t=0    meaning x0*y0, which gives w0 immediately
        !           207:         t=2    meaning (4*x2+2*x1*x0)*(4*y2+2*y1+y0)
        !           208:         t=1    meaning (x2+x1+x0)*(y2+y1+y0)
        !           209:         t=1/2  meaning (x2+2*x1+4*x0)*(y2+2*y1+4*y0)
        !           210:         t=inf  meaning x2*y2, which gives w4 immediately
        !           211:
        !           212: At t=1/2 the value calculated is actually 4*X(1/2)*Y(1/2), giving a value
        !           213: for 16*W(1/2) (this is always an integer).  At t=inf the value is actually
        !           214: X(t)*Y(t)/t^2 in the limit as t approaches infinity, but it's much easier to
        !           215: think of that as simply x2*y2 giving w4 immediately (much like at t=0 x0*y0
        !           216: gives w0 immediately).
        !           217:
        !           218: Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a linear
        !           219: combination of the w[i] coefficients, and the value of those combinations
        !           220: has just been calculated.
        !           221:
        !           222:            W(0)   =                                 w0
        !           223:         16*W(1/2) =    w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0
        !           224:            W(1)   =    w4 +   w3 +   w2 +   w1 +    w0
        !           225:            W(2)   = 16*w4 + 8*w3 + 4*w2 + 2*w1 +    w0
        !           226:            W(inf) =    w4
        !           227:
        !           228: This is a set of five equations in five unknowns, and some elementary linear
        !           229: algebra quickly isolates each w[i], by subtracting multiples of one equation
        !           230: from another.
        !           231:
        !           232: In the code the set of five values W(0),...,W(inf) will represent those
        !           233: certain linear combinations.  By adding or subtracting one from another as
        !           234: necessary, values which are each w[i] alone are arrived at.  This involves
        !           235: only a few subtractions of small multiples (some of which are powers of 2),
        !           236: and so is very fast.  A couple of divisions remain by powers of 2 and one
        !           237: division by 3 (or by 6 rather), and that last uses the special fast
        !           238: mpn_divexact_by3.
        !           239:
        !           240: In the code the values w4, w2 and w0 are formed in the destination, and w3
        !           241: and w1 are added to them.  There's an extra word at the high end of w3, w2
        !           242: and w1 that are handled separately.  With A=w0,B=w1,...,E=w4, the additions
        !           243: are as follows.
        !           244:
        !           245:          high                                        low
        !           246:         +-------+-------+-------+-------+-------+-------+
        !           247:         |       E       |       C       |       A       |
        !           248:         +-------+-------+-------+-------+-------+-------+
        !           249:                  +------+-------++------+-------+
        !           250:                  |      D       ||      B       |
        !           251:                  +------+-------++------+-------+
        !           252:               --      --      --
        !           253:              |tD|    |tC|    |tB|
        !           254:               --      --      --
        !           255:         -------------------------------------------------
        !           256:
        !           257:
        !           258: The conversion of W(t) values to the coefficients is called interpolation.
        !           259: A polynomial of degree 5 like W(t) is uniquely determined by values known at
        !           260: 5 different points.  The points can be chosen to make the linear equations
        !           261: come out with a convenient set of steps for isolating the w[i]s.
        !           262:
        !           263: In mpn/generic/mul_n.c the interpolate3() routine performs the
        !           264: interpolation.  The open-coded one-pass version may be a bit hard to
        !           265: understand, the steps performed can be better seen in the USE_MORE_MPN
        !           266: version.
        !           267:
        !           268: The use of five multiplies of N/3 limbs each leads to an asymptotic speed
        !           269: O(N^1.465).  (The exponent is log(5)/log(3).)  This is an improvement over
        !           270: Karatsuba at O(N^1.585), though Toom-Cook does more work in the evaluation
        !           271: and interpolation and so it's only above a certain size that Toom-Cook
        !           272: realizes its advantage.
        !           273:
        !           274: The formula given above for the Karatsuba algorithm has an equivalent for
        !           275: Toom-Cook 3-way, involving only five multiplies, but this would be
        !           276: complicated and unenlightening.
        !           277:
        !           278: An alternate view of Toom-Cook 3-way can be found in Zuras (reference
        !           279: below).  He uses a vector to represent the x and y splits and a matrix
        !           280: multiplication for the evaluation and interpolation stages.  The matrix
        !           281: inverses are not meant to be actually used, and they have elements with
        !           282: values much greater than in fact arise in the interpolation steps.  The
        !           283: diagram shown for the 3-way is attractive, but again it doesn't have to be
        !           284: implemented that way and for example with a bit of rearrangement just one
        !           285: division by 6 (not two of them) can be done.
        !           286:
        !           287:
        !           288:
        !           289:
        !           290: TOOM-COOK 3-WAY SQUARE
        !           291:
        !           292: Toom-Cook squaring follows the same procedure as multiplication, but there's
        !           293: only one X(t) and it's evaluated at 5 points, and those values squared to
        !           294: give values of W(t).  The interpolation is then identical, and in fact the
        !           295: same interpolate3() subroutine is used for both squaring and multiplying.
        !           296:
        !           297:
        !           298:
        !           299:
        !           300: FFT MULTIPLY
        !           301:
        !           302: At large to very large sizes a Fermat style FFT is used, meaning an FFT in a
        !           303: ring of integers modulo 2^M+1.  This is asymptotically fast, but overheads
        !           304: mean it's only worthwhile for large products.
        !           305:
        !           306: Some brief notes are given here.  Full explanations can be found in various
        !           307: texts.  Knuth section 4.3.3 part C describes the method, but using complex
        !           308: numbers.  In the references below Schonhage and Strassen is the original
        !           309: paper, and Pollard gives some of the mathematics for a finite field as used
        !           310: here.
        !           311:
        !           312: The FFT does its multiplication modulo 2^N+1, but by choosing
        !           313: N>=bits(A)+bits(B), a full product A*B is obtained.  The algorithm splits
        !           314: the inputs into 2^k pieces, for a chosen k, and will recursively perform 2^k
        !           315: pointwise multiplications modulo 2^M+1, where M=N/2^k.  N must be a multiple
        !           316: of 2^k.  Those multiplications are either done by recursing into a further
        !           317: FFT, or by a plain toom3 etc multiplication, whichever is optimal at the
        !           318: resultant size.  Note that in the current implementation M is always a
        !           319: multiple of the limb size.
        !           320:
        !           321: The steps leading to the pointwise products are like the evaluation and
        !           322: interpolation stages of the Karatsuba and Toom-Cook algorithms, but use a
        !           323: convolution, which can be efficiently calculated because 2^(2N/2^k) is a
        !           324: 2^k'th root of unity.
        !           325:
        !           326: As operand sizes gets bigger, bigger splits are used.  Each time a bigger k
        !           327: is used some multiplying is effectively swapped for some shifts, adds and
        !           328: overheads.  A table of thresholds gives the points where a k+1 FFT is faster
        !           329: than a k FFT.  A separate threshold gives the point where a mod 2^N+1 FFT
        !           330: first becomes faster than a plain multiply of that size, and this normally
        !           331: happens in the k=4 range.  A further threshold gives the point where an
        !           332: N/2xN/2 multiply done with an FFT mod 2^N+1 is faster than a plain multiply
        !           333: of that size, and this is normally in the k=7 or k=8 range.
        !           334:
        !           335:
        !           336:
        !           337:
        !           338: TOOM-COOK N-WAY (NOT USED)
        !           339:
        !           340: The 3-way Toom-Cook procedure described above generalizes to split into an
        !           341: arbitrary number of pieces, as per Knuth volume 2 section 4.3.3 algorithm C.
        !           342: Some code implementing this for GMP exists, but is not present since it has
        !           343: yet to prove worthwhile.  The notes here are merely for interest.
        !           344:
        !           345: In general a split into r+1 pieces will be made, and evaluations and
        !           346: pointwise multiplications done at 2*r+1 points.  So a 4-way split does 7
        !           347: pointwise multiplies, 5-way does 9, etc.
        !           348:
        !           349: Asymptotically an r+1-way algorithm is O(n^(log(2*r+1)/log(r+1)).  Only the
        !           350: pointwise multiplications count towards big O() complexity, but the time
        !           351: spent in the evaluate and interpolate stages grows with r and has a
        !           352: significant practical impact, with the asymptotic advantage of each r
        !           353: realized only at bigger and bigger sizes.
        !           354:
        !           355: Knuth algorithm C presents a version evaluating at points 0,1,2,...,2*r, but
        !           356: exercise 4 uses -r,...,0,...,r and the latter saves some small multiplies in
        !           357: the evaluate stage (or rather trades them for additions), and has a further
        !           358: saving of nearly half the interpolate steps.  The answer to the exercise
        !           359: doesn't give full details of the interpolate, but essentially the idea is to
        !           360: separate odd and even final coefficients and then perform algorithm C steps
        !           361: C7 and C8 on them separately.  The multipliers and divisors at step C7 then
        !           362: become j^2 and 2*t*j-j*j respectively.
        !           363:
        !           364: In the references below, Zuras presents 3-way and 4-way Toom-Cook methods,
        !           365: and compares them to small order FFTs.  Hollerbach presents an N-way
        !           366: algorithm from first principles.  Bernstein presents the N-way algorithm in
        !           367: a style that facilitates comparing its mathematics to other multiplication
        !           368: algorithms.
        !           369:
        !           370:
        !           371:
        !           372:
        !           373: REFERENCES
        !           374:
        !           375: "Algorithms for Computer Algebra", Keith O. Geddes, Stephen R. Czapor,
        !           376: George Labahn, Kluwer Academic Publishers, 1992, ISBN 0-7923-9259-0.
        !           377:
        !           378: "Schnelle Multiplikation grosser Zahlen", by Arnold Schonhage and Volker
        !           379: Strassen, Computing 7, p. 281-292, 1971.
        !           380:
        !           381: "The Fast Fourier Transform in a Finite Field", J.M. Pollard, Mathematics of
        !           382: Computation, vol 25, num 114, April 1971.
        !           383:
        !           384: "On Squaring and Multiplying Large Integers", Dan Zuras, ARITH-11: IEEE
        !           385: Symposium on Computer Arithmetic, 1993, pages 260 to 271.  And reprinted as
        !           386: "More on Multiplying and Squaring Large Integers", IEEE Transactions on
        !           387: Computers, August 1994.
        !           388:
        !           389: "Fast Multiplication & Division of Very Large Numbers", Uwe Hollerbach, post
        !           390: to sci.math.research, Jan 1996, archived at Swarthmore,
        !           391: http://forum.swarthmore.edu/epigone/sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@forum.swarthmore.edu
        !           392:
        !           393: "Multidigit Multiplication for Mathematicians", Daniel J. Bernstein,
        !           394: preprint available at http://koobera.math.uic.edu/www/papers.  (Every known
        !           395: multiplication technique, many references.)
        !           396:
        !           397:
        !           398:
        !           399:
        !           400: ----------------
        !           401: Local variables:
        !           402: mode: text
        !           403: fill-column: 76
        !           404: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>