Annotation of OpenXM_contrib/gmp/gmp.info-6, Revision 1.1
1.1 ! ohara 1: This is gmp.info, produced by makeinfo version 4.2 from gmp.texi.
! 2:
! 3: This manual describes how to install and use the GNU multiple precision
! 4: arithmetic library, version 4.1.2.
! 5:
! 6: Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
! 7: 2001, 2002 Free Software Foundation, Inc.
! 8:
! 9: Permission is granted to copy, distribute and/or modify this
! 10: document under the terms of the GNU Free Documentation License, Version
! 11: 1.1 or any later version published by the Free Software Foundation;
! 12: with no Invariant Sections, with the Front-Cover Texts being "A GNU
! 13: Manual", and with the Back-Cover Texts being "You have freedom to copy
! 14: and modify this GNU Manual, like GNU software". A copy of the license
! 15: is included in *Note GNU Free Documentation License::.
! 16: INFO-DIR-SECTION GNU libraries
! 17: START-INFO-DIR-ENTRY
! 18: * gmp: (gmp). GNU Multiple Precision Arithmetic Library.
! 19: END-INFO-DIR-ENTRY
! 20:
! 21:
! 22: File: gmp.info, Node: FFT Multiplication, Next: Other Multiplication, Prev: Toom-Cook 3-Way Multiplication, Up: Multiplication Algorithms
! 23:
! 24: FFT Multiplication
! 25: ------------------
! 26:
! 27: At large to very large sizes a Fermat style FFT multiplication is
! 28: used, following Scho"nhage and Strassen (*note References::).
! 29: Descriptions of FFTs in various forms can be found in many textbooks,
! 30: for instance Knuth section 4.3.3 part C or Lipson chapter IX. A brief
! 31: description of the form used in GMP is given here.
! 32:
! 33: The multiplication done is x*y mod 2^N+1, for a given N. A full
! 34: product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
! 35: and y with high zero limbs. The modular product is the native form for
! 36: the algorithm, so padding to get a full product is unavoidable.
! 37:
! 38: The algorithm follows a split, evaluate, pointwise multiply,
! 39: interpolate and combine similar to that described above for Karatsuba
! 40: and Toom-3. A k parameter controls the split, with an FFT-k splitting
! 41: into 2^k pieces of M=N/2^k bits each. N must be a multiple of
! 42: (2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
! 43: bit shifts in the split and combine stages.
! 44:
! 45: The evaluations, pointwise multiplications, and interpolation, are
! 46: all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
! 47: 2^k and of `mp_bits_per_limb'. The results of interpolation will be
! 48: the following negacyclic convolution of the input pieces, and the
! 49: choice of N' ensures these sums aren't truncated.
! 50:
! 51: ---
! 52: \ b
! 53: w[n] = / (-1) * x[i] * y[j]
! 54: ---
! 55: i+j==b*2^k+n
! 56: b=0,1
! 57:
! 58: The points used for the evaluation are g^i for i=0 to 2^k-1 where
! 59: g=2^(2N'/2^k). g is a 2^k'th root of unity mod 2^N'+1, which produces
! 60: necessary cancellations at the interpolation stage, and it's also a
! 61: power of 2 so the fast fourier transforms used for the evaluation and
! 62: interpolation do only shifts, adds and negations.
! 63:
! 64: The pointwise multiplications are done modulo 2^N'+1 and either
! 65: recurse into a further FFT or use a plain multiplication (Toom-3,
! 66: Karatsuba or basecase), whichever is optimal at the size N'. The
! 67: interpolation is an inverse fast fourier transform. The resulting set
! 68: of sums of x[i]*y[j] are added at appropriate offsets to give the final
! 69: result.
! 70:
! 71: Squaring is the same, but x is the only input so it's one transform
! 72: at the evaluate stage and the pointwise multiplies are squares. The
! 73: interpolation is the same.
! 74:
! 75: For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
! 76: the exponent representing 2^k recursed modular multiplies each
! 77: 1/2^(k-1) the size of the original. Each successive k is an asymptotic
! 78: improvement, but overheads mean each is only faster at bigger and
! 79: bigger sizes. In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
! 80: thresholds where each k is used. Each new k effectively swaps some
! 81: multiplying for some shifts, adds and overheads.
! 82:
! 83: A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
! 84: plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
! 85: A k=4 FFT at O(N^1.333) can be expected to be the first faster than
! 86: Toom-3 at O(N^1.465). In practice this is what's found, with
! 87: `MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
! 88: and 1000 limbs, depending on the CPU. So far it's been found that only
! 89: very large FFTs recurse into pointwise multiplies above these sizes.
! 90:
! 91: When an FFT is to give a full product, the change of N to 2N doesn't
! 92: alter the theoretical complexity for a given k, but for the purposes of
! 93: considering where an FFT might be first used it can be assumed that the
! 94: FFT is recursing into a normal multiply and that on that basis it's
! 95: doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
! 96: making it O(N^(k/(k-2))). This would mean k=7 at O(N^1.4) would be the
! 97: first FFT faster than Toom-3. In practice `MUL_FFT_THRESHOLD' and
! 98: `SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
! 99: between 3000 and 10000 limbs.
! 100:
! 101: The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
! 102: a multiple of 2^k and `mp_bits_per_limb' means that when
! 103: 2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
! 104: The +k+3 means some values of N just under such a multiple will be
! 105: rounded to the next. The complexity calculations above assume that a
! 106: favourable size is used, meaning one which isn't padded through
! 107: rounding, and it's also assumed that the extra +k+3 bits are negligible
! 108: at typical FFT sizes.
! 109:
! 110: The practical effect of the 2^(2k-1) constraint is to introduce a
! 111: step-effect into measured speeds. For example k=8 will round N up to a
! 112: multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
! 113: groups of sizes for which `mpn_mul_n' runs at the same speed. Or for
! 114: k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc. In practice
! 115: it's been found each k is used at quite small multiples of its size
! 116: constraint and so the step effect is quite noticeable in a time versus
! 117: size graph.
! 118:
! 119: The threshold determinations currently measure at the mid-points of
! 120: size steps, but this is sub-optimal since at the start of a new step it
! 121: can happen that it's better to go back to the previous k for a while.
! 122: Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
! 123: will be needed.
! 124:
! 125:
! 126: File: gmp.info, Node: Other Multiplication, Prev: FFT Multiplication, Up: Multiplication Algorithms
! 127:
! 128: Other Multiplication
! 129: --------------------
! 130:
! 131: The 3-way Toom-Cook algorithm described above (*note Toom-Cook 3-Way
! 132: Multiplication::) generalizes to split into an arbitrary number of
! 133: pieces, as per Knuth section 4.3.3 algorithm C. This is not currently
! 134: used, though it's possible a Toom-4 might fit in between Toom-3 and the
! 135: FFTs. The notes here are merely for interest.
! 136:
! 137: In general a split into r+1 pieces is made, and evaluations and
! 138: pointwise multiplications done at 2*r+1 points. A 4-way split does 7
! 139: pointwise multiplies, 5-way does 9, etc. Asymptotically an (r+1)-way
! 140: algorithm is O(N^(log(2*r+1)/log(r+1))). Only the pointwise
! 141: multiplications count towards big-O complexity, but the time spent in
! 142: the evaluate and interpolate stages grows with r and has a significant
! 143: practical impact, with the asymptotic advantage of each r realized only
! 144: at bigger and bigger sizes. The overheads grow as O(N*r), whereas in
! 145: an r=2^k FFT they grow only as O(N*log(r)).
! 146:
! 147: Knuth algorithm C evaluates at points 0,1,2,...,2*r, but exercise 4
! 148: uses -r,...,0,...,r and the latter saves some small multiplies in the
! 149: evaluate stage (or rather trades them for additions), and has a further
! 150: saving of nearly half the interpolate steps. The idea is to separate
! 151: odd and even final coefficients and then perform algorithm C steps C7
! 152: and C8 on them separately. The divisors at step C7 become j^2 and the
! 153: multipliers at C8 become 2*t*j-j^2.
! 154:
! 155: Splitting odd and even parts through positive and negative points
! 156: can be thought of as using -1 as a square root of unity. If a 4th root
! 157: of unity was available then a further split and speedup would be
! 158: possible, but no such root exists for plain integers. Going to complex
! 159: integers with i=sqrt(-1) doesn't help, essentially because in cartesian
! 160: form it takes three real multiplies to do a complex multiply. The
! 161: existence of 2^k'th roots of unity in a suitable ring or field lets the
! 162: fast fourier transform keep splitting and get to O(N*log(r)).
! 163:
! 164: Floating point FFTs use complex numbers approximating Nth roots of
! 165: unity. Some processors have special support for such FFTs. But these
! 166: are not used in GMP since it's very difficult to guarantee an exact
! 167: result (to some number of bits). An occasional difference of 1 in the
! 168: last bit might not matter to a typical signal processing algorithm, but
! 169: is of course of vital importance to GMP.
! 170:
! 171:
! 172: File: gmp.info, Node: Division Algorithms, Next: Greatest Common Divisor Algorithms, Prev: Multiplication Algorithms, Up: Algorithms
! 173:
! 174: Division Algorithms
! 175: ===================
! 176:
! 177: * Menu:
! 178:
! 179: * Single Limb Division::
! 180: * Basecase Division::
! 181: * Divide and Conquer Division::
! 182: * Exact Division::
! 183: * Exact Remainder::
! 184: * Small Quotient Division::
! 185:
! 186:
! 187: File: gmp.info, Node: Single Limb Division, Next: Basecase Division, Prev: Division Algorithms, Up: Division Algorithms
! 188:
! 189: Single Limb Division
! 190: --------------------
! 191:
! 192: Nx1 division is implemented using repeated 2x1 divisions from high
! 193: to low, either with a hardware divide instruction or a multiplication by
! 194: inverse, whichever is best on a given CPU.
! 195:
! 196: The multiply by inverse follows section 8 of "Division by Invariant
! 197: Integers using Multiplication" by Granlund and Montgomery (*note
! 198: References::) and is implemented as `udiv_qrnnd_preinv' in
! 199: `gmp-impl.h'. The idea is to have a fixed-point approximation to 1/d
! 200: (see `invert_limb') and then multiply by the high limb (plus one bit)
! 201: of the dividend to get a quotient q. With d normalized (high bit set),
! 202: q is no more than 1 too small. Subtracting q*d from the dividend gives
! 203: a remainder, and reveals whether q or q-1 is correct.
! 204:
! 205: The result is a division done with two multiplications and four or
! 206: five arithmetic operations. On CPUs with low latency multipliers this
! 207: can be much faster than a hardware divide, though the cost of
! 208: calculating the inverse at the start may mean it's only better on
! 209: inputs bigger than say 4 or 5 limbs.
! 210:
! 211: When a divisor must be normalized, either for the generic C
! 212: `__udiv_qrnnd_c' or the multiply by inverse, the division performed is
! 213: actually a*2^k by d*2^k where a is the dividend and k is the power
! 214: necessary to have the high bit of d*2^k set. The bit shifts for the
! 215: dividend are usually accomplished "on the fly" meaning by extracting
! 216: the appropriate bits at each step. Done this way the quotient limbs
! 217: come out aligned ready to store. When only the remainder is wanted, an
! 218: alternative is to take the dividend limbs unshifted and calculate r = a
! 219: mod d*2^k followed by an extra final step r*2^k mod d*2^k. This can
! 220: help on CPUs with poor bit shifts or few registers.
! 221:
! 222: The multiply by inverse can be done two limbs at a time. The
! 223: calculation is basically the same, but the inverse is two limbs and the
! 224: divisor treated as if padded with a low zero limb. This means more
! 225: work, since the inverse will need a 2x2 multiply, but the four 1x1s to
! 226: do that are independent and can therefore be done partly or wholly in
! 227: parallel. Likewise for a 2x1 calculating q*d. The net effect is to
! 228: process two limbs with roughly the same two multiplies worth of latency
! 229: that one limb at a time gives. This extends to 3 or 4 limbs at a time,
! 230: though the extra work to apply the inverse will almost certainly soon
! 231: reach the limits of multiplier throughput.
! 232:
! 233: A similar approach in reverse can be taken to process just half a
! 234: limb at a time if the divisor is only a half limb. In this case the
! 235: 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
! 236: limb, which can be a saving on CPUs with a fast half limb multiply, or
! 237: in fact if the only multiply is a half limb, and especially if it's not
! 238: pipelined.
! 239:
! 240:
! 241: File: gmp.info, Node: Basecase Division, Next: Divide and Conquer Division, Prev: Single Limb Division, Up: Division Algorithms
! 242:
! 243: Basecase Division
! 244: -----------------
! 245:
! 246: Basecase NxM division is like long division done by hand, but in base
! 247: 2^mp_bits_per_limb. See Knuth section 4.3.1 algorithm D, and
! 248: `mpn/generic/sb_divrem_mn.c'.
! 249:
! 250: Briefly stated, while the dividend remains larger than the divisor,
! 251: a high quotient limb is formed and the Nx1 product q*d subtracted at
! 252: the top end of the dividend. With a normalized divisor (most
! 253: significant bit set), each quotient limb can be formed with a 2x1
! 254: division and a 1x1 multiplication plus some subtractions. The 2x1
! 255: division is by the high limb of the divisor and is done either with a
! 256: hardware divide or a multiply by inverse (the same as in *Note Single
! 257: Limb Division::) whichever is faster. Such a quotient is sometimes one
! 258: too big, requiring an addback of the divisor, but that happens rarely.
! 259:
! 260: With Q=N-M being the number of quotient limbs, this is an O(Q*M)
! 261: algorithm and will run at a speed similar to a basecase QxM
! 262: multiplication, differing in fact only in the extra multiply and divide
! 263: for each of the Q quotient limbs.
! 264:
! 265:
! 266: File: gmp.info, Node: Divide and Conquer Division, Next: Exact Division, Prev: Basecase Division, Up: Division Algorithms
! 267:
! 268: Divide and Conquer Division
! 269: ---------------------------
! 270:
! 271: For divisors larger than `DIV_DC_THRESHOLD', division is done by
! 272: dividing. Or to be precise by a recursive divide and conquer algorithm
! 273: based on work by Moenck and Borodin, Jebelean, and Burnikel and Ziegler
! 274: (*note References::).
! 275:
! 276: The algorithm consists essentially of recognising that a 2NxN
! 277: division can be done with the basecase division algorithm (*note
! 278: Basecase Division::), but using N/2 limbs as a base, not just a single
! 279: limb. This way the multiplications that arise are (N/2)x(N/2) and can
! 280: take advantage of Karatsuba and higher multiplication algorithms (*note
! 281: Multiplication Algorithms::). The "digits" of the quotient are formed
! 282: by recursive Nx(N/2) divisions.
! 283:
! 284: If the (N/2)x(N/2) multiplies are done with a basecase multiplication
! 285: then the work is about the same as a basecase division, but with more
! 286: function call overheads and with some subtractions separated from the
! 287: multiplies. These overheads mean that it's only when N/2 is above
! 288: `MUL_KARATSUBA_THRESHOLD' that divide and conquer is of use.
! 289:
! 290: `DIV_DC_THRESHOLD' is based on the divisor size N, so it will be
! 291: somewhere above twice `MUL_KARATSUBA_THRESHOLD', but how much above
! 292: depends on the CPU. An optimized `mpn_mul_basecase' can lower
! 293: `DIV_DC_THRESHOLD' a little by offering a ready-made advantage over
! 294: repeated `mpn_submul_1' calls.
! 295:
! 296: Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is
! 297: the time for an NxN multiplication done with FFTs. The actual time is
! 298: a sum over multiplications of the recursed sizes, as can be seen near
! 299: the end of section 2.2 of Burnikel and Ziegler. For example, within
! 300: the Toom-3 range, divide and conquer is 2.63*M(N). With higher
! 301: algorithms the M(N) term improves and the multiplier tends to log(N).
! 302: In practice, at moderate to large sizes, a 2NxN division is about 2 to
! 303: 4 times slower than an NxN multiplication.
! 304:
! 305: Newton's method used for division is asymptotically O(M(N)) and
! 306: should therefore be superior to divide and conquer, but it's believed
! 307: this would only be for large to very large N.
! 308:
! 309:
! 310: File: gmp.info, Node: Exact Division, Next: Exact Remainder, Prev: Divide and Conquer Division, Up: Division Algorithms
! 311:
! 312: Exact Division
! 313: --------------
! 314:
! 315: A so-called exact division is when the dividend is known to be an
! 316: exact multiple of the divisor. Jebelean's exact division algorithm
! 317: uses this knowledge to make some significant optimizations (*note
! 318: References::).
! 319:
! 320: The idea can be illustrated in decimal for example with 368154
! 321: divided by 543. Because the low digit of the dividend is 4, the low
! 322: digit of the quotient must be 8. This is arrived at from 4*7 mod 10,
! 323: using the fact 7 is the modular inverse of 3 (the low digit of the
! 324: divisor), since 3*7 == 1 mod 10. So 8*543=4344 can be subtracted from
! 325: the dividend leaving 363810. Notice the low digit has become zero.
! 326:
! 327: The procedure is repeated at the second digit, with the next
! 328: quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
! 329: 325800. And finally at the third digit with quotient digit 6 (8*7 mod
! 330: 10), subtracting 6*543=3258 leaving 0. So the quotient is 678.
! 331:
! 332: Notice however that the multiplies and subtractions don't need to
! 333: extend past the low three digits of the dividend, since that's enough
! 334: to determine the three quotient digits. For the last quotient digit no
! 335: subtraction is needed at all. On a 2NxN division like this one, only
! 336: about half the work of a normal basecase division is necessary.
! 337:
! 338: For an NxM exact division producing Q=N-M quotient limbs, the saving
! 339: over a normal basecase division is in two parts. Firstly, each of the
! 340: Q quotient limbs needs only one multiply, not a 2x1 divide and
! 341: multiply. Secondly, the crossproducts are reduced when Q>M to
! 342: Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2. Notice the savings are
! 343: complementary. If Q is big then many divisions are saved, or if Q is
! 344: small then the crossproducts reduce to a small number.
! 345:
! 346: The modular inverse used is calculated efficiently by
! 347: `modlimb_invert' in `gmp-impl.h'. This does four multiplies for a
! 348: 32-bit limb, or six for a 64-bit limb. `tune/modlinv.c' has some
! 349: alternate implementations that might suit processors better at bit
! 350: twiddling than multiplying.
! 351:
! 352: The sub-quadratic exact division described by Jebelean in "Exact
! 353: Division with Karatsuba Complexity" is not currently implemented. It
! 354: uses a rearrangement similar to the divide and conquer for normal
! 355: division (*note Divide and Conquer Division::), but operating from low
! 356: to high. A further possibility not currently implemented is
! 357: "Bidirectional Exact Integer Division" by Krandick and Jebelean which
! 358: forms quotient limbs from both the high and low ends of the dividend,
! 359: and can halve once more the number of crossproducts needed in a 2NxN
! 360: division.
! 361:
! 362: A special case exact division by 3 exists in `mpn_divexact_by3',
! 363: supporting Toom-3 multiplication and `mpq' canonicalizations. It forms
! 364: quotient digits with a multiply by the modular inverse of 3 (which is
! 365: `0xAA..AAB') and uses two comparisons to determine a borrow for the next
! 366: limb. The multiplications don't need to be on the dependent chain, as
! 367: long as the effect of the borrows is applied. Only a few optimized
! 368: assembler implementations currently exist.
! 369:
! 370:
! 371: File: gmp.info, Node: Exact Remainder, Next: Small Quotient Division, Prev: Exact Division, Up: Division Algorithms
! 372:
! 373: Exact Remainder
! 374: ---------------
! 375:
! 376: If the exact division algorithm is done with a full subtraction at
! 377: each stage and the dividend isn't a multiple of the divisor, then low
! 378: zero limbs are produced but with a remainder in the high limbs. For
! 379: dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, then this
! 380: remainder r is of the form
! 381:
! 382: a = q*d + r*b^n
! 383:
! 384: n represents the number of zero limbs produced by the subtractions,
! 385: that being the number of limbs produced for q. r will be in the range
! 386: 0<=r<d and can be viewed as a remainder, but one shifted up by a factor
! 387: of b^n.
! 388:
! 389: Carrying out full subtractions at each stage means the same number
! 390: of cross products must be done as a normal division, but there's still
! 391: some single limb divisions saved. When d is a single limb some
! 392: simplifications arise, providing good speedups on a number of
! 393: processors.
! 394:
! 395: `mpn_bdivmod', `mpn_divexact_by3', `mpn_modexact_1_odd' and the
! 396: `redc' function in `mpz_powm' differ subtly in how they return r,
! 397: leading to some negations in the above formula, but all are essentially
! 398: the same.
! 399:
! 400: Clearly r is zero when a is a multiple of d, and this leads to
! 401: divisibility or congruence tests which are potentially more efficient
! 402: than a normal division.
! 403:
! 404: The factor of b^n on r can be ignored in a GCD when d is odd, hence
! 405: the use of `mpn_bdivmod' in `mpn_gcd', and the use of
! 406: `mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui' etc (*note
! 407: Greatest Common Divisor Algorithms::).
! 408:
! 409: Montgomery's REDC method for modular multiplications uses operands
! 410: of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
! 411: uses the factor of b^n in the exact remainder to reach a product in the
! 412: same form (x*y)*b^-n (*note Modular Powering Algorithm::).
! 413:
! 414: Notice that r generally gives no useful information about the
! 415: ordinary remainder a mod d since b^n mod d could be anything. If
! 416: however b^n == 1 mod d, then r is the negative of the ordinary
! 417: remainder. This occurs whenever d is a factor of b^n-1, as for example
! 418: with 3 in `mpn_divexact_by3'. Other such factors include 5, 17 and
! 419: 257, but no particular use has been found for this.
! 420:
! 421:
! 422: File: gmp.info, Node: Small Quotient Division, Prev: Exact Remainder, Up: Division Algorithms
! 423:
! 424: Small Quotient Division
! 425: -----------------------
! 426:
! 427: An NxM division where the number of quotient limbs Q=N-M is small
! 428: can be optimized somewhat.
! 429:
! 430: An ordinary basecase division normalizes the divisor by shifting it
! 431: to make the high bit set, shifting the dividend accordingly, and
! 432: shifting the remainder back down at the end of the calculation. This
! 433: is wasteful if only a few quotient limbs are to be formed. Instead a
! 434: division of just the top 2*Q limbs of the dividend by the top Q limbs
! 435: of the divisor can be used to form a trial quotient. This requires
! 436: only those limbs normalized, not the whole of the divisor and dividend.
! 437:
! 438: A multiply and subtract then applies the trial quotient to the M-Q
! 439: unused limbs of the divisor and N-Q dividend limbs (which includes Q
! 440: limbs remaining from the trial quotient division). The starting trial
! 441: quotient can be 1 or 2 too big, but all cases of 2 too big and most
! 442: cases of 1 too big are detected by first comparing the most significant
! 443: limbs that will arise from the subtraction. An addback is done if the
! 444: quotient still turns out to be 1 too big.
! 445:
! 446: This whole procedure is essentially the same as one step of the
! 447: basecase algorithm done in a Q limb base, though with the trial
! 448: quotient test done only with the high limbs, not an entire Q limb
! 449: "digit" product. The correctness of this weaker test can be
! 450: established by following the argument of Knuth section 4.3.1 exercise
! 451: 20 but with the v2*q>b*r+u2 condition appropriately relaxed.
! 452:
! 453:
! 454: File: gmp.info, Node: Greatest Common Divisor Algorithms, Next: Powering Algorithms, Prev: Division Algorithms, Up: Algorithms
! 455:
! 456: Greatest Common Divisor
! 457: =======================
! 458:
! 459: * Menu:
! 460:
! 461: * Binary GCD::
! 462: * Accelerated GCD::
! 463: * Extended GCD::
! 464: * Jacobi Symbol::
! 465:
! 466:
! 467: File: gmp.info, Node: Binary GCD, Next: Accelerated GCD, Prev: Greatest Common Divisor Algorithms, Up: Greatest Common Divisor Algorithms
! 468:
! 469: Binary GCD
! 470: ----------
! 471:
! 472: At small sizes GMP uses an O(N^2) binary style GCD. This is
! 473: described in many textbooks, for example Knuth section 4.5.2 algorithm
! 474: B. It simply consists of successively reducing operands a and b using
! 475: gcd(a,b) = gcd(min(a,b),abs(a-b)), and also that if a and b are first
! 476: made odd then abs(a-b) is even and factors of two can be discarded.
! 477:
! 478: Variants like letting a-b become negative and doing a different next
! 479: step are of interest only as far as they suit particular CPUs, since on
! 480: small operands it's machine dependent factors that determine
! 481: performance.
! 482:
! 483: The Euclidean GCD algorithm, as per Knuth algorithms E and A,
! 484: reduces using a mod b but this has so far been found to be slower
! 485: everywhere. One reason the binary method does well is that the implied
! 486: quotient at each step is usually small, so often only one or two
! 487: subtractions are needed to get the same effect as a division.
! 488: Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
! 489: section 4.5.3 Theorem E.
! 490:
! 491: When the implied quotient is large, meaning b is much smaller than
! 492: a, then a division is worthwhile. This is the basis for the initial a
! 493: mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
! 494: and 1x1 cases). But after that initial reduction, big quotients occur
! 495: too rarely to make it worth checking for them.
! 496:
! 497:
! 498: File: gmp.info, Node: Accelerated GCD, Next: Extended GCD, Prev: Binary GCD, Up: Greatest Common Divisor Algorithms
! 499:
! 500: Accelerated GCD
! 501: ---------------
! 502:
! 503: For sizes above `GCD_ACCEL_THRESHOLD', GMP uses the Accelerated GCD
! 504: algorithm described independently by Weber and Jebelean (the latter as
! 505: the "Generalized Binary" algorithm), *note References::. This
! 506: algorithm is still O(N^2), but is much faster than the binary algorithm
! 507: since it does fewer multi-precision operations. It consists of
! 508: alternating the k-ary reduction by Sorenson, and a "dmod" exact
! 509: remainder reduction.
! 510:
! 511: For operands u and v the k-ary reduction replaces u with n*v-d*u
! 512: where n and d are single limb values chosen to give two trailing zero
! 513: limbs on that value, which can be stripped. n and d are calculated
! 514: using an algorithm similar to half of a two limb GCD (see `find_a' in
! 515: `mpn/generic/gcd.c').
! 516:
! 517: When u and v differ in size by more than a certain number of bits, a
! 518: dmod is performed to zero out bits at the low end of the larger. It
! 519: consists of an exact remainder style division applied to an appropriate
! 520: number of bits (*note Exact Division::, and *note Exact Remainder::).
! 521: This is faster than a k-ary reduction but useful only when the operands
! 522: differ in size. There's a dmod after each k-ary reduction, and if the
! 523: dmod leaves the operands still differing in size then it's repeated.
! 524:
! 525: The k-ary reduction step can introduce spurious factors into the GCD
! 526: calculated, and these are eliminated at the end by taking GCDs with the
! 527: original inputs gcd(u,gcd(v,g)) using the binary algorithm. Since g is
! 528: almost always small this takes very little time.
! 529:
! 530: At small sizes the algorithm needs a good implementation of
! 531: `find_a'. At larger sizes it's dominated by `mpn_addmul_1' applying n
! 532: and d.
! 533:
! 534:
! 535: File: gmp.info, Node: Extended GCD, Next: Jacobi Symbol, Prev: Accelerated GCD, Up: Greatest Common Divisor Algorithms
! 536:
! 537: Extended GCD
! 538: ------------
! 539:
! 540: The extended GCD calculates gcd(a,b) and also cofactors x and y
! 541: satisfying a*x+b*y=gcd(a,b). Lehmer's multi-step improvement of the
! 542: extended Euclidean algorithm is used. See Knuth section 4.5.2
! 543: algorithm L, and `mpn/generic/gcdext.c'. This is an O(N^2) algorithm.
! 544:
! 545: The multipliers at each step are found using single limb
! 546: calculations for sizes up to `GCDEXT_THRESHOLD', or double limb
! 547: calculations above that. The single limb code is faster but doesn't
! 548: produce full-limb multipliers, hence not making full use of the
! 549: `mpn_addmul_1' calls.
! 550:
! 551: When a CPU has a data-dependent multiplier, meaning one which is
! 552: faster on operands with fewer bits, the extra work in the double-limb
! 553: calculation might only save some looping overheads, leading to a large
! 554: `GCDEXT_THRESHOLD'.
! 555:
! 556: Currently the single limb calculation doesn't optimize for the small
! 557: quotients that often occur, and this can lead to unusually low values of
! 558: `GCDEXT_THRESHOLD', depending on the CPU.
! 559:
! 560: An analysis of double-limb calculations can be found in "A
! 561: Double-Digit Lehmer-Euclid Algorithm" by Jebelean (*note References::).
! 562: The code in GMP was developed independently.
! 563:
! 564: It should be noted that when a double limb calculation is used, it's
! 565: used for the whole of that GCD, it doesn't fall back to single limb
! 566: part way through. This is because as the algorithm proceeds, the
! 567: inputs a and b are reduced, but the cofactors x and y grow, so the
! 568: multipliers at each step are applied to a roughly constant total number
! 569: of limbs.
! 570:
! 571:
! 572: File: gmp.info, Node: Jacobi Symbol, Prev: Extended GCD, Up: Greatest Common Divisor Algorithms
! 573:
! 574: Jacobi Symbol
! 575: -------------
! 576:
! 577: `mpz_jacobi' and `mpz_kronecker' are currently implemented with a
! 578: simple binary algorithm similar to that described for the GCDs (*note
! 579: Binary GCD::). They're not very fast when both inputs are large.
! 580: Lehmer's multi-step improvement or a binary based multi-step algorithm
! 581: is likely to be better.
! 582:
! 583: When one operand fits a single limb, and that includes
! 584: `mpz_kronecker_ui' and friends, an initial reduction is done with
! 585: either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
! 586: algorithm on a single limb. The binary algorithm is well suited to a
! 587: single limb, and the whole calculation in this case is quite efficient.
! 588:
! 589: In all the routines sign changes for the result are accumulated
! 590: using some bit twiddling, avoiding table lookups or conditional jumps.
! 591:
! 592:
! 593: File: gmp.info, Node: Powering Algorithms, Next: Root Extraction Algorithms, Prev: Greatest Common Divisor Algorithms, Up: Algorithms
! 594:
! 595: Powering Algorithms
! 596: ===================
! 597:
! 598: * Menu:
! 599:
! 600: * Normal Powering Algorithm::
! 601: * Modular Powering Algorithm::
! 602:
! 603:
! 604: File: gmp.info, Node: Normal Powering Algorithm, Next: Modular Powering Algorithm, Prev: Powering Algorithms, Up: Powering Algorithms
! 605:
! 606: Normal Powering
! 607: ---------------
! 608:
! 609: Normal `mpz' or `mpf' powering uses a simple binary algorithm,
! 610: successively squaring and then multiplying by the base when a 1 bit is
! 611: seen in the exponent, as per Knuth section 4.6.3. The "left to right"
! 612: variant described there is used rather than algorithm A, since it's
! 613: just as easy and can be done with somewhat less temporary memory.
! 614:
! 615:
! 616: File: gmp.info, Node: Modular Powering Algorithm, Prev: Normal Powering Algorithm, Up: Powering Algorithms
! 617:
! 618: Modular Powering
! 619: ----------------
! 620:
! 621: Modular powering is implemented using a 2^k-ary sliding window
! 622: algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
! 623: (*note References::). k is chosen according to the size of the
! 624: exponent. Larger exponents use larger values of k, the choice being
! 625: made to minimize the average number of multiplications that must
! 626: supplement the squaring.
! 627:
! 628: The modular multiplies and squares use either a simple division or
! 629: the REDC method by Montgomery (*note References::). REDC is a little
! 630: faster, essentially saving N single limb divisions in a fashion similar
! 631: to an exact remainder (*note Exact Remainder::). The current REDC has
! 632: some limitations. It's only O(N^2) so above `POWM_THRESHOLD' division
! 633: becomes faster and is used. It doesn't attempt to detect small bases,
! 634: but rather always uses a REDC form, which is usually a full size
! 635: operand. And lastly it's only applied to odd moduli.
! 636:
! 637:
! 638: File: gmp.info, Node: Root Extraction Algorithms, Next: Radix Conversion Algorithms, Prev: Powering Algorithms, Up: Algorithms
! 639:
! 640: Root Extraction Algorithms
! 641: ==========================
! 642:
! 643: * Menu:
! 644:
! 645: * Square Root Algorithm::
! 646: * Nth Root Algorithm::
! 647: * Perfect Square Algorithm::
! 648: * Perfect Power Algorithm::
! 649:
! 650:
! 651: File: gmp.info, Node: Square Root Algorithm, Next: Nth Root Algorithm, Prev: Root Extraction Algorithms, Up: Root Extraction Algorithms
! 652:
! 653: Square Root
! 654: -----------
! 655:
! 656: Square roots are taken using the "Karatsuba Square Root" algorithm
! 657: by Paul Zimmermann (*note References::). This is expressed in a divide
! 658: and conquer form, but as noted in the paper it can also be viewed as a
! 659: discrete variant of Newton's method.
! 660:
! 661: In the Karatsuba multiplication range this is an O(1.5*M(N/2))
! 662: algorithm, where M(n) is the time to multiply two numbers of n limbs.
! 663: In the FFT multiplication range this grows to a bound of O(6*M(N/2)).
! 664: In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and
! 665: Toom-3 ranges, growing to 2 or 3 in the FFT range.
! 666:
! 667: The algorithm does all its calculations in integers and the resulting
! 668: `mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'. The extended
! 669: precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.
! 670:
! 671:
! 672: File: gmp.info, Node: Nth Root Algorithm, Next: Perfect Square Algorithm, Prev: Square Root Algorithm, Up: Root Extraction Algorithms
! 673:
! 674: Nth Root
! 675: --------
! 676:
! 677: Integer Nth roots are taken using Newton's method with the following
! 678: iteration, where A is the input and n is the root to be taken.
! 679:
! 680: 1 A
! 681: a[i+1] = - * ( --------- + (n-1)*a[i] )
! 682: n a[i]^(n-1)
! 683:
! 684: The initial approximation a[1] is generated bitwise by successively
! 685: powering a trial root with or without new 1 bits, aiming to be just
! 686: above the true root. The iteration converges quadratically when
! 687: started from a good approximation. When n is large more initial bits
! 688: are needed to get good convergence. The current implementation is not
! 689: particularly well optimized.
! 690:
! 691:
! 692: File: gmp.info, Node: Perfect Square Algorithm, Next: Perfect Power Algorithm, Prev: Nth Root Algorithm, Up: Root Extraction Algorithms
! 693:
! 694: Perfect Square
! 695: --------------
! 696:
! 697: `mpz_perfect_square_p' is able to quickly exclude most non-squares by
! 698: checking whether the input is a quadratic residue modulo some small
! 699: integers.
! 700:
! 701: The first test is modulo 256 which means simply examining the least
! 702: significant byte. Only 44 different values occur as the low byte of a
! 703: square, so 82.8% of non-squares can be immediately excluded. Similar
! 704: tests modulo primes from 3 to 29 exclude 99.5% of those remaining, or
! 705: if a limb is 64 bits then primes up to 53 are used, excluding 99.99%.
! 706: A single Nx1 remainder using `PP' from `gmp-impl.h' quickly gives all
! 707: these remainders.
! 708:
! 709: A square root must still be taken for any value that passes the
! 710: residue tests, to verify it's really a square and not one of the 0.086%
! 711: (or 0.000156% for 64 bits) non-squares that get through. *Note Square
! 712: Root Algorithm::.
! 713:
! 714:
! 715: File: gmp.info, Node: Perfect Power Algorithm, Prev: Perfect Square Algorithm, Up: Root Extraction Algorithms
! 716:
! 717: Perfect Power
! 718: -------------
! 719:
! 720: Detecting perfect powers is required by some factorization
! 721: algorithms. Currently `mpz_perfect_power_p' is implemented using
! 722: repeated Nth root extractions, though naturally only prime roots need
! 723: to be considered. (*Note Nth Root Algorithm::.)
! 724:
! 725: If a prime divisor p with multiplicity e can be found, then only
! 726: roots which are divisors of e need to be considered, much reducing the
! 727: work necessary. To this end divisibility by a set of small primes is
! 728: checked.
! 729:
! 730:
! 731: File: gmp.info, Node: Radix Conversion Algorithms, Next: Other Algorithms, Prev: Root Extraction Algorithms, Up: Algorithms
! 732:
! 733: Radix Conversion
! 734: ================
! 735:
! 736: Radix conversions are less important than other algorithms. A
! 737: program dominated by conversions should probably use a different data
! 738: representation.
! 739:
! 740: * Menu:
! 741:
! 742: * Binary to Radix::
! 743: * Radix to Binary::
! 744:
! 745:
! 746: File: gmp.info, Node: Binary to Radix, Next: Radix to Binary, Prev: Radix Conversion Algorithms, Up: Radix Conversion Algorithms
! 747:
! 748: Binary to Radix
! 749: ---------------
! 750:
! 751: Conversions from binary to a power-of-2 radix use a simple and fast
! 752: O(N) bit extraction algorithm.
! 753:
! 754: Conversions from binary to other radices use one of two algorithms.
! 755: Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
! 756: Repeated divisions by b^n are made, where b is the radix and n is the
! 757: biggest power that fits in a limb. But instead of simply using the
! 758: remainder r from such divisions, an extra divide step is done to give a
! 759: fractional limb representing r/b^n. The digits of r can then be
! 760: extracted using multiplications by b rather than divisions. Special
! 761: case code is provided for decimal, allowing multiplications by 10 to
! 762: optimize to shifts and adds.
! 763:
! 764: Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
! 765: used. For an input t, powers b^(n*2^i) of the radix are calculated,
! 766: until a power between t and sqrt(t) is reached. t is then divided by
! 767: that largest power, giving a quotient which is the digits above that
! 768: power, and a remainder which is those below. These two parts are in
! 769: turn divided by the second highest power, and so on recursively. When
! 770: a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
! 771: limbs, the basecase algorithm described above is used.
! 772:
! 773: The advantage of this algorithm is that big divisions can make use
! 774: of the sub-quadratic divide and conquer division (*note Divide and
! 775: Conquer Division::), and big divisions tend to have less overheads than
! 776: lots of separate single limb divisions anyway. But in any case the
! 777: cost of calculating the powers b^(n*2^i) must first be overcome.
! 778:
! 779: `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
! 780: the same basic thing, the point where it becomes worth doing a big
! 781: division to cut the input in half. `GET_STR_PRECOMPUTE_THRESHOLD'
! 782: includes the cost of calculating the radix power required, whereas
! 783: `GET_STR_DC_THRESHOLD' assumes that's already available, which is the
! 784: case when recursing.
! 785:
! 786: Since the base case produces digits from least to most significant
! 787: but they want to be stored from most to least, it's necessary to
! 788: calculate in advance how many digits there will be, or at least be sure
! 789: not to underestimate that. For GMP the number of input bits is
! 790: multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
! 791: The result is either correct or one too big.
! 792:
! 793: Examining some of the high bits of the input could increase the
! 794: chance of getting the exact number of digits, but an exact result every
! 795: time would not be practical, since in general the difference between
! 796: numbers 100... and 99... is only in the last few bits and the work to
! 797: identify 99... might well be almost as much as a full conversion.
! 798:
! 799: `mpf_get_str' doesn't currently use the algorithm described here, it
! 800: multiplies or divides by a power of b to move the radix point to the
! 801: just above the highest non-zero digit (or at worst one above that
! 802: location), then multiplies by b^n to bring out digits. This is O(N^2)
! 803: and is certainly not optimal.
! 804:
! 805: The r/b^n scheme described above for using multiplications to bring
! 806: out digits might be useful for more than a single limb. Some brief
! 807: experiments with it on the base case when recursing didn't give a
! 808: noticable improvement, but perhaps that was only due to the
! 809: implementation. Something similar would work for the sub-quadratic
! 810: divisions too, though there would be the cost of calculating a bigger
! 811: radix power.
! 812:
! 813: Another possible improvement for the sub-quadratic part would be to
! 814: arrange for radix powers that balanced the sizes of quotient and
! 815: remainder produced, ie. the highest power would be an b^(n*k)
! 816: approximately equal to sqrt(t), not restricted to a 2^i factor. That
! 817: ought to smooth out a graph of times against sizes, but may or may not
! 818: be a net speedup.
! 819:
! 820:
! 821: File: gmp.info, Node: Radix to Binary, Prev: Binary to Radix, Up: Radix Conversion Algorithms
! 822:
! 823: Radix to Binary
! 824: ---------------
! 825:
! 826: Conversions from a power-of-2 radix into binary use a simple and fast
! 827: O(N) bitwise concatenation algorithm.
! 828:
! 829: Conversions from other radices use one of two algorithms. Sizes
! 830: below `SET_STR_THRESHOLD' use a basic O(N^2) method. Groups of n
! 831: digits are converted to limbs, where n is the biggest power of the base
! 832: b which will fit in a limb, then those groups are accumulated into the
! 833: result by multiplying by b^n and adding. This saves multi-precision
! 834: operations, as per Knuth section 4.4 part E (*note References::). Some
! 835: special case code is provided for decimal, giving the compiler a chance
! 836: to optimize multiplications by 10.
! 837:
! 838: Above `SET_STR_THRESHOLD' a sub-quadratic algorithm is used. First
! 839: groups of n digits are converted into limbs. Then adjacent limbs are
! 840: combined into limb pairs with x*b^n+y, where x and y are the limbs.
! 841: Adjacent limb pairs are combined into quads similarly with x*b^(2n)+y.
! 842: This continues until a single block remains, that being the result.
! 843:
! 844: The advantage of this method is that the multiplications for each x
! 845: are big blocks, allowing Karatsuba and higher algorithms to be used.
! 846: But the cost of calculating the powers b^(n*2^i) must be overcome.
! 847: `SET_STR_THRESHOLD' usually ends up quite big, around 5000 digits, and
! 848: on some processors much bigger still.
! 849:
! 850: `SET_STR_THRESHOLD' is based on the input digits (and tuned for
! 851: decimal), though it might be better based on a limb count, so as to be
! 852: independent of the base. But that sort of count isn't used by the base
! 853: case and so would need some sort of initial calculation or estimate.
! 854:
! 855: The main reason `SET_STR_THRESHOLD' is so much bigger than the
! 856: corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that `mpn_mul_1' is
! 857: much faster than `mpn_divrem_1' (often by a factor of 10, or more).
! 858:
! 859:
! 860: File: gmp.info, Node: Other Algorithms, Next: Assembler Coding, Prev: Radix Conversion Algorithms, Up: Algorithms
! 861:
! 862: Other Algorithms
! 863: ================
! 864:
! 865: * Menu:
! 866:
! 867: * Factorial Algorithm::
! 868: * Binomial Coefficients Algorithm::
! 869: * Fibonacci Numbers Algorithm::
! 870: * Lucas Numbers Algorithm::
! 871:
! 872:
! 873: File: gmp.info, Node: Factorial Algorithm, Next: Binomial Coefficients Algorithm, Prev: Other Algorithms, Up: Other Algorithms
! 874:
! 875: Factorial
! 876: ---------
! 877:
! 878: Factorials n! are calculated by a simple product from 1 to n, but
! 879: arranged into certain sub-products.
! 880:
! 881: First as many factors as fit in a limb are accumulated, then two of
! 882: those multiplied to give a 2-limb product. When two 2-limb products
! 883: are ready they're multiplied to a 4-limb product, and when two 4-limbs
! 884: are ready they're multiplied to an 8-limb product, etc. A stack of
! 885: outstanding products is built up, with two of the same size multiplied
! 886: together when ready.
! 887:
! 888: Arranging for multiplications to have operands the same (or nearly
! 889: the same) size means the Karatsuba and higher multiplication algorithms
! 890: can be used. And even on sizes below the Karatsuba threshold an NxN
! 891: multiply will give a basecase multiply more to work on.
! 892:
! 893: An obvious improvement not currently implemented would be to strip
! 894: factors of 2 from the products and apply them at the end with a bit
! 895: shift. Another possibility would be to determine the prime
! 896: factorization of the result (which can be done easily), and use a
! 897: powering method, at each stage squaring then multiplying in those
! 898: primes with a 1 in their exponent at that point. The advantage would
! 899: be some multiplies turned into squares.
! 900:
! 901:
! 902: File: gmp.info, Node: Binomial Coefficients Algorithm, Next: Fibonacci Numbers Algorithm, Prev: Factorial Algorithm, Up: Other Algorithms
! 903:
! 904: Binomial Coefficients
! 905: ---------------------
! 906:
! 907: Binomial coefficients C(n,k) are calculated by first arranging k <=
! 908: n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the
! 909: following product simply from i=2 to i=k.
! 910:
! 911: k (n-k+i)
! 912: C(n,k) = (n-k+1) * prod -------
! 913: i=2 i
! 914:
! 915: It's easy to show that each denominator i will divide the product so
! 916: far, so the exact division algorithm is used (*note Exact Division::).
! 917:
! 918: The numerators n-k+i and denominators i are first accumulated into
! 919: as many fit a limb, to save multi-precision operations, though for
! 920: `mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
! 921: and n-k+i in general won't fit in a limb at all.
! 922:
! 923: An obvious improvement would be to strip factors of 2 from each
! 924: multiplier and divisor and count them separately, to be applied with a
! 925: bit shift at the end. Factors of 3 and perhaps 5 could even be handled
! 926: similarly. Another possibility, if n is not too big, would be to
! 927: determine the prime factorization of the result based on the factorials
! 928: involved, and power up those primes appropriately. This would help
! 929: most when k is near n/2.
! 930:
! 931:
! 932: File: gmp.info, Node: Fibonacci Numbers Algorithm, Next: Lucas Numbers Algorithm, Prev: Binomial Coefficients Algorithm, Up: Other Algorithms
! 933:
! 934: Fibonacci Numbers
! 935: -----------------
! 936:
! 937: The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed
! 938: for calculating isolated F[n] or F[n],F[n-1] values efficiently.
! 939:
! 940: For small n, a table of single limb values in `__gmp_fib_table' is
! 941: used. On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
! 942: to F[93]. For convenience the table starts at F[-1].
! 943:
! 944: Beyond the table, values are generated with a binary powering
! 945: algorithm, calculating a pair F[n] and F[n-1] working from high to low
! 946: across the bits of n. The formulas used are
! 947:
! 948: F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
! 949: F[2k-1] = F[k]^2 + F[k-1]^2
! 950:
! 951: F[2k] = F[2k+1] - F[2k-1]
! 952:
! 953: At each step, k is the high b bits of n. If the next bit of n is 0
! 954: then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
! 955: and the process repeated until all bits of n are incorporated. Notice
! 956: these formulas require just two squares per bit of n.
! 957:
! 958: It'd be possible to handle the first few n above the single limb
! 959: table with simple additions, using the defining Fibonacci recurrence
! 960: F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
! 961: be faster for only about 10 or 20 values of n, and including a block of
! 962: code for just those doesn't seem worthwhile. If they really mattered
! 963: it'd be better to extend the data table.
! 964:
! 965: Using a table avoids lots of calculations on small numbers, and
! 966: makes small n go fast. A bigger table would make more small n go fast,
! 967: it's just a question of balancing size against desired speed. For GMP
! 968: the code is kept compact, with the emphasis primarily on a good
! 969: powering algorithm.
! 970:
! 971: `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
! 972: interested in F[n]. In this case the last step of the algorithm can
! 973: become one multiply instead of two squares. One of the following two
! 974: formulas is used, according as n is odd or even.
! 975:
! 976: F[2k] = F[k]*(F[k]+2F[k-1])
! 977:
! 978: F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
! 979:
! 980: F[2k+1] here is the same as above, just rearranged to be a multiply.
! 981: For interest, the 2*(-1)^k term both here and above can be applied
! 982: just to the low limb of the calculation, without a carry or borrow into
! 983: further limbs, which saves some code size. See comments with
! 984: `mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.
! 985:
! 986:
! 987: File: gmp.info, Node: Lucas Numbers Algorithm, Prev: Fibonacci Numbers Algorithm, Up: Other Algorithms
! 988:
! 989: Lucas Numbers
! 990: -------------
! 991:
! 992: `mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
! 993: Fibonacci numbers with the following simple formulas.
! 994:
! 995: L[k] = F[k] + 2*F[k-1]
! 996: L[k-1] = 2*F[k] - F[k-1]
! 997:
! 998: `mpz_lucnum_ui' is only interested in L[n], and some work can be
! 999: saved. Trailing zero bits on n can be handled with a single square
! 1000: each.
! 1001:
! 1002: L[2k] = L[k]^2 - 2*(-1)^k
! 1003:
! 1004: And the lowest 1 bit can be handled with one multiply of a pair of
! 1005: Fibonacci numbers, similar to what `mpz_fib_ui' does.
! 1006:
! 1007: L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
! 1008:
! 1009:
! 1010: File: gmp.info, Node: Assembler Coding, Prev: Other Algorithms, Up: Algorithms
! 1011:
! 1012: Assembler Coding
! 1013: ================
! 1014:
! 1015: The assembler subroutines in GMP are the most significant source of
! 1016: speed at small to moderate sizes. At larger sizes algorithm selection
! 1017: becomes more important, but of course speedups in low level routines
! 1018: will still speed up everything proportionally.
! 1019:
! 1020: Carry handling and widening multiplies that are important for GMP
! 1021: can't be easily expressed in C. GCC `asm' blocks help a lot and are
! 1022: provided in `longlong.h', but hand coding low level routines invariably
! 1023: offers a speedup over generic C by a factor of anything from 2 to 10.
! 1024:
! 1025: * Menu:
! 1026:
! 1027: * Assembler Code Organisation::
! 1028: * Assembler Basics::
! 1029: * Assembler Carry Propagation::
! 1030: * Assembler Cache Handling::
! 1031: * Assembler Floating Point::
! 1032: * Assembler SIMD Instructions::
! 1033: * Assembler Software Pipelining::
! 1034: * Assembler Loop Unrolling::
! 1035:
! 1036:
! 1037: File: gmp.info, Node: Assembler Code Organisation, Next: Assembler Basics, Prev: Assembler Coding, Up: Assembler Coding
! 1038:
! 1039: Code Organisation
! 1040: -----------------
! 1041:
! 1042: The various `mpn' subdirectories contain machine-dependent code,
! 1043: written in C or assembler. The `mpn/generic' subdirectory contains
! 1044: default code, used when there's no machine-specific version of a
! 1045: particular file.
! 1046:
! 1047: Each `mpn' subdirectory is for an ISA family. Generally 32-bit and
! 1048: 64-bit variants in a family cannot share code and will have separate
! 1049: directories. Within a family further subdirectories may exist for CPU
! 1050: variants.
! 1051:
! 1052:
! 1053: File: gmp.info, Node: Assembler Basics, Next: Assembler Carry Propagation, Prev: Assembler Code Organisation, Up: Assembler Coding
! 1054:
! 1055: Assembler Basics
! 1056: ----------------
! 1057:
! 1058: `mpn_addmul_1' and `mpn_submul_1' are the most important routines
! 1059: for overall GMP performance. All multiplications and divisions come
! 1060: down to repeated calls to these. `mpn_add_n', `mpn_sub_n',
! 1061: `mpn_lshift' and `mpn_rshift' are next most important.
! 1062:
! 1063: On some CPUs assembler versions of the internal functions
! 1064: `mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
! 1065: mainly through avoiding function call overheads. They can also
! 1066: potentially make better use of a wide superscalar processor.
! 1067:
! 1068: The restrictions on overlaps between sources and destinations (*note
! 1069: Low-level Functions::) are designed to facilitate a variety of
! 1070: implementations. For example, knowing `mpn_add_n' won't have partly
! 1071: overlapping sources and destination means reading can be done far ahead
! 1072: of writing on superscalar processors, and loops can be vectorized on a
! 1073: vector processor, depending on the carry handling.
! 1074:
! 1075:
! 1076: File: gmp.info, Node: Assembler Carry Propagation, Next: Assembler Cache Handling, Prev: Assembler Basics, Up: Assembler Coding
! 1077:
! 1078: Carry Propagation
! 1079: -----------------
! 1080:
! 1081: The problem that presents most challenges in GMP is propagating
! 1082: carries from one limb to the next. In functions like `mpn_addmul_1' and
! 1083: `mpn_add_n', carries are the only dependencies between limb operations.
! 1084:
! 1085: On processors with carry flags, a straightforward CISC style `adc' is
! 1086: generally best. AMD K6 `mpn_addmul_1' however is an example of an
! 1087: unusual set of circumstances where a branch works out better.
! 1088:
! 1089: On RISC processors generally an add and compare for overflow is
! 1090: used. This sort of thing can be seen in `mpn/generic/aors_n.c'. Some
! 1091: carry propagation schemes require 4 instructions, meaning at least 4
! 1092: cycles per limb, but other schemes may use just 1 or 2. On wide
! 1093: superscalar processors performance may be completely determined by the
! 1094: number of dependent instructions between carry-in and carry-out for
! 1095: each limb.
! 1096:
! 1097: On vector processors good use can be made of the fact that a carry
! 1098: bit only very rarely propagates more than one limb. When adding a
! 1099: single bit to a limb, there's only a carry out if that limb was
! 1100: `0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
! 1101: `mpn/cray/add_n.c' is an example of this, it adds all limbs in
! 1102: parallel, adds one set of carry bits in parallel and then only rarely
! 1103: needs to fall through to a loop propagating further carries.
! 1104:
! 1105: On the x86s, GCC (as of version 2.95.2) doesn't generate
! 1106: particularly good code for the RISC style idioms that are necessary to
! 1107: handle carry bits in C. Often conditional jumps are generated where
! 1108: `adc' or `sbb' forms would be better. And so unfortunately almost any
! 1109: loop involving carry bits needs to be coded in assembler for best
! 1110: results.
! 1111:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>