OpenXM_contrib/gmp/gmp.info-6 - annotate

Return to gmp.info-6 CVS log
Up to [local] / OpenXM_contrib / gmp
Annotation of OpenXM_contrib/gmp/gmp.info-6, Revision 1.1

1.1     ! ohara       1: This is gmp.info, produced by makeinfo version 4.2 from gmp.texi.
        !             2:
        !             3: This manual describes how to install and use the GNU multiple precision
        !             4: arithmetic library, version 4.1.2.
        !             5:
        !             6:    Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
        !             7: 2001, 2002 Free Software Foundation, Inc.
        !             8:
        !             9:    Permission is granted to copy, distribute and/or modify this
        !            10: document under the terms of the GNU Free Documentation License, Version
        !            11: 1.1 or any later version published by the Free Software Foundation;
        !            12: with no Invariant Sections, with the Front-Cover Texts being "A GNU
        !            13: Manual", and with the Back-Cover Texts being "You have freedom to copy
        !            14: and modify this GNU Manual, like GNU software".  A copy of the license
        !            15: is included in *Note GNU Free Documentation License::.
        !            16: INFO-DIR-SECTION GNU libraries
        !            17: START-INFO-DIR-ENTRY
        !            18: * gmp: (gmp).                   GNU Multiple Precision Arithmetic Library.
        !            19: END-INFO-DIR-ENTRY
        !            20:
        !            21: 
        !            22: File: gmp.info,  Node: FFT Multiplication,  Next: Other Multiplication,  Prev: Toom-Cook 3-Way Multiplication,  Up: Multiplication Algorithms
        !            23:
        !            24: FFT Multiplication
        !            25: ------------------
        !            26:
        !            27:    At large to very large sizes a Fermat style FFT multiplication is
        !            28: used, following Scho"nhage and Strassen (*note References::).
        !            29: Descriptions of FFTs in various forms can be found in many textbooks,
        !            30: for instance Knuth section 4.3.3 part C or Lipson chapter IX.  A brief
        !            31: description of the form used in GMP is given here.
        !            32:
        !            33:    The multiplication done is x*y mod 2^N+1, for a given N.  A full
        !            34: product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
        !            35: and y with high zero limbs.  The modular product is the native form for
        !            36: the algorithm, so padding to get a full product is unavoidable.
        !            37:
        !            38:    The algorithm follows a split, evaluate, pointwise multiply,
        !            39: interpolate and combine similar to that described above for Karatsuba
        !            40: and Toom-3.  A k parameter controls the split, with an FFT-k splitting
        !            41: into 2^k pieces of M=N/2^k bits each.  N must be a multiple of
        !            42: (2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
        !            43: bit shifts in the split and combine stages.
        !            44:
        !            45:    The evaluations, pointwise multiplications, and interpolation, are
        !            46: all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
        !            47: 2^k and of `mp_bits_per_limb'.  The results of interpolation will be
        !            48: the following negacyclic convolution of the input pieces, and the
        !            49: choice of N' ensures these sums aren't truncated.
        !            50:
        !            51:                 ---
        !            52:                 \         b
        !            53:      w[n] =     /     (-1) * x[i] * y[j]
        !            54:                 ---
        !            55:             i+j==b*2^k+n
        !            56:                b=0,1
        !            57:
        !            58:    The points used for the evaluation are g^i for i=0 to 2^k-1 where
        !            59: g=2^(2N'/2^k).  g is a 2^k'th root of unity mod 2^N'+1, which produces
        !            60: necessary cancellations at the interpolation stage, and it's also a
        !            61: power of 2 so the fast fourier transforms used for the evaluation and
        !            62: interpolation do only shifts, adds and negations.
        !            63:
        !            64:    The pointwise multiplications are done modulo 2^N'+1 and either
        !            65: recurse into a further FFT or use a plain multiplication (Toom-3,
        !            66: Karatsuba or basecase), whichever is optimal at the size N'.  The
        !            67: interpolation is an inverse fast fourier transform.  The resulting set
        !            68: of sums of x[i]*y[j] are added at appropriate offsets to give the final
        !            69: result.
        !            70:
        !            71:    Squaring is the same, but x is the only input so it's one transform
        !            72: at the evaluate stage and the pointwise multiplies are squares.  The
        !            73: interpolation is the same.
        !            74:
        !            75:    For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
        !            76: the exponent representing 2^k recursed modular multiplies each
        !            77: 1/2^(k-1) the size of the original.  Each successive k is an asymptotic
        !            78: improvement, but overheads mean each is only faster at bigger and
        !            79: bigger sizes.  In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
        !            80: thresholds where each k is used.  Each new k effectively swaps some
        !            81: multiplying for some shifts, adds and overheads.
        !            82:
        !            83:    A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
        !            84: plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
        !            85: A k=4 FFT at O(N^1.333) can be expected to be the first faster than
        !            86: Toom-3 at O(N^1.465).  In practice this is what's found, with
        !            87: `MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
        !            88: and 1000 limbs, depending on the CPU.  So far it's been found that only
        !            89: very large FFTs recurse into pointwise multiplies above these sizes.
        !            90:
        !            91:    When an FFT is to give a full product, the change of N to 2N doesn't
        !            92: alter the theoretical complexity for a given k, but for the purposes of
        !            93: considering where an FFT might be first used it can be assumed that the
        !            94: FFT is recursing into a normal multiply and that on that basis it's
        !            95: doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
        !            96: making it O(N^(k/(k-2))).  This would mean k=7 at O(N^1.4) would be the
        !            97: first FFT faster than Toom-3.  In practice `MUL_FFT_THRESHOLD' and
        !            98: `SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
        !            99: between 3000 and 10000 limbs.
        !           100:
        !           101:    The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
        !           102: a multiple of 2^k and `mp_bits_per_limb' means that when
        !           103: 2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
        !           104: The +k+3 means some values of N just under such a multiple will be
        !           105: rounded to the next.  The complexity calculations above assume that a
        !           106: favourable size is used, meaning one which isn't padded through
        !           107: rounding, and it's also assumed that the extra +k+3 bits are negligible
        !           108: at typical FFT sizes.
        !           109:
        !           110:    The practical effect of the 2^(2k-1) constraint is to introduce a
        !           111: step-effect into measured speeds.  For example k=8 will round N up to a
        !           112: multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
        !           113: groups of sizes for which `mpn_mul_n' runs at the same speed.  Or for
        !           114: k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc.  In practice
        !           115: it's been found each k is used at quite small multiples of its size
        !           116: constraint and so the step effect is quite noticeable in a time versus
        !           117: size graph.
        !           118:
        !           119:    The threshold determinations currently measure at the mid-points of
        !           120: size steps, but this is sub-optimal since at the start of a new step it
        !           121: can happen that it's better to go back to the previous k for a while.
        !           122: Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
        !           123: will be needed.
        !           124:
        !           125: 
        !           126: File: gmp.info,  Node: Other Multiplication,  Prev: FFT Multiplication,  Up: Multiplication Algorithms
        !           127:
        !           128: Other Multiplication
        !           129: --------------------
        !           130:
        !           131:    The 3-way Toom-Cook algorithm described above (*note Toom-Cook 3-Way
        !           132: Multiplication::) generalizes to split into an arbitrary number of
        !           133: pieces, as per Knuth section 4.3.3 algorithm C.  This is not currently
        !           134: used, though it's possible a Toom-4 might fit in between Toom-3 and the
        !           135: FFTs.  The notes here are merely for interest.
        !           136:
        !           137:    In general a split into r+1 pieces is made, and evaluations and
        !           138: pointwise multiplications done at 2*r+1 points.  A 4-way split does 7
        !           139: pointwise multiplies, 5-way does 9, etc.  Asymptotically an (r+1)-way
        !           140: algorithm is O(N^(log(2*r+1)/log(r+1))).  Only the pointwise
        !           141: multiplications count towards big-O complexity, but the time spent in
        !           142: the evaluate and interpolate stages grows with r and has a significant
        !           143: practical impact, with the asymptotic advantage of each r realized only
        !           144: at bigger and bigger sizes.  The overheads grow as O(N*r), whereas in
        !           145: an r=2^k FFT they grow only as O(N*log(r)).
        !           146:
        !           147:    Knuth algorithm C evaluates at points 0,1,2,...,2*r, but exercise 4
        !           148: uses -r,...,0,...,r and the latter saves some small multiplies in the
        !           149: evaluate stage (or rather trades them for additions), and has a further
        !           150: saving of nearly half the interpolate steps.  The idea is to separate
        !           151: odd and even final coefficients and then perform algorithm C steps C7
        !           152: and C8 on them separately.  The divisors at step C7 become j^2 and the
        !           153: multipliers at C8 become 2*t*j-j^2.
        !           154:
        !           155:    Splitting odd and even parts through positive and negative points
        !           156: can be thought of as using -1 as a square root of unity.  If a 4th root
        !           157: of unity was available then a further split and speedup would be
        !           158: possible, but no such root exists for plain integers.  Going to complex
        !           159: integers with i=sqrt(-1) doesn't help, essentially because in cartesian
        !           160: form it takes three real multiplies to do a complex multiply.  The
        !           161: existence of 2^k'th roots of unity in a suitable ring or field lets the
        !           162: fast fourier transform keep splitting and get to O(N*log(r)).
        !           163:
        !           164:    Floating point FFTs use complex numbers approximating Nth roots of
        !           165: unity.  Some processors have special support for such FFTs.  But these
        !           166: are not used in GMP since it's very difficult to guarantee an exact
        !           167: result (to some number of bits).  An occasional difference of 1 in the
        !           168: last bit might not matter to a typical signal processing algorithm, but
        !           169: is of course of vital importance to GMP.
        !           170:
        !           171: 
        !           172: File: gmp.info,  Node: Division Algorithms,  Next: Greatest Common Divisor Algorithms,  Prev: Multiplication Algorithms,  Up: Algorithms
        !           173:
        !           174: Division Algorithms
        !           175: ===================
        !           176:
        !           177: * Menu:
        !           178:
        !           179: * Single Limb Division::
        !           180: * Basecase Division::
        !           181: * Divide and Conquer Division::
        !           182: * Exact Division::
        !           183: * Exact Remainder::
        !           184: * Small Quotient Division::
        !           185:
        !           186: 
        !           187: File: gmp.info,  Node: Single Limb Division,  Next: Basecase Division,  Prev: Division Algorithms,  Up: Division Algorithms
        !           188:
        !           189: Single Limb Division
        !           190: --------------------
        !           191:
        !           192:    Nx1 division is implemented using repeated 2x1 divisions from high
        !           193: to low, either with a hardware divide instruction or a multiplication by
        !           194: inverse, whichever is best on a given CPU.
        !           195:
        !           196:    The multiply by inverse follows section 8 of "Division by Invariant
        !           197: Integers using Multiplication" by Granlund and Montgomery (*note
        !           198: References::) and is implemented as `udiv_qrnnd_preinv' in
        !           199: `gmp-impl.h'.  The idea is to have a fixed-point approximation to 1/d
        !           200: (see `invert_limb') and then multiply by the high limb (plus one bit)
        !           201: of the dividend to get a quotient q.  With d normalized (high bit set),
        !           202: q is no more than 1 too small.  Subtracting q*d from the dividend gives
        !           203: a remainder, and reveals whether q or q-1 is correct.
        !           204:
        !           205:    The result is a division done with two multiplications and four or
        !           206: five arithmetic operations.  On CPUs with low latency multipliers this
        !           207: can be much faster than a hardware divide, though the cost of
        !           208: calculating the inverse at the start may mean it's only better on
        !           209: inputs bigger than say 4 or 5 limbs.
        !           210:
        !           211:    When a divisor must be normalized, either for the generic C
        !           212: `__udiv_qrnnd_c' or the multiply by inverse, the division performed is
        !           213: actually a*2^k by d*2^k where a is the dividend and k is the power
        !           214: necessary to have the high bit of d*2^k set.  The bit shifts for the
        !           215: dividend are usually accomplished "on the fly" meaning by extracting
        !           216: the appropriate bits at each step.  Done this way the quotient limbs
        !           217: come out aligned ready to store.  When only the remainder is wanted, an
        !           218: alternative is to take the dividend limbs unshifted and calculate r = a
        !           219: mod d*2^k followed by an extra final step r*2^k mod d*2^k.  This can
        !           220: help on CPUs with poor bit shifts or few registers.
        !           221:
        !           222:    The multiply by inverse can be done two limbs at a time.  The
        !           223: calculation is basically the same, but the inverse is two limbs and the
        !           224: divisor treated as if padded with a low zero limb.  This means more
        !           225: work, since the inverse will need a 2x2 multiply, but the four 1x1s to
        !           226: do that are independent and can therefore be done partly or wholly in
        !           227: parallel.  Likewise for a 2x1 calculating q*d.  The net effect is to
        !           228: process two limbs with roughly the same two multiplies worth of latency
        !           229: that one limb at a time gives.  This extends to 3 or 4 limbs at a time,
        !           230: though the extra work to apply the inverse will almost certainly soon
        !           231: reach the limits of multiplier throughput.
        !           232:
        !           233:    A similar approach in reverse can be taken to process just half a
        !           234: limb at a time if the divisor is only a half limb.  In this case the
        !           235: 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
        !           236: limb, which can be a saving on CPUs with a fast half limb multiply, or
        !           237: in fact if the only multiply is a half limb, and especially if it's not
        !           238: pipelined.
        !           239:
        !           240: 
        !           241: File: gmp.info,  Node: Basecase Division,  Next: Divide and Conquer Division,  Prev: Single Limb Division,  Up: Division Algorithms
        !           242:
        !           243: Basecase Division
        !           244: -----------------
        !           245:
        !           246:    Basecase NxM division is like long division done by hand, but in base
        !           247: 2^mp_bits_per_limb.  See Knuth section 4.3.1 algorithm D, and
        !           248: `mpn/generic/sb_divrem_mn.c'.
        !           249:
        !           250:    Briefly stated, while the dividend remains larger than the divisor,
        !           251: a high quotient limb is formed and the Nx1 product q*d subtracted at
        !           252: the top end of the dividend.  With a normalized divisor (most
        !           253: significant bit set), each quotient limb can be formed with a 2x1
        !           254: division and a 1x1 multiplication plus some subtractions.  The 2x1
        !           255: division is by the high limb of the divisor and is done either with a
        !           256: hardware divide or a multiply by inverse (the same as in *Note Single
        !           257: Limb Division::) whichever is faster.  Such a quotient is sometimes one
        !           258: too big, requiring an addback of the divisor, but that happens rarely.
        !           259:
        !           260:    With Q=N-M being the number of quotient limbs, this is an O(Q*M)
        !           261: algorithm and will run at a speed similar to a basecase QxM
        !           262: multiplication, differing in fact only in the extra multiply and divide
        !           263: for each of the Q quotient limbs.
        !           264:
        !           265: 
        !           266: File: gmp.info,  Node: Divide and Conquer Division,  Next: Exact Division,  Prev: Basecase Division,  Up: Division Algorithms
        !           267:
        !           268: Divide and Conquer Division
        !           269: ---------------------------
        !           270:
        !           271:    For divisors larger than `DIV_DC_THRESHOLD', division is done by
        !           272: dividing.  Or to be precise by a recursive divide and conquer algorithm
        !           273: based on work by Moenck and Borodin, Jebelean, and Burnikel and Ziegler
        !           274: (*note References::).
        !           275:
        !           276:    The algorithm consists essentially of recognising that a 2NxN
        !           277: division can be done with the basecase division algorithm (*note
        !           278: Basecase Division::), but using N/2 limbs as a base, not just a single
        !           279: limb.  This way the multiplications that arise are (N/2)x(N/2) and can
        !           280: take advantage of Karatsuba and higher multiplication algorithms (*note
        !           281: Multiplication Algorithms::).  The "digits" of the quotient are formed
        !           282: by recursive Nx(N/2) divisions.
        !           283:
        !           284:    If the (N/2)x(N/2) multiplies are done with a basecase multiplication
        !           285: then the work is about the same as a basecase division, but with more
        !           286: function call overheads and with some subtractions separated from the
        !           287: multiplies.  These overheads mean that it's only when N/2 is above
        !           288: `MUL_KARATSUBA_THRESHOLD' that divide and conquer is of use.
        !           289:
        !           290:    `DIV_DC_THRESHOLD' is based on the divisor size N, so it will be
        !           291: somewhere above twice `MUL_KARATSUBA_THRESHOLD', but how much above
        !           292: depends on the CPU.  An optimized `mpn_mul_basecase' can lower
        !           293: `DIV_DC_THRESHOLD' a little by offering a ready-made advantage over
        !           294: repeated `mpn_submul_1' calls.
        !           295:
        !           296:    Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is
        !           297: the time for an NxN multiplication done with FFTs.  The actual time is
        !           298: a sum over multiplications of the recursed sizes, as can be seen near
        !           299: the end of section 2.2 of Burnikel and Ziegler.  For example, within
        !           300: the Toom-3 range, divide and conquer is 2.63*M(N).  With higher
        !           301: algorithms the M(N) term improves and the multiplier tends to log(N).
        !           302: In practice, at moderate to large sizes, a 2NxN division is about 2 to
        !           303: 4 times slower than an NxN multiplication.
        !           304:
        !           305:    Newton's method used for division is asymptotically O(M(N)) and
        !           306: should therefore be superior to divide and conquer, but it's believed
        !           307: this would only be for large to very large N.
        !           308:
        !           309: 
        !           310: File: gmp.info,  Node: Exact Division,  Next: Exact Remainder,  Prev: Divide and Conquer Division,  Up: Division Algorithms
        !           311:
        !           312: Exact Division
        !           313: --------------
        !           314:
        !           315:    A so-called exact division is when the dividend is known to be an
        !           316: exact multiple of the divisor.  Jebelean's exact division algorithm
        !           317: uses this knowledge to make some significant optimizations (*note
        !           318: References::).
        !           319:
        !           320:    The idea can be illustrated in decimal for example with 368154
        !           321: divided by 543.  Because the low digit of the dividend is 4, the low
        !           322: digit of the quotient must be 8.  This is arrived at from 4*7 mod 10,
        !           323: using the fact 7 is the modular inverse of 3 (the low digit of the
        !           324: divisor), since 3*7 == 1 mod 10.  So 8*543=4344 can be subtracted from
        !           325: the dividend leaving 363810.  Notice the low digit has become zero.
        !           326:
        !           327:    The procedure is repeated at the second digit, with the next
        !           328: quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
        !           329: 325800.  And finally at the third digit with quotient digit 6 (8*7 mod
        !           330: 10), subtracting 6*543=3258 leaving 0.  So the quotient is 678.
        !           331:
        !           332:    Notice however that the multiplies and subtractions don't need to
        !           333: extend past the low three digits of the dividend, since that's enough
        !           334: to determine the three quotient digits.  For the last quotient digit no
        !           335: subtraction is needed at all.  On a 2NxN division like this one, only
        !           336: about half the work of a normal basecase division is necessary.
        !           337:
        !           338:    For an NxM exact division producing Q=N-M quotient limbs, the saving
        !           339: over a normal basecase division is in two parts.  Firstly, each of the
        !           340: Q quotient limbs needs only one multiply, not a 2x1 divide and
        !           341: multiply.  Secondly, the crossproducts are reduced when Q>M to
        !           342: Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2.  Notice the savings are
        !           343: complementary.  If Q is big then many divisions are saved, or if Q is
        !           344: small then the crossproducts reduce to a small number.
        !           345:
        !           346:    The modular inverse used is calculated efficiently by
        !           347: `modlimb_invert' in `gmp-impl.h'.  This does four multiplies for a
        !           348: 32-bit limb, or six for a 64-bit limb.  `tune/modlinv.c' has some
        !           349: alternate implementations that might suit processors better at bit
        !           350: twiddling than multiplying.
        !           351:
        !           352:    The sub-quadratic exact division described by Jebelean in "Exact
        !           353: Division with Karatsuba Complexity" is not currently implemented.  It
        !           354: uses a rearrangement similar to the divide and conquer for normal
        !           355: division (*note Divide and Conquer Division::), but operating from low
        !           356: to high.  A further possibility not currently implemented is
        !           357: "Bidirectional Exact Integer Division" by Krandick and Jebelean which
        !           358: forms quotient limbs from both the high and low ends of the dividend,
        !           359: and can halve once more the number of crossproducts needed in a 2NxN
        !           360: division.
        !           361:
        !           362:    A special case exact division by 3 exists in `mpn_divexact_by3',
        !           363: supporting Toom-3 multiplication and `mpq' canonicalizations.  It forms
        !           364: quotient digits with a multiply by the modular inverse of 3 (which is
        !           365: `0xAA..AAB') and uses two comparisons to determine a borrow for the next
        !           366: limb.  The multiplications don't need to be on the dependent chain, as
        !           367: long as the effect of the borrows is applied.  Only a few optimized
        !           368: assembler implementations currently exist.
        !           369:
        !           370: 
        !           371: File: gmp.info,  Node: Exact Remainder,  Next: Small Quotient Division,  Prev: Exact Division,  Up: Division Algorithms
        !           372:
        !           373: Exact Remainder
        !           374: ---------------
        !           375:
        !           376:    If the exact division algorithm is done with a full subtraction at
        !           377: each stage and the dividend isn't a multiple of the divisor, then low
        !           378: zero limbs are produced but with a remainder in the high limbs.  For
        !           379: dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, then this
        !           380: remainder r is of the form
        !           381:
        !           382:      a = q*d + r*b^n
        !           383:
        !           384:    n represents the number of zero limbs produced by the subtractions,
        !           385: that being the number of limbs produced for q.  r will be in the range
        !           386: 0<=r<d and can be viewed as a remainder, but one shifted up by a factor
        !           387: of b^n.
        !           388:
        !           389:    Carrying out full subtractions at each stage means the same number
        !           390: of cross products must be done as a normal division, but there's still
        !           391: some single limb divisions saved.  When d is a single limb some
        !           392: simplifications arise, providing good speedups on a number of
        !           393: processors.
        !           394:
        !           395:    `mpn_bdivmod', `mpn_divexact_by3', `mpn_modexact_1_odd' and the
        !           396: `redc' function in `mpz_powm' differ subtly in how they return r,
        !           397: leading to some negations in the above formula, but all are essentially
        !           398: the same.
        !           399:
        !           400:    Clearly r is zero when a is a multiple of d, and this leads to
        !           401: divisibility or congruence tests which are potentially more efficient
        !           402: than a normal division.
        !           403:
        !           404:    The factor of b^n on r can be ignored in a GCD when d is odd, hence
        !           405: the use of `mpn_bdivmod' in `mpn_gcd', and the use of
        !           406: `mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui' etc (*note
        !           407: Greatest Common Divisor Algorithms::).
        !           408:
        !           409:    Montgomery's REDC method for modular multiplications uses operands
        !           410: of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
        !           411: uses the factor of b^n in the exact remainder to reach a product in the
        !           412: same form (x*y)*b^-n (*note Modular Powering Algorithm::).
        !           413:
        !           414:    Notice that r generally gives no useful information about the
        !           415: ordinary remainder a mod d since b^n mod d could be anything.  If
        !           416: however b^n == 1 mod d, then r is the negative of the ordinary
        !           417: remainder.  This occurs whenever d is a factor of b^n-1, as for example
        !           418: with 3 in `mpn_divexact_by3'.  Other such factors include 5, 17 and
        !           419: 257, but no particular use has been found for this.
        !           420:
        !           421: 
        !           422: File: gmp.info,  Node: Small Quotient Division,  Prev: Exact Remainder,  Up: Division Algorithms
        !           423:
        !           424: Small Quotient Division
        !           425: -----------------------
        !           426:
        !           427:    An NxM division where the number of quotient limbs Q=N-M is small
        !           428: can be optimized somewhat.
        !           429:
        !           430:    An ordinary basecase division normalizes the divisor by shifting it
        !           431: to make the high bit set, shifting the dividend accordingly, and
        !           432: shifting the remainder back down at the end of the calculation.  This
        !           433: is wasteful if only a few quotient limbs are to be formed.  Instead a
        !           434: division of just the top 2*Q limbs of the dividend by the top Q limbs
        !           435: of the divisor can be used to form a trial quotient.  This requires
        !           436: only those limbs normalized, not the whole of the divisor and dividend.
        !           437:
        !           438:    A multiply and subtract then applies the trial quotient to the M-Q
        !           439: unused limbs of the divisor and N-Q dividend limbs (which includes Q
        !           440: limbs remaining from the trial quotient division).  The starting trial
        !           441: quotient can be 1 or 2 too big, but all cases of 2 too big and most
        !           442: cases of 1 too big are detected by first comparing the most significant
        !           443: limbs that will arise from the subtraction.  An addback is done if the
        !           444: quotient still turns out to be 1 too big.
        !           445:
        !           446:    This whole procedure is essentially the same as one step of the
        !           447: basecase algorithm done in a Q limb base, though with the trial
        !           448: quotient test done only with the high limbs, not an entire Q limb
        !           449: "digit" product.  The correctness of this weaker test can be
        !           450: established by following the argument of Knuth section 4.3.1 exercise
        !           451: 20 but with the v2*q>b*r+u2 condition appropriately relaxed.
        !           452:
        !           453: 
        !           454: File: gmp.info,  Node: Greatest Common Divisor Algorithms,  Next: Powering Algorithms,  Prev: Division Algorithms,  Up: Algorithms
        !           455:
        !           456: Greatest Common Divisor
        !           457: =======================
        !           458:
        !           459: * Menu:
        !           460:
        !           461: * Binary GCD::
        !           462: * Accelerated GCD::
        !           463: * Extended GCD::
        !           464: * Jacobi Symbol::
        !           465:
        !           466: 
        !           467: File: gmp.info,  Node: Binary GCD,  Next: Accelerated GCD,  Prev: Greatest Common Divisor Algorithms,  Up: Greatest Common Divisor Algorithms
        !           468:
        !           469: Binary GCD
        !           470: ----------
        !           471:
        !           472:    At small sizes GMP uses an O(N^2) binary style GCD.  This is
        !           473: described in many textbooks, for example Knuth section 4.5.2 algorithm
        !           474: B.  It simply consists of successively reducing operands a and b using
        !           475: gcd(a,b) = gcd(min(a,b),abs(a-b)), and also that if a and b are first
        !           476: made odd then abs(a-b) is even and factors of two can be discarded.
        !           477:
        !           478:    Variants like letting a-b become negative and doing a different next
        !           479: step are of interest only as far as they suit particular CPUs, since on
        !           480: small operands it's machine dependent factors that determine
        !           481: performance.
        !           482:
        !           483:    The Euclidean GCD algorithm, as per Knuth algorithms E and A,
        !           484: reduces using a mod b but this has so far been found to be slower
        !           485: everywhere.  One reason the binary method does well is that the implied
        !           486: quotient at each step is usually small, so often only one or two
        !           487: subtractions are needed to get the same effect as a division.
        !           488: Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
        !           489: section 4.5.3 Theorem E.
        !           490:
        !           491:    When the implied quotient is large, meaning b is much smaller than
        !           492: a, then a division is worthwhile.  This is the basis for the initial a
        !           493: mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
        !           494: and 1x1 cases).  But after that initial reduction, big quotients occur
        !           495: too rarely to make it worth checking for them.
        !           496:
        !           497: 
        !           498: File: gmp.info,  Node: Accelerated GCD,  Next: Extended GCD,  Prev: Binary GCD,  Up: Greatest Common Divisor Algorithms
        !           499:
        !           500: Accelerated GCD
        !           501: ---------------
        !           502:
        !           503:    For sizes above `GCD_ACCEL_THRESHOLD', GMP uses the Accelerated GCD
        !           504: algorithm described independently by Weber and Jebelean (the latter as
        !           505: the "Generalized Binary" algorithm), *note References::.  This
        !           506: algorithm is still O(N^2), but is much faster than the binary algorithm
        !           507: since it does fewer multi-precision operations.  It consists of
        !           508: alternating the k-ary reduction by Sorenson, and a "dmod" exact
        !           509: remainder reduction.
        !           510:
        !           511:    For operands u and v the k-ary reduction replaces u with n*v-d*u
        !           512: where n and d are single limb values chosen to give two trailing zero
        !           513: limbs on that value, which can be stripped.  n and d are calculated
        !           514: using an algorithm similar to half of a two limb GCD (see `find_a' in
        !           515: `mpn/generic/gcd.c').
        !           516:
        !           517:    When u and v differ in size by more than a certain number of bits, a
        !           518: dmod is performed to zero out bits at the low end of the larger.  It
        !           519: consists of an exact remainder style division applied to an appropriate
        !           520: number of bits (*note Exact Division::, and *note Exact Remainder::).
        !           521: This is faster than a k-ary reduction but useful only when the operands
        !           522: differ in size.  There's a dmod after each k-ary reduction, and if the
        !           523: dmod leaves the operands still differing in size then it's repeated.
        !           524:
        !           525:    The k-ary reduction step can introduce spurious factors into the GCD
        !           526: calculated, and these are eliminated at the end by taking GCDs with the
        !           527: original inputs gcd(u,gcd(v,g)) using the binary algorithm.  Since g is
        !           528: almost always small this takes very little time.
        !           529:
        !           530:    At small sizes the algorithm needs a good implementation of
        !           531: `find_a'.  At larger sizes it's dominated by `mpn_addmul_1' applying n
        !           532: and d.
        !           533:
        !           534: 
        !           535: File: gmp.info,  Node: Extended GCD,  Next: Jacobi Symbol,  Prev: Accelerated GCD,  Up: Greatest Common Divisor Algorithms
        !           536:
        !           537: Extended GCD
        !           538: ------------
        !           539:
        !           540:    The extended GCD calculates gcd(a,b) and also cofactors x and y
        !           541: satisfying a*x+b*y=gcd(a,b).  Lehmer's multi-step improvement of the
        !           542: extended Euclidean algorithm is used.  See Knuth section 4.5.2
        !           543: algorithm L, and `mpn/generic/gcdext.c'.  This is an O(N^2) algorithm.
        !           544:
        !           545:    The multipliers at each step are found using single limb
        !           546: calculations for sizes up to `GCDEXT_THRESHOLD', or double limb
        !           547: calculations above that.  The single limb code is faster but doesn't
        !           548: produce full-limb multipliers, hence not making full use of the
        !           549: `mpn_addmul_1' calls.
        !           550:
        !           551:    When a CPU has a data-dependent multiplier, meaning one which is
        !           552: faster on operands with fewer bits, the extra work in the double-limb
        !           553: calculation might only save some looping overheads, leading to a large
        !           554: `GCDEXT_THRESHOLD'.
        !           555:
        !           556:    Currently the single limb calculation doesn't optimize for the small
        !           557: quotients that often occur, and this can lead to unusually low values of
        !           558: `GCDEXT_THRESHOLD', depending on the CPU.
        !           559:
        !           560:    An analysis of double-limb calculations can be found in "A
        !           561: Double-Digit Lehmer-Euclid Algorithm" by Jebelean (*note References::).
        !           562: The code in GMP was developed independently.
        !           563:
        !           564:    It should be noted that when a double limb calculation is used, it's
        !           565: used for the whole of that GCD, it doesn't fall back to single limb
        !           566: part way through.  This is because as the algorithm proceeds, the
        !           567: inputs a and b are reduced, but the cofactors x and y grow, so the
        !           568: multipliers at each step are applied to a roughly constant total number
        !           569: of limbs.
        !           570:
        !           571: 
        !           572: File: gmp.info,  Node: Jacobi Symbol,  Prev: Extended GCD,  Up: Greatest Common Divisor Algorithms
        !           573:
        !           574: Jacobi Symbol
        !           575: -------------
        !           576:
        !           577:    `mpz_jacobi' and `mpz_kronecker' are currently implemented with a
        !           578: simple binary algorithm similar to that described for the GCDs (*note
        !           579: Binary GCD::).  They're not very fast when both inputs are large.
        !           580: Lehmer's multi-step improvement or a binary based multi-step algorithm
        !           581: is likely to be better.
        !           582:
        !           583:    When one operand fits a single limb, and that includes
        !           584: `mpz_kronecker_ui' and friends, an initial reduction is done with
        !           585: either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
        !           586: algorithm on a single limb.  The binary algorithm is well suited to a
        !           587: single limb, and the whole calculation in this case is quite efficient.
        !           588:
        !           589:    In all the routines sign changes for the result are accumulated
        !           590: using some bit twiddling, avoiding table lookups or conditional jumps.
        !           591:
        !           592: 
        !           593: File: gmp.info,  Node: Powering Algorithms,  Next: Root Extraction Algorithms,  Prev: Greatest Common Divisor Algorithms,  Up: Algorithms
        !           594:
        !           595: Powering Algorithms
        !           596: ===================
        !           597:
        !           598: * Menu:
        !           599:
        !           600: * Normal Powering Algorithm::
        !           601: * Modular Powering Algorithm::
        !           602:
        !           603: 
        !           604: File: gmp.info,  Node: Normal Powering Algorithm,  Next: Modular Powering Algorithm,  Prev: Powering Algorithms,  Up: Powering Algorithms
        !           605:
        !           606: Normal Powering
        !           607: ---------------
        !           608:
        !           609:    Normal `mpz' or `mpf' powering uses a simple binary algorithm,
        !           610: successively squaring and then multiplying by the base when a 1 bit is
        !           611: seen in the exponent, as per Knuth section 4.6.3.  The "left to right"
        !           612: variant described there is used rather than algorithm A, since it's
        !           613: just as easy and can be done with somewhat less temporary memory.
        !           614:
        !           615: 
        !           616: File: gmp.info,  Node: Modular Powering Algorithm,  Prev: Normal Powering Algorithm,  Up: Powering Algorithms
        !           617:
        !           618: Modular Powering
        !           619: ----------------
        !           620:
        !           621:    Modular powering is implemented using a 2^k-ary sliding window
        !           622: algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
        !           623: (*note References::).  k is chosen according to the size of the
        !           624: exponent.  Larger exponents use larger values of k, the choice being
        !           625: made to minimize the average number of multiplications that must
        !           626: supplement the squaring.
        !           627:
        !           628:    The modular multiplies and squares use either a simple division or
        !           629: the REDC method by Montgomery (*note References::).  REDC is a little
        !           630: faster, essentially saving N single limb divisions in a fashion similar
        !           631: to an exact remainder (*note Exact Remainder::).  The current REDC has
        !           632: some limitations.  It's only O(N^2) so above `POWM_THRESHOLD' division
        !           633: becomes faster and is used.  It doesn't attempt to detect small bases,
        !           634: but rather always uses a REDC form, which is usually a full size
        !           635: operand.  And lastly it's only applied to odd moduli.
        !           636:
        !           637: 
        !           638: File: gmp.info,  Node: Root Extraction Algorithms,  Next: Radix Conversion Algorithms,  Prev: Powering Algorithms,  Up: Algorithms
        !           639:
        !           640: Root Extraction Algorithms
        !           641: ==========================
        !           642:
        !           643: * Menu:
        !           644:
        !           645: * Square Root Algorithm::
        !           646: * Nth Root Algorithm::
        !           647: * Perfect Square Algorithm::
        !           648: * Perfect Power Algorithm::
        !           649:
        !           650: 
        !           651: File: gmp.info,  Node: Square Root Algorithm,  Next: Nth Root Algorithm,  Prev: Root Extraction Algorithms,  Up: Root Extraction Algorithms
        !           652:
        !           653: Square Root
        !           654: -----------
        !           655:
        !           656:    Square roots are taken using the "Karatsuba Square Root" algorithm
        !           657: by Paul Zimmermann (*note References::).  This is expressed in a divide
        !           658: and conquer form, but as noted in the paper it can also be viewed as a
        !           659: discrete variant of Newton's method.
        !           660:
        !           661:    In the Karatsuba multiplication range this is an O(1.5*M(N/2))
        !           662: algorithm, where M(n) is the time to multiply two numbers of n limbs.
        !           663: In the FFT multiplication range this grows to a bound of O(6*M(N/2)).
        !           664: In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and
        !           665: Toom-3 ranges, growing to 2 or 3 in the FFT range.
        !           666:
        !           667:    The algorithm does all its calculations in integers and the resulting
        !           668: `mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'.  The extended
        !           669: precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.
        !           670:
        !           671: 
        !           672: File: gmp.info,  Node: Nth Root Algorithm,  Next: Perfect Square Algorithm,  Prev: Square Root Algorithm,  Up: Root Extraction Algorithms
        !           673:
        !           674: Nth Root
        !           675: --------
        !           676:
        !           677:    Integer Nth roots are taken using Newton's method with the following
        !           678: iteration, where A is the input and n is the root to be taken.
        !           679:
        !           680:               1         A
        !           681:      a[i+1] = - * ( --------- + (n-1)*a[i] )
        !           682:               n     a[i]^(n-1)
        !           683:
        !           684:    The initial approximation a[1] is generated bitwise by successively
        !           685: powering a trial root with or without new 1 bits, aiming to be just
        !           686: above the true root.  The iteration converges quadratically when
        !           687: started from a good approximation.  When n is large more initial bits
        !           688: are needed to get good convergence.  The current implementation is not
        !           689: particularly well optimized.
        !           690:
        !           691: 
        !           692: File: gmp.info,  Node: Perfect Square Algorithm,  Next: Perfect Power Algorithm,  Prev: Nth Root Algorithm,  Up: Root Extraction Algorithms
        !           693:
        !           694: Perfect Square
        !           695: --------------
        !           696:
        !           697:    `mpz_perfect_square_p' is able to quickly exclude most non-squares by
        !           698: checking whether the input is a quadratic residue modulo some small
        !           699: integers.
        !           700:
        !           701:    The first test is modulo 256 which means simply examining the least
        !           702: significant byte.  Only 44 different values occur as the low byte of a
        !           703: square, so 82.8% of non-squares can be immediately excluded.  Similar
        !           704: tests modulo primes from 3 to 29 exclude 99.5% of those remaining, or
        !           705: if a limb is 64 bits then primes up to 53 are used, excluding 99.99%.
        !           706: A single Nx1 remainder using `PP' from `gmp-impl.h' quickly gives all
        !           707: these remainders.
        !           708:
        !           709:    A square root must still be taken for any value that passes the
        !           710: residue tests, to verify it's really a square and not one of the 0.086%
        !           711: (or 0.000156% for 64 bits) non-squares that get through.  *Note Square
        !           712: Root Algorithm::.
        !           713:
        !           714: 
        !           715: File: gmp.info,  Node: Perfect Power Algorithm,  Prev: Perfect Square Algorithm,  Up: Root Extraction Algorithms
        !           716:
        !           717: Perfect Power
        !           718: -------------
        !           719:
        !           720:    Detecting perfect powers is required by some factorization
        !           721: algorithms.  Currently `mpz_perfect_power_p' is implemented using
        !           722: repeated Nth root extractions, though naturally only prime roots need
        !           723: to be considered.  (*Note Nth Root Algorithm::.)
        !           724:
        !           725:    If a prime divisor p with multiplicity e can be found, then only
        !           726: roots which are divisors of e need to be considered, much reducing the
        !           727: work necessary.  To this end divisibility by a set of small primes is
        !           728: checked.
        !           729:
        !           730: 
        !           731: File: gmp.info,  Node: Radix Conversion Algorithms,  Next: Other Algorithms,  Prev: Root Extraction Algorithms,  Up: Algorithms
        !           732:
        !           733: Radix Conversion
        !           734: ================
        !           735:
        !           736:    Radix conversions are less important than other algorithms.  A
        !           737: program dominated by conversions should probably use a different data
        !           738: representation.
        !           739:
        !           740: * Menu:
        !           741:
        !           742: * Binary to Radix::
        !           743: * Radix to Binary::
        !           744:
        !           745: 
        !           746: File: gmp.info,  Node: Binary to Radix,  Next: Radix to Binary,  Prev: Radix Conversion Algorithms,  Up: Radix Conversion Algorithms
        !           747:
        !           748: Binary to Radix
        !           749: ---------------
        !           750:
        !           751:    Conversions from binary to a power-of-2 radix use a simple and fast
        !           752: O(N) bit extraction algorithm.
        !           753:
        !           754:    Conversions from binary to other radices use one of two algorithms.
        !           755: Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
        !           756: Repeated divisions by b^n are made, where b is the radix and n is the
        !           757: biggest power that fits in a limb.  But instead of simply using the
        !           758: remainder r from such divisions, an extra divide step is done to give a
        !           759: fractional limb representing r/b^n.  The digits of r can then be
        !           760: extracted using multiplications by b rather than divisions.  Special
        !           761: case code is provided for decimal, allowing multiplications by 10 to
        !           762: optimize to shifts and adds.
        !           763:
        !           764:    Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
        !           765: used.  For an input t, powers b^(n*2^i) of the radix are calculated,
        !           766: until a power between t and sqrt(t) is reached.  t is then divided by
        !           767: that largest power, giving a quotient which is the digits above that
        !           768: power, and a remainder which is those below.  These two parts are in
        !           769: turn divided by the second highest power, and so on recursively.  When
        !           770: a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
        !           771: limbs, the basecase algorithm described above is used.
        !           772:
        !           773:    The advantage of this algorithm is that big divisions can make use
        !           774: of the sub-quadratic divide and conquer division (*note Divide and
        !           775: Conquer Division::), and big divisions tend to have less overheads than
        !           776: lots of separate single limb divisions anyway.  But in any case the
        !           777: cost of calculating the powers b^(n*2^i) must first be overcome.
        !           778:
        !           779:    `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
        !           780: the same basic thing, the point where it becomes worth doing a big
        !           781: division to cut the input in half.  `GET_STR_PRECOMPUTE_THRESHOLD'
        !           782: includes the cost of calculating the radix power required, whereas
        !           783: `GET_STR_DC_THRESHOLD' assumes that's already available, which is the
        !           784: case when recursing.
        !           785:
        !           786:    Since the base case produces digits from least to most significant
        !           787: but they want to be stored from most to least, it's necessary to
        !           788: calculate in advance how many digits there will be, or at least be sure
        !           789: not to underestimate that.  For GMP the number of input bits is
        !           790: multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
        !           791: The result is either correct or one too big.
        !           792:
        !           793:    Examining some of the high bits of the input could increase the
        !           794: chance of getting the exact number of digits, but an exact result every
        !           795: time would not be practical, since in general the difference between
        !           796: numbers 100... and 99... is only in the last few bits and the work to
        !           797: identify 99...  might well be almost as much as a full conversion.
        !           798:
        !           799:    `mpf_get_str' doesn't currently use the algorithm described here, it
        !           800: multiplies or divides by a power of b to move the radix point to the
        !           801: just above the highest non-zero digit (or at worst one above that
        !           802: location), then multiplies by b^n to bring out digits.  This is O(N^2)
        !           803: and is certainly not optimal.
        !           804:
        !           805:    The r/b^n scheme described above for using multiplications to bring
        !           806: out digits might be useful for more than a single limb.  Some brief
        !           807: experiments with it on the base case when recursing didn't give a
        !           808: noticable improvement, but perhaps that was only due to the
        !           809: implementation.  Something similar would work for the sub-quadratic
        !           810: divisions too, though there would be the cost of calculating a bigger
        !           811: radix power.
        !           812:
        !           813:    Another possible improvement for the sub-quadratic part would be to
        !           814: arrange for radix powers that balanced the sizes of quotient and
        !           815: remainder produced, ie. the highest power would be an b^(n*k)
        !           816: approximately equal to sqrt(t), not restricted to a 2^i factor.  That
        !           817: ought to smooth out a graph of times against sizes, but may or may not
        !           818: be a net speedup.
        !           819:
        !           820: 
        !           821: File: gmp.info,  Node: Radix to Binary,  Prev: Binary to Radix,  Up: Radix Conversion Algorithms
        !           822:
        !           823: Radix to Binary
        !           824: ---------------
        !           825:
        !           826:    Conversions from a power-of-2 radix into binary use a simple and fast
        !           827: O(N) bitwise concatenation algorithm.
        !           828:
        !           829:    Conversions from other radices use one of two algorithms.  Sizes
        !           830: below `SET_STR_THRESHOLD' use a basic O(N^2) method.  Groups of n
        !           831: digits are converted to limbs, where n is the biggest power of the base
        !           832: b which will fit in a limb, then those groups are accumulated into the
        !           833: result by multiplying by b^n and adding.  This saves multi-precision
        !           834: operations, as per Knuth section 4.4 part E (*note References::).  Some
        !           835: special case code is provided for decimal, giving the compiler a chance
        !           836: to optimize multiplications by 10.
        !           837:
        !           838:    Above `SET_STR_THRESHOLD' a sub-quadratic algorithm is used.  First
        !           839: groups of n digits are converted into limbs.  Then adjacent limbs are
        !           840: combined into limb pairs with x*b^n+y, where x and y are the limbs.
        !           841: Adjacent limb pairs are combined into quads similarly with x*b^(2n)+y.
        !           842: This continues until a single block remains, that being the result.
        !           843:
        !           844:    The advantage of this method is that the multiplications for each x
        !           845: are big blocks, allowing Karatsuba and higher algorithms to be used.
        !           846: But the cost of calculating the powers b^(n*2^i) must be overcome.
        !           847: `SET_STR_THRESHOLD' usually ends up quite big, around 5000 digits, and
        !           848: on some processors much bigger still.
        !           849:
        !           850:    `SET_STR_THRESHOLD' is based on the input digits (and tuned for
        !           851: decimal), though it might be better based on a limb count, so as to be
        !           852: independent of the base.  But that sort of count isn't used by the base
        !           853: case and so would need some sort of initial calculation or estimate.
        !           854:
        !           855:    The main reason `SET_STR_THRESHOLD' is so much bigger than the
        !           856: corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that `mpn_mul_1' is
        !           857: much faster than `mpn_divrem_1' (often by a factor of 10, or more).
        !           858:
        !           859: 
        !           860: File: gmp.info,  Node: Other Algorithms,  Next: Assembler Coding,  Prev: Radix Conversion Algorithms,  Up: Algorithms
        !           861:
        !           862: Other Algorithms
        !           863: ================
        !           864:
        !           865: * Menu:
        !           866:
        !           867: * Factorial Algorithm::
        !           868: * Binomial Coefficients Algorithm::
        !           869: * Fibonacci Numbers Algorithm::
        !           870: * Lucas Numbers Algorithm::
        !           871:
        !           872: 
        !           873: File: gmp.info,  Node: Factorial Algorithm,  Next: Binomial Coefficients Algorithm,  Prev: Other Algorithms,  Up: Other Algorithms
        !           874:
        !           875: Factorial
        !           876: ---------
        !           877:
        !           878:    Factorials n! are calculated by a simple product from 1 to n, but
        !           879: arranged into certain sub-products.
        !           880:
        !           881:    First as many factors as fit in a limb are accumulated, then two of
        !           882: those multiplied to give a 2-limb product.  When two 2-limb products
        !           883: are ready they're multiplied to a 4-limb product, and when two 4-limbs
        !           884: are ready they're multiplied to an 8-limb product, etc.  A stack of
        !           885: outstanding products is built up, with two of the same size multiplied
        !           886: together when ready.
        !           887:
        !           888:    Arranging for multiplications to have operands the same (or nearly
        !           889: the same) size means the Karatsuba and higher multiplication algorithms
        !           890: can be used.  And even on sizes below the Karatsuba threshold an NxN
        !           891: multiply will give a basecase multiply more to work on.
        !           892:
        !           893:    An obvious improvement not currently implemented would be to strip
        !           894: factors of 2 from the products and apply them at the end with a bit
        !           895: shift.  Another possibility would be to determine the prime
        !           896: factorization of the result (which can be done easily), and use a
        !           897: powering method, at each stage squaring then multiplying in those
        !           898: primes with a 1 in their exponent at that point.  The advantage would
        !           899: be some multiplies turned into squares.
        !           900:
        !           901: 
        !           902: File: gmp.info,  Node: Binomial Coefficients Algorithm,  Next: Fibonacci Numbers Algorithm,  Prev: Factorial Algorithm,  Up: Other Algorithms
        !           903:
        !           904: Binomial Coefficients
        !           905: ---------------------
        !           906:
        !           907:    Binomial coefficients C(n,k) are calculated by first arranging k <=
        !           908: n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the
        !           909: following product simply from i=2 to i=k.
        !           910:
        !           911:                            k  (n-k+i)
        !           912:      C(n,k) =  (n-k+1) * prod -------
        !           913:                           i=2    i
        !           914:
        !           915:    It's easy to show that each denominator i will divide the product so
        !           916: far, so the exact division algorithm is used (*note Exact Division::).
        !           917:
        !           918:    The numerators n-k+i and denominators i are first accumulated into
        !           919: as many fit a limb, to save multi-precision operations, though for
        !           920: `mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
        !           921: and n-k+i in general won't fit in a limb at all.
        !           922:
        !           923:    An obvious improvement would be to strip factors of 2 from each
        !           924: multiplier and divisor and count them separately, to be applied with a
        !           925: bit shift at the end.  Factors of 3 and perhaps 5 could even be handled
        !           926: similarly.  Another possibility, if n is not too big, would be to
        !           927: determine the prime factorization of the result based on the factorials
        !           928: involved, and power up those primes appropriately.  This would help
        !           929: most when k is near n/2.
        !           930:
        !           931: 
        !           932: File: gmp.info,  Node: Fibonacci Numbers Algorithm,  Next: Lucas Numbers Algorithm,  Prev: Binomial Coefficients Algorithm,  Up: Other Algorithms
        !           933:
        !           934: Fibonacci Numbers
        !           935: -----------------
        !           936:
        !           937:    The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed
        !           938: for calculating isolated F[n] or F[n],F[n-1] values efficiently.
        !           939:
        !           940:    For small n, a table of single limb values in `__gmp_fib_table' is
        !           941: used.  On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
        !           942: to F[93].  For convenience the table starts at F[-1].
        !           943:
        !           944:    Beyond the table, values are generated with a binary powering
        !           945: algorithm, calculating a pair F[n] and F[n-1] working from high to low
        !           946: across the bits of n.  The formulas used are
        !           947:
        !           948:      F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
        !           949:      F[2k-1] =   F[k]^2 + F[k-1]^2
        !           950:
        !           951:      F[2k] = F[2k+1] - F[2k-1]
        !           952:
        !           953:    At each step, k is the high b bits of n.  If the next bit of n is 0
        !           954: then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
        !           955: and the process repeated until all bits of n are incorporated.  Notice
        !           956: these formulas require just two squares per bit of n.
        !           957:
        !           958:    It'd be possible to handle the first few n above the single limb
        !           959: table with simple additions, using the defining Fibonacci recurrence
        !           960: F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
        !           961: be faster for only about 10 or 20 values of n, and including a block of
        !           962: code for just those doesn't seem worthwhile.  If they really mattered
        !           963: it'd be better to extend the data table.
        !           964:
        !           965:    Using a table avoids lots of calculations on small numbers, and
        !           966: makes small n go fast.  A bigger table would make more small n go fast,
        !           967: it's just a question of balancing size against desired speed.  For GMP
        !           968: the code is kept compact, with the emphasis primarily on a good
        !           969: powering algorithm.
        !           970:
        !           971:    `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
        !           972: interested in F[n].  In this case the last step of the algorithm can
        !           973: become one multiply instead of two squares.  One of the following two
        !           974: formulas is used, according as n is odd or even.
        !           975:
        !           976:      F[2k]   = F[k]*(F[k]+2F[k-1])
        !           977:
        !           978:      F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
        !           979:
        !           980:    F[2k+1] here is the same as above, just rearranged to be a multiply.
        !           981: For interest, the 2*(-1)^k term both here and above can be applied
        !           982: just to the low limb of the calculation, without a carry or borrow into
        !           983: further limbs, which saves some code size.  See comments with
        !           984: `mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.
        !           985:
        !           986: 
        !           987: File: gmp.info,  Node: Lucas Numbers Algorithm,  Prev: Fibonacci Numbers Algorithm,  Up: Other Algorithms
        !           988:
        !           989: Lucas Numbers
        !           990: -------------
        !           991:
        !           992:    `mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
        !           993: Fibonacci numbers with the following simple formulas.
        !           994:
        !           995:      L[k]   =   F[k] + 2*F[k-1]
        !           996:      L[k-1] = 2*F[k] -   F[k-1]
        !           997:
        !           998:    `mpz_lucnum_ui' is only interested in L[n], and some work can be
        !           999: saved.  Trailing zero bits on n can be handled with a single square
        !          1000: each.
        !          1001:
        !          1002:      L[2k] = L[k]^2 - 2*(-1)^k
        !          1003:
        !          1004:    And the lowest 1 bit can be handled with one multiply of a pair of
        !          1005: Fibonacci numbers, similar to what `mpz_fib_ui' does.
        !          1006:
        !          1007:      L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
        !          1008:
        !          1009: 
        !          1010: File: gmp.info,  Node: Assembler Coding,  Prev: Other Algorithms,  Up: Algorithms
        !          1011:
        !          1012: Assembler Coding
        !          1013: ================
        !          1014:
        !          1015:    The assembler subroutines in GMP are the most significant source of
        !          1016: speed at small to moderate sizes.  At larger sizes algorithm selection
        !          1017: becomes more important, but of course speedups in low level routines
        !          1018: will still speed up everything proportionally.
        !          1019:
        !          1020:    Carry handling and widening multiplies that are important for GMP
        !          1021: can't be easily expressed in C.  GCC `asm' blocks help a lot and are
        !          1022: provided in `longlong.h', but hand coding low level routines invariably
        !          1023: offers a speedup over generic C by a factor of anything from 2 to 10.
        !          1024:
        !          1025: * Menu:
        !          1026:
        !          1027: * Assembler Code Organisation::
        !          1028: * Assembler Basics::
        !          1029: * Assembler Carry Propagation::
        !          1030: * Assembler Cache Handling::
        !          1031: * Assembler Floating Point::
        !          1032: * Assembler SIMD Instructions::
        !          1033: * Assembler Software Pipelining::
        !          1034: * Assembler Loop Unrolling::
        !          1035:
        !          1036: 
        !          1037: File: gmp.info,  Node: Assembler Code Organisation,  Next: Assembler Basics,  Prev: Assembler Coding,  Up: Assembler Coding
        !          1038:
        !          1039: Code Organisation
        !          1040: -----------------
        !          1041:
        !          1042:    The various `mpn' subdirectories contain machine-dependent code,
        !          1043: written in C or assembler.  The `mpn/generic' subdirectory contains
        !          1044: default code, used when there's no machine-specific version of a
        !          1045: particular file.
        !          1046:
        !          1047:    Each `mpn' subdirectory is for an ISA family.  Generally 32-bit and
        !          1048: 64-bit variants in a family cannot share code and will have separate
        !          1049: directories.  Within a family further subdirectories may exist for CPU
        !          1050: variants.
        !          1051:
        !          1052: 
        !          1053: File: gmp.info,  Node: Assembler Basics,  Next: Assembler Carry Propagation,  Prev: Assembler Code Organisation,  Up: Assembler Coding
        !          1054:
        !          1055: Assembler Basics
        !          1056: ----------------
        !          1057:
        !          1058:    `mpn_addmul_1' and `mpn_submul_1' are the most important routines
        !          1059: for overall GMP performance.  All multiplications and divisions come
        !          1060: down to repeated calls to these.  `mpn_add_n', `mpn_sub_n',
        !          1061: `mpn_lshift' and `mpn_rshift' are next most important.
        !          1062:
        !          1063:    On some CPUs assembler versions of the internal functions
        !          1064: `mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
        !          1065: mainly through avoiding function call overheads.  They can also
        !          1066: potentially make better use of a wide superscalar processor.
        !          1067:
        !          1068:    The restrictions on overlaps between sources and destinations (*note
        !          1069: Low-level Functions::) are designed to facilitate a variety of
        !          1070: implementations.  For example, knowing `mpn_add_n' won't have partly
        !          1071: overlapping sources and destination means reading can be done far ahead
        !          1072: of writing on superscalar processors, and loops can be vectorized on a
        !          1073: vector processor, depending on the carry handling.
        !          1074:
        !          1075: 
        !          1076: File: gmp.info,  Node: Assembler Carry Propagation,  Next: Assembler Cache Handling,  Prev: Assembler Basics,  Up: Assembler Coding
        !          1077:
        !          1078: Carry Propagation
        !          1079: -----------------
        !          1080:
        !          1081:    The problem that presents most challenges in GMP is propagating
        !          1082: carries from one limb to the next.  In functions like `mpn_addmul_1' and
        !          1083: `mpn_add_n', carries are the only dependencies between limb operations.
        !          1084:
        !          1085:    On processors with carry flags, a straightforward CISC style `adc' is
        !          1086: generally best.  AMD K6 `mpn_addmul_1' however is an example of an
        !          1087: unusual set of circumstances where a branch works out better.
        !          1088:
        !          1089:    On RISC processors generally an add and compare for overflow is
        !          1090: used.  This sort of thing can be seen in `mpn/generic/aors_n.c'.  Some
        !          1091: carry propagation schemes require 4 instructions, meaning at least 4
        !          1092: cycles per limb, but other schemes may use just 1 or 2.  On wide
        !          1093: superscalar processors performance may be completely determined by the
        !          1094: number of dependent instructions between carry-in and carry-out for
        !          1095: each limb.
        !          1096:
        !          1097:    On vector processors good use can be made of the fact that a carry
        !          1098: bit only very rarely propagates more than one limb.  When adding a
        !          1099: single bit to a limb, there's only a carry out if that limb was
        !          1100: `0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
        !          1101: `mpn/cray/add_n.c' is an example of this, it adds all limbs in
        !          1102: parallel, adds one set of carry bits in parallel and then only rarely
        !          1103: needs to fall through to a loop propagating further carries.
        !          1104:
        !          1105:    On the x86s, GCC (as of version 2.95.2) doesn't generate
        !          1106: particularly good code for the RISC style idioms that are necessary to
        !          1107: handle carry bits in C.  Often conditional jumps are generated where
        !          1108: `adc' or `sbb' forms would be better.  And so unfortunately almost any
        !          1109: loop involving carry bits needs to be coded in assembler for best
        !          1110: results.
        !          1111:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>