[BACK]Return to gmp.info-6 CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp

Annotation of OpenXM_contrib/gmp/gmp.info-6, Revision 1.1.1.1

1.1       ohara       1: This is gmp.info, produced by makeinfo version 4.2 from gmp.texi.
                      2:
                      3: This manual describes how to install and use the GNU multiple precision
                      4: arithmetic library, version 4.1.2.
                      5:
                      6:    Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
                      7: 2001, 2002 Free Software Foundation, Inc.
                      8:
                      9:    Permission is granted to copy, distribute and/or modify this
                     10: document under the terms of the GNU Free Documentation License, Version
                     11: 1.1 or any later version published by the Free Software Foundation;
                     12: with no Invariant Sections, with the Front-Cover Texts being "A GNU
                     13: Manual", and with the Back-Cover Texts being "You have freedom to copy
                     14: and modify this GNU Manual, like GNU software".  A copy of the license
                     15: is included in *Note GNU Free Documentation License::.
                     16: INFO-DIR-SECTION GNU libraries
                     17: START-INFO-DIR-ENTRY
                     18: * gmp: (gmp).                   GNU Multiple Precision Arithmetic Library.
                     19: END-INFO-DIR-ENTRY
                     20:
                     21: 
                     22: File: gmp.info,  Node: FFT Multiplication,  Next: Other Multiplication,  Prev: Toom-Cook 3-Way Multiplication,  Up: Multiplication Algorithms
                     23:
                     24: FFT Multiplication
                     25: ------------------
                     26:
                     27:    At large to very large sizes a Fermat style FFT multiplication is
                     28: used, following Scho"nhage and Strassen (*note References::).
                     29: Descriptions of FFTs in various forms can be found in many textbooks,
                     30: for instance Knuth section 4.3.3 part C or Lipson chapter IX.  A brief
                     31: description of the form used in GMP is given here.
                     32:
                     33:    The multiplication done is x*y mod 2^N+1, for a given N.  A full
                     34: product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
                     35: and y with high zero limbs.  The modular product is the native form for
                     36: the algorithm, so padding to get a full product is unavoidable.
                     37:
                     38:    The algorithm follows a split, evaluate, pointwise multiply,
                     39: interpolate and combine similar to that described above for Karatsuba
                     40: and Toom-3.  A k parameter controls the split, with an FFT-k splitting
                     41: into 2^k pieces of M=N/2^k bits each.  N must be a multiple of
                     42: (2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
                     43: bit shifts in the split and combine stages.
                     44:
                     45:    The evaluations, pointwise multiplications, and interpolation, are
                     46: all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
                     47: 2^k and of `mp_bits_per_limb'.  The results of interpolation will be
                     48: the following negacyclic convolution of the input pieces, and the
                     49: choice of N' ensures these sums aren't truncated.
                     50:
                     51:                 ---
                     52:                 \         b
                     53:      w[n] =     /     (-1) * x[i] * y[j]
                     54:                 ---
                     55:             i+j==b*2^k+n
                     56:                b=0,1
                     57:
                     58:    The points used for the evaluation are g^i for i=0 to 2^k-1 where
                     59: g=2^(2N'/2^k).  g is a 2^k'th root of unity mod 2^N'+1, which produces
                     60: necessary cancellations at the interpolation stage, and it's also a
                     61: power of 2 so the fast fourier transforms used for the evaluation and
                     62: interpolation do only shifts, adds and negations.
                     63:
                     64:    The pointwise multiplications are done modulo 2^N'+1 and either
                     65: recurse into a further FFT or use a plain multiplication (Toom-3,
                     66: Karatsuba or basecase), whichever is optimal at the size N'.  The
                     67: interpolation is an inverse fast fourier transform.  The resulting set
                     68: of sums of x[i]*y[j] are added at appropriate offsets to give the final
                     69: result.
                     70:
                     71:    Squaring is the same, but x is the only input so it's one transform
                     72: at the evaluate stage and the pointwise multiplies are squares.  The
                     73: interpolation is the same.
                     74:
                     75:    For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
                     76: the exponent representing 2^k recursed modular multiplies each
                     77: 1/2^(k-1) the size of the original.  Each successive k is an asymptotic
                     78: improvement, but overheads mean each is only faster at bigger and
                     79: bigger sizes.  In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
                     80: thresholds where each k is used.  Each new k effectively swaps some
                     81: multiplying for some shifts, adds and overheads.
                     82:
                     83:    A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
                     84: plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
                     85: A k=4 FFT at O(N^1.333) can be expected to be the first faster than
                     86: Toom-3 at O(N^1.465).  In practice this is what's found, with
                     87: `MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
                     88: and 1000 limbs, depending on the CPU.  So far it's been found that only
                     89: very large FFTs recurse into pointwise multiplies above these sizes.
                     90:
                     91:    When an FFT is to give a full product, the change of N to 2N doesn't
                     92: alter the theoretical complexity for a given k, but for the purposes of
                     93: considering where an FFT might be first used it can be assumed that the
                     94: FFT is recursing into a normal multiply and that on that basis it's
                     95: doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
                     96: making it O(N^(k/(k-2))).  This would mean k=7 at O(N^1.4) would be the
                     97: first FFT faster than Toom-3.  In practice `MUL_FFT_THRESHOLD' and
                     98: `SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
                     99: between 3000 and 10000 limbs.
                    100:
                    101:    The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
                    102: a multiple of 2^k and `mp_bits_per_limb' means that when
                    103: 2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
                    104: The +k+3 means some values of N just under such a multiple will be
                    105: rounded to the next.  The complexity calculations above assume that a
                    106: favourable size is used, meaning one which isn't padded through
                    107: rounding, and it's also assumed that the extra +k+3 bits are negligible
                    108: at typical FFT sizes.
                    109:
                    110:    The practical effect of the 2^(2k-1) constraint is to introduce a
                    111: step-effect into measured speeds.  For example k=8 will round N up to a
                    112: multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
                    113: groups of sizes for which `mpn_mul_n' runs at the same speed.  Or for
                    114: k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc.  In practice
                    115: it's been found each k is used at quite small multiples of its size
                    116: constraint and so the step effect is quite noticeable in a time versus
                    117: size graph.
                    118:
                    119:    The threshold determinations currently measure at the mid-points of
                    120: size steps, but this is sub-optimal since at the start of a new step it
                    121: can happen that it's better to go back to the previous k for a while.
                    122: Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
                    123: will be needed.
                    124:
                    125: 
                    126: File: gmp.info,  Node: Other Multiplication,  Prev: FFT Multiplication,  Up: Multiplication Algorithms
                    127:
                    128: Other Multiplication
                    129: --------------------
                    130:
                    131:    The 3-way Toom-Cook algorithm described above (*note Toom-Cook 3-Way
                    132: Multiplication::) generalizes to split into an arbitrary number of
                    133: pieces, as per Knuth section 4.3.3 algorithm C.  This is not currently
                    134: used, though it's possible a Toom-4 might fit in between Toom-3 and the
                    135: FFTs.  The notes here are merely for interest.
                    136:
                    137:    In general a split into r+1 pieces is made, and evaluations and
                    138: pointwise multiplications done at 2*r+1 points.  A 4-way split does 7
                    139: pointwise multiplies, 5-way does 9, etc.  Asymptotically an (r+1)-way
                    140: algorithm is O(N^(log(2*r+1)/log(r+1))).  Only the pointwise
                    141: multiplications count towards big-O complexity, but the time spent in
                    142: the evaluate and interpolate stages grows with r and has a significant
                    143: practical impact, with the asymptotic advantage of each r realized only
                    144: at bigger and bigger sizes.  The overheads grow as O(N*r), whereas in
                    145: an r=2^k FFT they grow only as O(N*log(r)).
                    146:
                    147:    Knuth algorithm C evaluates at points 0,1,2,...,2*r, but exercise 4
                    148: uses -r,...,0,...,r and the latter saves some small multiplies in the
                    149: evaluate stage (or rather trades them for additions), and has a further
                    150: saving of nearly half the interpolate steps.  The idea is to separate
                    151: odd and even final coefficients and then perform algorithm C steps C7
                    152: and C8 on them separately.  The divisors at step C7 become j^2 and the
                    153: multipliers at C8 become 2*t*j-j^2.
                    154:
                    155:    Splitting odd and even parts through positive and negative points
                    156: can be thought of as using -1 as a square root of unity.  If a 4th root
                    157: of unity was available then a further split and speedup would be
                    158: possible, but no such root exists for plain integers.  Going to complex
                    159: integers with i=sqrt(-1) doesn't help, essentially because in cartesian
                    160: form it takes three real multiplies to do a complex multiply.  The
                    161: existence of 2^k'th roots of unity in a suitable ring or field lets the
                    162: fast fourier transform keep splitting and get to O(N*log(r)).
                    163:
                    164:    Floating point FFTs use complex numbers approximating Nth roots of
                    165: unity.  Some processors have special support for such FFTs.  But these
                    166: are not used in GMP since it's very difficult to guarantee an exact
                    167: result (to some number of bits).  An occasional difference of 1 in the
                    168: last bit might not matter to a typical signal processing algorithm, but
                    169: is of course of vital importance to GMP.
                    170:
                    171: 
                    172: File: gmp.info,  Node: Division Algorithms,  Next: Greatest Common Divisor Algorithms,  Prev: Multiplication Algorithms,  Up: Algorithms
                    173:
                    174: Division Algorithms
                    175: ===================
                    176:
                    177: * Menu:
                    178:
                    179: * Single Limb Division::
                    180: * Basecase Division::
                    181: * Divide and Conquer Division::
                    182: * Exact Division::
                    183: * Exact Remainder::
                    184: * Small Quotient Division::
                    185:
                    186: 
                    187: File: gmp.info,  Node: Single Limb Division,  Next: Basecase Division,  Prev: Division Algorithms,  Up: Division Algorithms
                    188:
                    189: Single Limb Division
                    190: --------------------
                    191:
                    192:    Nx1 division is implemented using repeated 2x1 divisions from high
                    193: to low, either with a hardware divide instruction or a multiplication by
                    194: inverse, whichever is best on a given CPU.
                    195:
                    196:    The multiply by inverse follows section 8 of "Division by Invariant
                    197: Integers using Multiplication" by Granlund and Montgomery (*note
                    198: References::) and is implemented as `udiv_qrnnd_preinv' in
                    199: `gmp-impl.h'.  The idea is to have a fixed-point approximation to 1/d
                    200: (see `invert_limb') and then multiply by the high limb (plus one bit)
                    201: of the dividend to get a quotient q.  With d normalized (high bit set),
                    202: q is no more than 1 too small.  Subtracting q*d from the dividend gives
                    203: a remainder, and reveals whether q or q-1 is correct.
                    204:
                    205:    The result is a division done with two multiplications and four or
                    206: five arithmetic operations.  On CPUs with low latency multipliers this
                    207: can be much faster than a hardware divide, though the cost of
                    208: calculating the inverse at the start may mean it's only better on
                    209: inputs bigger than say 4 or 5 limbs.
                    210:
                    211:    When a divisor must be normalized, either for the generic C
                    212: `__udiv_qrnnd_c' or the multiply by inverse, the division performed is
                    213: actually a*2^k by d*2^k where a is the dividend and k is the power
                    214: necessary to have the high bit of d*2^k set.  The bit shifts for the
                    215: dividend are usually accomplished "on the fly" meaning by extracting
                    216: the appropriate bits at each step.  Done this way the quotient limbs
                    217: come out aligned ready to store.  When only the remainder is wanted, an
                    218: alternative is to take the dividend limbs unshifted and calculate r = a
                    219: mod d*2^k followed by an extra final step r*2^k mod d*2^k.  This can
                    220: help on CPUs with poor bit shifts or few registers.
                    221:
                    222:    The multiply by inverse can be done two limbs at a time.  The
                    223: calculation is basically the same, but the inverse is two limbs and the
                    224: divisor treated as if padded with a low zero limb.  This means more
                    225: work, since the inverse will need a 2x2 multiply, but the four 1x1s to
                    226: do that are independent and can therefore be done partly or wholly in
                    227: parallel.  Likewise for a 2x1 calculating q*d.  The net effect is to
                    228: process two limbs with roughly the same two multiplies worth of latency
                    229: that one limb at a time gives.  This extends to 3 or 4 limbs at a time,
                    230: though the extra work to apply the inverse will almost certainly soon
                    231: reach the limits of multiplier throughput.
                    232:
                    233:    A similar approach in reverse can be taken to process just half a
                    234: limb at a time if the divisor is only a half limb.  In this case the
                    235: 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
                    236: limb, which can be a saving on CPUs with a fast half limb multiply, or
                    237: in fact if the only multiply is a half limb, and especially if it's not
                    238: pipelined.
                    239:
                    240: 
                    241: File: gmp.info,  Node: Basecase Division,  Next: Divide and Conquer Division,  Prev: Single Limb Division,  Up: Division Algorithms
                    242:
                    243: Basecase Division
                    244: -----------------
                    245:
                    246:    Basecase NxM division is like long division done by hand, but in base
                    247: 2^mp_bits_per_limb.  See Knuth section 4.3.1 algorithm D, and
                    248: `mpn/generic/sb_divrem_mn.c'.
                    249:
                    250:    Briefly stated, while the dividend remains larger than the divisor,
                    251: a high quotient limb is formed and the Nx1 product q*d subtracted at
                    252: the top end of the dividend.  With a normalized divisor (most
                    253: significant bit set), each quotient limb can be formed with a 2x1
                    254: division and a 1x1 multiplication plus some subtractions.  The 2x1
                    255: division is by the high limb of the divisor and is done either with a
                    256: hardware divide or a multiply by inverse (the same as in *Note Single
                    257: Limb Division::) whichever is faster.  Such a quotient is sometimes one
                    258: too big, requiring an addback of the divisor, but that happens rarely.
                    259:
                    260:    With Q=N-M being the number of quotient limbs, this is an O(Q*M)
                    261: algorithm and will run at a speed similar to a basecase QxM
                    262: multiplication, differing in fact only in the extra multiply and divide
                    263: for each of the Q quotient limbs.
                    264:
                    265: 
                    266: File: gmp.info,  Node: Divide and Conquer Division,  Next: Exact Division,  Prev: Basecase Division,  Up: Division Algorithms
                    267:
                    268: Divide and Conquer Division
                    269: ---------------------------
                    270:
                    271:    For divisors larger than `DIV_DC_THRESHOLD', division is done by
                    272: dividing.  Or to be precise by a recursive divide and conquer algorithm
                    273: based on work by Moenck and Borodin, Jebelean, and Burnikel and Ziegler
                    274: (*note References::).
                    275:
                    276:    The algorithm consists essentially of recognising that a 2NxN
                    277: division can be done with the basecase division algorithm (*note
                    278: Basecase Division::), but using N/2 limbs as a base, not just a single
                    279: limb.  This way the multiplications that arise are (N/2)x(N/2) and can
                    280: take advantage of Karatsuba and higher multiplication algorithms (*note
                    281: Multiplication Algorithms::).  The "digits" of the quotient are formed
                    282: by recursive Nx(N/2) divisions.
                    283:
                    284:    If the (N/2)x(N/2) multiplies are done with a basecase multiplication
                    285: then the work is about the same as a basecase division, but with more
                    286: function call overheads and with some subtractions separated from the
                    287: multiplies.  These overheads mean that it's only when N/2 is above
                    288: `MUL_KARATSUBA_THRESHOLD' that divide and conquer is of use.
                    289:
                    290:    `DIV_DC_THRESHOLD' is based on the divisor size N, so it will be
                    291: somewhere above twice `MUL_KARATSUBA_THRESHOLD', but how much above
                    292: depends on the CPU.  An optimized `mpn_mul_basecase' can lower
                    293: `DIV_DC_THRESHOLD' a little by offering a ready-made advantage over
                    294: repeated `mpn_submul_1' calls.
                    295:
                    296:    Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is
                    297: the time for an NxN multiplication done with FFTs.  The actual time is
                    298: a sum over multiplications of the recursed sizes, as can be seen near
                    299: the end of section 2.2 of Burnikel and Ziegler.  For example, within
                    300: the Toom-3 range, divide and conquer is 2.63*M(N).  With higher
                    301: algorithms the M(N) term improves and the multiplier tends to log(N).
                    302: In practice, at moderate to large sizes, a 2NxN division is about 2 to
                    303: 4 times slower than an NxN multiplication.
                    304:
                    305:    Newton's method used for division is asymptotically O(M(N)) and
                    306: should therefore be superior to divide and conquer, but it's believed
                    307: this would only be for large to very large N.
                    308:
                    309: 
                    310: File: gmp.info,  Node: Exact Division,  Next: Exact Remainder,  Prev: Divide and Conquer Division,  Up: Division Algorithms
                    311:
                    312: Exact Division
                    313: --------------
                    314:
                    315:    A so-called exact division is when the dividend is known to be an
                    316: exact multiple of the divisor.  Jebelean's exact division algorithm
                    317: uses this knowledge to make some significant optimizations (*note
                    318: References::).
                    319:
                    320:    The idea can be illustrated in decimal for example with 368154
                    321: divided by 543.  Because the low digit of the dividend is 4, the low
                    322: digit of the quotient must be 8.  This is arrived at from 4*7 mod 10,
                    323: using the fact 7 is the modular inverse of 3 (the low digit of the
                    324: divisor), since 3*7 == 1 mod 10.  So 8*543=4344 can be subtracted from
                    325: the dividend leaving 363810.  Notice the low digit has become zero.
                    326:
                    327:    The procedure is repeated at the second digit, with the next
                    328: quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
                    329: 325800.  And finally at the third digit with quotient digit 6 (8*7 mod
                    330: 10), subtracting 6*543=3258 leaving 0.  So the quotient is 678.
                    331:
                    332:    Notice however that the multiplies and subtractions don't need to
                    333: extend past the low three digits of the dividend, since that's enough
                    334: to determine the three quotient digits.  For the last quotient digit no
                    335: subtraction is needed at all.  On a 2NxN division like this one, only
                    336: about half the work of a normal basecase division is necessary.
                    337:
                    338:    For an NxM exact division producing Q=N-M quotient limbs, the saving
                    339: over a normal basecase division is in two parts.  Firstly, each of the
                    340: Q quotient limbs needs only one multiply, not a 2x1 divide and
                    341: multiply.  Secondly, the crossproducts are reduced when Q>M to
                    342: Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2.  Notice the savings are
                    343: complementary.  If Q is big then many divisions are saved, or if Q is
                    344: small then the crossproducts reduce to a small number.
                    345:
                    346:    The modular inverse used is calculated efficiently by
                    347: `modlimb_invert' in `gmp-impl.h'.  This does four multiplies for a
                    348: 32-bit limb, or six for a 64-bit limb.  `tune/modlinv.c' has some
                    349: alternate implementations that might suit processors better at bit
                    350: twiddling than multiplying.
                    351:
                    352:    The sub-quadratic exact division described by Jebelean in "Exact
                    353: Division with Karatsuba Complexity" is not currently implemented.  It
                    354: uses a rearrangement similar to the divide and conquer for normal
                    355: division (*note Divide and Conquer Division::), but operating from low
                    356: to high.  A further possibility not currently implemented is
                    357: "Bidirectional Exact Integer Division" by Krandick and Jebelean which
                    358: forms quotient limbs from both the high and low ends of the dividend,
                    359: and can halve once more the number of crossproducts needed in a 2NxN
                    360: division.
                    361:
                    362:    A special case exact division by 3 exists in `mpn_divexact_by3',
                    363: supporting Toom-3 multiplication and `mpq' canonicalizations.  It forms
                    364: quotient digits with a multiply by the modular inverse of 3 (which is
                    365: `0xAA..AAB') and uses two comparisons to determine a borrow for the next
                    366: limb.  The multiplications don't need to be on the dependent chain, as
                    367: long as the effect of the borrows is applied.  Only a few optimized
                    368: assembler implementations currently exist.
                    369:
                    370: 
                    371: File: gmp.info,  Node: Exact Remainder,  Next: Small Quotient Division,  Prev: Exact Division,  Up: Division Algorithms
                    372:
                    373: Exact Remainder
                    374: ---------------
                    375:
                    376:    If the exact division algorithm is done with a full subtraction at
                    377: each stage and the dividend isn't a multiple of the divisor, then low
                    378: zero limbs are produced but with a remainder in the high limbs.  For
                    379: dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, then this
                    380: remainder r is of the form
                    381:
                    382:      a = q*d + r*b^n
                    383:
                    384:    n represents the number of zero limbs produced by the subtractions,
                    385: that being the number of limbs produced for q.  r will be in the range
                    386: 0<=r<d and can be viewed as a remainder, but one shifted up by a factor
                    387: of b^n.
                    388:
                    389:    Carrying out full subtractions at each stage means the same number
                    390: of cross products must be done as a normal division, but there's still
                    391: some single limb divisions saved.  When d is a single limb some
                    392: simplifications arise, providing good speedups on a number of
                    393: processors.
                    394:
                    395:    `mpn_bdivmod', `mpn_divexact_by3', `mpn_modexact_1_odd' and the
                    396: `redc' function in `mpz_powm' differ subtly in how they return r,
                    397: leading to some negations in the above formula, but all are essentially
                    398: the same.
                    399:
                    400:    Clearly r is zero when a is a multiple of d, and this leads to
                    401: divisibility or congruence tests which are potentially more efficient
                    402: than a normal division.
                    403:
                    404:    The factor of b^n on r can be ignored in a GCD when d is odd, hence
                    405: the use of `mpn_bdivmod' in `mpn_gcd', and the use of
                    406: `mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui' etc (*note
                    407: Greatest Common Divisor Algorithms::).
                    408:
                    409:    Montgomery's REDC method for modular multiplications uses operands
                    410: of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
                    411: uses the factor of b^n in the exact remainder to reach a product in the
                    412: same form (x*y)*b^-n (*note Modular Powering Algorithm::).
                    413:
                    414:    Notice that r generally gives no useful information about the
                    415: ordinary remainder a mod d since b^n mod d could be anything.  If
                    416: however b^n == 1 mod d, then r is the negative of the ordinary
                    417: remainder.  This occurs whenever d is a factor of b^n-1, as for example
                    418: with 3 in `mpn_divexact_by3'.  Other such factors include 5, 17 and
                    419: 257, but no particular use has been found for this.
                    420:
                    421: 
                    422: File: gmp.info,  Node: Small Quotient Division,  Prev: Exact Remainder,  Up: Division Algorithms
                    423:
                    424: Small Quotient Division
                    425: -----------------------
                    426:
                    427:    An NxM division where the number of quotient limbs Q=N-M is small
                    428: can be optimized somewhat.
                    429:
                    430:    An ordinary basecase division normalizes the divisor by shifting it
                    431: to make the high bit set, shifting the dividend accordingly, and
                    432: shifting the remainder back down at the end of the calculation.  This
                    433: is wasteful if only a few quotient limbs are to be formed.  Instead a
                    434: division of just the top 2*Q limbs of the dividend by the top Q limbs
                    435: of the divisor can be used to form a trial quotient.  This requires
                    436: only those limbs normalized, not the whole of the divisor and dividend.
                    437:
                    438:    A multiply and subtract then applies the trial quotient to the M-Q
                    439: unused limbs of the divisor and N-Q dividend limbs (which includes Q
                    440: limbs remaining from the trial quotient division).  The starting trial
                    441: quotient can be 1 or 2 too big, but all cases of 2 too big and most
                    442: cases of 1 too big are detected by first comparing the most significant
                    443: limbs that will arise from the subtraction.  An addback is done if the
                    444: quotient still turns out to be 1 too big.
                    445:
                    446:    This whole procedure is essentially the same as one step of the
                    447: basecase algorithm done in a Q limb base, though with the trial
                    448: quotient test done only with the high limbs, not an entire Q limb
                    449: "digit" product.  The correctness of this weaker test can be
                    450: established by following the argument of Knuth section 4.3.1 exercise
                    451: 20 but with the v2*q>b*r+u2 condition appropriately relaxed.
                    452:
                    453: 
                    454: File: gmp.info,  Node: Greatest Common Divisor Algorithms,  Next: Powering Algorithms,  Prev: Division Algorithms,  Up: Algorithms
                    455:
                    456: Greatest Common Divisor
                    457: =======================
                    458:
                    459: * Menu:
                    460:
                    461: * Binary GCD::
                    462: * Accelerated GCD::
                    463: * Extended GCD::
                    464: * Jacobi Symbol::
                    465:
                    466: 
                    467: File: gmp.info,  Node: Binary GCD,  Next: Accelerated GCD,  Prev: Greatest Common Divisor Algorithms,  Up: Greatest Common Divisor Algorithms
                    468:
                    469: Binary GCD
                    470: ----------
                    471:
                    472:    At small sizes GMP uses an O(N^2) binary style GCD.  This is
                    473: described in many textbooks, for example Knuth section 4.5.2 algorithm
                    474: B.  It simply consists of successively reducing operands a and b using
                    475: gcd(a,b) = gcd(min(a,b),abs(a-b)), and also that if a and b are first
                    476: made odd then abs(a-b) is even and factors of two can be discarded.
                    477:
                    478:    Variants like letting a-b become negative and doing a different next
                    479: step are of interest only as far as they suit particular CPUs, since on
                    480: small operands it's machine dependent factors that determine
                    481: performance.
                    482:
                    483:    The Euclidean GCD algorithm, as per Knuth algorithms E and A,
                    484: reduces using a mod b but this has so far been found to be slower
                    485: everywhere.  One reason the binary method does well is that the implied
                    486: quotient at each step is usually small, so often only one or two
                    487: subtractions are needed to get the same effect as a division.
                    488: Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
                    489: section 4.5.3 Theorem E.
                    490:
                    491:    When the implied quotient is large, meaning b is much smaller than
                    492: a, then a division is worthwhile.  This is the basis for the initial a
                    493: mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
                    494: and 1x1 cases).  But after that initial reduction, big quotients occur
                    495: too rarely to make it worth checking for them.
                    496:
                    497: 
                    498: File: gmp.info,  Node: Accelerated GCD,  Next: Extended GCD,  Prev: Binary GCD,  Up: Greatest Common Divisor Algorithms
                    499:
                    500: Accelerated GCD
                    501: ---------------
                    502:
                    503:    For sizes above `GCD_ACCEL_THRESHOLD', GMP uses the Accelerated GCD
                    504: algorithm described independently by Weber and Jebelean (the latter as
                    505: the "Generalized Binary" algorithm), *note References::.  This
                    506: algorithm is still O(N^2), but is much faster than the binary algorithm
                    507: since it does fewer multi-precision operations.  It consists of
                    508: alternating the k-ary reduction by Sorenson, and a "dmod" exact
                    509: remainder reduction.
                    510:
                    511:    For operands u and v the k-ary reduction replaces u with n*v-d*u
                    512: where n and d are single limb values chosen to give two trailing zero
                    513: limbs on that value, which can be stripped.  n and d are calculated
                    514: using an algorithm similar to half of a two limb GCD (see `find_a' in
                    515: `mpn/generic/gcd.c').
                    516:
                    517:    When u and v differ in size by more than a certain number of bits, a
                    518: dmod is performed to zero out bits at the low end of the larger.  It
                    519: consists of an exact remainder style division applied to an appropriate
                    520: number of bits (*note Exact Division::, and *note Exact Remainder::).
                    521: This is faster than a k-ary reduction but useful only when the operands
                    522: differ in size.  There's a dmod after each k-ary reduction, and if the
                    523: dmod leaves the operands still differing in size then it's repeated.
                    524:
                    525:    The k-ary reduction step can introduce spurious factors into the GCD
                    526: calculated, and these are eliminated at the end by taking GCDs with the
                    527: original inputs gcd(u,gcd(v,g)) using the binary algorithm.  Since g is
                    528: almost always small this takes very little time.
                    529:
                    530:    At small sizes the algorithm needs a good implementation of
                    531: `find_a'.  At larger sizes it's dominated by `mpn_addmul_1' applying n
                    532: and d.
                    533:
                    534: 
                    535: File: gmp.info,  Node: Extended GCD,  Next: Jacobi Symbol,  Prev: Accelerated GCD,  Up: Greatest Common Divisor Algorithms
                    536:
                    537: Extended GCD
                    538: ------------
                    539:
                    540:    The extended GCD calculates gcd(a,b) and also cofactors x and y
                    541: satisfying a*x+b*y=gcd(a,b).  Lehmer's multi-step improvement of the
                    542: extended Euclidean algorithm is used.  See Knuth section 4.5.2
                    543: algorithm L, and `mpn/generic/gcdext.c'.  This is an O(N^2) algorithm.
                    544:
                    545:    The multipliers at each step are found using single limb
                    546: calculations for sizes up to `GCDEXT_THRESHOLD', or double limb
                    547: calculations above that.  The single limb code is faster but doesn't
                    548: produce full-limb multipliers, hence not making full use of the
                    549: `mpn_addmul_1' calls.
                    550:
                    551:    When a CPU has a data-dependent multiplier, meaning one which is
                    552: faster on operands with fewer bits, the extra work in the double-limb
                    553: calculation might only save some looping overheads, leading to a large
                    554: `GCDEXT_THRESHOLD'.
                    555:
                    556:    Currently the single limb calculation doesn't optimize for the small
                    557: quotients that often occur, and this can lead to unusually low values of
                    558: `GCDEXT_THRESHOLD', depending on the CPU.
                    559:
                    560:    An analysis of double-limb calculations can be found in "A
                    561: Double-Digit Lehmer-Euclid Algorithm" by Jebelean (*note References::).
                    562: The code in GMP was developed independently.
                    563:
                    564:    It should be noted that when a double limb calculation is used, it's
                    565: used for the whole of that GCD, it doesn't fall back to single limb
                    566: part way through.  This is because as the algorithm proceeds, the
                    567: inputs a and b are reduced, but the cofactors x and y grow, so the
                    568: multipliers at each step are applied to a roughly constant total number
                    569: of limbs.
                    570:
                    571: 
                    572: File: gmp.info,  Node: Jacobi Symbol,  Prev: Extended GCD,  Up: Greatest Common Divisor Algorithms
                    573:
                    574: Jacobi Symbol
                    575: -------------
                    576:
                    577:    `mpz_jacobi' and `mpz_kronecker' are currently implemented with a
                    578: simple binary algorithm similar to that described for the GCDs (*note
                    579: Binary GCD::).  They're not very fast when both inputs are large.
                    580: Lehmer's multi-step improvement or a binary based multi-step algorithm
                    581: is likely to be better.
                    582:
                    583:    When one operand fits a single limb, and that includes
                    584: `mpz_kronecker_ui' and friends, an initial reduction is done with
                    585: either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
                    586: algorithm on a single limb.  The binary algorithm is well suited to a
                    587: single limb, and the whole calculation in this case is quite efficient.
                    588:
                    589:    In all the routines sign changes for the result are accumulated
                    590: using some bit twiddling, avoiding table lookups or conditional jumps.
                    591:
                    592: 
                    593: File: gmp.info,  Node: Powering Algorithms,  Next: Root Extraction Algorithms,  Prev: Greatest Common Divisor Algorithms,  Up: Algorithms
                    594:
                    595: Powering Algorithms
                    596: ===================
                    597:
                    598: * Menu:
                    599:
                    600: * Normal Powering Algorithm::
                    601: * Modular Powering Algorithm::
                    602:
                    603: 
                    604: File: gmp.info,  Node: Normal Powering Algorithm,  Next: Modular Powering Algorithm,  Prev: Powering Algorithms,  Up: Powering Algorithms
                    605:
                    606: Normal Powering
                    607: ---------------
                    608:
                    609:    Normal `mpz' or `mpf' powering uses a simple binary algorithm,
                    610: successively squaring and then multiplying by the base when a 1 bit is
                    611: seen in the exponent, as per Knuth section 4.6.3.  The "left to right"
                    612: variant described there is used rather than algorithm A, since it's
                    613: just as easy and can be done with somewhat less temporary memory.
                    614:
                    615: 
                    616: File: gmp.info,  Node: Modular Powering Algorithm,  Prev: Normal Powering Algorithm,  Up: Powering Algorithms
                    617:
                    618: Modular Powering
                    619: ----------------
                    620:
                    621:    Modular powering is implemented using a 2^k-ary sliding window
                    622: algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
                    623: (*note References::).  k is chosen according to the size of the
                    624: exponent.  Larger exponents use larger values of k, the choice being
                    625: made to minimize the average number of multiplications that must
                    626: supplement the squaring.
                    627:
                    628:    The modular multiplies and squares use either a simple division or
                    629: the REDC method by Montgomery (*note References::).  REDC is a little
                    630: faster, essentially saving N single limb divisions in a fashion similar
                    631: to an exact remainder (*note Exact Remainder::).  The current REDC has
                    632: some limitations.  It's only O(N^2) so above `POWM_THRESHOLD' division
                    633: becomes faster and is used.  It doesn't attempt to detect small bases,
                    634: but rather always uses a REDC form, which is usually a full size
                    635: operand.  And lastly it's only applied to odd moduli.
                    636:
                    637: 
                    638: File: gmp.info,  Node: Root Extraction Algorithms,  Next: Radix Conversion Algorithms,  Prev: Powering Algorithms,  Up: Algorithms
                    639:
                    640: Root Extraction Algorithms
                    641: ==========================
                    642:
                    643: * Menu:
                    644:
                    645: * Square Root Algorithm::
                    646: * Nth Root Algorithm::
                    647: * Perfect Square Algorithm::
                    648: * Perfect Power Algorithm::
                    649:
                    650: 
                    651: File: gmp.info,  Node: Square Root Algorithm,  Next: Nth Root Algorithm,  Prev: Root Extraction Algorithms,  Up: Root Extraction Algorithms
                    652:
                    653: Square Root
                    654: -----------
                    655:
                    656:    Square roots are taken using the "Karatsuba Square Root" algorithm
                    657: by Paul Zimmermann (*note References::).  This is expressed in a divide
                    658: and conquer form, but as noted in the paper it can also be viewed as a
                    659: discrete variant of Newton's method.
                    660:
                    661:    In the Karatsuba multiplication range this is an O(1.5*M(N/2))
                    662: algorithm, where M(n) is the time to multiply two numbers of n limbs.
                    663: In the FFT multiplication range this grows to a bound of O(6*M(N/2)).
                    664: In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and
                    665: Toom-3 ranges, growing to 2 or 3 in the FFT range.
                    666:
                    667:    The algorithm does all its calculations in integers and the resulting
                    668: `mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'.  The extended
                    669: precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.
                    670:
                    671: 
                    672: File: gmp.info,  Node: Nth Root Algorithm,  Next: Perfect Square Algorithm,  Prev: Square Root Algorithm,  Up: Root Extraction Algorithms
                    673:
                    674: Nth Root
                    675: --------
                    676:
                    677:    Integer Nth roots are taken using Newton's method with the following
                    678: iteration, where A is the input and n is the root to be taken.
                    679:
                    680:               1         A
                    681:      a[i+1] = - * ( --------- + (n-1)*a[i] )
                    682:               n     a[i]^(n-1)
                    683:
                    684:    The initial approximation a[1] is generated bitwise by successively
                    685: powering a trial root with or without new 1 bits, aiming to be just
                    686: above the true root.  The iteration converges quadratically when
                    687: started from a good approximation.  When n is large more initial bits
                    688: are needed to get good convergence.  The current implementation is not
                    689: particularly well optimized.
                    690:
                    691: 
                    692: File: gmp.info,  Node: Perfect Square Algorithm,  Next: Perfect Power Algorithm,  Prev: Nth Root Algorithm,  Up: Root Extraction Algorithms
                    693:
                    694: Perfect Square
                    695: --------------
                    696:
                    697:    `mpz_perfect_square_p' is able to quickly exclude most non-squares by
                    698: checking whether the input is a quadratic residue modulo some small
                    699: integers.
                    700:
                    701:    The first test is modulo 256 which means simply examining the least
                    702: significant byte.  Only 44 different values occur as the low byte of a
                    703: square, so 82.8% of non-squares can be immediately excluded.  Similar
                    704: tests modulo primes from 3 to 29 exclude 99.5% of those remaining, or
                    705: if a limb is 64 bits then primes up to 53 are used, excluding 99.99%.
                    706: A single Nx1 remainder using `PP' from `gmp-impl.h' quickly gives all
                    707: these remainders.
                    708:
                    709:    A square root must still be taken for any value that passes the
                    710: residue tests, to verify it's really a square and not one of the 0.086%
                    711: (or 0.000156% for 64 bits) non-squares that get through.  *Note Square
                    712: Root Algorithm::.
                    713:
                    714: 
                    715: File: gmp.info,  Node: Perfect Power Algorithm,  Prev: Perfect Square Algorithm,  Up: Root Extraction Algorithms
                    716:
                    717: Perfect Power
                    718: -------------
                    719:
                    720:    Detecting perfect powers is required by some factorization
                    721: algorithms.  Currently `mpz_perfect_power_p' is implemented using
                    722: repeated Nth root extractions, though naturally only prime roots need
                    723: to be considered.  (*Note Nth Root Algorithm::.)
                    724:
                    725:    If a prime divisor p with multiplicity e can be found, then only
                    726: roots which are divisors of e need to be considered, much reducing the
                    727: work necessary.  To this end divisibility by a set of small primes is
                    728: checked.
                    729:
                    730: 
                    731: File: gmp.info,  Node: Radix Conversion Algorithms,  Next: Other Algorithms,  Prev: Root Extraction Algorithms,  Up: Algorithms
                    732:
                    733: Radix Conversion
                    734: ================
                    735:
                    736:    Radix conversions are less important than other algorithms.  A
                    737: program dominated by conversions should probably use a different data
                    738: representation.
                    739:
                    740: * Menu:
                    741:
                    742: * Binary to Radix::
                    743: * Radix to Binary::
                    744:
                    745: 
                    746: File: gmp.info,  Node: Binary to Radix,  Next: Radix to Binary,  Prev: Radix Conversion Algorithms,  Up: Radix Conversion Algorithms
                    747:
                    748: Binary to Radix
                    749: ---------------
                    750:
                    751:    Conversions from binary to a power-of-2 radix use a simple and fast
                    752: O(N) bit extraction algorithm.
                    753:
                    754:    Conversions from binary to other radices use one of two algorithms.
                    755: Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
                    756: Repeated divisions by b^n are made, where b is the radix and n is the
                    757: biggest power that fits in a limb.  But instead of simply using the
                    758: remainder r from such divisions, an extra divide step is done to give a
                    759: fractional limb representing r/b^n.  The digits of r can then be
                    760: extracted using multiplications by b rather than divisions.  Special
                    761: case code is provided for decimal, allowing multiplications by 10 to
                    762: optimize to shifts and adds.
                    763:
                    764:    Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
                    765: used.  For an input t, powers b^(n*2^i) of the radix are calculated,
                    766: until a power between t and sqrt(t) is reached.  t is then divided by
                    767: that largest power, giving a quotient which is the digits above that
                    768: power, and a remainder which is those below.  These two parts are in
                    769: turn divided by the second highest power, and so on recursively.  When
                    770: a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
                    771: limbs, the basecase algorithm described above is used.
                    772:
                    773:    The advantage of this algorithm is that big divisions can make use
                    774: of the sub-quadratic divide and conquer division (*note Divide and
                    775: Conquer Division::), and big divisions tend to have less overheads than
                    776: lots of separate single limb divisions anyway.  But in any case the
                    777: cost of calculating the powers b^(n*2^i) must first be overcome.
                    778:
                    779:    `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
                    780: the same basic thing, the point where it becomes worth doing a big
                    781: division to cut the input in half.  `GET_STR_PRECOMPUTE_THRESHOLD'
                    782: includes the cost of calculating the radix power required, whereas
                    783: `GET_STR_DC_THRESHOLD' assumes that's already available, which is the
                    784: case when recursing.
                    785:
                    786:    Since the base case produces digits from least to most significant
                    787: but they want to be stored from most to least, it's necessary to
                    788: calculate in advance how many digits there will be, or at least be sure
                    789: not to underestimate that.  For GMP the number of input bits is
                    790: multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
                    791: The result is either correct or one too big.
                    792:
                    793:    Examining some of the high bits of the input could increase the
                    794: chance of getting the exact number of digits, but an exact result every
                    795: time would not be practical, since in general the difference between
                    796: numbers 100... and 99... is only in the last few bits and the work to
                    797: identify 99...  might well be almost as much as a full conversion.
                    798:
                    799:    `mpf_get_str' doesn't currently use the algorithm described here, it
                    800: multiplies or divides by a power of b to move the radix point to the
                    801: just above the highest non-zero digit (or at worst one above that
                    802: location), then multiplies by b^n to bring out digits.  This is O(N^2)
                    803: and is certainly not optimal.
                    804:
                    805:    The r/b^n scheme described above for using multiplications to bring
                    806: out digits might be useful for more than a single limb.  Some brief
                    807: experiments with it on the base case when recursing didn't give a
                    808: noticable improvement, but perhaps that was only due to the
                    809: implementation.  Something similar would work for the sub-quadratic
                    810: divisions too, though there would be the cost of calculating a bigger
                    811: radix power.
                    812:
                    813:    Another possible improvement for the sub-quadratic part would be to
                    814: arrange for radix powers that balanced the sizes of quotient and
                    815: remainder produced, ie. the highest power would be an b^(n*k)
                    816: approximately equal to sqrt(t), not restricted to a 2^i factor.  That
                    817: ought to smooth out a graph of times against sizes, but may or may not
                    818: be a net speedup.
                    819:
                    820: 
                    821: File: gmp.info,  Node: Radix to Binary,  Prev: Binary to Radix,  Up: Radix Conversion Algorithms
                    822:
                    823: Radix to Binary
                    824: ---------------
                    825:
                    826:    Conversions from a power-of-2 radix into binary use a simple and fast
                    827: O(N) bitwise concatenation algorithm.
                    828:
                    829:    Conversions from other radices use one of two algorithms.  Sizes
                    830: below `SET_STR_THRESHOLD' use a basic O(N^2) method.  Groups of n
                    831: digits are converted to limbs, where n is the biggest power of the base
                    832: b which will fit in a limb, then those groups are accumulated into the
                    833: result by multiplying by b^n and adding.  This saves multi-precision
                    834: operations, as per Knuth section 4.4 part E (*note References::).  Some
                    835: special case code is provided for decimal, giving the compiler a chance
                    836: to optimize multiplications by 10.
                    837:
                    838:    Above `SET_STR_THRESHOLD' a sub-quadratic algorithm is used.  First
                    839: groups of n digits are converted into limbs.  Then adjacent limbs are
                    840: combined into limb pairs with x*b^n+y, where x and y are the limbs.
                    841: Adjacent limb pairs are combined into quads similarly with x*b^(2n)+y.
                    842: This continues until a single block remains, that being the result.
                    843:
                    844:    The advantage of this method is that the multiplications for each x
                    845: are big blocks, allowing Karatsuba and higher algorithms to be used.
                    846: But the cost of calculating the powers b^(n*2^i) must be overcome.
                    847: `SET_STR_THRESHOLD' usually ends up quite big, around 5000 digits, and
                    848: on some processors much bigger still.
                    849:
                    850:    `SET_STR_THRESHOLD' is based on the input digits (and tuned for
                    851: decimal), though it might be better based on a limb count, so as to be
                    852: independent of the base.  But that sort of count isn't used by the base
                    853: case and so would need some sort of initial calculation or estimate.
                    854:
                    855:    The main reason `SET_STR_THRESHOLD' is so much bigger than the
                    856: corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that `mpn_mul_1' is
                    857: much faster than `mpn_divrem_1' (often by a factor of 10, or more).
                    858:
                    859: 
                    860: File: gmp.info,  Node: Other Algorithms,  Next: Assembler Coding,  Prev: Radix Conversion Algorithms,  Up: Algorithms
                    861:
                    862: Other Algorithms
                    863: ================
                    864:
                    865: * Menu:
                    866:
                    867: * Factorial Algorithm::
                    868: * Binomial Coefficients Algorithm::
                    869: * Fibonacci Numbers Algorithm::
                    870: * Lucas Numbers Algorithm::
                    871:
                    872: 
                    873: File: gmp.info,  Node: Factorial Algorithm,  Next: Binomial Coefficients Algorithm,  Prev: Other Algorithms,  Up: Other Algorithms
                    874:
                    875: Factorial
                    876: ---------
                    877:
                    878:    Factorials n! are calculated by a simple product from 1 to n, but
                    879: arranged into certain sub-products.
                    880:
                    881:    First as many factors as fit in a limb are accumulated, then two of
                    882: those multiplied to give a 2-limb product.  When two 2-limb products
                    883: are ready they're multiplied to a 4-limb product, and when two 4-limbs
                    884: are ready they're multiplied to an 8-limb product, etc.  A stack of
                    885: outstanding products is built up, with two of the same size multiplied
                    886: together when ready.
                    887:
                    888:    Arranging for multiplications to have operands the same (or nearly
                    889: the same) size means the Karatsuba and higher multiplication algorithms
                    890: can be used.  And even on sizes below the Karatsuba threshold an NxN
                    891: multiply will give a basecase multiply more to work on.
                    892:
                    893:    An obvious improvement not currently implemented would be to strip
                    894: factors of 2 from the products and apply them at the end with a bit
                    895: shift.  Another possibility would be to determine the prime
                    896: factorization of the result (which can be done easily), and use a
                    897: powering method, at each stage squaring then multiplying in those
                    898: primes with a 1 in their exponent at that point.  The advantage would
                    899: be some multiplies turned into squares.
                    900:
                    901: 
                    902: File: gmp.info,  Node: Binomial Coefficients Algorithm,  Next: Fibonacci Numbers Algorithm,  Prev: Factorial Algorithm,  Up: Other Algorithms
                    903:
                    904: Binomial Coefficients
                    905: ---------------------
                    906:
                    907:    Binomial coefficients C(n,k) are calculated by first arranging k <=
                    908: n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the
                    909: following product simply from i=2 to i=k.
                    910:
                    911:                            k  (n-k+i)
                    912:      C(n,k) =  (n-k+1) * prod -------
                    913:                           i=2    i
                    914:
                    915:    It's easy to show that each denominator i will divide the product so
                    916: far, so the exact division algorithm is used (*note Exact Division::).
                    917:
                    918:    The numerators n-k+i and denominators i are first accumulated into
                    919: as many fit a limb, to save multi-precision operations, though for
                    920: `mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
                    921: and n-k+i in general won't fit in a limb at all.
                    922:
                    923:    An obvious improvement would be to strip factors of 2 from each
                    924: multiplier and divisor and count them separately, to be applied with a
                    925: bit shift at the end.  Factors of 3 and perhaps 5 could even be handled
                    926: similarly.  Another possibility, if n is not too big, would be to
                    927: determine the prime factorization of the result based on the factorials
                    928: involved, and power up those primes appropriately.  This would help
                    929: most when k is near n/2.
                    930:
                    931: 
                    932: File: gmp.info,  Node: Fibonacci Numbers Algorithm,  Next: Lucas Numbers Algorithm,  Prev: Binomial Coefficients Algorithm,  Up: Other Algorithms
                    933:
                    934: Fibonacci Numbers
                    935: -----------------
                    936:
                    937:    The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed
                    938: for calculating isolated F[n] or F[n],F[n-1] values efficiently.
                    939:
                    940:    For small n, a table of single limb values in `__gmp_fib_table' is
                    941: used.  On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
                    942: to F[93].  For convenience the table starts at F[-1].
                    943:
                    944:    Beyond the table, values are generated with a binary powering
                    945: algorithm, calculating a pair F[n] and F[n-1] working from high to low
                    946: across the bits of n.  The formulas used are
                    947:
                    948:      F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
                    949:      F[2k-1] =   F[k]^2 + F[k-1]^2
                    950:
                    951:      F[2k] = F[2k+1] - F[2k-1]
                    952:
                    953:    At each step, k is the high b bits of n.  If the next bit of n is 0
                    954: then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
                    955: and the process repeated until all bits of n are incorporated.  Notice
                    956: these formulas require just two squares per bit of n.
                    957:
                    958:    It'd be possible to handle the first few n above the single limb
                    959: table with simple additions, using the defining Fibonacci recurrence
                    960: F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
                    961: be faster for only about 10 or 20 values of n, and including a block of
                    962: code for just those doesn't seem worthwhile.  If they really mattered
                    963: it'd be better to extend the data table.
                    964:
                    965:    Using a table avoids lots of calculations on small numbers, and
                    966: makes small n go fast.  A bigger table would make more small n go fast,
                    967: it's just a question of balancing size against desired speed.  For GMP
                    968: the code is kept compact, with the emphasis primarily on a good
                    969: powering algorithm.
                    970:
                    971:    `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
                    972: interested in F[n].  In this case the last step of the algorithm can
                    973: become one multiply instead of two squares.  One of the following two
                    974: formulas is used, according as n is odd or even.
                    975:
                    976:      F[2k]   = F[k]*(F[k]+2F[k-1])
                    977:
                    978:      F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
                    979:
                    980:    F[2k+1] here is the same as above, just rearranged to be a multiply.
                    981: For interest, the 2*(-1)^k term both here and above can be applied
                    982: just to the low limb of the calculation, without a carry or borrow into
                    983: further limbs, which saves some code size.  See comments with
                    984: `mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.
                    985:
                    986: 
                    987: File: gmp.info,  Node: Lucas Numbers Algorithm,  Prev: Fibonacci Numbers Algorithm,  Up: Other Algorithms
                    988:
                    989: Lucas Numbers
                    990: -------------
                    991:
                    992:    `mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
                    993: Fibonacci numbers with the following simple formulas.
                    994:
                    995:      L[k]   =   F[k] + 2*F[k-1]
                    996:      L[k-1] = 2*F[k] -   F[k-1]
                    997:
                    998:    `mpz_lucnum_ui' is only interested in L[n], and some work can be
                    999: saved.  Trailing zero bits on n can be handled with a single square
                   1000: each.
                   1001:
                   1002:      L[2k] = L[k]^2 - 2*(-1)^k
                   1003:
                   1004:    And the lowest 1 bit can be handled with one multiply of a pair of
                   1005: Fibonacci numbers, similar to what `mpz_fib_ui' does.
                   1006:
                   1007:      L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
                   1008:
                   1009: 
                   1010: File: gmp.info,  Node: Assembler Coding,  Prev: Other Algorithms,  Up: Algorithms
                   1011:
                   1012: Assembler Coding
                   1013: ================
                   1014:
                   1015:    The assembler subroutines in GMP are the most significant source of
                   1016: speed at small to moderate sizes.  At larger sizes algorithm selection
                   1017: becomes more important, but of course speedups in low level routines
                   1018: will still speed up everything proportionally.
                   1019:
                   1020:    Carry handling and widening multiplies that are important for GMP
                   1021: can't be easily expressed in C.  GCC `asm' blocks help a lot and are
                   1022: provided in `longlong.h', but hand coding low level routines invariably
                   1023: offers a speedup over generic C by a factor of anything from 2 to 10.
                   1024:
                   1025: * Menu:
                   1026:
                   1027: * Assembler Code Organisation::
                   1028: * Assembler Basics::
                   1029: * Assembler Carry Propagation::
                   1030: * Assembler Cache Handling::
                   1031: * Assembler Floating Point::
                   1032: * Assembler SIMD Instructions::
                   1033: * Assembler Software Pipelining::
                   1034: * Assembler Loop Unrolling::
                   1035:
                   1036: 
                   1037: File: gmp.info,  Node: Assembler Code Organisation,  Next: Assembler Basics,  Prev: Assembler Coding,  Up: Assembler Coding
                   1038:
                   1039: Code Organisation
                   1040: -----------------
                   1041:
                   1042:    The various `mpn' subdirectories contain machine-dependent code,
                   1043: written in C or assembler.  The `mpn/generic' subdirectory contains
                   1044: default code, used when there's no machine-specific version of a
                   1045: particular file.
                   1046:
                   1047:    Each `mpn' subdirectory is for an ISA family.  Generally 32-bit and
                   1048: 64-bit variants in a family cannot share code and will have separate
                   1049: directories.  Within a family further subdirectories may exist for CPU
                   1050: variants.
                   1051:
                   1052: 
                   1053: File: gmp.info,  Node: Assembler Basics,  Next: Assembler Carry Propagation,  Prev: Assembler Code Organisation,  Up: Assembler Coding
                   1054:
                   1055: Assembler Basics
                   1056: ----------------
                   1057:
                   1058:    `mpn_addmul_1' and `mpn_submul_1' are the most important routines
                   1059: for overall GMP performance.  All multiplications and divisions come
                   1060: down to repeated calls to these.  `mpn_add_n', `mpn_sub_n',
                   1061: `mpn_lshift' and `mpn_rshift' are next most important.
                   1062:
                   1063:    On some CPUs assembler versions of the internal functions
                   1064: `mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
                   1065: mainly through avoiding function call overheads.  They can also
                   1066: potentially make better use of a wide superscalar processor.
                   1067:
                   1068:    The restrictions on overlaps between sources and destinations (*note
                   1069: Low-level Functions::) are designed to facilitate a variety of
                   1070: implementations.  For example, knowing `mpn_add_n' won't have partly
                   1071: overlapping sources and destination means reading can be done far ahead
                   1072: of writing on superscalar processors, and loops can be vectorized on a
                   1073: vector processor, depending on the carry handling.
                   1074:
                   1075: 
                   1076: File: gmp.info,  Node: Assembler Carry Propagation,  Next: Assembler Cache Handling,  Prev: Assembler Basics,  Up: Assembler Coding
                   1077:
                   1078: Carry Propagation
                   1079: -----------------
                   1080:
                   1081:    The problem that presents most challenges in GMP is propagating
                   1082: carries from one limb to the next.  In functions like `mpn_addmul_1' and
                   1083: `mpn_add_n', carries are the only dependencies between limb operations.
                   1084:
                   1085:    On processors with carry flags, a straightforward CISC style `adc' is
                   1086: generally best.  AMD K6 `mpn_addmul_1' however is an example of an
                   1087: unusual set of circumstances where a branch works out better.
                   1088:
                   1089:    On RISC processors generally an add and compare for overflow is
                   1090: used.  This sort of thing can be seen in `mpn/generic/aors_n.c'.  Some
                   1091: carry propagation schemes require 4 instructions, meaning at least 4
                   1092: cycles per limb, but other schemes may use just 1 or 2.  On wide
                   1093: superscalar processors performance may be completely determined by the
                   1094: number of dependent instructions between carry-in and carry-out for
                   1095: each limb.
                   1096:
                   1097:    On vector processors good use can be made of the fact that a carry
                   1098: bit only very rarely propagates more than one limb.  When adding a
                   1099: single bit to a limb, there's only a carry out if that limb was
                   1100: `0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
                   1101: `mpn/cray/add_n.c' is an example of this, it adds all limbs in
                   1102: parallel, adds one set of carry bits in parallel and then only rarely
                   1103: needs to fall through to a loop propagating further carries.
                   1104:
                   1105:    On the x86s, GCC (as of version 2.95.2) doesn't generate
                   1106: particularly good code for the RISC style idioms that are necessary to
                   1107: handle carry bits in C.  Often conditional jumps are generated where
                   1108: `adc' or `sbb' forms would be better.  And so unfortunately almost any
                   1109: loop involving carry bits needs to be coded in assembler for best
                   1110: results.
                   1111:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>