Annotation of OpenXM_contrib/gmp/gmp.info-6, Revision 1.1.1.1
1.1 ohara 1: This is gmp.info, produced by makeinfo version 4.2 from gmp.texi.
2:
3: This manual describes how to install and use the GNU multiple precision
4: arithmetic library, version 4.1.2.
5:
6: Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
7: 2001, 2002 Free Software Foundation, Inc.
8:
9: Permission is granted to copy, distribute and/or modify this
10: document under the terms of the GNU Free Documentation License, Version
11: 1.1 or any later version published by the Free Software Foundation;
12: with no Invariant Sections, with the Front-Cover Texts being "A GNU
13: Manual", and with the Back-Cover Texts being "You have freedom to copy
14: and modify this GNU Manual, like GNU software". A copy of the license
15: is included in *Note GNU Free Documentation License::.
16: INFO-DIR-SECTION GNU libraries
17: START-INFO-DIR-ENTRY
18: * gmp: (gmp). GNU Multiple Precision Arithmetic Library.
19: END-INFO-DIR-ENTRY
20:
21:
22: File: gmp.info, Node: FFT Multiplication, Next: Other Multiplication, Prev: Toom-Cook 3-Way Multiplication, Up: Multiplication Algorithms
23:
24: FFT Multiplication
25: ------------------
26:
27: At large to very large sizes a Fermat style FFT multiplication is
28: used, following Scho"nhage and Strassen (*note References::).
29: Descriptions of FFTs in various forms can be found in many textbooks,
30: for instance Knuth section 4.3.3 part C or Lipson chapter IX. A brief
31: description of the form used in GMP is given here.
32:
33: The multiplication done is x*y mod 2^N+1, for a given N. A full
34: product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
35: and y with high zero limbs. The modular product is the native form for
36: the algorithm, so padding to get a full product is unavoidable.
37:
38: The algorithm follows a split, evaluate, pointwise multiply,
39: interpolate and combine similar to that described above for Karatsuba
40: and Toom-3. A k parameter controls the split, with an FFT-k splitting
41: into 2^k pieces of M=N/2^k bits each. N must be a multiple of
42: (2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
43: bit shifts in the split and combine stages.
44:
45: The evaluations, pointwise multiplications, and interpolation, are
46: all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
47: 2^k and of `mp_bits_per_limb'. The results of interpolation will be
48: the following negacyclic convolution of the input pieces, and the
49: choice of N' ensures these sums aren't truncated.
50:
51: ---
52: \ b
53: w[n] = / (-1) * x[i] * y[j]
54: ---
55: i+j==b*2^k+n
56: b=0,1
57:
58: The points used for the evaluation are g^i for i=0 to 2^k-1 where
59: g=2^(2N'/2^k). g is a 2^k'th root of unity mod 2^N'+1, which produces
60: necessary cancellations at the interpolation stage, and it's also a
61: power of 2 so the fast fourier transforms used for the evaluation and
62: interpolation do only shifts, adds and negations.
63:
64: The pointwise multiplications are done modulo 2^N'+1 and either
65: recurse into a further FFT or use a plain multiplication (Toom-3,
66: Karatsuba or basecase), whichever is optimal at the size N'. The
67: interpolation is an inverse fast fourier transform. The resulting set
68: of sums of x[i]*y[j] are added at appropriate offsets to give the final
69: result.
70:
71: Squaring is the same, but x is the only input so it's one transform
72: at the evaluate stage and the pointwise multiplies are squares. The
73: interpolation is the same.
74:
75: For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
76: the exponent representing 2^k recursed modular multiplies each
77: 1/2^(k-1) the size of the original. Each successive k is an asymptotic
78: improvement, but overheads mean each is only faster at bigger and
79: bigger sizes. In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
80: thresholds where each k is used. Each new k effectively swaps some
81: multiplying for some shifts, adds and overheads.
82:
83: A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
84: plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
85: A k=4 FFT at O(N^1.333) can be expected to be the first faster than
86: Toom-3 at O(N^1.465). In practice this is what's found, with
87: `MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
88: and 1000 limbs, depending on the CPU. So far it's been found that only
89: very large FFTs recurse into pointwise multiplies above these sizes.
90:
91: When an FFT is to give a full product, the change of N to 2N doesn't
92: alter the theoretical complexity for a given k, but for the purposes of
93: considering where an FFT might be first used it can be assumed that the
94: FFT is recursing into a normal multiply and that on that basis it's
95: doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
96: making it O(N^(k/(k-2))). This would mean k=7 at O(N^1.4) would be the
97: first FFT faster than Toom-3. In practice `MUL_FFT_THRESHOLD' and
98: `SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
99: between 3000 and 10000 limbs.
100:
101: The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
102: a multiple of 2^k and `mp_bits_per_limb' means that when
103: 2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
104: The +k+3 means some values of N just under such a multiple will be
105: rounded to the next. The complexity calculations above assume that a
106: favourable size is used, meaning one which isn't padded through
107: rounding, and it's also assumed that the extra +k+3 bits are negligible
108: at typical FFT sizes.
109:
110: The practical effect of the 2^(2k-1) constraint is to introduce a
111: step-effect into measured speeds. For example k=8 will round N up to a
112: multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
113: groups of sizes for which `mpn_mul_n' runs at the same speed. Or for
114: k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc. In practice
115: it's been found each k is used at quite small multiples of its size
116: constraint and so the step effect is quite noticeable in a time versus
117: size graph.
118:
119: The threshold determinations currently measure at the mid-points of
120: size steps, but this is sub-optimal since at the start of a new step it
121: can happen that it's better to go back to the previous k for a while.
122: Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
123: will be needed.
124:
125:
126: File: gmp.info, Node: Other Multiplication, Prev: FFT Multiplication, Up: Multiplication Algorithms
127:
128: Other Multiplication
129: --------------------
130:
131: The 3-way Toom-Cook algorithm described above (*note Toom-Cook 3-Way
132: Multiplication::) generalizes to split into an arbitrary number of
133: pieces, as per Knuth section 4.3.3 algorithm C. This is not currently
134: used, though it's possible a Toom-4 might fit in between Toom-3 and the
135: FFTs. The notes here are merely for interest.
136:
137: In general a split into r+1 pieces is made, and evaluations and
138: pointwise multiplications done at 2*r+1 points. A 4-way split does 7
139: pointwise multiplies, 5-way does 9, etc. Asymptotically an (r+1)-way
140: algorithm is O(N^(log(2*r+1)/log(r+1))). Only the pointwise
141: multiplications count towards big-O complexity, but the time spent in
142: the evaluate and interpolate stages grows with r and has a significant
143: practical impact, with the asymptotic advantage of each r realized only
144: at bigger and bigger sizes. The overheads grow as O(N*r), whereas in
145: an r=2^k FFT they grow only as O(N*log(r)).
146:
147: Knuth algorithm C evaluates at points 0,1,2,...,2*r, but exercise 4
148: uses -r,...,0,...,r and the latter saves some small multiplies in the
149: evaluate stage (or rather trades them for additions), and has a further
150: saving of nearly half the interpolate steps. The idea is to separate
151: odd and even final coefficients and then perform algorithm C steps C7
152: and C8 on them separately. The divisors at step C7 become j^2 and the
153: multipliers at C8 become 2*t*j-j^2.
154:
155: Splitting odd and even parts through positive and negative points
156: can be thought of as using -1 as a square root of unity. If a 4th root
157: of unity was available then a further split and speedup would be
158: possible, but no such root exists for plain integers. Going to complex
159: integers with i=sqrt(-1) doesn't help, essentially because in cartesian
160: form it takes three real multiplies to do a complex multiply. The
161: existence of 2^k'th roots of unity in a suitable ring or field lets the
162: fast fourier transform keep splitting and get to O(N*log(r)).
163:
164: Floating point FFTs use complex numbers approximating Nth roots of
165: unity. Some processors have special support for such FFTs. But these
166: are not used in GMP since it's very difficult to guarantee an exact
167: result (to some number of bits). An occasional difference of 1 in the
168: last bit might not matter to a typical signal processing algorithm, but
169: is of course of vital importance to GMP.
170:
171:
172: File: gmp.info, Node: Division Algorithms, Next: Greatest Common Divisor Algorithms, Prev: Multiplication Algorithms, Up: Algorithms
173:
174: Division Algorithms
175: ===================
176:
177: * Menu:
178:
179: * Single Limb Division::
180: * Basecase Division::
181: * Divide and Conquer Division::
182: * Exact Division::
183: * Exact Remainder::
184: * Small Quotient Division::
185:
186:
187: File: gmp.info, Node: Single Limb Division, Next: Basecase Division, Prev: Division Algorithms, Up: Division Algorithms
188:
189: Single Limb Division
190: --------------------
191:
192: Nx1 division is implemented using repeated 2x1 divisions from high
193: to low, either with a hardware divide instruction or a multiplication by
194: inverse, whichever is best on a given CPU.
195:
196: The multiply by inverse follows section 8 of "Division by Invariant
197: Integers using Multiplication" by Granlund and Montgomery (*note
198: References::) and is implemented as `udiv_qrnnd_preinv' in
199: `gmp-impl.h'. The idea is to have a fixed-point approximation to 1/d
200: (see `invert_limb') and then multiply by the high limb (plus one bit)
201: of the dividend to get a quotient q. With d normalized (high bit set),
202: q is no more than 1 too small. Subtracting q*d from the dividend gives
203: a remainder, and reveals whether q or q-1 is correct.
204:
205: The result is a division done with two multiplications and four or
206: five arithmetic operations. On CPUs with low latency multipliers this
207: can be much faster than a hardware divide, though the cost of
208: calculating the inverse at the start may mean it's only better on
209: inputs bigger than say 4 or 5 limbs.
210:
211: When a divisor must be normalized, either for the generic C
212: `__udiv_qrnnd_c' or the multiply by inverse, the division performed is
213: actually a*2^k by d*2^k where a is the dividend and k is the power
214: necessary to have the high bit of d*2^k set. The bit shifts for the
215: dividend are usually accomplished "on the fly" meaning by extracting
216: the appropriate bits at each step. Done this way the quotient limbs
217: come out aligned ready to store. When only the remainder is wanted, an
218: alternative is to take the dividend limbs unshifted and calculate r = a
219: mod d*2^k followed by an extra final step r*2^k mod d*2^k. This can
220: help on CPUs with poor bit shifts or few registers.
221:
222: The multiply by inverse can be done two limbs at a time. The
223: calculation is basically the same, but the inverse is two limbs and the
224: divisor treated as if padded with a low zero limb. This means more
225: work, since the inverse will need a 2x2 multiply, but the four 1x1s to
226: do that are independent and can therefore be done partly or wholly in
227: parallel. Likewise for a 2x1 calculating q*d. The net effect is to
228: process two limbs with roughly the same two multiplies worth of latency
229: that one limb at a time gives. This extends to 3 or 4 limbs at a time,
230: though the extra work to apply the inverse will almost certainly soon
231: reach the limits of multiplier throughput.
232:
233: A similar approach in reverse can be taken to process just half a
234: limb at a time if the divisor is only a half limb. In this case the
235: 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
236: limb, which can be a saving on CPUs with a fast half limb multiply, or
237: in fact if the only multiply is a half limb, and especially if it's not
238: pipelined.
239:
240:
241: File: gmp.info, Node: Basecase Division, Next: Divide and Conquer Division, Prev: Single Limb Division, Up: Division Algorithms
242:
243: Basecase Division
244: -----------------
245:
246: Basecase NxM division is like long division done by hand, but in base
247: 2^mp_bits_per_limb. See Knuth section 4.3.1 algorithm D, and
248: `mpn/generic/sb_divrem_mn.c'.
249:
250: Briefly stated, while the dividend remains larger than the divisor,
251: a high quotient limb is formed and the Nx1 product q*d subtracted at
252: the top end of the dividend. With a normalized divisor (most
253: significant bit set), each quotient limb can be formed with a 2x1
254: division and a 1x1 multiplication plus some subtractions. The 2x1
255: division is by the high limb of the divisor and is done either with a
256: hardware divide or a multiply by inverse (the same as in *Note Single
257: Limb Division::) whichever is faster. Such a quotient is sometimes one
258: too big, requiring an addback of the divisor, but that happens rarely.
259:
260: With Q=N-M being the number of quotient limbs, this is an O(Q*M)
261: algorithm and will run at a speed similar to a basecase QxM
262: multiplication, differing in fact only in the extra multiply and divide
263: for each of the Q quotient limbs.
264:
265:
266: File: gmp.info, Node: Divide and Conquer Division, Next: Exact Division, Prev: Basecase Division, Up: Division Algorithms
267:
268: Divide and Conquer Division
269: ---------------------------
270:
271: For divisors larger than `DIV_DC_THRESHOLD', division is done by
272: dividing. Or to be precise by a recursive divide and conquer algorithm
273: based on work by Moenck and Borodin, Jebelean, and Burnikel and Ziegler
274: (*note References::).
275:
276: The algorithm consists essentially of recognising that a 2NxN
277: division can be done with the basecase division algorithm (*note
278: Basecase Division::), but using N/2 limbs as a base, not just a single
279: limb. This way the multiplications that arise are (N/2)x(N/2) and can
280: take advantage of Karatsuba and higher multiplication algorithms (*note
281: Multiplication Algorithms::). The "digits" of the quotient are formed
282: by recursive Nx(N/2) divisions.
283:
284: If the (N/2)x(N/2) multiplies are done with a basecase multiplication
285: then the work is about the same as a basecase division, but with more
286: function call overheads and with some subtractions separated from the
287: multiplies. These overheads mean that it's only when N/2 is above
288: `MUL_KARATSUBA_THRESHOLD' that divide and conquer is of use.
289:
290: `DIV_DC_THRESHOLD' is based on the divisor size N, so it will be
291: somewhere above twice `MUL_KARATSUBA_THRESHOLD', but how much above
292: depends on the CPU. An optimized `mpn_mul_basecase' can lower
293: `DIV_DC_THRESHOLD' a little by offering a ready-made advantage over
294: repeated `mpn_submul_1' calls.
295:
296: Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is
297: the time for an NxN multiplication done with FFTs. The actual time is
298: a sum over multiplications of the recursed sizes, as can be seen near
299: the end of section 2.2 of Burnikel and Ziegler. For example, within
300: the Toom-3 range, divide and conquer is 2.63*M(N). With higher
301: algorithms the M(N) term improves and the multiplier tends to log(N).
302: In practice, at moderate to large sizes, a 2NxN division is about 2 to
303: 4 times slower than an NxN multiplication.
304:
305: Newton's method used for division is asymptotically O(M(N)) and
306: should therefore be superior to divide and conquer, but it's believed
307: this would only be for large to very large N.
308:
309:
310: File: gmp.info, Node: Exact Division, Next: Exact Remainder, Prev: Divide and Conquer Division, Up: Division Algorithms
311:
312: Exact Division
313: --------------
314:
315: A so-called exact division is when the dividend is known to be an
316: exact multiple of the divisor. Jebelean's exact division algorithm
317: uses this knowledge to make some significant optimizations (*note
318: References::).
319:
320: The idea can be illustrated in decimal for example with 368154
321: divided by 543. Because the low digit of the dividend is 4, the low
322: digit of the quotient must be 8. This is arrived at from 4*7 mod 10,
323: using the fact 7 is the modular inverse of 3 (the low digit of the
324: divisor), since 3*7 == 1 mod 10. So 8*543=4344 can be subtracted from
325: the dividend leaving 363810. Notice the low digit has become zero.
326:
327: The procedure is repeated at the second digit, with the next
328: quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
329: 325800. And finally at the third digit with quotient digit 6 (8*7 mod
330: 10), subtracting 6*543=3258 leaving 0. So the quotient is 678.
331:
332: Notice however that the multiplies and subtractions don't need to
333: extend past the low three digits of the dividend, since that's enough
334: to determine the three quotient digits. For the last quotient digit no
335: subtraction is needed at all. On a 2NxN division like this one, only
336: about half the work of a normal basecase division is necessary.
337:
338: For an NxM exact division producing Q=N-M quotient limbs, the saving
339: over a normal basecase division is in two parts. Firstly, each of the
340: Q quotient limbs needs only one multiply, not a 2x1 divide and
341: multiply. Secondly, the crossproducts are reduced when Q>M to
342: Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2. Notice the savings are
343: complementary. If Q is big then many divisions are saved, or if Q is
344: small then the crossproducts reduce to a small number.
345:
346: The modular inverse used is calculated efficiently by
347: `modlimb_invert' in `gmp-impl.h'. This does four multiplies for a
348: 32-bit limb, or six for a 64-bit limb. `tune/modlinv.c' has some
349: alternate implementations that might suit processors better at bit
350: twiddling than multiplying.
351:
352: The sub-quadratic exact division described by Jebelean in "Exact
353: Division with Karatsuba Complexity" is not currently implemented. It
354: uses a rearrangement similar to the divide and conquer for normal
355: division (*note Divide and Conquer Division::), but operating from low
356: to high. A further possibility not currently implemented is
357: "Bidirectional Exact Integer Division" by Krandick and Jebelean which
358: forms quotient limbs from both the high and low ends of the dividend,
359: and can halve once more the number of crossproducts needed in a 2NxN
360: division.
361:
362: A special case exact division by 3 exists in `mpn_divexact_by3',
363: supporting Toom-3 multiplication and `mpq' canonicalizations. It forms
364: quotient digits with a multiply by the modular inverse of 3 (which is
365: `0xAA..AAB') and uses two comparisons to determine a borrow for the next
366: limb. The multiplications don't need to be on the dependent chain, as
367: long as the effect of the borrows is applied. Only a few optimized
368: assembler implementations currently exist.
369:
370:
371: File: gmp.info, Node: Exact Remainder, Next: Small Quotient Division, Prev: Exact Division, Up: Division Algorithms
372:
373: Exact Remainder
374: ---------------
375:
376: If the exact division algorithm is done with a full subtraction at
377: each stage and the dividend isn't a multiple of the divisor, then low
378: zero limbs are produced but with a remainder in the high limbs. For
379: dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, then this
380: remainder r is of the form
381:
382: a = q*d + r*b^n
383:
384: n represents the number of zero limbs produced by the subtractions,
385: that being the number of limbs produced for q. r will be in the range
386: 0<=r<d and can be viewed as a remainder, but one shifted up by a factor
387: of b^n.
388:
389: Carrying out full subtractions at each stage means the same number
390: of cross products must be done as a normal division, but there's still
391: some single limb divisions saved. When d is a single limb some
392: simplifications arise, providing good speedups on a number of
393: processors.
394:
395: `mpn_bdivmod', `mpn_divexact_by3', `mpn_modexact_1_odd' and the
396: `redc' function in `mpz_powm' differ subtly in how they return r,
397: leading to some negations in the above formula, but all are essentially
398: the same.
399:
400: Clearly r is zero when a is a multiple of d, and this leads to
401: divisibility or congruence tests which are potentially more efficient
402: than a normal division.
403:
404: The factor of b^n on r can be ignored in a GCD when d is odd, hence
405: the use of `mpn_bdivmod' in `mpn_gcd', and the use of
406: `mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui' etc (*note
407: Greatest Common Divisor Algorithms::).
408:
409: Montgomery's REDC method for modular multiplications uses operands
410: of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
411: uses the factor of b^n in the exact remainder to reach a product in the
412: same form (x*y)*b^-n (*note Modular Powering Algorithm::).
413:
414: Notice that r generally gives no useful information about the
415: ordinary remainder a mod d since b^n mod d could be anything. If
416: however b^n == 1 mod d, then r is the negative of the ordinary
417: remainder. This occurs whenever d is a factor of b^n-1, as for example
418: with 3 in `mpn_divexact_by3'. Other such factors include 5, 17 and
419: 257, but no particular use has been found for this.
420:
421:
422: File: gmp.info, Node: Small Quotient Division, Prev: Exact Remainder, Up: Division Algorithms
423:
424: Small Quotient Division
425: -----------------------
426:
427: An NxM division where the number of quotient limbs Q=N-M is small
428: can be optimized somewhat.
429:
430: An ordinary basecase division normalizes the divisor by shifting it
431: to make the high bit set, shifting the dividend accordingly, and
432: shifting the remainder back down at the end of the calculation. This
433: is wasteful if only a few quotient limbs are to be formed. Instead a
434: division of just the top 2*Q limbs of the dividend by the top Q limbs
435: of the divisor can be used to form a trial quotient. This requires
436: only those limbs normalized, not the whole of the divisor and dividend.
437:
438: A multiply and subtract then applies the trial quotient to the M-Q
439: unused limbs of the divisor and N-Q dividend limbs (which includes Q
440: limbs remaining from the trial quotient division). The starting trial
441: quotient can be 1 or 2 too big, but all cases of 2 too big and most
442: cases of 1 too big are detected by first comparing the most significant
443: limbs that will arise from the subtraction. An addback is done if the
444: quotient still turns out to be 1 too big.
445:
446: This whole procedure is essentially the same as one step of the
447: basecase algorithm done in a Q limb base, though with the trial
448: quotient test done only with the high limbs, not an entire Q limb
449: "digit" product. The correctness of this weaker test can be
450: established by following the argument of Knuth section 4.3.1 exercise
451: 20 but with the v2*q>b*r+u2 condition appropriately relaxed.
452:
453:
454: File: gmp.info, Node: Greatest Common Divisor Algorithms, Next: Powering Algorithms, Prev: Division Algorithms, Up: Algorithms
455:
456: Greatest Common Divisor
457: =======================
458:
459: * Menu:
460:
461: * Binary GCD::
462: * Accelerated GCD::
463: * Extended GCD::
464: * Jacobi Symbol::
465:
466:
467: File: gmp.info, Node: Binary GCD, Next: Accelerated GCD, Prev: Greatest Common Divisor Algorithms, Up: Greatest Common Divisor Algorithms
468:
469: Binary GCD
470: ----------
471:
472: At small sizes GMP uses an O(N^2) binary style GCD. This is
473: described in many textbooks, for example Knuth section 4.5.2 algorithm
474: B. It simply consists of successively reducing operands a and b using
475: gcd(a,b) = gcd(min(a,b),abs(a-b)), and also that if a and b are first
476: made odd then abs(a-b) is even and factors of two can be discarded.
477:
478: Variants like letting a-b become negative and doing a different next
479: step are of interest only as far as they suit particular CPUs, since on
480: small operands it's machine dependent factors that determine
481: performance.
482:
483: The Euclidean GCD algorithm, as per Knuth algorithms E and A,
484: reduces using a mod b but this has so far been found to be slower
485: everywhere. One reason the binary method does well is that the implied
486: quotient at each step is usually small, so often only one or two
487: subtractions are needed to get the same effect as a division.
488: Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
489: section 4.5.3 Theorem E.
490:
491: When the implied quotient is large, meaning b is much smaller than
492: a, then a division is worthwhile. This is the basis for the initial a
493: mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
494: and 1x1 cases). But after that initial reduction, big quotients occur
495: too rarely to make it worth checking for them.
496:
497:
498: File: gmp.info, Node: Accelerated GCD, Next: Extended GCD, Prev: Binary GCD, Up: Greatest Common Divisor Algorithms
499:
500: Accelerated GCD
501: ---------------
502:
503: For sizes above `GCD_ACCEL_THRESHOLD', GMP uses the Accelerated GCD
504: algorithm described independently by Weber and Jebelean (the latter as
505: the "Generalized Binary" algorithm), *note References::. This
506: algorithm is still O(N^2), but is much faster than the binary algorithm
507: since it does fewer multi-precision operations. It consists of
508: alternating the k-ary reduction by Sorenson, and a "dmod" exact
509: remainder reduction.
510:
511: For operands u and v the k-ary reduction replaces u with n*v-d*u
512: where n and d are single limb values chosen to give two trailing zero
513: limbs on that value, which can be stripped. n and d are calculated
514: using an algorithm similar to half of a two limb GCD (see `find_a' in
515: `mpn/generic/gcd.c').
516:
517: When u and v differ in size by more than a certain number of bits, a
518: dmod is performed to zero out bits at the low end of the larger. It
519: consists of an exact remainder style division applied to an appropriate
520: number of bits (*note Exact Division::, and *note Exact Remainder::).
521: This is faster than a k-ary reduction but useful only when the operands
522: differ in size. There's a dmod after each k-ary reduction, and if the
523: dmod leaves the operands still differing in size then it's repeated.
524:
525: The k-ary reduction step can introduce spurious factors into the GCD
526: calculated, and these are eliminated at the end by taking GCDs with the
527: original inputs gcd(u,gcd(v,g)) using the binary algorithm. Since g is
528: almost always small this takes very little time.
529:
530: At small sizes the algorithm needs a good implementation of
531: `find_a'. At larger sizes it's dominated by `mpn_addmul_1' applying n
532: and d.
533:
534:
535: File: gmp.info, Node: Extended GCD, Next: Jacobi Symbol, Prev: Accelerated GCD, Up: Greatest Common Divisor Algorithms
536:
537: Extended GCD
538: ------------
539:
540: The extended GCD calculates gcd(a,b) and also cofactors x and y
541: satisfying a*x+b*y=gcd(a,b). Lehmer's multi-step improvement of the
542: extended Euclidean algorithm is used. See Knuth section 4.5.2
543: algorithm L, and `mpn/generic/gcdext.c'. This is an O(N^2) algorithm.
544:
545: The multipliers at each step are found using single limb
546: calculations for sizes up to `GCDEXT_THRESHOLD', or double limb
547: calculations above that. The single limb code is faster but doesn't
548: produce full-limb multipliers, hence not making full use of the
549: `mpn_addmul_1' calls.
550:
551: When a CPU has a data-dependent multiplier, meaning one which is
552: faster on operands with fewer bits, the extra work in the double-limb
553: calculation might only save some looping overheads, leading to a large
554: `GCDEXT_THRESHOLD'.
555:
556: Currently the single limb calculation doesn't optimize for the small
557: quotients that often occur, and this can lead to unusually low values of
558: `GCDEXT_THRESHOLD', depending on the CPU.
559:
560: An analysis of double-limb calculations can be found in "A
561: Double-Digit Lehmer-Euclid Algorithm" by Jebelean (*note References::).
562: The code in GMP was developed independently.
563:
564: It should be noted that when a double limb calculation is used, it's
565: used for the whole of that GCD, it doesn't fall back to single limb
566: part way through. This is because as the algorithm proceeds, the
567: inputs a and b are reduced, but the cofactors x and y grow, so the
568: multipliers at each step are applied to a roughly constant total number
569: of limbs.
570:
571:
572: File: gmp.info, Node: Jacobi Symbol, Prev: Extended GCD, Up: Greatest Common Divisor Algorithms
573:
574: Jacobi Symbol
575: -------------
576:
577: `mpz_jacobi' and `mpz_kronecker' are currently implemented with a
578: simple binary algorithm similar to that described for the GCDs (*note
579: Binary GCD::). They're not very fast when both inputs are large.
580: Lehmer's multi-step improvement or a binary based multi-step algorithm
581: is likely to be better.
582:
583: When one operand fits a single limb, and that includes
584: `mpz_kronecker_ui' and friends, an initial reduction is done with
585: either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
586: algorithm on a single limb. The binary algorithm is well suited to a
587: single limb, and the whole calculation in this case is quite efficient.
588:
589: In all the routines sign changes for the result are accumulated
590: using some bit twiddling, avoiding table lookups or conditional jumps.
591:
592:
593: File: gmp.info, Node: Powering Algorithms, Next: Root Extraction Algorithms, Prev: Greatest Common Divisor Algorithms, Up: Algorithms
594:
595: Powering Algorithms
596: ===================
597:
598: * Menu:
599:
600: * Normal Powering Algorithm::
601: * Modular Powering Algorithm::
602:
603:
604: File: gmp.info, Node: Normal Powering Algorithm, Next: Modular Powering Algorithm, Prev: Powering Algorithms, Up: Powering Algorithms
605:
606: Normal Powering
607: ---------------
608:
609: Normal `mpz' or `mpf' powering uses a simple binary algorithm,
610: successively squaring and then multiplying by the base when a 1 bit is
611: seen in the exponent, as per Knuth section 4.6.3. The "left to right"
612: variant described there is used rather than algorithm A, since it's
613: just as easy and can be done with somewhat less temporary memory.
614:
615:
616: File: gmp.info, Node: Modular Powering Algorithm, Prev: Normal Powering Algorithm, Up: Powering Algorithms
617:
618: Modular Powering
619: ----------------
620:
621: Modular powering is implemented using a 2^k-ary sliding window
622: algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
623: (*note References::). k is chosen according to the size of the
624: exponent. Larger exponents use larger values of k, the choice being
625: made to minimize the average number of multiplications that must
626: supplement the squaring.
627:
628: The modular multiplies and squares use either a simple division or
629: the REDC method by Montgomery (*note References::). REDC is a little
630: faster, essentially saving N single limb divisions in a fashion similar
631: to an exact remainder (*note Exact Remainder::). The current REDC has
632: some limitations. It's only O(N^2) so above `POWM_THRESHOLD' division
633: becomes faster and is used. It doesn't attempt to detect small bases,
634: but rather always uses a REDC form, which is usually a full size
635: operand. And lastly it's only applied to odd moduli.
636:
637:
638: File: gmp.info, Node: Root Extraction Algorithms, Next: Radix Conversion Algorithms, Prev: Powering Algorithms, Up: Algorithms
639:
640: Root Extraction Algorithms
641: ==========================
642:
643: * Menu:
644:
645: * Square Root Algorithm::
646: * Nth Root Algorithm::
647: * Perfect Square Algorithm::
648: * Perfect Power Algorithm::
649:
650:
651: File: gmp.info, Node: Square Root Algorithm, Next: Nth Root Algorithm, Prev: Root Extraction Algorithms, Up: Root Extraction Algorithms
652:
653: Square Root
654: -----------
655:
656: Square roots are taken using the "Karatsuba Square Root" algorithm
657: by Paul Zimmermann (*note References::). This is expressed in a divide
658: and conquer form, but as noted in the paper it can also be viewed as a
659: discrete variant of Newton's method.
660:
661: In the Karatsuba multiplication range this is an O(1.5*M(N/2))
662: algorithm, where M(n) is the time to multiply two numbers of n limbs.
663: In the FFT multiplication range this grows to a bound of O(6*M(N/2)).
664: In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and
665: Toom-3 ranges, growing to 2 or 3 in the FFT range.
666:
667: The algorithm does all its calculations in integers and the resulting
668: `mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'. The extended
669: precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.
670:
671:
672: File: gmp.info, Node: Nth Root Algorithm, Next: Perfect Square Algorithm, Prev: Square Root Algorithm, Up: Root Extraction Algorithms
673:
674: Nth Root
675: --------
676:
677: Integer Nth roots are taken using Newton's method with the following
678: iteration, where A is the input and n is the root to be taken.
679:
680: 1 A
681: a[i+1] = - * ( --------- + (n-1)*a[i] )
682: n a[i]^(n-1)
683:
684: The initial approximation a[1] is generated bitwise by successively
685: powering a trial root with or without new 1 bits, aiming to be just
686: above the true root. The iteration converges quadratically when
687: started from a good approximation. When n is large more initial bits
688: are needed to get good convergence. The current implementation is not
689: particularly well optimized.
690:
691:
692: File: gmp.info, Node: Perfect Square Algorithm, Next: Perfect Power Algorithm, Prev: Nth Root Algorithm, Up: Root Extraction Algorithms
693:
694: Perfect Square
695: --------------
696:
697: `mpz_perfect_square_p' is able to quickly exclude most non-squares by
698: checking whether the input is a quadratic residue modulo some small
699: integers.
700:
701: The first test is modulo 256 which means simply examining the least
702: significant byte. Only 44 different values occur as the low byte of a
703: square, so 82.8% of non-squares can be immediately excluded. Similar
704: tests modulo primes from 3 to 29 exclude 99.5% of those remaining, or
705: if a limb is 64 bits then primes up to 53 are used, excluding 99.99%.
706: A single Nx1 remainder using `PP' from `gmp-impl.h' quickly gives all
707: these remainders.
708:
709: A square root must still be taken for any value that passes the
710: residue tests, to verify it's really a square and not one of the 0.086%
711: (or 0.000156% for 64 bits) non-squares that get through. *Note Square
712: Root Algorithm::.
713:
714:
715: File: gmp.info, Node: Perfect Power Algorithm, Prev: Perfect Square Algorithm, Up: Root Extraction Algorithms
716:
717: Perfect Power
718: -------------
719:
720: Detecting perfect powers is required by some factorization
721: algorithms. Currently `mpz_perfect_power_p' is implemented using
722: repeated Nth root extractions, though naturally only prime roots need
723: to be considered. (*Note Nth Root Algorithm::.)
724:
725: If a prime divisor p with multiplicity e can be found, then only
726: roots which are divisors of e need to be considered, much reducing the
727: work necessary. To this end divisibility by a set of small primes is
728: checked.
729:
730:
731: File: gmp.info, Node: Radix Conversion Algorithms, Next: Other Algorithms, Prev: Root Extraction Algorithms, Up: Algorithms
732:
733: Radix Conversion
734: ================
735:
736: Radix conversions are less important than other algorithms. A
737: program dominated by conversions should probably use a different data
738: representation.
739:
740: * Menu:
741:
742: * Binary to Radix::
743: * Radix to Binary::
744:
745:
746: File: gmp.info, Node: Binary to Radix, Next: Radix to Binary, Prev: Radix Conversion Algorithms, Up: Radix Conversion Algorithms
747:
748: Binary to Radix
749: ---------------
750:
751: Conversions from binary to a power-of-2 radix use a simple and fast
752: O(N) bit extraction algorithm.
753:
754: Conversions from binary to other radices use one of two algorithms.
755: Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
756: Repeated divisions by b^n are made, where b is the radix and n is the
757: biggest power that fits in a limb. But instead of simply using the
758: remainder r from such divisions, an extra divide step is done to give a
759: fractional limb representing r/b^n. The digits of r can then be
760: extracted using multiplications by b rather than divisions. Special
761: case code is provided for decimal, allowing multiplications by 10 to
762: optimize to shifts and adds.
763:
764: Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
765: used. For an input t, powers b^(n*2^i) of the radix are calculated,
766: until a power between t and sqrt(t) is reached. t is then divided by
767: that largest power, giving a quotient which is the digits above that
768: power, and a remainder which is those below. These two parts are in
769: turn divided by the second highest power, and so on recursively. When
770: a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
771: limbs, the basecase algorithm described above is used.
772:
773: The advantage of this algorithm is that big divisions can make use
774: of the sub-quadratic divide and conquer division (*note Divide and
775: Conquer Division::), and big divisions tend to have less overheads than
776: lots of separate single limb divisions anyway. But in any case the
777: cost of calculating the powers b^(n*2^i) must first be overcome.
778:
779: `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
780: the same basic thing, the point where it becomes worth doing a big
781: division to cut the input in half. `GET_STR_PRECOMPUTE_THRESHOLD'
782: includes the cost of calculating the radix power required, whereas
783: `GET_STR_DC_THRESHOLD' assumes that's already available, which is the
784: case when recursing.
785:
786: Since the base case produces digits from least to most significant
787: but they want to be stored from most to least, it's necessary to
788: calculate in advance how many digits there will be, or at least be sure
789: not to underestimate that. For GMP the number of input bits is
790: multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
791: The result is either correct or one too big.
792:
793: Examining some of the high bits of the input could increase the
794: chance of getting the exact number of digits, but an exact result every
795: time would not be practical, since in general the difference between
796: numbers 100... and 99... is only in the last few bits and the work to
797: identify 99... might well be almost as much as a full conversion.
798:
799: `mpf_get_str' doesn't currently use the algorithm described here, it
800: multiplies or divides by a power of b to move the radix point to the
801: just above the highest non-zero digit (or at worst one above that
802: location), then multiplies by b^n to bring out digits. This is O(N^2)
803: and is certainly not optimal.
804:
805: The r/b^n scheme described above for using multiplications to bring
806: out digits might be useful for more than a single limb. Some brief
807: experiments with it on the base case when recursing didn't give a
808: noticable improvement, but perhaps that was only due to the
809: implementation. Something similar would work for the sub-quadratic
810: divisions too, though there would be the cost of calculating a bigger
811: radix power.
812:
813: Another possible improvement for the sub-quadratic part would be to
814: arrange for radix powers that balanced the sizes of quotient and
815: remainder produced, ie. the highest power would be an b^(n*k)
816: approximately equal to sqrt(t), not restricted to a 2^i factor. That
817: ought to smooth out a graph of times against sizes, but may or may not
818: be a net speedup.
819:
820:
821: File: gmp.info, Node: Radix to Binary, Prev: Binary to Radix, Up: Radix Conversion Algorithms
822:
823: Radix to Binary
824: ---------------
825:
826: Conversions from a power-of-2 radix into binary use a simple and fast
827: O(N) bitwise concatenation algorithm.
828:
829: Conversions from other radices use one of two algorithms. Sizes
830: below `SET_STR_THRESHOLD' use a basic O(N^2) method. Groups of n
831: digits are converted to limbs, where n is the biggest power of the base
832: b which will fit in a limb, then those groups are accumulated into the
833: result by multiplying by b^n and adding. This saves multi-precision
834: operations, as per Knuth section 4.4 part E (*note References::). Some
835: special case code is provided for decimal, giving the compiler a chance
836: to optimize multiplications by 10.
837:
838: Above `SET_STR_THRESHOLD' a sub-quadratic algorithm is used. First
839: groups of n digits are converted into limbs. Then adjacent limbs are
840: combined into limb pairs with x*b^n+y, where x and y are the limbs.
841: Adjacent limb pairs are combined into quads similarly with x*b^(2n)+y.
842: This continues until a single block remains, that being the result.
843:
844: The advantage of this method is that the multiplications for each x
845: are big blocks, allowing Karatsuba and higher algorithms to be used.
846: But the cost of calculating the powers b^(n*2^i) must be overcome.
847: `SET_STR_THRESHOLD' usually ends up quite big, around 5000 digits, and
848: on some processors much bigger still.
849:
850: `SET_STR_THRESHOLD' is based on the input digits (and tuned for
851: decimal), though it might be better based on a limb count, so as to be
852: independent of the base. But that sort of count isn't used by the base
853: case and so would need some sort of initial calculation or estimate.
854:
855: The main reason `SET_STR_THRESHOLD' is so much bigger than the
856: corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that `mpn_mul_1' is
857: much faster than `mpn_divrem_1' (often by a factor of 10, or more).
858:
859:
860: File: gmp.info, Node: Other Algorithms, Next: Assembler Coding, Prev: Radix Conversion Algorithms, Up: Algorithms
861:
862: Other Algorithms
863: ================
864:
865: * Menu:
866:
867: * Factorial Algorithm::
868: * Binomial Coefficients Algorithm::
869: * Fibonacci Numbers Algorithm::
870: * Lucas Numbers Algorithm::
871:
872:
873: File: gmp.info, Node: Factorial Algorithm, Next: Binomial Coefficients Algorithm, Prev: Other Algorithms, Up: Other Algorithms
874:
875: Factorial
876: ---------
877:
878: Factorials n! are calculated by a simple product from 1 to n, but
879: arranged into certain sub-products.
880:
881: First as many factors as fit in a limb are accumulated, then two of
882: those multiplied to give a 2-limb product. When two 2-limb products
883: are ready they're multiplied to a 4-limb product, and when two 4-limbs
884: are ready they're multiplied to an 8-limb product, etc. A stack of
885: outstanding products is built up, with two of the same size multiplied
886: together when ready.
887:
888: Arranging for multiplications to have operands the same (or nearly
889: the same) size means the Karatsuba and higher multiplication algorithms
890: can be used. And even on sizes below the Karatsuba threshold an NxN
891: multiply will give a basecase multiply more to work on.
892:
893: An obvious improvement not currently implemented would be to strip
894: factors of 2 from the products and apply them at the end with a bit
895: shift. Another possibility would be to determine the prime
896: factorization of the result (which can be done easily), and use a
897: powering method, at each stage squaring then multiplying in those
898: primes with a 1 in their exponent at that point. The advantage would
899: be some multiplies turned into squares.
900:
901:
902: File: gmp.info, Node: Binomial Coefficients Algorithm, Next: Fibonacci Numbers Algorithm, Prev: Factorial Algorithm, Up: Other Algorithms
903:
904: Binomial Coefficients
905: ---------------------
906:
907: Binomial coefficients C(n,k) are calculated by first arranging k <=
908: n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the
909: following product simply from i=2 to i=k.
910:
911: k (n-k+i)
912: C(n,k) = (n-k+1) * prod -------
913: i=2 i
914:
915: It's easy to show that each denominator i will divide the product so
916: far, so the exact division algorithm is used (*note Exact Division::).
917:
918: The numerators n-k+i and denominators i are first accumulated into
919: as many fit a limb, to save multi-precision operations, though for
920: `mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
921: and n-k+i in general won't fit in a limb at all.
922:
923: An obvious improvement would be to strip factors of 2 from each
924: multiplier and divisor and count them separately, to be applied with a
925: bit shift at the end. Factors of 3 and perhaps 5 could even be handled
926: similarly. Another possibility, if n is not too big, would be to
927: determine the prime factorization of the result based on the factorials
928: involved, and power up those primes appropriately. This would help
929: most when k is near n/2.
930:
931:
932: File: gmp.info, Node: Fibonacci Numbers Algorithm, Next: Lucas Numbers Algorithm, Prev: Binomial Coefficients Algorithm, Up: Other Algorithms
933:
934: Fibonacci Numbers
935: -----------------
936:
937: The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed
938: for calculating isolated F[n] or F[n],F[n-1] values efficiently.
939:
940: For small n, a table of single limb values in `__gmp_fib_table' is
941: used. On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
942: to F[93]. For convenience the table starts at F[-1].
943:
944: Beyond the table, values are generated with a binary powering
945: algorithm, calculating a pair F[n] and F[n-1] working from high to low
946: across the bits of n. The formulas used are
947:
948: F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
949: F[2k-1] = F[k]^2 + F[k-1]^2
950:
951: F[2k] = F[2k+1] - F[2k-1]
952:
953: At each step, k is the high b bits of n. If the next bit of n is 0
954: then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
955: and the process repeated until all bits of n are incorporated. Notice
956: these formulas require just two squares per bit of n.
957:
958: It'd be possible to handle the first few n above the single limb
959: table with simple additions, using the defining Fibonacci recurrence
960: F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
961: be faster for only about 10 or 20 values of n, and including a block of
962: code for just those doesn't seem worthwhile. If they really mattered
963: it'd be better to extend the data table.
964:
965: Using a table avoids lots of calculations on small numbers, and
966: makes small n go fast. A bigger table would make more small n go fast,
967: it's just a question of balancing size against desired speed. For GMP
968: the code is kept compact, with the emphasis primarily on a good
969: powering algorithm.
970:
971: `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
972: interested in F[n]. In this case the last step of the algorithm can
973: become one multiply instead of two squares. One of the following two
974: formulas is used, according as n is odd or even.
975:
976: F[2k] = F[k]*(F[k]+2F[k-1])
977:
978: F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
979:
980: F[2k+1] here is the same as above, just rearranged to be a multiply.
981: For interest, the 2*(-1)^k term both here and above can be applied
982: just to the low limb of the calculation, without a carry or borrow into
983: further limbs, which saves some code size. See comments with
984: `mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.
985:
986:
987: File: gmp.info, Node: Lucas Numbers Algorithm, Prev: Fibonacci Numbers Algorithm, Up: Other Algorithms
988:
989: Lucas Numbers
990: -------------
991:
992: `mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
993: Fibonacci numbers with the following simple formulas.
994:
995: L[k] = F[k] + 2*F[k-1]
996: L[k-1] = 2*F[k] - F[k-1]
997:
998: `mpz_lucnum_ui' is only interested in L[n], and some work can be
999: saved. Trailing zero bits on n can be handled with a single square
1000: each.
1001:
1002: L[2k] = L[k]^2 - 2*(-1)^k
1003:
1004: And the lowest 1 bit can be handled with one multiply of a pair of
1005: Fibonacci numbers, similar to what `mpz_fib_ui' does.
1006:
1007: L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
1008:
1009:
1010: File: gmp.info, Node: Assembler Coding, Prev: Other Algorithms, Up: Algorithms
1011:
1012: Assembler Coding
1013: ================
1014:
1015: The assembler subroutines in GMP are the most significant source of
1016: speed at small to moderate sizes. At larger sizes algorithm selection
1017: becomes more important, but of course speedups in low level routines
1018: will still speed up everything proportionally.
1019:
1020: Carry handling and widening multiplies that are important for GMP
1021: can't be easily expressed in C. GCC `asm' blocks help a lot and are
1022: provided in `longlong.h', but hand coding low level routines invariably
1023: offers a speedup over generic C by a factor of anything from 2 to 10.
1024:
1025: * Menu:
1026:
1027: * Assembler Code Organisation::
1028: * Assembler Basics::
1029: * Assembler Carry Propagation::
1030: * Assembler Cache Handling::
1031: * Assembler Floating Point::
1032: * Assembler SIMD Instructions::
1033: * Assembler Software Pipelining::
1034: * Assembler Loop Unrolling::
1035:
1036:
1037: File: gmp.info, Node: Assembler Code Organisation, Next: Assembler Basics, Prev: Assembler Coding, Up: Assembler Coding
1038:
1039: Code Organisation
1040: -----------------
1041:
1042: The various `mpn' subdirectories contain machine-dependent code,
1043: written in C or assembler. The `mpn/generic' subdirectory contains
1044: default code, used when there's no machine-specific version of a
1045: particular file.
1046:
1047: Each `mpn' subdirectory is for an ISA family. Generally 32-bit and
1048: 64-bit variants in a family cannot share code and will have separate
1049: directories. Within a family further subdirectories may exist for CPU
1050: variants.
1051:
1052:
1053: File: gmp.info, Node: Assembler Basics, Next: Assembler Carry Propagation, Prev: Assembler Code Organisation, Up: Assembler Coding
1054:
1055: Assembler Basics
1056: ----------------
1057:
1058: `mpn_addmul_1' and `mpn_submul_1' are the most important routines
1059: for overall GMP performance. All multiplications and divisions come
1060: down to repeated calls to these. `mpn_add_n', `mpn_sub_n',
1061: `mpn_lshift' and `mpn_rshift' are next most important.
1062:
1063: On some CPUs assembler versions of the internal functions
1064: `mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
1065: mainly through avoiding function call overheads. They can also
1066: potentially make better use of a wide superscalar processor.
1067:
1068: The restrictions on overlaps between sources and destinations (*note
1069: Low-level Functions::) are designed to facilitate a variety of
1070: implementations. For example, knowing `mpn_add_n' won't have partly
1071: overlapping sources and destination means reading can be done far ahead
1072: of writing on superscalar processors, and loops can be vectorized on a
1073: vector processor, depending on the carry handling.
1074:
1075:
1076: File: gmp.info, Node: Assembler Carry Propagation, Next: Assembler Cache Handling, Prev: Assembler Basics, Up: Assembler Coding
1077:
1078: Carry Propagation
1079: -----------------
1080:
1081: The problem that presents most challenges in GMP is propagating
1082: carries from one limb to the next. In functions like `mpn_addmul_1' and
1083: `mpn_add_n', carries are the only dependencies between limb operations.
1084:
1085: On processors with carry flags, a straightforward CISC style `adc' is
1086: generally best. AMD K6 `mpn_addmul_1' however is an example of an
1087: unusual set of circumstances where a branch works out better.
1088:
1089: On RISC processors generally an add and compare for overflow is
1090: used. This sort of thing can be seen in `mpn/generic/aors_n.c'. Some
1091: carry propagation schemes require 4 instructions, meaning at least 4
1092: cycles per limb, but other schemes may use just 1 or 2. On wide
1093: superscalar processors performance may be completely determined by the
1094: number of dependent instructions between carry-in and carry-out for
1095: each limb.
1096:
1097: On vector processors good use can be made of the fact that a carry
1098: bit only very rarely propagates more than one limb. When adding a
1099: single bit to a limb, there's only a carry out if that limb was
1100: `0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
1101: `mpn/cray/add_n.c' is an example of this, it adds all limbs in
1102: parallel, adds one set of carry bits in parallel and then only rarely
1103: needs to fall through to a loop propagating further carries.
1104:
1105: On the x86s, GCC (as of version 2.95.2) doesn't generate
1106: particularly good code for the RISC style idioms that are necessary to
1107: handle carry bits in C. Often conditional jumps are generated where
1108: `adc' or `sbb' forms would be better. And so unfortunately almost any
1109: loop involving carry bits needs to be coded in assembler for best
1110: results.
1111:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>