Annotation of OpenXM_contrib/gmp/tune/README, Revision 1.1.1.1
1.1 maekawa 1:
2: GMP SPEED MEASURING AND PARAMETER TUNING
3:
4:
5: The programs in this directory are for knowledgeable users who want to make
6: measurements of the speed of GMP routines on their machine, and perhaps
7: tweak some settings or identify things that can be improved.
8:
9: The programs here are tools, not ready to run solutions. Nothing is built
10: in a normal "make all", but various Makefile targets described below exist.
11:
12: Relatively few systems and CPUs have been tested, so be sure to verify that
13: you're getting sensible results before relying on them.
14:
15:
16:
17:
18: MISCELLANEOUS NOTES
19:
20: Don't configure with --enable-assert when using the things here, since the
21: extra code added by assertion checking may influence measurements.
22:
23: Some effort has been made to accommodate CPUs with direct mapped caches, but
24: it will depend on TMP_ALLOC using a proper alloca, and even then it may or
25: may not be enough.
26:
27: The sparc32/v9 addmul_1 code runs at noticeably different speeds on
28: successive sizes, and this has a bad effect on the tune program's
29: determinations of the multiply and square thresholds.
30:
31:
32:
33:
34:
35: PARAMETER TUNING
36:
37: The "tuneup" program runs some tests designed to find the best settings for
38: various thresholds, like KARATSUBA_MUL_THRESHOLD. Its output can be put
39: into gmp-mparam.h. The program can be built and run with
40:
41: make tune
42:
43: If the thresholds indicated are grossly different from the values in the
44: selected gmp-mparam.h then you may get a performance boost in relevant size
45: ranges by changing gmp-mparam.h accordingly.
46:
47: If your CPU has specific tuned parameters coming from a gmp-mparam.h in one
48: of the mpn subdirectories then the values from "make tune" should be
49: similar. You can submit new values if it looks like the current ones are
50: out of date or wildly wrong. But check you're on the right CPU target and
51: there aren't any machine-specific effects causing a difference.
52:
53: It's hoped the compiler and options used won't have too much effect on
54: thresholds, since for most CPUs they ultimately come down to comparisons
55: between assembler subroutines. Missing out on the longlong.h macros by not
56: using gcc will probably have an effect.
57:
58: Some thresholds produced by the tune program are merely single values chosen
59: from what's actually a range of sizes where two algorithms are pretty much
60: the same speed. When this happens the program is likely to give slightly
61: different values on successive runs. This is noticeable on the toom3
62: thresholds for instance.
63:
64:
65:
66:
67: SPEED PROGRAM
68:
69: The "speed" program can be used for measuring and comparing various
70: routines, and producing tables of data or gnuplot graphs. Compile it with
71:
72: make speed
73:
74: Here are some examples of how to use it. Check the code for all the
75: options.
76:
77: Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
78: (whichever is greater).
79:
80: ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
81: gnuplot foo.gnuplot
82:
83: Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing
84: under mpn_lshift the difference between it and mpn_add_n.
85:
86: ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
87:
88: Using option -c for times in cycles is interesting but normally only
89: necessary when looking carefully at assembler subroutines. You might think
90: it would always give an integer value, but this doesn't happen in practice,
91: probably due to overheads in the time measurements.
92:
93: In the free-form output the "#" symbol against a measurement means the
94: corresponding routine is fastest at that size. This is a convenient visual
95: cue when comparing different routines. The graph data files <name>.data
96: don't get this since it would upset gnuplot or other data viewers.
97:
98:
99:
100:
101: TIME BASE
102:
103: The time measuring method is determined in time.c, based on what the
104: configured target has available. A microsecond accurate gettimeofday() will
105: work well, but there's code to use better methods, such as the cycle
106: counters on various CPUs.
107:
108: Currently, all methods except possibly the alpha cycle counter depend on the
109: machine being otherwise idle, or rather on other jobs not stealing CPU time
110: from the measuring program. Short routines (that complete within a
111: timeslice) should work even on a busy machine. Some trouble is taken by
112: speed_measure() in common.c to avoid the ill effects of sporadic interrupts,
113: or other intermittent things (like cron waking up every minute). But
114: generally you'll want an idle machine to be sure of consistent results.
115:
116: The CPU frequency is needed if times in cycles are to be displayed, and it's
117: always needed when using a cycle counter time base. time.c knows how to get
118: the frequency on some systems, but when that fails, or needs to be
119: overridden, an environment variable GMP_CPU_FREQUENCY can be used (in
120: Hertz). For example in "bash" on a 650 MHz machine,
121:
122: export GMP_CPU_FREQUENCY=650e6
123:
124: A high precision time base makes it possible to get accurate measurements in
125: a shorter time. Support for systems and CPUs not already covered is wanted.
126:
127: When setting up a method, be sure not to claim a higher accuracy than is
128: really available. For example the default gettimeofday() code is set for
129: microsecond accuracy, but if only 10ms or 55ms is available then
130: inconsistent results can be expected.
131:
132:
133:
134:
135:
136: EXAMPLE COMPARISONS
137:
138: Here are some ideas for things you can do with the speed program.
139:
140: There's always going to be a certain amount of overhead in the time
141: measurements, due to reading the time base, and in the loop that runs a
142: routine enough times to get a reading of the desired precision. Noop
143: functions taking various arguments are available to measure this. The
144: "overhead" printed by the speed program each time in its intro is the "noop"
145: routine, but note that this is just for information, it isn't deducted from
146: the times printed or anything.
147:
148: ./speed -s 1 noop noop_wxs noop_wxys
149:
150: If you want to know how many cycles per limb a routine is taking, look at
151: the time increase when the size increments, using option -D. This avoids
152: fixed overheads in the measuring. Also, remember many of the assembler
153: routines have unrolled loops, so it might be necessary to compare times at,
154: say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed
155: to any finishing off.
156:
157: ./speed -s 16-64 -t 16 -C -D mpn_add_n
158:
159: The -C option on its own gives cycles per limb, but is really only useful at
160: big sizes where fixed overheads are small compared to the code doing the
161: real work. Remember of course memory caching and/or page swapping will
162: affect results at large sizes.
163:
164: ./speed -s 500000 -C mpn_add_n
165:
166: Once a calculation stops fitting in the CPU data cache, it's going to start
167: taking longer. Exactly where this happens depends on the cache priming in
168: the measuring routines, and on what sort of "least recently used" the
169: hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and
170: 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
171: limbs.
172:
173: ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
174: gnuplot foo.gnuplot
175:
176: When a routine has an unrolled loop for, say, multiples of 8 limbs and then
177: an ordinary loop for the remainder, it can happen that it's actually faster
178: to do an operation on, say, 8 limbs than it is on 7 limbs. Here's an
179: example drawing a graph of mpn_sub_n, which you can look at to see if times
180: smoothly increase with size.
181:
182: ./speed -s 1-100 -c -P foo mpn_sub_n
183: gnuplot foo.gnuplot
184:
185: If mpn_lshift and mpn_rshift for your CPU have special case code for shifts
186: by 1, it ought to be faster (or at least not slower) than shifting by, say,
187: 2 bits.
188:
189: ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
190:
191: An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
192: if the lshift isn't faster there's an obvious improvement that's possible.
193:
194: ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
195:
196: On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
197: destination is one of the sources is faster than a separate destination.
198: Here's an example to see this. (mpn_add_n_inplace is a special measuring
199: routine, not available for other operations.)
200:
201: ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace
202:
203: The gmp manual recommends divisions by powers of two should be done using a
204: right shift because it'll be significantly faster. Here's how you can see
205: by what factor mpn_rshift is faster, using division by 32 as an example.
206:
207: ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
208:
209: mul_basecase takes an "r" parameter that's the first (larger) size
210: parameter. For example to show speeds for 20x1 up to 20x15 in cycles,
211:
212: ./speed -s 1-15 -c mpn_mul_basecase.20
213:
214: mul_basecase with no parameter does an NxN multiply, so for example to show
215: speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
216:
217: ./speed -s 1-20 -c mpn_mul_basecase
218:
219: sqr_basecase is implemented by a "triangular" method on most CPUs, making it
220: up to twice as fast as mul_basecase. In practice loop overheads and the
221: products on the diagonal mean it falls short of this. Here's an example
222: running the two and showing by what factor an NxN mul_basecase is slower
223: than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes
224: below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.)
225:
226: ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
227:
228: The technique described above with -CD for showing the time difference in
229: cycles per limb between two size operations can be done on an NxN
230: mul_basecase using -E to change the basis for the size increment to N*N.
231: For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
232: doing 256 limbs. The following therefore shows the per crossproduct speed
233: of mul_basecase and sqr_basecase at around 20x20 limbs.
234:
235: ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
236:
237: Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
238: interesting to compare it to mul_basecase as if it was. For sqr_basecase
239: the -F option can be used to base the deltas on N*(N+1)/2 operations, which
240: is the triangular products sqr_basecase does. For example,
241:
242: ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
243:
244: Both -E and -F are preliminary and might change. A consistent approach to
245: using them when claiming certain per crossproduct or per triangularproduct
246: speeds hasn't really been established, but the increment between speeds in
247: the range karatsuba will call seems sensible, that being k to k/2. For
248: instance, if the karatsuba threshold was 20 for the multiply and 30 for the
249: square,
250:
251: ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
252: ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
253:
254: The gmp manual recommends application programs avoid excessive initializing
255: and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new
256: variable will at a minimum go through an init, a realloc for its first
257: store, and finally a clear. Quite how long that takes depends on the C
258: library. The following compares an mpz_init/realloc/clear to a 10 limb
259: mpz_add.
260:
261: ./speed -s 10 -c mpz_init_realloc_clear mpz_add
262:
263: The normal libtool link of the speed program does a static link to libgmp.la
264: and libspeed.la, but will end up dynamic linked to libc. Depending on the
265: system, a dynamic linked malloc may be noticeably slower than static linked,
266: and you may want to re-run the libtool link invocation to static link libc
267: for comparison. The example below does a 10 limb malloc/free or
268: malloc/realloc/free to test the C library. Of course a real world program
269: has big problems if it's doing so many mallocs and frees that it gets slowed
270: down by a dynamic linked malloc.
271:
272: ./speed -s 10 -c malloc_free malloc_realloc_free
273:
274:
275:
276:
277: SPEED PROGRAM EXTENSIONS
278:
279: Potentially lots of things could be made available in the program, but it's
280: been left at only the things that have actually been wanted and are likely
281: to be reasonably useful in the future.
282:
283: Extensions should be fairly easy to make though. speed-ext.c is an example,
284: in a style that should suit one-off tests, or new code fragments under
285: development.
286:
287:
288:
289:
290: THRESHOLD EXAMINING
291:
292: The speed program can be used to examine the speeds of different algorithms
293: to check the tune program has done the right thing. For example to examine
294: the karatsuba multiply threshold,
295:
296: ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
297:
298: When examining the toom3 threshold, remember it depends on the karatsuba
299: threshold, so the right karatsuba threshold needs to be compiled into the
300: library first. The tune program uses special recompiled versions of
301: mpn/mul_n.c etc for this reason, but the speed program simply uses the
302: normal libgmp.la.
303:
304: Note further that the various routines may recurse into themselves on sizes
305: far enough above applicable thresholds. For example, mpn_kara_mul_n will
306: recurse into itself on sizes greater than twice the compiled-in
307: KARATSUBA_MUL_THRESHOLD.
308:
309: When doing the above comparison between mul_basecase and kara_mul_n what's
310: probably of interest is mul_basecase versus a kara_mul_n that does one level
311: of Karatsuba then calls to mul_basecase, but this only happens on sizes less
312: than twice the compiled KARATSUBA_MUL_THRESHOLD. A larger value for that
313: setting can be compiled-in to avoid the problem if necessary. The same
314: applies to toom3 and BZ, though in a trickier fashion.
315:
316: There are some upper limits on some of the thresholds, arising from arrays
317: dimensioned according to a threshold (mpn_mul_n), or asm code with certain
318: sized displacements (some x86 versions of sqr_basecase). So putting huge
319: values for the thresholds, even just for testing, may fail.
320:
321:
322:
323:
324: THINGS AFFECTING THRESHOLDS
325:
326: The following are some general notes on some things that can affect the
327: various algorithm thresholds.
328:
329: KARATSUBA_MUL_THRESHOLD
330:
331: At size 2N, karatsuba does three NxN multiplies and some adds and
332: shifts, compared to a 2Nx2N basecase multiply which will be roughly
333: equivalent to four NxN multiplies.
334:
335: Fast mul - increases threshold
336:
337: If the CPU has a fast multiply, the basecase multiplies are going
338: to stay faster than the karatsuba overheads for longer. Conversely
339: if the CPU has a slow multiply the karatsuba method trading some
340: multiplies for adds will become worthwhile sooner.
341:
342: Remember it's "addmul" performance that's of interest here. This
343: may differ from a simple "mul" instruction in the CPU. For example
344: K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul,
345: and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due
346: to pipelining.
347:
348: Unrolled addmul - increases threshold
349:
350: If the CPU addmul routine (or the addmul part of the mul_basecase
351: routine) is unrolled it can mean that a 2Nx2N multiply is a bit
352: faster than four NxN multiplies, due to proportionally less looping
353: overheads. This can be thought of as the addmul warming to its
354: task on bigger sizes, and keeping the basecase better than
355: karatsuba for longer.
356:
357: Karatsuba overheads - increases threshold
358:
359: Fairly obviously anything gained or lost in the karatsuba extra
360: calculations will translate directly to the threshold. But
361: remember the extra calculations are likely to always be a
362: relatively small fraction of the total multiply time and in that
363: sense the basecase code is the best place to be looking for
364: optimizations.
365:
366: KARATSUBA_SQR_THRESHOLD
367:
368: Squaring is essentially the same as multiplying, so the above applies
369: to squaring too. Fixed overheads will, proportionally, be bigger when
370: squaring, leading to a higher threshold usually.
371:
372: mpn/generic/sqr_basecase.c
373:
374: This relies on a reasonable umul_ppmm, and if the generic C code is
375: being used it may badly affect the speed. Don't bother paying
376: attention to the square thresholds until you have either a good
377: umul_ppmm or an assembler sqr_basecase.
378:
379: TOOM3_MUL_THRESHOLD
380:
381: At size N, toom3 does five (N/3)x(N/3) multiplies and some extra
382: calculations, compared to karatsuba doing three (N/2)x(N/2)
383: multiplies and some extra calculations (fewer). Toom3 will become
384: better before long, being O(n^1.465) versus karatsuba at O(n^1.585),
385: but exactly where depends a great deal on the implementations of all
386: the relevant bits of extra calculation.
387:
388: In practice the curves for time versus size on toom3 and karatsuba
389: have similar slopes near their crossover, leading to a range of sizes
390: where there's very little difference between the two. Choosing a
391: single value from the range is a bit arbitrary and will lead to
392: slightly different values on successive runs of the tune program.
393:
394: divexact_by3 - used by toom3
395:
396: Toom3 does a divexact_by3 which at size N is roughly equivalent to
397: N successively dependent multiplies with a further couple of extra
398: instructions in between. CPUs with a low latency multiply and good
399: divexact_by3 implementation should see the toom3 threshold lowered.
400: But note this is unlikely to have much effect on total multiply
401: times.
402:
403: Asymptotic behaviour
404:
405: At the fairly small sizes where the thresholds occur it's worth
406: remembering that the asymptotic behaviour for karatsuba and toom3
407: can't be expected to make accurate predictions, due of course to
408: the big influence of all sorts of overheads, and the fact that only
409: a few recursions of each are being performed.
410:
411: Even at large sizes there's a good chance machine dependent effects
412: like cache architecture will mean actual performance deviates from
413: what might be predicted. This is why the rather positivist
414: approach of just measuring things has been adopted, in general.
415:
416: TOOM3_SQR_THRESHOLD
417:
418: The same factors apply to squaring as to multiplying, though with
419: overheads being proportionally a bit bigger.
420:
421: FFT_MUL_THRESHOLD, etc
422:
423: When configured with --enable-fft, a Fermat style FFT is used for
424: multiplication above FFT_MUL_THRESHOLD, and a further threshold
425: FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1
426: multiply. FFT_MUL_TABLE is the thresholds at which each split size
427: "k" is used in the FFT.
428:
429: step effect - coarse grained thresholds
430:
431: The FFT has size restrictions that mean it rounds up sizes to
432: certain multiples and therefore does the same amount of work for a
433: range of different sized operands. For example at k=8 the size is
434: internally rounded to a multiple of 1024 limbs. The current single
435: values for the various thresholds are set to give good average
436: performance, but in the future multiple values might be wanted to
437: take into account the different step sizes for different "k"s.
438:
439: FFT_SQR_THRESHOLD, etc
440:
441: The same considerations apply as for multiplications, plus the
442: following.
443:
444: similarity to mul thresholds
445:
446: On some CPUs the squaring thresholds are nearly the same as those
447: for multiplying. It's not quite clear why this is, it might be
448: similar shaped size/time graphs for the mul and sqrs recursed into.
449:
450: BZ_THRESHOLD
451:
452: The B-Z division algorithm rearranges a traditional multi-precision
453: long division so that NxN multiplies can be done rather than repeated
454: Nx1 multiplies, thereby exploiting the algorithmic advantages of
455: karatsuba and toom3, and leading to significant speedups.
456:
457: fast mul_basecase - decreases threshold
458:
459: CPUs with an optimized mul_basecase can expect a lower B-Z
460: threshold due to the helping hand such a mul_basecase will give to
461: B-Z as compared to submul_1 used in the schoolbook method.
462:
463: GCD_ACCEL_THRESHOLD
464:
465: Below this threshold a simple binary subtract and shift is used, above
466: it Ken Weber's accelerated algorithm is used. The accelerated GCD
467: performs far fewer steps than the binary GCD and will normally kick in
468: at quite small sizes.
469:
470: modlimb_invert and find_a - affect threshold
471:
472: At small sizes the performance of modlimb_invert and find_a will
473: affect the accelerated algorithm and CPUs where those routines are
474: not well optimized may see a higher threshold. (At large sizes
475: mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated
476: algorithm.)
477:
478: GCDEXT_THRESHOLD
479:
480: mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of
481: Euclid's algorithm. The multipliers are found using single limb
482: calculations below GCDEXT_THRESHOLD, or double limb calculations
483: above. The single limb code is fast but doesn't produce full-limb
484: multipliers.
485:
486: data-dependent multiplier - big threshold
487:
488: If multiplications done by mpn_mul_1, addmul_1 and submul_1 run
489: slower when there's more bits in the multiplier, then producing
490: bigger multipliers with the double limb calculation doesn't save
491: much more than some looping and function call overheads. A large
492: threshold can then be expected.
493:
494: slow division - low threshold
495:
496: The single limb calculation does some plain "/" divisions, whereas
497: the double limb calculation has a divide routine optimized for the
498: small quotients that often occur. Until the single limb code does
499: something similar a slow hardware divide will count against it.
500:
501:
502:
503:
504:
505: FUTURE
506:
507: Make a program to check the time base is working properly, for small and
508: large measurements. Make it able to test each available method, including
509: perhaps the apparent resolution of each.
510:
511: Add versions of the toom3 multiplication using either the mpn calls or the
512: open-coded style, so the two can be compared.
513:
514: Add versions of the generic C mpn_divrem_1 using straight division versus a
515: multiply by inverse, so the two can be compared. Include the branch-free
516: version of multiply by inverse too.
517:
518: Make an option in struct speed_parameters to specify operand overlap,
519: perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1
520: dst2=src2, 4 for dst1=src2 dst2=src1. This is done for addsub_n with the r
521: parameter (though addsub_n isn't yet enabled), and could be done for add_n,
522: xor_n, etc too.
523:
524: When speed_measure() divides the total time measured by repetitions
525: performed, it divides the fixed overheads imposed by speed_starttime() and
526: speed_endtime(). When different routines are run with different repetitions
527: the overhead will then be differently counted. It would improve precision
528: to try to avoid this. Currently the idea is just to set speed_precision big
529: enough that the effect is insignificant compared to the routines being
530: measured.
531:
532:
533:
534:
535: ----------------
536: Local variables:
537: mode: text
538: fill-column: 76
539: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>