Annotation of OpenXM_contrib/gmp/tune/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
! 22:
1.1 maekawa 23:
24: GMP SPEED MEASURING AND PARAMETER TUNING
25:
26:
1.1.1.2 ! ohara 27: The programs in this directory are for knowledgeable users who want to
! 28: measure GMP routines on their machine, and perhaps tweak some settings or
! 29: identify things that can be improved.
1.1 maekawa 30:
31: The programs here are tools, not ready to run solutions. Nothing is built
32: in a normal "make all", but various Makefile targets described below exist.
33:
34: Relatively few systems and CPUs have been tested, so be sure to verify that
1.1.1.2 ! ohara 35: results are sensible before relying on them.
1.1 maekawa 36:
37:
38:
39:
40: MISCELLANEOUS NOTES
41:
1.1.1.2 ! ohara 42: --enable-assert
! 43:
! 44: Don't configure with --enable-assert, since the extra code added by
! 45: assertion checking may influence measurements.
! 46:
! 47: Direct mapped caches
! 48:
! 49: Some effort has been made to accommodate CPUs with direct mapped caches,
! 50: by putting data blocks more or less contiguously on the stack. But this
! 51: will depend on TMP_ALLOC using alloca, and even then it may or may not
! 52: be enough.
! 53:
! 54: FreeBSD 4.2 i486 getrusage
! 55:
! 56: This getrusage seems to be a bit doubtful, it looks like it's
! 57: microsecond accurate, but sometimes ru_utime remains unchanged after a
! 58: time of many microseconds has elapsed. It'd be good to detect this in
! 59: the time.c initializations, but for now the suggestion is to pretend it
! 60: doesn't exist.
! 61:
! 62: ./configure ac_cv_func_getrusage=no
! 63:
! 64: NetBSD 1.4.1 m68k macintosh time base
1.1 maekawa 65:
1.1.1.2 ! ohara 66: On this system it's been found getrusage often goes backwards, making it
! 67: unusable (configure is setup to ignore it). gettimeofday sometimes
! 68: doesn't update atomically when it crosses a 1 second boundary. Not sure
! 69: what to do about this. Expect intermittent failures.
1.1 maekawa 70:
1.1.1.2 ! ohara 71: SCO OpenUNIX 8 /etc/hw
! 72:
! 73: /etc/hw takes about a second to return the cpu frequency, which suggests
! 74: perhaps it's measuring each time it runs. If this is annoying when
! 75: running the speed program repeatedly then set a GMP_CPU_FREQUENCY
! 76: environment variable (see TIME BASE section below).
! 77:
! 78: Low resolution timebase
! 79:
! 80: Parameter tuning can be very time consuming if the only timebase
! 81: available is a 10 millisecond clock tick, to the point of being
! 82: unusable. This is currently the case on VAX and ARM systems.
1.1 maekawa 83:
84:
85:
86:
87: PARAMETER TUNING
88:
89: The "tuneup" program runs some tests designed to find the best settings for
1.1.1.2 ! ohara 90: various thresholds, like MUL_KARATSUBA_THRESHOLD. Its output can be put
! 91: into gmp-mparam.h. The program is built and run with
1.1 maekawa 92:
93: make tune
94:
95: If the thresholds indicated are grossly different from the values in the
1.1.1.2 ! ohara 96: selected gmp-mparam.h then there may be a performance boost in applicable
! 97: size ranges by changing gmp-mparam.h accordingly.
1.1 maekawa 98:
1.1.1.2 ! ohara 99: Be sure to do a full reconfigure and rebuild to get any newly set thresholds
! 100: to take effect. A partial rebuild is enough sometimes, but a fresh
! 101: configure and make is certain to be correct.
! 102:
! 103: If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
! 104: the mpn subdirectories then the values from "make tune" should be similar.
! 105: But check that the configured CPU is right and there are no machine specific
! 106: effects causing a difference.
1.1 maekawa 107:
108: It's hoped the compiler and options used won't have too much effect on
109: thresholds, since for most CPUs they ultimately come down to comparisons
110: between assembler subroutines. Missing out on the longlong.h macros by not
111: using gcc will probably have an effect.
112:
113: Some thresholds produced by the tune program are merely single values chosen
1.1.1.2 ! ohara 114: from what's a range of sizes where two algorithms are pretty much the same
! 115: speed. When this happens the program is likely to give somewhat different
! 116: values on successive runs. This is noticeable on the toom3 thresholds for
! 117: instance.
1.1 maekawa 118:
119:
120:
121:
122: SPEED PROGRAM
123:
124: The "speed" program can be used for measuring and comparing various
125: routines, and producing tables of data or gnuplot graphs. Compile it with
126:
127: make speed
128:
1.1.1.2 ! ohara 129: (Or on DOS systems "make speed.exe".)
! 130:
1.1 maekawa 131: Here are some examples of how to use it. Check the code for all the
132: options.
133:
134: Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
135: (whichever is greater).
136:
137: ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
138: gnuplot foo.gnuplot
139:
1.1.1.2 ! ohara 140: Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
! 141: showing under mpn_lshift the difference between it and mpn_add_n.
1.1 maekawa 142:
143: ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
144:
145: Using option -c for times in cycles is interesting but normally only
146: necessary when looking carefully at assembler subroutines. You might think
147: it would always give an integer value, but this doesn't happen in practice,
148: probably due to overheads in the time measurements.
149:
150: In the free-form output the "#" symbol against a measurement means the
151: corresponding routine is fastest at that size. This is a convenient visual
152: cue when comparing different routines. The graph data files <name>.data
153: don't get this since it would upset gnuplot or other data viewers.
154:
155:
156:
157:
158: TIME BASE
159:
160: The time measuring method is determined in time.c, based on what the
1.1.1.2 ! ohara 161: configured host has available. A cycle counter is preferred, possibly
! 162: supplemented by another method if the counter has a limited range. A
! 163: microsecond accurate getrusage() or gettimeofday() will work quite well too.
! 164:
! 165: The cycle counters (except possibly on alpha) and gettimeofday() will depend
! 166: on the machine being otherwise idle, or rather on other jobs not stealing
! 167: CPU time from the measuring program. Short routines (those that complete
! 168: within a timeslice) should work even on a busy machine.
! 169:
! 170: Some trouble is taken by speed_measure() in common.c to avoid ill effects
! 171: from sporadic interrupts, or other intermittent things (like cron waking up
! 172: every minute). But generally an idle machine will be necessary to be
! 173: certain of consistent results.
! 174:
! 175: The CPU frequency is needed to convert between cycles and seconds, or for
! 176: when a cycle counter is supplemented by getrusage() etc. The speed program
! 177: will convert as necessary according to the output format requested. The
! 178: tune program will work with either cycles or seconds.
! 179:
! 180: freq.c knows how to get the frequency on some systems, or can measure a
! 181: cycle counter against gettimeofday() or getrusage(), but when that fails, or
! 182: needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
! 183: used (in Hertz). For example in "bash" on a 650 MHz machine,
1.1 maekawa 184:
185: export GMP_CPU_FREQUENCY=650e6
186:
187: A high precision time base makes it possible to get accurate measurements in
1.1.1.2 ! ohara 188: a shorter time.
1.1 maekawa 189:
190:
191:
192:
1.1.1.2 ! ohara 193: EXAMPLE COMPARISONS - VARIOUS
1.1 maekawa 194:
1.1.1.2 ! ohara 195: Here are some ideas for things that can be done with the speed program.
1.1 maekawa 196:
197: There's always going to be a certain amount of overhead in the time
198: measurements, due to reading the time base, and in the loop that runs a
199: routine enough times to get a reading of the desired precision. Noop
200: functions taking various arguments are available to measure this. The
201: "overhead" printed by the speed program each time in its intro is the "noop"
202: routine, but note that this is just for information, it isn't deducted from
203: the times printed or anything.
204:
205: ./speed -s 1 noop noop_wxs noop_wxys
206:
1.1.1.2 ! ohara 207: To see how many cycles per limb a routine is taking, look at the time
! 208: increase when the size increments, using option -D. This avoids fixed
! 209: overheads in the measuring. Also, remember many of the assembler routines
! 210: have unrolled loops, so it might be necessary to compare times at, say, 16,
! 211: 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
! 212: finishing off.
1.1 maekawa 213:
214: ./speed -s 16-64 -t 16 -C -D mpn_add_n
215:
216: The -C option on its own gives cycles per limb, but is really only useful at
217: big sizes where fixed overheads are small compared to the code doing the
218: real work. Remember of course memory caching and/or page swapping will
219: affect results at large sizes.
220:
221: ./speed -s 500000 -C mpn_add_n
222:
223: Once a calculation stops fitting in the CPU data cache, it's going to start
224: taking longer. Exactly where this happens depends on the cache priming in
225: the measuring routines, and on what sort of "least recently used" the
226: hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and
227: 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
228: limbs.
229:
230: ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
231: gnuplot foo.gnuplot
232:
233: When a routine has an unrolled loop for, say, multiples of 8 limbs and then
234: an ordinary loop for the remainder, it can happen that it's actually faster
1.1.1.2 ! ohara 235: to do an operation on, say, 8 limbs than it is on 7 limbs. The following
! 236: draws a graph of mpn_sub_n, to see whether times smoothly increase with
! 237: size.
1.1 maekawa 238:
239: ./speed -s 1-100 -c -P foo mpn_sub_n
240: gnuplot foo.gnuplot
241:
1.1.1.2 ! ohara 242: If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
! 243: ought to be faster (or at least not slower) than shifting by, say, 2 bits.
1.1 maekawa 244:
245: ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
246:
247: An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
248: if the lshift isn't faster there's an obvious improvement that's possible.
249:
250: ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
251:
252: On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
253: destination is one of the sources is faster than a separate destination.
1.1.1.2 ! ohara 254: Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and
! 255: mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
1.1 maekawa 256:
1.1.1.2 ! ohara 257: ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
1.1 maekawa 258:
1.1.1.2 ! ohara 259: The gmp manual points out that divisions by powers of two should be done
! 260: using a right shift because it'll be significantly faster than an actual
! 261: division. The following shows by what factor mpn_rshift is faster than
! 262: mpn_divrem_1, using division by 32 as an example.
1.1 maekawa 263:
264: ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
265:
1.1.1.2 ! ohara 266:
! 267:
! 268:
! 269: EXAMPLE COMPARISONS - MULTIPLICATION
! 270:
! 271: mul_basecase takes a ".<r>" parameter which is the first (larger) size
1.1 maekawa 272: parameter. For example to show speeds for 20x1 up to 20x15 in cycles,
273:
274: ./speed -s 1-15 -c mpn_mul_basecase.20
275:
276: mul_basecase with no parameter does an NxN multiply, so for example to show
277: speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
278:
279: ./speed -s 1-20 -c mpn_mul_basecase
280:
281: sqr_basecase is implemented by a "triangular" method on most CPUs, making it
282: up to twice as fast as mul_basecase. In practice loop overheads and the
283: products on the diagonal mean it falls short of this. Here's an example
284: running the two and showing by what factor an NxN mul_basecase is slower
285: than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes
1.1.1.2 ! ohara 286: below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.)
1.1 maekawa 287:
288: ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
289:
290: The technique described above with -CD for showing the time difference in
291: cycles per limb between two size operations can be done on an NxN
292: mul_basecase using -E to change the basis for the size increment to N*N.
293: For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
294: doing 256 limbs. The following therefore shows the per crossproduct speed
295: of mul_basecase and sqr_basecase at around 20x20 limbs.
296:
297: ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
298:
299: Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
300: interesting to compare it to mul_basecase as if it was. For sqr_basecase
301: the -F option can be used to base the deltas on N*(N+1)/2 operations, which
302: is the triangular products sqr_basecase does. For example,
303:
304: ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
305:
306: Both -E and -F are preliminary and might change. A consistent approach to
307: using them when claiming certain per crossproduct or per triangularproduct
308: speeds hasn't really been established, but the increment between speeds in
309: the range karatsuba will call seems sensible, that being k to k/2. For
310: instance, if the karatsuba threshold was 20 for the multiply and 30 for the
311: square,
312:
313: ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
314: ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
315:
1.1.1.2 ! ohara 316: Two versions of toom3 interpolation and evaluation are available in
! 317: mpn/generic/mul_n.c, using either a one-pass open-coded style or simple mpn
! 318: subroutine calls. The former is used on RISCs with lots of registers, the
! 319: latter on other CPUs. The two can be compared directly to check which is
! 320: best. Naturally it's sizes where toom3 is faster than karatsuba that are of
! 321: interest.
! 322:
! 323: ./speed -s 80-120 -c mpn_toom3_mul_n_mpn mpn_toom3_mul_n_open
! 324: ./speed -s 80-120 -c mpn_toom3_sqr_n_mpn mpn_toom3_sqr_n_open
! 325:
! 326:
! 327:
! 328:
! 329: EXAMPLE COMPARISONS - MALLOC
! 330:
1.1 maekawa 331: The gmp manual recommends application programs avoid excessive initializing
332: and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new
333: variable will at a minimum go through an init, a realloc for its first
334: store, and finally a clear. Quite how long that takes depends on the C
335: library. The following compares an mpz_init/realloc/clear to a 10 limb
1.1.1.2 ! ohara 336: mpz_add. Don't be surprised if the mallocing is quite slow.
1.1 maekawa 337:
338: ./speed -s 10 -c mpz_init_realloc_clear mpz_add
339:
1.1.1.2 ! ohara 340: On some systems malloc and free are much slower when dynamic linked. The
! 341: speed-dynamic program can be used to see this. For example the following
! 342: measures malloc/free, first static then dynamic.
! 343:
! 344: ./speed -s 10 -c malloc_free
! 345: ./speed-dynamic -s 10 -c malloc_free
! 346:
! 347: Of course a real world program has big problems if it's doing so many
! 348: mallocs and frees that it gets slowed down by a dynamic linked malloc.
! 349:
! 350:
! 351:
! 352:
! 353:
! 354: EXAMPLE COMPARISONS - STRING CONVERSIONS
! 355:
! 356: mpn_get_str does a binary to string conversion. The base is specified with
! 357: a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster
! 358: than general bases. The following compares decimal and hex for instance.
! 359:
! 360: ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
! 361:
! 362: Smaller bases need more divisions to split a given size number, and so are
! 363: slower. The following compares base 3 and base 9. On small operands 9 will
! 364: be nearly twice as fast, though at bigger sizes this reduces since in the
! 365: current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
! 366: limbs) and those divisions come to dominate.
1.1 maekawa 367:
1.1.1.2 ! ohara 368: ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
! 369:
! 370: mpn_set_str does a string to binary conversion. The base is specified with
! 371: a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than
! 372: general bases on large conversions.
! 373:
! 374: ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
! 375:
! 376: mpn_set_str also has some special case code for decimal which is a bit
! 377: faster than the general case, basically by giving the compiler a chance to
! 378: optimize some multiplications by 10.
! 379:
! 380: ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
! 381:
! 382:
! 383:
! 384:
! 385: EXAMPLE COMPARISONS - GCDs
! 386:
! 387: mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
! 388: x and y are single limbs. This isn't tuned currently, but a value can be
! 389: established by a measurement like
! 390:
! 391: ./speed -s 10-32 mpn_gcd_1.10
! 392:
! 393: This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div
! 394: threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
! 395: is done by nibbling away at the 32-bit operands bit-by-bit. When the
! 396: threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
! 397: 10x10 bit operation.
! 398:
! 399: The threshold in mpn/generic/gcd_1.c or the various assembler
! 400: implementations can be tweaked up or down until there's no more speedups on
! 401: interesting combinations of sizes. Note that this affects only a 1x1 limb
! 402: operation and so isn't very important. (An Nx1 limb operation always does
! 403: an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
1.1 maekawa 404:
405:
406:
407:
408: SPEED PROGRAM EXTENSIONS
409:
410: Potentially lots of things could be made available in the program, but it's
411: been left at only the things that have actually been wanted and are likely
412: to be reasonably useful in the future.
413:
414: Extensions should be fairly easy to make though. speed-ext.c is an example,
415: in a style that should suit one-off tests, or new code fragments under
416: development.
417:
1.1.1.2 ! ohara 418: many.pl is a script for generating a new speed program supplemented with
! 419: alternate versions of the standard routines. It can be used for measuring
! 420: experimental code, or for comparing different implementations that exist
! 421: within a CPU family.
! 422:
1.1 maekawa 423:
424:
425:
426: THRESHOLD EXAMINING
427:
428: The speed program can be used to examine the speeds of different algorithms
429: to check the tune program has done the right thing. For example to examine
430: the karatsuba multiply threshold,
431:
432: ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
433:
434: When examining the toom3 threshold, remember it depends on the karatsuba
435: threshold, so the right karatsuba threshold needs to be compiled into the
1.1.1.2 ! ohara 436: library first. The tune program uses specially recompiled versions of
1.1 maekawa 437: mpn/mul_n.c etc for this reason, but the speed program simply uses the
438: normal libgmp.la.
439:
440: Note further that the various routines may recurse into themselves on sizes
441: far enough above applicable thresholds. For example, mpn_kara_mul_n will
442: recurse into itself on sizes greater than twice the compiled-in
1.1.1.2 ! ohara 443: MUL_KARATSUBA_THRESHOLD.
1.1 maekawa 444:
445: When doing the above comparison between mul_basecase and kara_mul_n what's
446: probably of interest is mul_basecase versus a kara_mul_n that does one level
447: of Karatsuba then calls to mul_basecase, but this only happens on sizes less
1.1.1.2 ! ohara 448: than twice the compiled MUL_KARATSUBA_THRESHOLD. A larger value for that
1.1 maekawa 449: setting can be compiled-in to avoid the problem if necessary. The same
1.1.1.2 ! ohara 450: applies to toom3 and DC, though in a trickier fashion.
1.1 maekawa 451:
452: There are some upper limits on some of the thresholds, arising from arrays
453: dimensioned according to a threshold (mpn_mul_n), or asm code with certain
454: sized displacements (some x86 versions of sqr_basecase). So putting huge
455: values for the thresholds, even just for testing, may fail.
456:
457:
458:
459:
460: FUTURE
461:
462: Make a program to check the time base is working properly, for small and
463: large measurements. Make it able to test each available method, including
464: perhaps the apparent resolution of each.
465:
1.1.1.2 ! ohara 466: Make a general mechanism for specifying operand overlap, and a syntax like
! 467: maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this
! 468: sort of thing with the "r" parameter currently.
1.1 maekawa 469:
470:
471:
472: ----------------
473: Local variables:
474: mode: text
475: fill-column: 76
476: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>