Annotation of OpenXM_contrib/gmp/tune/README, Revision 1.1
1.1 ! maekawa 1:
! 2: GMP SPEED MEASURING AND PARAMETER TUNING
! 3:
! 4:
! 5: The programs in this directory are for knowledgeable users who want to make
! 6: measurements of the speed of GMP routines on their machine, and perhaps
! 7: tweak some settings or identify things that can be improved.
! 8:
! 9: The programs here are tools, not ready to run solutions. Nothing is built
! 10: in a normal "make all", but various Makefile targets described below exist.
! 11:
! 12: Relatively few systems and CPUs have been tested, so be sure to verify that
! 13: you're getting sensible results before relying on them.
! 14:
! 15:
! 16:
! 17:
! 18: MISCELLANEOUS NOTES
! 19:
! 20: Don't configure with --enable-assert when using the things here, since the
! 21: extra code added by assertion checking may influence measurements.
! 22:
! 23: Some effort has been made to accommodate CPUs with direct mapped caches, but
! 24: it will depend on TMP_ALLOC using a proper alloca, and even then it may or
! 25: may not be enough.
! 26:
! 27: The sparc32/v9 addmul_1 code runs at noticeably different speeds on
! 28: successive sizes, and this has a bad effect on the tune program's
! 29: determinations of the multiply and square thresholds.
! 30:
! 31:
! 32:
! 33:
! 34:
! 35: PARAMETER TUNING
! 36:
! 37: The "tuneup" program runs some tests designed to find the best settings for
! 38: various thresholds, like KARATSUBA_MUL_THRESHOLD. Its output can be put
! 39: into gmp-mparam.h. The program can be built and run with
! 40:
! 41: make tune
! 42:
! 43: If the thresholds indicated are grossly different from the values in the
! 44: selected gmp-mparam.h then you may get a performance boost in relevant size
! 45: ranges by changing gmp-mparam.h accordingly.
! 46:
! 47: If your CPU has specific tuned parameters coming from a gmp-mparam.h in one
! 48: of the mpn subdirectories then the values from "make tune" should be
! 49: similar. You can submit new values if it looks like the current ones are
! 50: out of date or wildly wrong. But check you're on the right CPU target and
! 51: there aren't any machine-specific effects causing a difference.
! 52:
! 53: It's hoped the compiler and options used won't have too much effect on
! 54: thresholds, since for most CPUs they ultimately come down to comparisons
! 55: between assembler subroutines. Missing out on the longlong.h macros by not
! 56: using gcc will probably have an effect.
! 57:
! 58: Some thresholds produced by the tune program are merely single values chosen
! 59: from what's actually a range of sizes where two algorithms are pretty much
! 60: the same speed. When this happens the program is likely to give slightly
! 61: different values on successive runs. This is noticeable on the toom3
! 62: thresholds for instance.
! 63:
! 64:
! 65:
! 66:
! 67: SPEED PROGRAM
! 68:
! 69: The "speed" program can be used for measuring and comparing various
! 70: routines, and producing tables of data or gnuplot graphs. Compile it with
! 71:
! 72: make speed
! 73:
! 74: Here are some examples of how to use it. Check the code for all the
! 75: options.
! 76:
! 77: Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
! 78: (whichever is greater).
! 79:
! 80: ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
! 81: gnuplot foo.gnuplot
! 82:
! 83: Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing
! 84: under mpn_lshift the difference between it and mpn_add_n.
! 85:
! 86: ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
! 87:
! 88: Using option -c for times in cycles is interesting but normally only
! 89: necessary when looking carefully at assembler subroutines. You might think
! 90: it would always give an integer value, but this doesn't happen in practice,
! 91: probably due to overheads in the time measurements.
! 92:
! 93: In the free-form output the "#" symbol against a measurement means the
! 94: corresponding routine is fastest at that size. This is a convenient visual
! 95: cue when comparing different routines. The graph data files <name>.data
! 96: don't get this since it would upset gnuplot or other data viewers.
! 97:
! 98:
! 99:
! 100:
! 101: TIME BASE
! 102:
! 103: The time measuring method is determined in time.c, based on what the
! 104: configured target has available. A microsecond accurate gettimeofday() will
! 105: work well, but there's code to use better methods, such as the cycle
! 106: counters on various CPUs.
! 107:
! 108: Currently, all methods except possibly the alpha cycle counter depend on the
! 109: machine being otherwise idle, or rather on other jobs not stealing CPU time
! 110: from the measuring program. Short routines (that complete within a
! 111: timeslice) should work even on a busy machine. Some trouble is taken by
! 112: speed_measure() in common.c to avoid the ill effects of sporadic interrupts,
! 113: or other intermittent things (like cron waking up every minute). But
! 114: generally you'll want an idle machine to be sure of consistent results.
! 115:
! 116: The CPU frequency is needed if times in cycles are to be displayed, and it's
! 117: always needed when using a cycle counter time base. time.c knows how to get
! 118: the frequency on some systems, but when that fails, or needs to be
! 119: overridden, an environment variable GMP_CPU_FREQUENCY can be used (in
! 120: Hertz). For example in "bash" on a 650 MHz machine,
! 121:
! 122: export GMP_CPU_FREQUENCY=650e6
! 123:
! 124: A high precision time base makes it possible to get accurate measurements in
! 125: a shorter time. Support for systems and CPUs not already covered is wanted.
! 126:
! 127: When setting up a method, be sure not to claim a higher accuracy than is
! 128: really available. For example the default gettimeofday() code is set for
! 129: microsecond accuracy, but if only 10ms or 55ms is available then
! 130: inconsistent results can be expected.
! 131:
! 132:
! 133:
! 134:
! 135:
! 136: EXAMPLE COMPARISONS
! 137:
! 138: Here are some ideas for things you can do with the speed program.
! 139:
! 140: There's always going to be a certain amount of overhead in the time
! 141: measurements, due to reading the time base, and in the loop that runs a
! 142: routine enough times to get a reading of the desired precision. Noop
! 143: functions taking various arguments are available to measure this. The
! 144: "overhead" printed by the speed program each time in its intro is the "noop"
! 145: routine, but note that this is just for information, it isn't deducted from
! 146: the times printed or anything.
! 147:
! 148: ./speed -s 1 noop noop_wxs noop_wxys
! 149:
! 150: If you want to know how many cycles per limb a routine is taking, look at
! 151: the time increase when the size increments, using option -D. This avoids
! 152: fixed overheads in the measuring. Also, remember many of the assembler
! 153: routines have unrolled loops, so it might be necessary to compare times at,
! 154: say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed
! 155: to any finishing off.
! 156:
! 157: ./speed -s 16-64 -t 16 -C -D mpn_add_n
! 158:
! 159: The -C option on its own gives cycles per limb, but is really only useful at
! 160: big sizes where fixed overheads are small compared to the code doing the
! 161: real work. Remember of course memory caching and/or page swapping will
! 162: affect results at large sizes.
! 163:
! 164: ./speed -s 500000 -C mpn_add_n
! 165:
! 166: Once a calculation stops fitting in the CPU data cache, it's going to start
! 167: taking longer. Exactly where this happens depends on the cache priming in
! 168: the measuring routines, and on what sort of "least recently used" the
! 169: hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and
! 170: 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
! 171: limbs.
! 172:
! 173: ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
! 174: gnuplot foo.gnuplot
! 175:
! 176: When a routine has an unrolled loop for, say, multiples of 8 limbs and then
! 177: an ordinary loop for the remainder, it can happen that it's actually faster
! 178: to do an operation on, say, 8 limbs than it is on 7 limbs. Here's an
! 179: example drawing a graph of mpn_sub_n, which you can look at to see if times
! 180: smoothly increase with size.
! 181:
! 182: ./speed -s 1-100 -c -P foo mpn_sub_n
! 183: gnuplot foo.gnuplot
! 184:
! 185: If mpn_lshift and mpn_rshift for your CPU have special case code for shifts
! 186: by 1, it ought to be faster (or at least not slower) than shifting by, say,
! 187: 2 bits.
! 188:
! 189: ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
! 190:
! 191: An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
! 192: if the lshift isn't faster there's an obvious improvement that's possible.
! 193:
! 194: ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
! 195:
! 196: On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
! 197: destination is one of the sources is faster than a separate destination.
! 198: Here's an example to see this. (mpn_add_n_inplace is a special measuring
! 199: routine, not available for other operations.)
! 200:
! 201: ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace
! 202:
! 203: The gmp manual recommends divisions by powers of two should be done using a
! 204: right shift because it'll be significantly faster. Here's how you can see
! 205: by what factor mpn_rshift is faster, using division by 32 as an example.
! 206:
! 207: ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
! 208:
! 209: mul_basecase takes an "r" parameter that's the first (larger) size
! 210: parameter. For example to show speeds for 20x1 up to 20x15 in cycles,
! 211:
! 212: ./speed -s 1-15 -c mpn_mul_basecase.20
! 213:
! 214: mul_basecase with no parameter does an NxN multiply, so for example to show
! 215: speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
! 216:
! 217: ./speed -s 1-20 -c mpn_mul_basecase
! 218:
! 219: sqr_basecase is implemented by a "triangular" method on most CPUs, making it
! 220: up to twice as fast as mul_basecase. In practice loop overheads and the
! 221: products on the diagonal mean it falls short of this. Here's an example
! 222: running the two and showing by what factor an NxN mul_basecase is slower
! 223: than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes
! 224: below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.)
! 225:
! 226: ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
! 227:
! 228: The technique described above with -CD for showing the time difference in
! 229: cycles per limb between two size operations can be done on an NxN
! 230: mul_basecase using -E to change the basis for the size increment to N*N.
! 231: For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
! 232: doing 256 limbs. The following therefore shows the per crossproduct speed
! 233: of mul_basecase and sqr_basecase at around 20x20 limbs.
! 234:
! 235: ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
! 236:
! 237: Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
! 238: interesting to compare it to mul_basecase as if it was. For sqr_basecase
! 239: the -F option can be used to base the deltas on N*(N+1)/2 operations, which
! 240: is the triangular products sqr_basecase does. For example,
! 241:
! 242: ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
! 243:
! 244: Both -E and -F are preliminary and might change. A consistent approach to
! 245: using them when claiming certain per crossproduct or per triangularproduct
! 246: speeds hasn't really been established, but the increment between speeds in
! 247: the range karatsuba will call seems sensible, that being k to k/2. For
! 248: instance, if the karatsuba threshold was 20 for the multiply and 30 for the
! 249: square,
! 250:
! 251: ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
! 252: ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
! 253:
! 254: The gmp manual recommends application programs avoid excessive initializing
! 255: and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new
! 256: variable will at a minimum go through an init, a realloc for its first
! 257: store, and finally a clear. Quite how long that takes depends on the C
! 258: library. The following compares an mpz_init/realloc/clear to a 10 limb
! 259: mpz_add.
! 260:
! 261: ./speed -s 10 -c mpz_init_realloc_clear mpz_add
! 262:
! 263: The normal libtool link of the speed program does a static link to libgmp.la
! 264: and libspeed.la, but will end up dynamic linked to libc. Depending on the
! 265: system, a dynamic linked malloc may be noticeably slower than static linked,
! 266: and you may want to re-run the libtool link invocation to static link libc
! 267: for comparison. The example below does a 10 limb malloc/free or
! 268: malloc/realloc/free to test the C library. Of course a real world program
! 269: has big problems if it's doing so many mallocs and frees that it gets slowed
! 270: down by a dynamic linked malloc.
! 271:
! 272: ./speed -s 10 -c malloc_free malloc_realloc_free
! 273:
! 274:
! 275:
! 276:
! 277: SPEED PROGRAM EXTENSIONS
! 278:
! 279: Potentially lots of things could be made available in the program, but it's
! 280: been left at only the things that have actually been wanted and are likely
! 281: to be reasonably useful in the future.
! 282:
! 283: Extensions should be fairly easy to make though. speed-ext.c is an example,
! 284: in a style that should suit one-off tests, or new code fragments under
! 285: development.
! 286:
! 287:
! 288:
! 289:
! 290: THRESHOLD EXAMINING
! 291:
! 292: The speed program can be used to examine the speeds of different algorithms
! 293: to check the tune program has done the right thing. For example to examine
! 294: the karatsuba multiply threshold,
! 295:
! 296: ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
! 297:
! 298: When examining the toom3 threshold, remember it depends on the karatsuba
! 299: threshold, so the right karatsuba threshold needs to be compiled into the
! 300: library first. The tune program uses special recompiled versions of
! 301: mpn/mul_n.c etc for this reason, but the speed program simply uses the
! 302: normal libgmp.la.
! 303:
! 304: Note further that the various routines may recurse into themselves on sizes
! 305: far enough above applicable thresholds. For example, mpn_kara_mul_n will
! 306: recurse into itself on sizes greater than twice the compiled-in
! 307: KARATSUBA_MUL_THRESHOLD.
! 308:
! 309: When doing the above comparison between mul_basecase and kara_mul_n what's
! 310: probably of interest is mul_basecase versus a kara_mul_n that does one level
! 311: of Karatsuba then calls to mul_basecase, but this only happens on sizes less
! 312: than twice the compiled KARATSUBA_MUL_THRESHOLD. A larger value for that
! 313: setting can be compiled-in to avoid the problem if necessary. The same
! 314: applies to toom3 and BZ, though in a trickier fashion.
! 315:
! 316: There are some upper limits on some of the thresholds, arising from arrays
! 317: dimensioned according to a threshold (mpn_mul_n), or asm code with certain
! 318: sized displacements (some x86 versions of sqr_basecase). So putting huge
! 319: values for the thresholds, even just for testing, may fail.
! 320:
! 321:
! 322:
! 323:
! 324: THINGS AFFECTING THRESHOLDS
! 325:
! 326: The following are some general notes on some things that can affect the
! 327: various algorithm thresholds.
! 328:
! 329: KARATSUBA_MUL_THRESHOLD
! 330:
! 331: At size 2N, karatsuba does three NxN multiplies and some adds and
! 332: shifts, compared to a 2Nx2N basecase multiply which will be roughly
! 333: equivalent to four NxN multiplies.
! 334:
! 335: Fast mul - increases threshold
! 336:
! 337: If the CPU has a fast multiply, the basecase multiplies are going
! 338: to stay faster than the karatsuba overheads for longer. Conversely
! 339: if the CPU has a slow multiply the karatsuba method trading some
! 340: multiplies for adds will become worthwhile sooner.
! 341:
! 342: Remember it's "addmul" performance that's of interest here. This
! 343: may differ from a simple "mul" instruction in the CPU. For example
! 344: K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul,
! 345: and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due
! 346: to pipelining.
! 347:
! 348: Unrolled addmul - increases threshold
! 349:
! 350: If the CPU addmul routine (or the addmul part of the mul_basecase
! 351: routine) is unrolled it can mean that a 2Nx2N multiply is a bit
! 352: faster than four NxN multiplies, due to proportionally less looping
! 353: overheads. This can be thought of as the addmul warming to its
! 354: task on bigger sizes, and keeping the basecase better than
! 355: karatsuba for longer.
! 356:
! 357: Karatsuba overheads - increases threshold
! 358:
! 359: Fairly obviously anything gained or lost in the karatsuba extra
! 360: calculations will translate directly to the threshold. But
! 361: remember the extra calculations are likely to always be a
! 362: relatively small fraction of the total multiply time and in that
! 363: sense the basecase code is the best place to be looking for
! 364: optimizations.
! 365:
! 366: KARATSUBA_SQR_THRESHOLD
! 367:
! 368: Squaring is essentially the same as multiplying, so the above applies
! 369: to squaring too. Fixed overheads will, proportionally, be bigger when
! 370: squaring, leading to a higher threshold usually.
! 371:
! 372: mpn/generic/sqr_basecase.c
! 373:
! 374: This relies on a reasonable umul_ppmm, and if the generic C code is
! 375: being used it may badly affect the speed. Don't bother paying
! 376: attention to the square thresholds until you have either a good
! 377: umul_ppmm or an assembler sqr_basecase.
! 378:
! 379: TOOM3_MUL_THRESHOLD
! 380:
! 381: At size N, toom3 does five (N/3)x(N/3) multiplies and some extra
! 382: calculations, compared to karatsuba doing three (N/2)x(N/2)
! 383: multiplies and some extra calculations (fewer). Toom3 will become
! 384: better before long, being O(n^1.465) versus karatsuba at O(n^1.585),
! 385: but exactly where depends a great deal on the implementations of all
! 386: the relevant bits of extra calculation.
! 387:
! 388: In practice the curves for time versus size on toom3 and karatsuba
! 389: have similar slopes near their crossover, leading to a range of sizes
! 390: where there's very little difference between the two. Choosing a
! 391: single value from the range is a bit arbitrary and will lead to
! 392: slightly different values on successive runs of the tune program.
! 393:
! 394: divexact_by3 - used by toom3
! 395:
! 396: Toom3 does a divexact_by3 which at size N is roughly equivalent to
! 397: N successively dependent multiplies with a further couple of extra
! 398: instructions in between. CPUs with a low latency multiply and good
! 399: divexact_by3 implementation should see the toom3 threshold lowered.
! 400: But note this is unlikely to have much effect on total multiply
! 401: times.
! 402:
! 403: Asymptotic behaviour
! 404:
! 405: At the fairly small sizes where the thresholds occur it's worth
! 406: remembering that the asymptotic behaviour for karatsuba and toom3
! 407: can't be expected to make accurate predictions, due of course to
! 408: the big influence of all sorts of overheads, and the fact that only
! 409: a few recursions of each are being performed.
! 410:
! 411: Even at large sizes there's a good chance machine dependent effects
! 412: like cache architecture will mean actual performance deviates from
! 413: what might be predicted. This is why the rather positivist
! 414: approach of just measuring things has been adopted, in general.
! 415:
! 416: TOOM3_SQR_THRESHOLD
! 417:
! 418: The same factors apply to squaring as to multiplying, though with
! 419: overheads being proportionally a bit bigger.
! 420:
! 421: FFT_MUL_THRESHOLD, etc
! 422:
! 423: When configured with --enable-fft, a Fermat style FFT is used for
! 424: multiplication above FFT_MUL_THRESHOLD, and a further threshold
! 425: FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1
! 426: multiply. FFT_MUL_TABLE is the thresholds at which each split size
! 427: "k" is used in the FFT.
! 428:
! 429: step effect - coarse grained thresholds
! 430:
! 431: The FFT has size restrictions that mean it rounds up sizes to
! 432: certain multiples and therefore does the same amount of work for a
! 433: range of different sized operands. For example at k=8 the size is
! 434: internally rounded to a multiple of 1024 limbs. The current single
! 435: values for the various thresholds are set to give good average
! 436: performance, but in the future multiple values might be wanted to
! 437: take into account the different step sizes for different "k"s.
! 438:
! 439: FFT_SQR_THRESHOLD, etc
! 440:
! 441: The same considerations apply as for multiplications, plus the
! 442: following.
! 443:
! 444: similarity to mul thresholds
! 445:
! 446: On some CPUs the squaring thresholds are nearly the same as those
! 447: for multiplying. It's not quite clear why this is, it might be
! 448: similar shaped size/time graphs for the mul and sqrs recursed into.
! 449:
! 450: BZ_THRESHOLD
! 451:
! 452: The B-Z division algorithm rearranges a traditional multi-precision
! 453: long division so that NxN multiplies can be done rather than repeated
! 454: Nx1 multiplies, thereby exploiting the algorithmic advantages of
! 455: karatsuba and toom3, and leading to significant speedups.
! 456:
! 457: fast mul_basecase - decreases threshold
! 458:
! 459: CPUs with an optimized mul_basecase can expect a lower B-Z
! 460: threshold due to the helping hand such a mul_basecase will give to
! 461: B-Z as compared to submul_1 used in the schoolbook method.
! 462:
! 463: GCD_ACCEL_THRESHOLD
! 464:
! 465: Below this threshold a simple binary subtract and shift is used, above
! 466: it Ken Weber's accelerated algorithm is used. The accelerated GCD
! 467: performs far fewer steps than the binary GCD and will normally kick in
! 468: at quite small sizes.
! 469:
! 470: modlimb_invert and find_a - affect threshold
! 471:
! 472: At small sizes the performance of modlimb_invert and find_a will
! 473: affect the accelerated algorithm and CPUs where those routines are
! 474: not well optimized may see a higher threshold. (At large sizes
! 475: mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated
! 476: algorithm.)
! 477:
! 478: GCDEXT_THRESHOLD
! 479:
! 480: mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of
! 481: Euclid's algorithm. The multipliers are found using single limb
! 482: calculations below GCDEXT_THRESHOLD, or double limb calculations
! 483: above. The single limb code is fast but doesn't produce full-limb
! 484: multipliers.
! 485:
! 486: data-dependent multiplier - big threshold
! 487:
! 488: If multiplications done by mpn_mul_1, addmul_1 and submul_1 run
! 489: slower when there's more bits in the multiplier, then producing
! 490: bigger multipliers with the double limb calculation doesn't save
! 491: much more than some looping and function call overheads. A large
! 492: threshold can then be expected.
! 493:
! 494: slow division - low threshold
! 495:
! 496: The single limb calculation does some plain "/" divisions, whereas
! 497: the double limb calculation has a divide routine optimized for the
! 498: small quotients that often occur. Until the single limb code does
! 499: something similar a slow hardware divide will count against it.
! 500:
! 501:
! 502:
! 503:
! 504:
! 505: FUTURE
! 506:
! 507: Make a program to check the time base is working properly, for small and
! 508: large measurements. Make it able to test each available method, including
! 509: perhaps the apparent resolution of each.
! 510:
! 511: Add versions of the toom3 multiplication using either the mpn calls or the
! 512: open-coded style, so the two can be compared.
! 513:
! 514: Add versions of the generic C mpn_divrem_1 using straight division versus a
! 515: multiply by inverse, so the two can be compared. Include the branch-free
! 516: version of multiply by inverse too.
! 517:
! 518: Make an option in struct speed_parameters to specify operand overlap,
! 519: perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1
! 520: dst2=src2, 4 for dst1=src2 dst2=src1. This is done for addsub_n with the r
! 521: parameter (though addsub_n isn't yet enabled), and could be done for add_n,
! 522: xor_n, etc too.
! 523:
! 524: When speed_measure() divides the total time measured by repetitions
! 525: performed, it divides the fixed overheads imposed by speed_starttime() and
! 526: speed_endtime(). When different routines are run with different repetitions
! 527: the overhead will then be differently counted. It would improve precision
! 528: to try to avoid this. Currently the idea is just to set speed_precision big
! 529: enough that the effect is insignificant compared to the routines being
! 530: measured.
! 531:
! 532:
! 533:
! 534:
! 535: ----------------
! 536: Local variables:
! 537: mode: text
! 538: fill-column: 76
! 539: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>