OpenXM_contrib/gmp/tune/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / tune
Annotation of OpenXM_contrib/gmp/tune/README, Revision 1.1.1.1

1.1       maekawa     1:
                      2:                GMP SPEED MEASURING AND PARAMETER TUNING
                      3:
                      4:
                      5: The programs in this directory are for knowledgeable users who want to make
                      6: measurements of the speed of GMP routines on their machine, and perhaps
                      7: tweak some settings or identify things that can be improved.
                      8:
                      9: The programs here are tools, not ready to run solutions.  Nothing is built
                     10: in a normal "make all", but various Makefile targets described below exist.
                     11:
                     12: Relatively few systems and CPUs have been tested, so be sure to verify that
                     13: you're getting sensible results before relying on them.
                     14:
                     15:
                     16:
                     17:
                     18: MISCELLANEOUS NOTES
                     19:
                     20: Don't configure with --enable-assert when using the things here, since the
                     21: extra code added by assertion checking may influence measurements.
                     22:
                     23: Some effort has been made to accommodate CPUs with direct mapped caches, but
                     24: it will depend on TMP_ALLOC using a proper alloca, and even then it may or
                     25: may not be enough.
                     26:
                     27: The sparc32/v9 addmul_1 code runs at noticeably different speeds on
                     28: successive sizes, and this has a bad effect on the tune program's
                     29: determinations of the multiply and square thresholds.
                     30:
                     31:
                     32:
                     33:
                     34:
                     35: PARAMETER TUNING
                     36:
                     37: The "tuneup" program runs some tests designed to find the best settings for
                     38: various thresholds, like KARATSUBA_MUL_THRESHOLD.  Its output can be put
                     39: into gmp-mparam.h.  The program can be built and run with
                     40:
                     41:         make tune
                     42:
                     43: If the thresholds indicated are grossly different from the values in the
                     44: selected gmp-mparam.h then you may get a performance boost in relevant size
                     45: ranges by changing gmp-mparam.h accordingly.
                     46:
                     47: If your CPU has specific tuned parameters coming from a gmp-mparam.h in one
                     48: of the mpn subdirectories then the values from "make tune" should be
                     49: similar.  You can submit new values if it looks like the current ones are
                     50: out of date or wildly wrong.  But check you're on the right CPU target and
                     51: there aren't any machine-specific effects causing a difference.
                     52:
                     53: It's hoped the compiler and options used won't have too much effect on
                     54: thresholds, since for most CPUs they ultimately come down to comparisons
                     55: between assembler subroutines.  Missing out on the longlong.h macros by not
                     56: using gcc will probably have an effect.
                     57:
                     58: Some thresholds produced by the tune program are merely single values chosen
                     59: from what's actually a range of sizes where two algorithms are pretty much
                     60: the same speed.  When this happens the program is likely to give slightly
                     61: different values on successive runs.  This is noticeable on the toom3
                     62: thresholds for instance.
                     63:
                     64:
                     65:
                     66:
                     67: SPEED PROGRAM
                     68:
                     69: The "speed" program can be used for measuring and comparing various
                     70: routines, and producing tables of data or gnuplot graphs.  Compile it with
                     71:
                     72:        make speed
                     73:
                     74: Here are some examples of how to use it.  Check the code for all the
                     75: options.
                     76:
                     77: Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
                     78: (whichever is greater).
                     79:
                     80:         ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
                     81:        gnuplot foo.gnuplot
                     82:
                     83: Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing
                     84: under mpn_lshift the difference between it and mpn_add_n.
                     85:
                     86:        ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
                     87:
                     88: Using option -c for times in cycles is interesting but normally only
                     89: necessary when looking carefully at assembler subroutines.  You might think
                     90: it would always give an integer value, but this doesn't happen in practice,
                     91: probably due to overheads in the time measurements.
                     92:
                     93: In the free-form output the "#" symbol against a measurement means the
                     94: corresponding routine is fastest at that size.  This is a convenient visual
                     95: cue when comparing different routines.  The graph data files <name>.data
                     96: don't get this since it would upset gnuplot or other data viewers.
                     97:
                     98:
                     99:
                    100:
                    101: TIME BASE
                    102:
                    103: The time measuring method is determined in time.c, based on what the
                    104: configured target has available.  A microsecond accurate gettimeofday() will
                    105: work well, but there's code to use better methods, such as the cycle
                    106: counters on various CPUs.
                    107:
                    108: Currently, all methods except possibly the alpha cycle counter depend on the
                    109: machine being otherwise idle, or rather on other jobs not stealing CPU time
                    110: from the measuring program.  Short routines (that complete within a
                    111: timeslice) should work even on a busy machine.  Some trouble is taken by
                    112: speed_measure() in common.c to avoid the ill effects of sporadic interrupts,
                    113: or other intermittent things (like cron waking up every minute).  But
                    114: generally you'll want an idle machine to be sure of consistent results.
                    115:
                    116: The CPU frequency is needed if times in cycles are to be displayed, and it's
                    117: always needed when using a cycle counter time base.  time.c knows how to get
                    118: the frequency on some systems, but when that fails, or needs to be
                    119: overridden, an environment variable GMP_CPU_FREQUENCY can be used (in
                    120: Hertz).  For example in "bash" on a 650 MHz machine,
                    121:
                    122:        export GMP_CPU_FREQUENCY=650e6
                    123:
                    124: A high precision time base makes it possible to get accurate measurements in
                    125: a shorter time.  Support for systems and CPUs not already covered is wanted.
                    126:
                    127: When setting up a method, be sure not to claim a higher accuracy than is
                    128: really available.  For example the default gettimeofday() code is set for
                    129: microsecond accuracy, but if only 10ms or 55ms is available then
                    130: inconsistent results can be expected.
                    131:
                    132:
                    133:
                    134:
                    135:
                    136: EXAMPLE COMPARISONS
                    137:
                    138: Here are some ideas for things you can do with the speed program.
                    139:
                    140: There's always going to be a certain amount of overhead in the time
                    141: measurements, due to reading the time base, and in the loop that runs a
                    142: routine enough times to get a reading of the desired precision.  Noop
                    143: functions taking various arguments are available to measure this.  The
                    144: "overhead" printed by the speed program each time in its intro is the "noop"
                    145: routine, but note that this is just for information, it isn't deducted from
                    146: the times printed or anything.
                    147:
                    148:        ./speed -s 1 noop noop_wxs noop_wxys
                    149:
                    150: If you want to know how many cycles per limb a routine is taking, look at
                    151: the time increase when the size increments, using option -D.  This avoids
                    152: fixed overheads in the measuring.  Also, remember many of the assembler
                    153: routines have unrolled loops, so it might be necessary to compare times at,
                    154: say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed
                    155: to any finishing off.
                    156:
                    157:         ./speed -s 16-64 -t 16 -C -D mpn_add_n
                    158:
                    159: The -C option on its own gives cycles per limb, but is really only useful at
                    160: big sizes where fixed overheads are small compared to the code doing the
                    161: real work.  Remember of course memory caching and/or page swapping will
                    162: affect results at large sizes.
                    163:
                    164:         ./speed -s 500000 -C mpn_add_n
                    165:
                    166: Once a calculation stops fitting in the CPU data cache, it's going to start
                    167: taking longer.  Exactly where this happens depends on the cache priming in
                    168: the measuring routines, and on what sort of "least recently used" the
                    169: hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
                    170: 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
                    171: limbs.
                    172:
                    173:         ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
                    174:        gnuplot foo.gnuplot
                    175:
                    176: When a routine has an unrolled loop for, say, multiples of 8 limbs and then
                    177: an ordinary loop for the remainder, it can happen that it's actually faster
                    178: to do an operation on, say, 8 limbs than it is on 7 limbs.  Here's an
                    179: example drawing a graph of mpn_sub_n, which you can look at to see if times
                    180: smoothly increase with size.
                    181:
                    182:         ./speed -s 1-100 -c -P foo mpn_sub_n
                    183:        gnuplot foo.gnuplot
                    184:
                    185: If mpn_lshift and mpn_rshift for your CPU have special case code for shifts
                    186: by 1, it ought to be faster (or at least not slower) than shifting by, say,
                    187: 2 bits.
                    188:
                    189:         ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
                    190:
                    191: An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
                    192: if the lshift isn't faster there's an obvious improvement that's possible.
                    193:
                    194:         ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
                    195:
                    196: On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
                    197: destination is one of the sources is faster than a separate destination.
                    198: Here's an example to see this.  (mpn_add_n_inplace is a special measuring
                    199: routine, not available for other operations.)
                    200:
                    201:         ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace
                    202:
                    203: The gmp manual recommends divisions by powers of two should be done using a
                    204: right shift because it'll be significantly faster.  Here's how you can see
                    205: by what factor mpn_rshift is faster, using division by 32 as an example.
                    206:
                    207:         ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
                    208:
                    209: mul_basecase takes an "r" parameter that's the first (larger) size
                    210: parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
                    211:
                    212:         ./speed -s 1-15 -c mpn_mul_basecase.20
                    213:
                    214: mul_basecase with no parameter does an NxN multiply, so for example to show
                    215: speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
                    216:
                    217:         ./speed -s 1-20 -c mpn_mul_basecase
                    218:
                    219: sqr_basecase is implemented by a "triangular" method on most CPUs, making it
                    220: up to twice as fast as mul_basecase.  In practice loop overheads and the
                    221: products on the diagonal mean it falls short of this.  Here's an example
                    222: running the two and showing by what factor an NxN mul_basecase is slower
                    223: than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
                    224: below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.)
                    225:
                    226:         ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
                    227:
                    228: The technique described above with -CD for showing the time difference in
                    229: cycles per limb between two size operations can be done on an NxN
                    230: mul_basecase using -E to change the basis for the size increment to N*N.
                    231: For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
                    232: doing 256 limbs.  The following therefore shows the per crossproduct speed
                    233: of mul_basecase and sqr_basecase at around 20x20 limbs.
                    234:
                    235:         ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
                    236:
                    237: Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
                    238: interesting to compare it to mul_basecase as if it was.  For sqr_basecase
                    239: the -F option can be used to base the deltas on N*(N+1)/2 operations, which
                    240: is the triangular products sqr_basecase does.  For example,
                    241:
                    242:         ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
                    243:
                    244: Both -E and -F are preliminary and might change.  A consistent approach to
                    245: using them when claiming certain per crossproduct or per triangularproduct
                    246: speeds hasn't really been established, but the increment between speeds in
                    247: the range karatsuba will call seems sensible, that being k to k/2.  For
                    248: instance, if the karatsuba threshold was 20 for the multiply and 30 for the
                    249: square,
                    250:
                    251:         ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
                    252:         ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
                    253:
                    254: The gmp manual recommends application programs avoid excessive initializing
                    255: and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
                    256: variable will at a minimum go through an init, a realloc for its first
                    257: store, and finally a clear.  Quite how long that takes depends on the C
                    258: library.  The following compares an mpz_init/realloc/clear to a 10 limb
                    259: mpz_add.
                    260:
                    261:         ./speed -s 10 -c mpz_init_realloc_clear mpz_add
                    262:
                    263: The normal libtool link of the speed program does a static link to libgmp.la
                    264: and libspeed.la, but will end up dynamic linked to libc.  Depending on the
                    265: system, a dynamic linked malloc may be noticeably slower than static linked,
                    266: and you may want to re-run the libtool link invocation to static link libc
                    267: for comparison.  The example below does a 10 limb malloc/free or
                    268: malloc/realloc/free to test the C library.  Of course a real world program
                    269: has big problems if it's doing so many mallocs and frees that it gets slowed
                    270: down by a dynamic linked malloc.
                    271:
                    272:         ./speed -s 10 -c malloc_free malloc_realloc_free
                    273:
                    274:
                    275:
                    276:
                    277: SPEED PROGRAM EXTENSIONS
                    278:
                    279: Potentially lots of things could be made available in the program, but it's
                    280: been left at only the things that have actually been wanted and are likely
                    281: to be reasonably useful in the future.
                    282:
                    283: Extensions should be fairly easy to make though.  speed-ext.c is an example,
                    284: in a style that should suit one-off tests, or new code fragments under
                    285: development.
                    286:
                    287:
                    288:
                    289:
                    290: THRESHOLD EXAMINING
                    291:
                    292: The speed program can be used to examine the speeds of different algorithms
                    293: to check the tune program has done the right thing.  For example to examine
                    294: the karatsuba multiply threshold,
                    295:
                    296:        ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
                    297:
                    298: When examining the toom3 threshold, remember it depends on the karatsuba
                    299: threshold, so the right karatsuba threshold needs to be compiled into the
                    300: library first.  The tune program uses special recompiled versions of
                    301: mpn/mul_n.c etc for this reason, but the speed program simply uses the
                    302: normal libgmp.la.
                    303:
                    304: Note further that the various routines may recurse into themselves on sizes
                    305: far enough above applicable thresholds.  For example, mpn_kara_mul_n will
                    306: recurse into itself on sizes greater than twice the compiled-in
                    307: KARATSUBA_MUL_THRESHOLD.
                    308:
                    309: When doing the above comparison between mul_basecase and kara_mul_n what's
                    310: probably of interest is mul_basecase versus a kara_mul_n that does one level
                    311: of Karatsuba then calls to mul_basecase, but this only happens on sizes less
                    312: than twice the compiled KARATSUBA_MUL_THRESHOLD.  A larger value for that
                    313: setting can be compiled-in to avoid the problem if necessary.  The same
                    314: applies to toom3 and BZ, though in a trickier fashion.
                    315:
                    316: There are some upper limits on some of the thresholds, arising from arrays
                    317: dimensioned according to a threshold (mpn_mul_n), or asm code with certain
                    318: sized displacements (some x86 versions of sqr_basecase).  So putting huge
                    319: values for the thresholds, even just for testing, may fail.
                    320:
                    321:
                    322:
                    323:
                    324: THINGS AFFECTING THRESHOLDS
                    325:
                    326: The following are some general notes on some things that can affect the
                    327: various algorithm thresholds.
                    328:
                    329:    KARATSUBA_MUL_THRESHOLD
                    330:
                    331:       At size 2N, karatsuba does three NxN multiplies and some adds and
                    332:       shifts, compared to a 2Nx2N basecase multiply which will be roughly
                    333:       equivalent to four NxN multiplies.
                    334:
                    335:       Fast mul - increases threshold
                    336:
                    337:          If the CPU has a fast multiply, the basecase multiplies are going
                    338:          to stay faster than the karatsuba overheads for longer.  Conversely
                    339:          if the CPU has a slow multiply the karatsuba method trading some
                    340:          multiplies for adds will become worthwhile sooner.
                    341:
                    342:         Remember it's "addmul" performance that's of interest here.  This
                    343:         may differ from a simple "mul" instruction in the CPU.  For example
                    344:         K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul,
                    345:         and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due
                    346:         to pipelining.
                    347:
                    348:       Unrolled addmul - increases threshold
                    349:
                    350:          If the CPU addmul routine (or the addmul part of the mul_basecase
                    351:          routine) is unrolled it can mean that a 2Nx2N multiply is a bit
                    352:          faster than four NxN multiplies, due to proportionally less looping
                    353:          overheads.  This can be thought of as the addmul warming to its
                    354:          task on bigger sizes, and keeping the basecase better than
                    355:          karatsuba for longer.
                    356:
                    357:       Karatsuba overheads - increases threshold
                    358:
                    359:          Fairly obviously anything gained or lost in the karatsuba extra
                    360:          calculations will translate directly to the threshold.  But
                    361:          remember the extra calculations are likely to always be a
                    362:          relatively small fraction of the total multiply time and in that
                    363:          sense the basecase code is the best place to be looking for
                    364:          optimizations.
                    365:
                    366:    KARATSUBA_SQR_THRESHOLD
                    367:
                    368:       Squaring is essentially the same as multiplying, so the above applies
                    369:       to squaring too.  Fixed overheads will, proportionally, be bigger when
                    370:       squaring, leading to a higher threshold usually.
                    371:
                    372:       mpn/generic/sqr_basecase.c
                    373:
                    374:          This relies on a reasonable umul_ppmm, and if the generic C code is
                    375:          being used it may badly affect the speed.  Don't bother paying
                    376:          attention to the square thresholds until you have either a good
                    377:          umul_ppmm or an assembler sqr_basecase.
                    378:
                    379:    TOOM3_MUL_THRESHOLD
                    380:
                    381:       At size N, toom3 does five (N/3)x(N/3) multiplies and some extra
                    382:       calculations, compared to karatsuba doing three (N/2)x(N/2)
                    383:       multiplies and some extra calculations (fewer).  Toom3 will become
                    384:       better before long, being O(n^1.465) versus karatsuba at O(n^1.585),
                    385:       but exactly where depends a great deal on the implementations of all
                    386:       the relevant bits of extra calculation.
                    387:
                    388:       In practice the curves for time versus size on toom3 and karatsuba
                    389:       have similar slopes near their crossover, leading to a range of sizes
                    390:       where there's very little difference between the two.  Choosing a
                    391:       single value from the range is a bit arbitrary and will lead to
                    392:       slightly different values on successive runs of the tune program.
                    393:
                    394:       divexact_by3 - used by toom3
                    395:
                    396:          Toom3 does a divexact_by3 which at size N is roughly equivalent to
                    397:          N successively dependent multiplies with a further couple of extra
                    398:          instructions in between.  CPUs with a low latency multiply and good
                    399:          divexact_by3 implementation should see the toom3 threshold lowered.
                    400:          But note this is unlikely to have much effect on total multiply
                    401:          times.
                    402:
                    403:       Asymptotic behaviour
                    404:
                    405:          At the fairly small sizes where the thresholds occur it's worth
                    406:          remembering that the asymptotic behaviour for karatsuba and toom3
                    407:          can't be expected to make accurate predictions, due of course to
                    408:          the big influence of all sorts of overheads, and the fact that only
                    409:          a few recursions of each are being performed.
                    410:
                    411:         Even at large sizes there's a good chance machine dependent effects
                    412:         like cache architecture will mean actual performance deviates from
                    413:         what might be predicted.  This is why the rather positivist
                    414:         approach of just measuring things has been adopted, in general.
                    415:
                    416:    TOOM3_SQR_THRESHOLD
                    417:
                    418:       The same factors apply to squaring as to multiplying, though with
                    419:       overheads being proportionally a bit bigger.
                    420:
                    421:    FFT_MUL_THRESHOLD, etc
                    422:
                    423:       When configured with --enable-fft, a Fermat style FFT is used for
                    424:       multiplication above FFT_MUL_THRESHOLD, and a further threshold
                    425:       FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1
                    426:       multiply.  FFT_MUL_TABLE is the thresholds at which each split size
                    427:       "k" is used in the FFT.
                    428:
                    429:       step effect - coarse grained thresholds
                    430:
                    431:          The FFT has size restrictions that mean it rounds up sizes to
                    432:          certain multiples and therefore does the same amount of work for a
                    433:          range of different sized operands.  For example at k=8 the size is
                    434:          internally rounded to a multiple of 1024 limbs.  The current single
                    435:          values for the various thresholds are set to give good average
                    436:          performance, but in the future multiple values might be wanted to
                    437:          take into account the different step sizes for different "k"s.
                    438:
                    439:    FFT_SQR_THRESHOLD, etc
                    440:
                    441:       The same considerations apply as for multiplications, plus the
                    442:       following.
                    443:
                    444:       similarity to mul thresholds
                    445:
                    446:          On some CPUs the squaring thresholds are nearly the same as those
                    447:          for multiplying.  It's not quite clear why this is, it might be
                    448:          similar shaped size/time graphs for the mul and sqrs recursed into.
                    449:
                    450:    BZ_THRESHOLD
                    451:
                    452:       The B-Z division algorithm rearranges a traditional multi-precision
                    453:       long division so that NxN multiplies can be done rather than repeated
                    454:       Nx1 multiplies, thereby exploiting the algorithmic advantages of
                    455:       karatsuba and toom3, and leading to significant speedups.
                    456:
                    457:       fast mul_basecase - decreases threshold
                    458:
                    459:          CPUs with an optimized mul_basecase can expect a lower B-Z
                    460:          threshold due to the helping hand such a mul_basecase will give to
                    461:          B-Z as compared to submul_1 used in the schoolbook method.
                    462:
                    463:    GCD_ACCEL_THRESHOLD
                    464:
                    465:       Below this threshold a simple binary subtract and shift is used, above
                    466:       it Ken Weber's accelerated algorithm is used.  The accelerated GCD
                    467:       performs far fewer steps than the binary GCD and will normally kick in
                    468:       at quite small sizes.
                    469:
                    470:       modlimb_invert and find_a - affect threshold
                    471:
                    472:          At small sizes the performance of modlimb_invert and find_a will
                    473:          affect the accelerated algorithm and CPUs where those routines are
                    474:          not well optimized may see a higher threshold.  (At large sizes
                    475:          mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated
                    476:          algorithm.)
                    477:
                    478:    GCDEXT_THRESHOLD
                    479:
                    480:       mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of
                    481:       Euclid's algorithm.  The multipliers are found using single limb
                    482:       calculations below GCDEXT_THRESHOLD, or double limb calculations
                    483:       above.  The single limb code is fast but doesn't produce full-limb
                    484:       multipliers.
                    485:
                    486:       data-dependent multiplier - big threshold
                    487:
                    488:          If multiplications done by mpn_mul_1, addmul_1 and submul_1 run
                    489:          slower when there's more bits in the multiplier, then producing
                    490:          bigger multipliers with the double limb calculation doesn't save
                    491:          much more than some looping and function call overheads.  A large
                    492:          threshold can then be expected.
                    493:
                    494:       slow division - low threshold
                    495:
                    496:          The single limb calculation does some plain "/" divisions, whereas
                    497:          the double limb calculation has a divide routine optimized for the
                    498:          small quotients that often occur.  Until the single limb code does
                    499:          something similar a slow hardware divide will count against it.
                    500:
                    501:
                    502:
                    503:
                    504:
                    505: FUTURE
                    506:
                    507: Make a program to check the time base is working properly, for small and
                    508: large measurements.  Make it able to test each available method, including
                    509: perhaps the apparent resolution of each.
                    510:
                    511: Add versions of the toom3 multiplication using either the mpn calls or the
                    512: open-coded style, so the two can be compared.
                    513:
                    514: Add versions of the generic C mpn_divrem_1 using straight division versus a
                    515: multiply by inverse, so the two can be compared.  Include the branch-free
                    516: version of multiply by inverse too.
                    517:
                    518: Make an option in struct speed_parameters to specify operand overlap,
                    519: perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1
                    520: dst2=src2, 4 for dst1=src2 dst2=src1.  This is done for addsub_n with the r
                    521: parameter (though addsub_n isn't yet enabled), and could be done for add_n,
                    522: xor_n, etc too.
                    523:
                    524: When speed_measure() divides the total time measured by repetitions
                    525: performed, it divides the fixed overheads imposed by speed_starttime() and
                    526: speed_endtime().  When different routines are run with different repetitions
                    527: the overhead will then be differently counted.  It would improve precision
                    528: to try to avoid this.  Currently the idea is just to set speed_precision big
                    529: enough that the effect is insignificant compared to the routines being
                    530: measured.
                    531:
                    532:
                    533:
                    534:
                    535: ----------------
                    536: Local variables:
                    537: mode: text
                    538: fill-column: 76
                    539: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>