OpenXM_contrib/gmp/tune/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / tune
Annotation of OpenXM_contrib/gmp/tune/README, Revision 1.1

1.1     ! maekawa     1:
        !             2:                GMP SPEED MEASURING AND PARAMETER TUNING
        !             3:
        !             4:
        !             5: The programs in this directory are for knowledgeable users who want to make
        !             6: measurements of the speed of GMP routines on their machine, and perhaps
        !             7: tweak some settings or identify things that can be improved.
        !             8:
        !             9: The programs here are tools, not ready to run solutions.  Nothing is built
        !            10: in a normal "make all", but various Makefile targets described below exist.
        !            11:
        !            12: Relatively few systems and CPUs have been tested, so be sure to verify that
        !            13: you're getting sensible results before relying on them.
        !            14:
        !            15:
        !            16:
        !            17:
        !            18: MISCELLANEOUS NOTES
        !            19:
        !            20: Don't configure with --enable-assert when using the things here, since the
        !            21: extra code added by assertion checking may influence measurements.
        !            22:
        !            23: Some effort has been made to accommodate CPUs with direct mapped caches, but
        !            24: it will depend on TMP_ALLOC using a proper alloca, and even then it may or
        !            25: may not be enough.
        !            26:
        !            27: The sparc32/v9 addmul_1 code runs at noticeably different speeds on
        !            28: successive sizes, and this has a bad effect on the tune program's
        !            29: determinations of the multiply and square thresholds.
        !            30:
        !            31:
        !            32:
        !            33:
        !            34:
        !            35: PARAMETER TUNING
        !            36:
        !            37: The "tuneup" program runs some tests designed to find the best settings for
        !            38: various thresholds, like KARATSUBA_MUL_THRESHOLD.  Its output can be put
        !            39: into gmp-mparam.h.  The program can be built and run with
        !            40:
        !            41:         make tune
        !            42:
        !            43: If the thresholds indicated are grossly different from the values in the
        !            44: selected gmp-mparam.h then you may get a performance boost in relevant size
        !            45: ranges by changing gmp-mparam.h accordingly.
        !            46:
        !            47: If your CPU has specific tuned parameters coming from a gmp-mparam.h in one
        !            48: of the mpn subdirectories then the values from "make tune" should be
        !            49: similar.  You can submit new values if it looks like the current ones are
        !            50: out of date or wildly wrong.  But check you're on the right CPU target and
        !            51: there aren't any machine-specific effects causing a difference.
        !            52:
        !            53: It's hoped the compiler and options used won't have too much effect on
        !            54: thresholds, since for most CPUs they ultimately come down to comparisons
        !            55: between assembler subroutines.  Missing out on the longlong.h macros by not
        !            56: using gcc will probably have an effect.
        !            57:
        !            58: Some thresholds produced by the tune program are merely single values chosen
        !            59: from what's actually a range of sizes where two algorithms are pretty much
        !            60: the same speed.  When this happens the program is likely to give slightly
        !            61: different values on successive runs.  This is noticeable on the toom3
        !            62: thresholds for instance.
        !            63:
        !            64:
        !            65:
        !            66:
        !            67: SPEED PROGRAM
        !            68:
        !            69: The "speed" program can be used for measuring and comparing various
        !            70: routines, and producing tables of data or gnuplot graphs.  Compile it with
        !            71:
        !            72:        make speed
        !            73:
        !            74: Here are some examples of how to use it.  Check the code for all the
        !            75: options.
        !            76:
        !            77: Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
        !            78: (whichever is greater).
        !            79:
        !            80:         ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
        !            81:        gnuplot foo.gnuplot
        !            82:
        !            83: Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing
        !            84: under mpn_lshift the difference between it and mpn_add_n.
        !            85:
        !            86:        ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
        !            87:
        !            88: Using option -c for times in cycles is interesting but normally only
        !            89: necessary when looking carefully at assembler subroutines.  You might think
        !            90: it would always give an integer value, but this doesn't happen in practice,
        !            91: probably due to overheads in the time measurements.
        !            92:
        !            93: In the free-form output the "#" symbol against a measurement means the
        !            94: corresponding routine is fastest at that size.  This is a convenient visual
        !            95: cue when comparing different routines.  The graph data files <name>.data
        !            96: don't get this since it would upset gnuplot or other data viewers.
        !            97:
        !            98:
        !            99:
        !           100:
        !           101: TIME BASE
        !           102:
        !           103: The time measuring method is determined in time.c, based on what the
        !           104: configured target has available.  A microsecond accurate gettimeofday() will
        !           105: work well, but there's code to use better methods, such as the cycle
        !           106: counters on various CPUs.
        !           107:
        !           108: Currently, all methods except possibly the alpha cycle counter depend on the
        !           109: machine being otherwise idle, or rather on other jobs not stealing CPU time
        !           110: from the measuring program.  Short routines (that complete within a
        !           111: timeslice) should work even on a busy machine.  Some trouble is taken by
        !           112: speed_measure() in common.c to avoid the ill effects of sporadic interrupts,
        !           113: or other intermittent things (like cron waking up every minute).  But
        !           114: generally you'll want an idle machine to be sure of consistent results.
        !           115:
        !           116: The CPU frequency is needed if times in cycles are to be displayed, and it's
        !           117: always needed when using a cycle counter time base.  time.c knows how to get
        !           118: the frequency on some systems, but when that fails, or needs to be
        !           119: overridden, an environment variable GMP_CPU_FREQUENCY can be used (in
        !           120: Hertz).  For example in "bash" on a 650 MHz machine,
        !           121:
        !           122:        export GMP_CPU_FREQUENCY=650e6
        !           123:
        !           124: A high precision time base makes it possible to get accurate measurements in
        !           125: a shorter time.  Support for systems and CPUs not already covered is wanted.
        !           126:
        !           127: When setting up a method, be sure not to claim a higher accuracy than is
        !           128: really available.  For example the default gettimeofday() code is set for
        !           129: microsecond accuracy, but if only 10ms or 55ms is available then
        !           130: inconsistent results can be expected.
        !           131:
        !           132:
        !           133:
        !           134:
        !           135:
        !           136: EXAMPLE COMPARISONS
        !           137:
        !           138: Here are some ideas for things you can do with the speed program.
        !           139:
        !           140: There's always going to be a certain amount of overhead in the time
        !           141: measurements, due to reading the time base, and in the loop that runs a
        !           142: routine enough times to get a reading of the desired precision.  Noop
        !           143: functions taking various arguments are available to measure this.  The
        !           144: "overhead" printed by the speed program each time in its intro is the "noop"
        !           145: routine, but note that this is just for information, it isn't deducted from
        !           146: the times printed or anything.
        !           147:
        !           148:        ./speed -s 1 noop noop_wxs noop_wxys
        !           149:
        !           150: If you want to know how many cycles per limb a routine is taking, look at
        !           151: the time increase when the size increments, using option -D.  This avoids
        !           152: fixed overheads in the measuring.  Also, remember many of the assembler
        !           153: routines have unrolled loops, so it might be necessary to compare times at,
        !           154: say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed
        !           155: to any finishing off.
        !           156:
        !           157:         ./speed -s 16-64 -t 16 -C -D mpn_add_n
        !           158:
        !           159: The -C option on its own gives cycles per limb, but is really only useful at
        !           160: big sizes where fixed overheads are small compared to the code doing the
        !           161: real work.  Remember of course memory caching and/or page swapping will
        !           162: affect results at large sizes.
        !           163:
        !           164:         ./speed -s 500000 -C mpn_add_n
        !           165:
        !           166: Once a calculation stops fitting in the CPU data cache, it's going to start
        !           167: taking longer.  Exactly where this happens depends on the cache priming in
        !           168: the measuring routines, and on what sort of "least recently used" the
        !           169: hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
        !           170: 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
        !           171: limbs.
        !           172:
        !           173:         ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
        !           174:        gnuplot foo.gnuplot
        !           175:
        !           176: When a routine has an unrolled loop for, say, multiples of 8 limbs and then
        !           177: an ordinary loop for the remainder, it can happen that it's actually faster
        !           178: to do an operation on, say, 8 limbs than it is on 7 limbs.  Here's an
        !           179: example drawing a graph of mpn_sub_n, which you can look at to see if times
        !           180: smoothly increase with size.
        !           181:
        !           182:         ./speed -s 1-100 -c -P foo mpn_sub_n
        !           183:        gnuplot foo.gnuplot
        !           184:
        !           185: If mpn_lshift and mpn_rshift for your CPU have special case code for shifts
        !           186: by 1, it ought to be faster (or at least not slower) than shifting by, say,
        !           187: 2 bits.
        !           188:
        !           189:         ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
        !           190:
        !           191: An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
        !           192: if the lshift isn't faster there's an obvious improvement that's possible.
        !           193:
        !           194:         ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
        !           195:
        !           196: On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
        !           197: destination is one of the sources is faster than a separate destination.
        !           198: Here's an example to see this.  (mpn_add_n_inplace is a special measuring
        !           199: routine, not available for other operations.)
        !           200:
        !           201:         ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace
        !           202:
        !           203: The gmp manual recommends divisions by powers of two should be done using a
        !           204: right shift because it'll be significantly faster.  Here's how you can see
        !           205: by what factor mpn_rshift is faster, using division by 32 as an example.
        !           206:
        !           207:         ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
        !           208:
        !           209: mul_basecase takes an "r" parameter that's the first (larger) size
        !           210: parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
        !           211:
        !           212:         ./speed -s 1-15 -c mpn_mul_basecase.20
        !           213:
        !           214: mul_basecase with no parameter does an NxN multiply, so for example to show
        !           215: speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
        !           216:
        !           217:         ./speed -s 1-20 -c mpn_mul_basecase
        !           218:
        !           219: sqr_basecase is implemented by a "triangular" method on most CPUs, making it
        !           220: up to twice as fast as mul_basecase.  In practice loop overheads and the
        !           221: products on the diagonal mean it falls short of this.  Here's an example
        !           222: running the two and showing by what factor an NxN mul_basecase is slower
        !           223: than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
        !           224: below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.)
        !           225:
        !           226:         ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
        !           227:
        !           228: The technique described above with -CD for showing the time difference in
        !           229: cycles per limb between two size operations can be done on an NxN
        !           230: mul_basecase using -E to change the basis for the size increment to N*N.
        !           231: For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
        !           232: doing 256 limbs.  The following therefore shows the per crossproduct speed
        !           233: of mul_basecase and sqr_basecase at around 20x20 limbs.
        !           234:
        !           235:         ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
        !           236:
        !           237: Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
        !           238: interesting to compare it to mul_basecase as if it was.  For sqr_basecase
        !           239: the -F option can be used to base the deltas on N*(N+1)/2 operations, which
        !           240: is the triangular products sqr_basecase does.  For example,
        !           241:
        !           242:         ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
        !           243:
        !           244: Both -E and -F are preliminary and might change.  A consistent approach to
        !           245: using them when claiming certain per crossproduct or per triangularproduct
        !           246: speeds hasn't really been established, but the increment between speeds in
        !           247: the range karatsuba will call seems sensible, that being k to k/2.  For
        !           248: instance, if the karatsuba threshold was 20 for the multiply and 30 for the
        !           249: square,
        !           250:
        !           251:         ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
        !           252:         ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
        !           253:
        !           254: The gmp manual recommends application programs avoid excessive initializing
        !           255: and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
        !           256: variable will at a minimum go through an init, a realloc for its first
        !           257: store, and finally a clear.  Quite how long that takes depends on the C
        !           258: library.  The following compares an mpz_init/realloc/clear to a 10 limb
        !           259: mpz_add.
        !           260:
        !           261:         ./speed -s 10 -c mpz_init_realloc_clear mpz_add
        !           262:
        !           263: The normal libtool link of the speed program does a static link to libgmp.la
        !           264: and libspeed.la, but will end up dynamic linked to libc.  Depending on the
        !           265: system, a dynamic linked malloc may be noticeably slower than static linked,
        !           266: and you may want to re-run the libtool link invocation to static link libc
        !           267: for comparison.  The example below does a 10 limb malloc/free or
        !           268: malloc/realloc/free to test the C library.  Of course a real world program
        !           269: has big problems if it's doing so many mallocs and frees that it gets slowed
        !           270: down by a dynamic linked malloc.
        !           271:
        !           272:         ./speed -s 10 -c malloc_free malloc_realloc_free
        !           273:
        !           274:
        !           275:
        !           276:
        !           277: SPEED PROGRAM EXTENSIONS
        !           278:
        !           279: Potentially lots of things could be made available in the program, but it's
        !           280: been left at only the things that have actually been wanted and are likely
        !           281: to be reasonably useful in the future.
        !           282:
        !           283: Extensions should be fairly easy to make though.  speed-ext.c is an example,
        !           284: in a style that should suit one-off tests, or new code fragments under
        !           285: development.
        !           286:
        !           287:
        !           288:
        !           289:
        !           290: THRESHOLD EXAMINING
        !           291:
        !           292: The speed program can be used to examine the speeds of different algorithms
        !           293: to check the tune program has done the right thing.  For example to examine
        !           294: the karatsuba multiply threshold,
        !           295:
        !           296:        ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
        !           297:
        !           298: When examining the toom3 threshold, remember it depends on the karatsuba
        !           299: threshold, so the right karatsuba threshold needs to be compiled into the
        !           300: library first.  The tune program uses special recompiled versions of
        !           301: mpn/mul_n.c etc for this reason, but the speed program simply uses the
        !           302: normal libgmp.la.
        !           303:
        !           304: Note further that the various routines may recurse into themselves on sizes
        !           305: far enough above applicable thresholds.  For example, mpn_kara_mul_n will
        !           306: recurse into itself on sizes greater than twice the compiled-in
        !           307: KARATSUBA_MUL_THRESHOLD.
        !           308:
        !           309: When doing the above comparison between mul_basecase and kara_mul_n what's
        !           310: probably of interest is mul_basecase versus a kara_mul_n that does one level
        !           311: of Karatsuba then calls to mul_basecase, but this only happens on sizes less
        !           312: than twice the compiled KARATSUBA_MUL_THRESHOLD.  A larger value for that
        !           313: setting can be compiled-in to avoid the problem if necessary.  The same
        !           314: applies to toom3 and BZ, though in a trickier fashion.
        !           315:
        !           316: There are some upper limits on some of the thresholds, arising from arrays
        !           317: dimensioned according to a threshold (mpn_mul_n), or asm code with certain
        !           318: sized displacements (some x86 versions of sqr_basecase).  So putting huge
        !           319: values for the thresholds, even just for testing, may fail.
        !           320:
        !           321:
        !           322:
        !           323:
        !           324: THINGS AFFECTING THRESHOLDS
        !           325:
        !           326: The following are some general notes on some things that can affect the
        !           327: various algorithm thresholds.
        !           328:
        !           329:    KARATSUBA_MUL_THRESHOLD
        !           330:
        !           331:       At size 2N, karatsuba does three NxN multiplies and some adds and
        !           332:       shifts, compared to a 2Nx2N basecase multiply which will be roughly
        !           333:       equivalent to four NxN multiplies.
        !           334:
        !           335:       Fast mul - increases threshold
        !           336:
        !           337:          If the CPU has a fast multiply, the basecase multiplies are going
        !           338:          to stay faster than the karatsuba overheads for longer.  Conversely
        !           339:          if the CPU has a slow multiply the karatsuba method trading some
        !           340:          multiplies for adds will become worthwhile sooner.
        !           341:
        !           342:         Remember it's "addmul" performance that's of interest here.  This
        !           343:         may differ from a simple "mul" instruction in the CPU.  For example
        !           344:         K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul,
        !           345:         and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due
        !           346:         to pipelining.
        !           347:
        !           348:       Unrolled addmul - increases threshold
        !           349:
        !           350:          If the CPU addmul routine (or the addmul part of the mul_basecase
        !           351:          routine) is unrolled it can mean that a 2Nx2N multiply is a bit
        !           352:          faster than four NxN multiplies, due to proportionally less looping
        !           353:          overheads.  This can be thought of as the addmul warming to its
        !           354:          task on bigger sizes, and keeping the basecase better than
        !           355:          karatsuba for longer.
        !           356:
        !           357:       Karatsuba overheads - increases threshold
        !           358:
        !           359:          Fairly obviously anything gained or lost in the karatsuba extra
        !           360:          calculations will translate directly to the threshold.  But
        !           361:          remember the extra calculations are likely to always be a
        !           362:          relatively small fraction of the total multiply time and in that
        !           363:          sense the basecase code is the best place to be looking for
        !           364:          optimizations.
        !           365:
        !           366:    KARATSUBA_SQR_THRESHOLD
        !           367:
        !           368:       Squaring is essentially the same as multiplying, so the above applies
        !           369:       to squaring too.  Fixed overheads will, proportionally, be bigger when
        !           370:       squaring, leading to a higher threshold usually.
        !           371:
        !           372:       mpn/generic/sqr_basecase.c
        !           373:
        !           374:          This relies on a reasonable umul_ppmm, and if the generic C code is
        !           375:          being used it may badly affect the speed.  Don't bother paying
        !           376:          attention to the square thresholds until you have either a good
        !           377:          umul_ppmm or an assembler sqr_basecase.
        !           378:
        !           379:    TOOM3_MUL_THRESHOLD
        !           380:
        !           381:       At size N, toom3 does five (N/3)x(N/3) multiplies and some extra
        !           382:       calculations, compared to karatsuba doing three (N/2)x(N/2)
        !           383:       multiplies and some extra calculations (fewer).  Toom3 will become
        !           384:       better before long, being O(n^1.465) versus karatsuba at O(n^1.585),
        !           385:       but exactly where depends a great deal on the implementations of all
        !           386:       the relevant bits of extra calculation.
        !           387:
        !           388:       In practice the curves for time versus size on toom3 and karatsuba
        !           389:       have similar slopes near their crossover, leading to a range of sizes
        !           390:       where there's very little difference between the two.  Choosing a
        !           391:       single value from the range is a bit arbitrary and will lead to
        !           392:       slightly different values on successive runs of the tune program.
        !           393:
        !           394:       divexact_by3 - used by toom3
        !           395:
        !           396:          Toom3 does a divexact_by3 which at size N is roughly equivalent to
        !           397:          N successively dependent multiplies with a further couple of extra
        !           398:          instructions in between.  CPUs with a low latency multiply and good
        !           399:          divexact_by3 implementation should see the toom3 threshold lowered.
        !           400:          But note this is unlikely to have much effect on total multiply
        !           401:          times.
        !           402:
        !           403:       Asymptotic behaviour
        !           404:
        !           405:          At the fairly small sizes where the thresholds occur it's worth
        !           406:          remembering that the asymptotic behaviour for karatsuba and toom3
        !           407:          can't be expected to make accurate predictions, due of course to
        !           408:          the big influence of all sorts of overheads, and the fact that only
        !           409:          a few recursions of each are being performed.
        !           410:
        !           411:         Even at large sizes there's a good chance machine dependent effects
        !           412:         like cache architecture will mean actual performance deviates from
        !           413:         what might be predicted.  This is why the rather positivist
        !           414:         approach of just measuring things has been adopted, in general.
        !           415:
        !           416:    TOOM3_SQR_THRESHOLD
        !           417:
        !           418:       The same factors apply to squaring as to multiplying, though with
        !           419:       overheads being proportionally a bit bigger.
        !           420:
        !           421:    FFT_MUL_THRESHOLD, etc
        !           422:
        !           423:       When configured with --enable-fft, a Fermat style FFT is used for
        !           424:       multiplication above FFT_MUL_THRESHOLD, and a further threshold
        !           425:       FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1
        !           426:       multiply.  FFT_MUL_TABLE is the thresholds at which each split size
        !           427:       "k" is used in the FFT.
        !           428:
        !           429:       step effect - coarse grained thresholds
        !           430:
        !           431:          The FFT has size restrictions that mean it rounds up sizes to
        !           432:          certain multiples and therefore does the same amount of work for a
        !           433:          range of different sized operands.  For example at k=8 the size is
        !           434:          internally rounded to a multiple of 1024 limbs.  The current single
        !           435:          values for the various thresholds are set to give good average
        !           436:          performance, but in the future multiple values might be wanted to
        !           437:          take into account the different step sizes for different "k"s.
        !           438:
        !           439:    FFT_SQR_THRESHOLD, etc
        !           440:
        !           441:       The same considerations apply as for multiplications, plus the
        !           442:       following.
        !           443:
        !           444:       similarity to mul thresholds
        !           445:
        !           446:          On some CPUs the squaring thresholds are nearly the same as those
        !           447:          for multiplying.  It's not quite clear why this is, it might be
        !           448:          similar shaped size/time graphs for the mul and sqrs recursed into.
        !           449:
        !           450:    BZ_THRESHOLD
        !           451:
        !           452:       The B-Z division algorithm rearranges a traditional multi-precision
        !           453:       long division so that NxN multiplies can be done rather than repeated
        !           454:       Nx1 multiplies, thereby exploiting the algorithmic advantages of
        !           455:       karatsuba and toom3, and leading to significant speedups.
        !           456:
        !           457:       fast mul_basecase - decreases threshold
        !           458:
        !           459:          CPUs with an optimized mul_basecase can expect a lower B-Z
        !           460:          threshold due to the helping hand such a mul_basecase will give to
        !           461:          B-Z as compared to submul_1 used in the schoolbook method.
        !           462:
        !           463:    GCD_ACCEL_THRESHOLD
        !           464:
        !           465:       Below this threshold a simple binary subtract and shift is used, above
        !           466:       it Ken Weber's accelerated algorithm is used.  The accelerated GCD
        !           467:       performs far fewer steps than the binary GCD and will normally kick in
        !           468:       at quite small sizes.
        !           469:
        !           470:       modlimb_invert and find_a - affect threshold
        !           471:
        !           472:          At small sizes the performance of modlimb_invert and find_a will
        !           473:          affect the accelerated algorithm and CPUs where those routines are
        !           474:          not well optimized may see a higher threshold.  (At large sizes
        !           475:          mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated
        !           476:          algorithm.)
        !           477:
        !           478:    GCDEXT_THRESHOLD
        !           479:
        !           480:       mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of
        !           481:       Euclid's algorithm.  The multipliers are found using single limb
        !           482:       calculations below GCDEXT_THRESHOLD, or double limb calculations
        !           483:       above.  The single limb code is fast but doesn't produce full-limb
        !           484:       multipliers.
        !           485:
        !           486:       data-dependent multiplier - big threshold
        !           487:
        !           488:          If multiplications done by mpn_mul_1, addmul_1 and submul_1 run
        !           489:          slower when there's more bits in the multiplier, then producing
        !           490:          bigger multipliers with the double limb calculation doesn't save
        !           491:          much more than some looping and function call overheads.  A large
        !           492:          threshold can then be expected.
        !           493:
        !           494:       slow division - low threshold
        !           495:
        !           496:          The single limb calculation does some plain "/" divisions, whereas
        !           497:          the double limb calculation has a divide routine optimized for the
        !           498:          small quotients that often occur.  Until the single limb code does
        !           499:          something similar a slow hardware divide will count against it.
        !           500:
        !           501:
        !           502:
        !           503:
        !           504:
        !           505: FUTURE
        !           506:
        !           507: Make a program to check the time base is working properly, for small and
        !           508: large measurements.  Make it able to test each available method, including
        !           509: perhaps the apparent resolution of each.
        !           510:
        !           511: Add versions of the toom3 multiplication using either the mpn calls or the
        !           512: open-coded style, so the two can be compared.
        !           513:
        !           514: Add versions of the generic C mpn_divrem_1 using straight division versus a
        !           515: multiply by inverse, so the two can be compared.  Include the branch-free
        !           516: version of multiply by inverse too.
        !           517:
        !           518: Make an option in struct speed_parameters to specify operand overlap,
        !           519: perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1
        !           520: dst2=src2, 4 for dst1=src2 dst2=src1.  This is done for addsub_n with the r
        !           521: parameter (though addsub_n isn't yet enabled), and could be done for add_n,
        !           522: xor_n, etc too.
        !           523:
        !           524: When speed_measure() divides the total time measured by repetitions
        !           525: performed, it divides the fixed overheads imposed by speed_starttime() and
        !           526: speed_endtime().  When different routines are run with different repetitions
        !           527: the overhead will then be differently counted.  It would improve precision
        !           528: to try to avoid this.  Currently the idea is just to set speed_precision big
        !           529: enough that the effect is insignificant compared to the routines being
        !           530: measured.
        !           531:
        !           532:
        !           533:
        !           534:
        !           535: ----------------
        !           536: Local variables:
        !           537: mode: text
        !           538: fill-column: 76
        !           539: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>