OpenXM_contrib/gmp/tune/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / tune
Annotation of OpenXM_contrib/gmp/tune/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
        !            22:
1.1       maekawa    23:
                     24:                GMP SPEED MEASURING AND PARAMETER TUNING
                     25:
                     26:
1.1.1.2 ! ohara      27: The programs in this directory are for knowledgeable users who want to
        !            28: measure GMP routines on their machine, and perhaps tweak some settings or
        !            29: identify things that can be improved.
1.1       maekawa    30:
                     31: The programs here are tools, not ready to run solutions.  Nothing is built
                     32: in a normal "make all", but various Makefile targets described below exist.
                     33:
                     34: Relatively few systems and CPUs have been tested, so be sure to verify that
1.1.1.2 ! ohara      35: results are sensible before relying on them.
1.1       maekawa    36:
                     37:
                     38:
                     39:
                     40: MISCELLANEOUS NOTES
                     41:
1.1.1.2 ! ohara      42: --enable-assert
        !            43:
        !            44:     Don't configure with --enable-assert, since the extra code added by
        !            45:     assertion checking may influence measurements.
        !            46:
        !            47: Direct mapped caches
        !            48:
        !            49:     Some effort has been made to accommodate CPUs with direct mapped caches,
        !            50:     by putting data blocks more or less contiguously on the stack.  But this
        !            51:     will depend on TMP_ALLOC using alloca, and even then it may or may not
        !            52:     be enough.
        !            53:
        !            54: FreeBSD 4.2 i486 getrusage
        !            55:
        !            56:     This getrusage seems to be a bit doubtful, it looks like it's
        !            57:     microsecond accurate, but sometimes ru_utime remains unchanged after a
        !            58:     time of many microseconds has elapsed.  It'd be good to detect this in
        !            59:     the time.c initializations, but for now the suggestion is to pretend it
        !            60:     doesn't exist.
        !            61:
        !            62:         ./configure ac_cv_func_getrusage=no
        !            63:
        !            64: NetBSD 1.4.1 m68k macintosh time base
1.1       maekawa    65:
1.1.1.2 ! ohara      66:     On this system it's been found getrusage often goes backwards, making it
        !            67:     unusable (configure is setup to ignore it).  gettimeofday sometimes
        !            68:     doesn't update atomically when it crosses a 1 second boundary.  Not sure
        !            69:     what to do about this.  Expect intermittent failures.
1.1       maekawa    70:
1.1.1.2 ! ohara      71: SCO OpenUNIX 8 /etc/hw
        !            72:
        !            73:     /etc/hw takes about a second to return the cpu frequency, which suggests
        !            74:     perhaps it's measuring each time it runs.  If this is annoying when
        !            75:     running the speed program repeatedly then set a GMP_CPU_FREQUENCY
        !            76:     environment variable (see TIME BASE section below).
        !            77:
        !            78: Low resolution timebase
        !            79:
        !            80:     Parameter tuning can be very time consuming if the only timebase
        !            81:     available is a 10 millisecond clock tick, to the point of being
        !            82:     unusable.  This is currently the case on VAX and ARM systems.
1.1       maekawa    83:
                     84:
                     85:
                     86:
                     87: PARAMETER TUNING
                     88:
                     89: The "tuneup" program runs some tests designed to find the best settings for
1.1.1.2 ! ohara      90: various thresholds, like MUL_KARATSUBA_THRESHOLD.  Its output can be put
        !            91: into gmp-mparam.h.  The program is built and run with
1.1       maekawa    92:
                     93:         make tune
                     94:
                     95: If the thresholds indicated are grossly different from the values in the
1.1.1.2 ! ohara      96: selected gmp-mparam.h then there may be a performance boost in applicable
        !            97: size ranges by changing gmp-mparam.h accordingly.
1.1       maekawa    98:
1.1.1.2 ! ohara      99: Be sure to do a full reconfigure and rebuild to get any newly set thresholds
        !           100: to take effect.  A partial rebuild is enough sometimes, but a fresh
        !           101: configure and make is certain to be correct.
        !           102:
        !           103: If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
        !           104: the mpn subdirectories then the values from "make tune" should be similar.
        !           105: But check that the configured CPU is right and there are no machine specific
        !           106: effects causing a difference.
1.1       maekawa   107:
                    108: It's hoped the compiler and options used won't have too much effect on
                    109: thresholds, since for most CPUs they ultimately come down to comparisons
                    110: between assembler subroutines.  Missing out on the longlong.h macros by not
                    111: using gcc will probably have an effect.
                    112:
                    113: Some thresholds produced by the tune program are merely single values chosen
1.1.1.2 ! ohara     114: from what's a range of sizes where two algorithms are pretty much the same
        !           115: speed.  When this happens the program is likely to give somewhat different
        !           116: values on successive runs.  This is noticeable on the toom3 thresholds for
        !           117: instance.
1.1       maekawa   118:
                    119:
                    120:
                    121:
                    122: SPEED PROGRAM
                    123:
                    124: The "speed" program can be used for measuring and comparing various
                    125: routines, and producing tables of data or gnuplot graphs.  Compile it with
                    126:
                    127:        make speed
                    128:
1.1.1.2 ! ohara     129: (Or on DOS systems "make speed.exe".)
        !           130:
1.1       maekawa   131: Here are some examples of how to use it.  Check the code for all the
                    132: options.
                    133:
                    134: Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
                    135: (whichever is greater).
                    136:
                    137:         ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
                    138:        gnuplot foo.gnuplot
                    139:
1.1.1.2 ! ohara     140: Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
        !           141: showing under mpn_lshift the difference between it and mpn_add_n.
1.1       maekawa   142:
                    143:        ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
                    144:
                    145: Using option -c for times in cycles is interesting but normally only
                    146: necessary when looking carefully at assembler subroutines.  You might think
                    147: it would always give an integer value, but this doesn't happen in practice,
                    148: probably due to overheads in the time measurements.
                    149:
                    150: In the free-form output the "#" symbol against a measurement means the
                    151: corresponding routine is fastest at that size.  This is a convenient visual
                    152: cue when comparing different routines.  The graph data files <name>.data
                    153: don't get this since it would upset gnuplot or other data viewers.
                    154:
                    155:
                    156:
                    157:
                    158: TIME BASE
                    159:
                    160: The time measuring method is determined in time.c, based on what the
1.1.1.2 ! ohara     161: configured host has available.  A cycle counter is preferred, possibly
        !           162: supplemented by another method if the counter has a limited range.  A
        !           163: microsecond accurate getrusage() or gettimeofday() will work quite well too.
        !           164:
        !           165: The cycle counters (except possibly on alpha) and gettimeofday() will depend
        !           166: on the machine being otherwise idle, or rather on other jobs not stealing
        !           167: CPU time from the measuring program.  Short routines (those that complete
        !           168: within a timeslice) should work even on a busy machine.
        !           169:
        !           170: Some trouble is taken by speed_measure() in common.c to avoid ill effects
        !           171: from sporadic interrupts, or other intermittent things (like cron waking up
        !           172: every minute).  But generally an idle machine will be necessary to be
        !           173: certain of consistent results.
        !           174:
        !           175: The CPU frequency is needed to convert between cycles and seconds, or for
        !           176: when a cycle counter is supplemented by getrusage() etc.  The speed program
        !           177: will convert as necessary according to the output format requested.  The
        !           178: tune program will work with either cycles or seconds.
        !           179:
        !           180: freq.c knows how to get the frequency on some systems, or can measure a
        !           181: cycle counter against gettimeofday() or getrusage(), but when that fails, or
        !           182: needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
        !           183: used (in Hertz).  For example in "bash" on a 650 MHz machine,
1.1       maekawa   184:
                    185:        export GMP_CPU_FREQUENCY=650e6
                    186:
                    187: A high precision time base makes it possible to get accurate measurements in
1.1.1.2 ! ohara     188: a shorter time.
1.1       maekawa   189:
                    190:
                    191:
                    192:
1.1.1.2 ! ohara     193: EXAMPLE COMPARISONS - VARIOUS
1.1       maekawa   194:
1.1.1.2 ! ohara     195: Here are some ideas for things that can be done with the speed program.
1.1       maekawa   196:
                    197: There's always going to be a certain amount of overhead in the time
                    198: measurements, due to reading the time base, and in the loop that runs a
                    199: routine enough times to get a reading of the desired precision.  Noop
                    200: functions taking various arguments are available to measure this.  The
                    201: "overhead" printed by the speed program each time in its intro is the "noop"
                    202: routine, but note that this is just for information, it isn't deducted from
                    203: the times printed or anything.
                    204:
                    205:        ./speed -s 1 noop noop_wxs noop_wxys
                    206:
1.1.1.2 ! ohara     207: To see how many cycles per limb a routine is taking, look at the time
        !           208: increase when the size increments, using option -D.  This avoids fixed
        !           209: overheads in the measuring.  Also, remember many of the assembler routines
        !           210: have unrolled loops, so it might be necessary to compare times at, say, 16,
        !           211: 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
        !           212: finishing off.
1.1       maekawa   213:
                    214:         ./speed -s 16-64 -t 16 -C -D mpn_add_n
                    215:
                    216: The -C option on its own gives cycles per limb, but is really only useful at
                    217: big sizes where fixed overheads are small compared to the code doing the
                    218: real work.  Remember of course memory caching and/or page swapping will
                    219: affect results at large sizes.
                    220:
                    221:         ./speed -s 500000 -C mpn_add_n
                    222:
                    223: Once a calculation stops fitting in the CPU data cache, it's going to start
                    224: taking longer.  Exactly where this happens depends on the cache priming in
                    225: the measuring routines, and on what sort of "least recently used" the
                    226: hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
                    227: 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
                    228: limbs.
                    229:
                    230:         ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
                    231:        gnuplot foo.gnuplot
                    232:
                    233: When a routine has an unrolled loop for, say, multiples of 8 limbs and then
                    234: an ordinary loop for the remainder, it can happen that it's actually faster
1.1.1.2 ! ohara     235: to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
        !           236: draws a graph of mpn_sub_n, to see whether times smoothly increase with
        !           237: size.
1.1       maekawa   238:
                    239:         ./speed -s 1-100 -c -P foo mpn_sub_n
                    240:        gnuplot foo.gnuplot
                    241:
1.1.1.2 ! ohara     242: If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
        !           243: ought to be faster (or at least not slower) than shifting by, say, 2 bits.
1.1       maekawa   244:
                    245:         ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
                    246:
                    247: An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
                    248: if the lshift isn't faster there's an obvious improvement that's possible.
                    249:
                    250:         ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
                    251:
                    252: On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
                    253: destination is one of the sources is faster than a separate destination.
1.1.1.2 ! ohara     254: Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
        !           255: mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
1.1       maekawa   256:
1.1.1.2 ! ohara     257:         ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
1.1       maekawa   258:
1.1.1.2 ! ohara     259: The gmp manual points out that divisions by powers of two should be done
        !           260: using a right shift because it'll be significantly faster than an actual
        !           261: division.  The following shows by what factor mpn_rshift is faster than
        !           262: mpn_divrem_1, using division by 32 as an example.
1.1       maekawa   263:
                    264:         ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
                    265:
1.1.1.2 ! ohara     266:
        !           267:
        !           268:
        !           269: EXAMPLE COMPARISONS - MULTIPLICATION
        !           270:
        !           271: mul_basecase takes a ".<r>" parameter which is the first (larger) size
1.1       maekawa   272: parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
                    273:
                    274:         ./speed -s 1-15 -c mpn_mul_basecase.20
                    275:
                    276: mul_basecase with no parameter does an NxN multiply, so for example to show
                    277: speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
                    278:
                    279:         ./speed -s 1-20 -c mpn_mul_basecase
                    280:
                    281: sqr_basecase is implemented by a "triangular" method on most CPUs, making it
                    282: up to twice as fast as mul_basecase.  In practice loop overheads and the
                    283: products on the diagonal mean it falls short of this.  Here's an example
                    284: running the two and showing by what factor an NxN mul_basecase is slower
                    285: than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
1.1.1.2 ! ohara     286: below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.)
1.1       maekawa   287:
                    288:         ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
                    289:
                    290: The technique described above with -CD for showing the time difference in
                    291: cycles per limb between two size operations can be done on an NxN
                    292: mul_basecase using -E to change the basis for the size increment to N*N.
                    293: For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
                    294: doing 256 limbs.  The following therefore shows the per crossproduct speed
                    295: of mul_basecase and sqr_basecase at around 20x20 limbs.
                    296:
                    297:         ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
                    298:
                    299: Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
                    300: interesting to compare it to mul_basecase as if it was.  For sqr_basecase
                    301: the -F option can be used to base the deltas on N*(N+1)/2 operations, which
                    302: is the triangular products sqr_basecase does.  For example,
                    303:
                    304:         ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
                    305:
                    306: Both -E and -F are preliminary and might change.  A consistent approach to
                    307: using them when claiming certain per crossproduct or per triangularproduct
                    308: speeds hasn't really been established, but the increment between speeds in
                    309: the range karatsuba will call seems sensible, that being k to k/2.  For
                    310: instance, if the karatsuba threshold was 20 for the multiply and 30 for the
                    311: square,
                    312:
                    313:         ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
                    314:         ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
                    315:
1.1.1.2 ! ohara     316: Two versions of toom3 interpolation and evaluation are available in
        !           317: mpn/generic/mul_n.c, using either a one-pass open-coded style or simple mpn
        !           318: subroutine calls.  The former is used on RISCs with lots of registers, the
        !           319: latter on other CPUs.  The two can be compared directly to check which is
        !           320: best.  Naturally it's sizes where toom3 is faster than karatsuba that are of
        !           321: interest.
        !           322:
        !           323:         ./speed -s 80-120 -c mpn_toom3_mul_n_mpn mpn_toom3_mul_n_open
        !           324:         ./speed -s 80-120 -c mpn_toom3_sqr_n_mpn mpn_toom3_sqr_n_open
        !           325:
        !           326:
        !           327:
        !           328:
        !           329: EXAMPLE COMPARISONS - MALLOC
        !           330:
1.1       maekawa   331: The gmp manual recommends application programs avoid excessive initializing
                    332: and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
                    333: variable will at a minimum go through an init, a realloc for its first
                    334: store, and finally a clear.  Quite how long that takes depends on the C
                    335: library.  The following compares an mpz_init/realloc/clear to a 10 limb
1.1.1.2 ! ohara     336: mpz_add.  Don't be surprised if the mallocing is quite slow.
1.1       maekawa   337:
                    338:         ./speed -s 10 -c mpz_init_realloc_clear mpz_add
                    339:
1.1.1.2 ! ohara     340: On some systems malloc and free are much slower when dynamic linked.  The
        !           341: speed-dynamic program can be used to see this.  For example the following
        !           342: measures malloc/free, first static then dynamic.
        !           343:
        !           344:         ./speed -s 10 -c malloc_free
        !           345:         ./speed-dynamic -s 10 -c malloc_free
        !           346:
        !           347: Of course a real world program has big problems if it's doing so many
        !           348: mallocs and frees that it gets slowed down by a dynamic linked malloc.
        !           349:
        !           350:
        !           351:
        !           352:
        !           353:
        !           354: EXAMPLE COMPARISONS - STRING CONVERSIONS
        !           355:
        !           356: mpn_get_str does a binary to string conversion.  The base is specified with
        !           357: a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
        !           358: than general bases.  The following compares decimal and hex for instance.
        !           359:
        !           360:         ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
        !           361:
        !           362: Smaller bases need more divisions to split a given size number, and so are
        !           363: slower.  The following compares base 3 and base 9.  On small operands 9 will
        !           364: be nearly twice as fast, though at bigger sizes this reduces since in the
        !           365: current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
        !           366: limbs) and those divisions come to dominate.
1.1       maekawa   367:
1.1.1.2 ! ohara     368:         ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
        !           369:
        !           370: mpn_set_str does a string to binary conversion.  The base is specified with
        !           371: a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
        !           372: general bases on large conversions.
        !           373:
        !           374:        ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
        !           375:
        !           376: mpn_set_str also has some special case code for decimal which is a bit
        !           377: faster than the general case, basically by giving the compiler a chance to
        !           378: optimize some multiplications by 10.
        !           379:
        !           380:        ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
        !           381:
        !           382:
        !           383:
        !           384:
        !           385: EXAMPLE COMPARISONS - GCDs
        !           386:
        !           387: mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
        !           388: x and y are single limbs.  This isn't tuned currently, but a value can be
        !           389: established by a measurement like
        !           390:
        !           391:        ./speed -s 10-32 mpn_gcd_1.10
        !           392:
        !           393: This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
        !           394: threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
        !           395: is done by nibbling away at the 32-bit operands bit-by-bit.  When the
        !           396: threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
        !           397: 10x10 bit operation.
        !           398:
        !           399: The threshold in mpn/generic/gcd_1.c or the various assembler
        !           400: implementations can be tweaked up or down until there's no more speedups on
        !           401: interesting combinations of sizes.  Note that this affects only a 1x1 limb
        !           402: operation and so isn't very important.  (An Nx1 limb operation always does
        !           403: an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
1.1       maekawa   404:
                    405:
                    406:
                    407:
                    408: SPEED PROGRAM EXTENSIONS
                    409:
                    410: Potentially lots of things could be made available in the program, but it's
                    411: been left at only the things that have actually been wanted and are likely
                    412: to be reasonably useful in the future.
                    413:
                    414: Extensions should be fairly easy to make though.  speed-ext.c is an example,
                    415: in a style that should suit one-off tests, or new code fragments under
                    416: development.
                    417:
1.1.1.2 ! ohara     418: many.pl is a script for generating a new speed program supplemented with
        !           419: alternate versions of the standard routines.  It can be used for measuring
        !           420: experimental code, or for comparing different implementations that exist
        !           421: within a CPU family.
        !           422:
1.1       maekawa   423:
                    424:
                    425:
                    426: THRESHOLD EXAMINING
                    427:
                    428: The speed program can be used to examine the speeds of different algorithms
                    429: to check the tune program has done the right thing.  For example to examine
                    430: the karatsuba multiply threshold,
                    431:
                    432:        ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
                    433:
                    434: When examining the toom3 threshold, remember it depends on the karatsuba
                    435: threshold, so the right karatsuba threshold needs to be compiled into the
1.1.1.2 ! ohara     436: library first.  The tune program uses specially recompiled versions of
1.1       maekawa   437: mpn/mul_n.c etc for this reason, but the speed program simply uses the
                    438: normal libgmp.la.
                    439:
                    440: Note further that the various routines may recurse into themselves on sizes
                    441: far enough above applicable thresholds.  For example, mpn_kara_mul_n will
                    442: recurse into itself on sizes greater than twice the compiled-in
1.1.1.2 ! ohara     443: MUL_KARATSUBA_THRESHOLD.
1.1       maekawa   444:
                    445: When doing the above comparison between mul_basecase and kara_mul_n what's
                    446: probably of interest is mul_basecase versus a kara_mul_n that does one level
                    447: of Karatsuba then calls to mul_basecase, but this only happens on sizes less
1.1.1.2 ! ohara     448: than twice the compiled MUL_KARATSUBA_THRESHOLD.  A larger value for that
1.1       maekawa   449: setting can be compiled-in to avoid the problem if necessary.  The same
1.1.1.2 ! ohara     450: applies to toom3 and DC, though in a trickier fashion.
1.1       maekawa   451:
                    452: There are some upper limits on some of the thresholds, arising from arrays
                    453: dimensioned according to a threshold (mpn_mul_n), or asm code with certain
                    454: sized displacements (some x86 versions of sqr_basecase).  So putting huge
                    455: values for the thresholds, even just for testing, may fail.
                    456:
                    457:
                    458:
                    459:
                    460: FUTURE
                    461:
                    462: Make a program to check the time base is working properly, for small and
                    463: large measurements.  Make it able to test each available method, including
                    464: perhaps the apparent resolution of each.
                    465:
1.1.1.2 ! ohara     466: Make a general mechanism for specifying operand overlap, and a syntax like
        !           467: maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
        !           468: sort of thing with the "r" parameter currently.
1.1       maekawa   469:
                    470:
                    471:
                    472: ----------------
                    473: Local variables:
                    474: mode: text
                    475: fill-column: 76
                    476: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>