GMP SPEED MEASURING AND PARAMETER TUNING


The programs in this directory are for knowledgeable users who want to make
measurements of the speed of GMP routines on their machine, and perhaps
tweak some settings or identify things that can be improved.

The programs here are tools, not ready to run solutions.  Nothing is built
in a normal "make all", but various Makefile targets described below exist.

Relatively few systems and CPUs have been tested, so be sure to verify that
you're getting sensible results before relying on them.


MISCELLANEOUS NOTES

Don't configure with --enable-assert when using the things here, since the
extra code added by assertion checking may influence measurements.

Some effort has been made to accommodate CPUs with direct mapped caches, but
it will depend on TMP_ALLOC using a proper alloca, and even then it may or
may not be enough.

The sparc32/v9 addmul_1 code runs at noticeably different speeds on
successive sizes, and this has a bad effect on the tune program's
determinations of the multiply and square thresholds.


PARAMETER TUNING

The "tuneup" program runs some tests designed to find the best settings for
various thresholds, like KARATSUBA_MUL_THRESHOLD.  Its output can be put
into gmp-mparam.h.  The program can be built and run with

        make tune

If the thresholds indicated are grossly different from the values in the
selected gmp-mparam.h then you may get a performance boost in relevant size
ranges by changing gmp-mparam.h accordingly.

If your CPU has specific tuned parameters coming from a gmp-mparam.h in one
of the mpn subdirectories then the values from "make tune" should be
similar.  You can submit new values if it looks like the current ones are
out of date or wildly wrong.  But check you're on the right CPU target and
there aren't any machine-specific effects causing a difference.

It's hoped the compiler and options used won't have too much effect on
thresholds, since for most CPUs they ultimately come down to comparisons
between assembler subroutines.  Missing out on the longlong.h macros by not
using gcc will probably have an effect.

Some thresholds produced by the tune program are merely single values chosen
from what's actually a range of sizes where two algorithms are pretty much
the same speed.  When this happens the program is likely to give slightly
different values on successive runs.  This is noticeable on the toom3
thresholds for instance.


SPEED PROGRAM

The "speed" program can be used for measuring and comparing various
routines, and producing tables of data or gnuplot graphs.  Compile it with

	make speed

Here are some examples of how to use it.  Check the code for all the
options.

Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
(whichever is greater).

        ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
	gnuplot foo.gnuplot

Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing
under mpn_lshift the difference between it and mpn_add_n.

	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1

Using option -c for times in cycles is interesting but normally only
necessary when looking carefully at assembler subroutines.  You might think
it would always give an integer value, but this doesn't happen in practice,
probably due to overheads in the time measurements.

In the free-form output the "#" symbol against a measurement means the
corresponding routine is fastest at that size.  This is a convenient visual
cue when comparing different routines.  The graph data files <name>.data
don't get this since it would upset gnuplot or other data viewers.


TIME BASE

The time measuring method is determined in time.c, based on what the
configured target has available.  A microsecond accurate gettimeofday() will
work well, but there's code to use better methods, such as the cycle
counters on various CPUs.

Currently, all methods except possibly the alpha cycle counter depend on the
machine being otherwise idle, or rather on other jobs not stealing CPU time
from the measuring program.  Short routines (that complete within a
timeslice) should work even on a busy machine.  Some trouble is taken by
speed_measure() in common.c to avoid the ill effects of sporadic interrupts,
or other intermittent things (like cron waking up every minute).  But
generally you'll want an idle machine to be sure of consistent results.

The CPU frequency is needed if times in cycles are to be displayed, and it's
always needed when using a cycle counter time base.  time.c knows how to get
the frequency on some systems, but when that fails, or needs to be
overridden, an environment variable GMP_CPU_FREQUENCY can be used (in
Hertz).  For example in "bash" on a 650 MHz machine,

	export GMP_CPU_FREQUENCY=650e6

A high precision time base makes it possible to get accurate measurements in
a shorter time.  Support for systems and CPUs not already covered is wanted.

When setting up a method, be sure not to claim a higher accuracy than is
really available.  For example the default gettimeofday() code is set for
microsecond accuracy, but if only 10ms or 55ms is available then
inconsistent results can be expected.


EXAMPLE COMPARISONS

Here are some ideas for things you can do with the speed program.

There's always going to be a certain amount of overhead in the time
measurements, due to reading the time base, and in the loop that runs a
routine enough times to get a reading of the desired precision.  Noop
functions taking various arguments are available to measure this.  The
"overhead" printed by the speed program each time in its intro is the "noop"
routine, but note that this is just for information, it isn't deducted from
the times printed or anything.

	./speed -s 1 noop noop_wxs noop_wxys

If you want to know how many cycles per limb a routine is taking, look at
the time increase when the size increments, using option -D.  This avoids
fixed overheads in the measuring.  Also, remember many of the assembler
routines have unrolled loops, so it might be necessary to compare times at,
say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed
to any finishing off.

        ./speed -s 16-64 -t 16 -C -D mpn_add_n

The -C option on its own gives cycles per limb, but is really only useful at
big sizes where fixed overheads are small compared to the code doing the
real work.  Remember of course memory caching and/or page swapping will
affect results at large sizes.

        ./speed -s 500000 -C mpn_add_n

Once a calculation stops fitting in the CPU data cache, it's going to start
taking longer.  Exactly where this happens depends on the cache priming in
the measuring routines, and on what sort of "least recently used" the
hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
limbs.

        ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
	gnuplot foo.gnuplot

When a routine has an unrolled loop for, say, multiples of 8 limbs and then
an ordinary loop for the remainder, it can happen that it's actually faster
to do an operation on, say, 8 limbs than it is on 7 limbs.  Here's an
example drawing a graph of mpn_sub_n, which you can look at to see if times
smoothly increase with size.

        ./speed -s 1-100 -c -P foo mpn_sub_n
	gnuplot foo.gnuplot

If mpn_lshift and mpn_rshift for your CPU have special case code for shifts
by 1, it ought to be faster (or at least not slower) than shifting by, say,
2 bits.

        ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2

An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
if the lshift isn't faster there's an obvious improvement that's possible.

        ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self

On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
destination is one of the sources is faster than a separate destination.
Here's an example to see this.  (mpn_add_n_inplace is a special measuring
routine, not available for other operations.)

        ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace

The gmp manual recommends divisions by powers of two should be done using a
right shift because it'll be significantly faster.  Here's how you can see
by what factor mpn_rshift is faster, using division by 32 as an example.

        ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32

mul_basecase takes an "r" parameter that's the first (larger) size
parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,

        ./speed -s 1-15 -c mpn_mul_basecase.20

mul_basecase with no parameter does an NxN multiply, so for example to show
speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,

        ./speed -s 1-20 -c mpn_mul_basecase

sqr_basecase is implemented by a "triangular" method on most CPUs, making it
up to twice as fast as mul_basecase.  In practice loop overheads and the
products on the diagonal mean it falls short of this.  Here's an example
running the two and showing by what factor an NxN mul_basecase is slower
than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.)

        ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase

The technique described above with -CD for showing the time difference in
cycles per limb between two size operations can be done on an NxN
mul_basecase using -E to change the basis for the size increment to N*N.
For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
doing 256 limbs.  The following therefore shows the per crossproduct speed
of mul_basecase and sqr_basecase at around 20x20 limbs.

        ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase

Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
interesting to compare it to mul_basecase as if it was.  For sqr_basecase
the -F option can be used to base the deltas on N*(N+1)/2 operations, which
is the triangular products sqr_basecase does.  For example,

        ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase

Both -E and -F are preliminary and might change.  A consistent approach to
using them when claiming certain per crossproduct or per triangularproduct
speeds hasn't really been established, but the increment between speeds in
the range karatsuba will call seems sensible, that being k to k/2.  For
instance, if the karatsuba threshold was 20 for the multiply and 30 for the
square,

        ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
        ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase

The gmp manual recommends application programs avoid excessive initializing
and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
variable will at a minimum go through an init, a realloc for its first
store, and finally a clear.  Quite how long that takes depends on the C
library.  The following compares an mpz_init/realloc/clear to a 10 limb
mpz_add.

        ./speed -s 10 -c mpz_init_realloc_clear mpz_add

The normal libtool link of the speed program does a static link to libgmp.la
and libspeed.la, but will end up dynamic linked to libc.  Depending on the
system, a dynamic linked malloc may be noticeably slower than static linked,
and you may want to re-run the libtool link invocation to static link libc
for comparison.  The example below does a 10 limb malloc/free or
malloc/realloc/free to test the C library.  Of course a real world program
has big problems if it's doing so many mallocs and frees that it gets slowed
down by a dynamic linked malloc.

        ./speed -s 10 -c malloc_free malloc_realloc_free


SPEED PROGRAM EXTENSIONS

Potentially lots of things could be made available in the program, but it's
been left at only the things that have actually been wanted and are likely
to be reasonably useful in the future.

Extensions should be fairly easy to make though.  speed-ext.c is an example,
in a style that should suit one-off tests, or new code fragments under
development.


THRESHOLD EXAMINING

The speed program can be used to examine the speeds of different algorithms
to check the tune program has done the right thing.  For example to examine
the karatsuba multiply threshold,

	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n

When examining the toom3 threshold, remember it depends on the karatsuba
threshold, so the right karatsuba threshold needs to be compiled into the
library first.  The tune program uses special recompiled versions of
mpn/mul_n.c etc for this reason, but the speed program simply uses the
normal libgmp.la.

Note further that the various routines may recurse into themselves on sizes
far enough above applicable thresholds.  For example, mpn_kara_mul_n will
recurse into itself on sizes greater than twice the compiled-in
KARATSUBA_MUL_THRESHOLD.

When doing the above comparison between mul_basecase and kara_mul_n what's
probably of interest is mul_basecase versus a kara_mul_n that does one level
of Karatsuba then calls to mul_basecase, but this only happens on sizes less
than twice the compiled KARATSUBA_MUL_THRESHOLD.  A larger value for that
setting can be compiled-in to avoid the problem if necessary.  The same
applies to toom3 and BZ, though in a trickier fashion.

There are some upper limits on some of the thresholds, arising from arrays
dimensioned according to a threshold (mpn_mul_n), or asm code with certain
sized displacements (some x86 versions of sqr_basecase).  So putting huge
values for the thresholds, even just for testing, may fail.


THINGS AFFECTING THRESHOLDS

The following are some general notes on some things that can affect the
various algorithm thresholds.

   KARATSUBA_MUL_THRESHOLD

      At size 2N, karatsuba does three NxN multiplies and some adds and
      shifts, compared to a 2Nx2N basecase multiply which will be roughly
      equivalent to four NxN multiplies.

      Fast mul - increases threshold
    
         If the CPU has a fast multiply, the basecase multiplies are going
         to stay faster than the karatsuba overheads for longer.  Conversely
         if the CPU has a slow multiply the karatsuba method trading some
         multiplies for adds will become worthwhile sooner.

	 Remember it's "addmul" performance that's of interest here.  This
	 may differ from a simple "mul" instruction in the CPU.  For example
	 K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul,
	 and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due
	 to pipelining.

      Unrolled addmul - increases threshold

         If the CPU addmul routine (or the addmul part of the mul_basecase
         routine) is unrolled it can mean that a 2Nx2N multiply is a bit
         faster than four NxN multiplies, due to proportionally less looping
         overheads.  This can be thought of as the addmul warming to its
         task on bigger sizes, and keeping the basecase better than
         karatsuba for longer.

      Karatsuba overheads - increases threshold

         Fairly obviously anything gained or lost in the karatsuba extra
         calculations will translate directly to the threshold.  But
         remember the extra calculations are likely to always be a
         relatively small fraction of the total multiply time and in that
         sense the basecase code is the best place to be looking for
         optimizations.

   KARATSUBA_SQR_THRESHOLD

      Squaring is essentially the same as multiplying, so the above applies
      to squaring too.  Fixed overheads will, proportionally, be bigger when
      squaring, leading to a higher threshold usually.

      mpn/generic/sqr_basecase.c

         This relies on a reasonable umul_ppmm, and if the generic C code is
         being used it may badly affect the speed.  Don't bother paying
         attention to the square thresholds until you have either a good
         umul_ppmm or an assembler sqr_basecase.

   TOOM3_MUL_THRESHOLD

      At size N, toom3 does five (N/3)x(N/3) multiplies and some extra
      calculations, compared to karatsuba doing three (N/2)x(N/2)
      multiplies and some extra calculations (fewer).  Toom3 will become
      better before long, being O(n^1.465) versus karatsuba at O(n^1.585),
      but exactly where depends a great deal on the implementations of all
      the relevant bits of extra calculation.

      In practice the curves for time versus size on toom3 and karatsuba
      have similar slopes near their crossover, leading to a range of sizes
      where there's very little difference between the two.  Choosing a
      single value from the range is a bit arbitrary and will lead to
      slightly different values on successive runs of the tune program.

      divexact_by3 - used by toom3

         Toom3 does a divexact_by3 which at size N is roughly equivalent to
         N successively dependent multiplies with a further couple of extra
         instructions in between.  CPUs with a low latency multiply and good
         divexact_by3 implementation should see the toom3 threshold lowered.
         But note this is unlikely to have much effect on total multiply
         times.

      Asymptotic behaviour

         At the fairly small sizes where the thresholds occur it's worth
         remembering that the asymptotic behaviour for karatsuba and toom3
         can't be expected to make accurate predictions, due of course to
         the big influence of all sorts of overheads, and the fact that only
         a few recursions of each are being performed.

	 Even at large sizes there's a good chance machine dependent effects
	 like cache architecture will mean actual performance deviates from
	 what might be predicted.  This is why the rather positivist
	 approach of just measuring things has been adopted, in general.

   TOOM3_SQR_THRESHOLD

      The same factors apply to squaring as to multiplying, though with
      overheads being proportionally a bit bigger.

   FFT_MUL_THRESHOLD, etc

      When configured with --enable-fft, a Fermat style FFT is used for
      multiplication above FFT_MUL_THRESHOLD, and a further threshold
      FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1
      multiply.  FFT_MUL_TABLE is the thresholds at which each split size
      "k" is used in the FFT.

      step effect - coarse grained thresholds

         The FFT has size restrictions that mean it rounds up sizes to
         certain multiples and therefore does the same amount of work for a
         range of different sized operands.  For example at k=8 the size is
         internally rounded to a multiple of 1024 limbs.  The current single
         values for the various thresholds are set to give good average
         performance, but in the future multiple values might be wanted to
         take into account the different step sizes for different "k"s.

   FFT_SQR_THRESHOLD, etc

      The same considerations apply as for multiplications, plus the
      following.

      similarity to mul thresholds

         On some CPUs the squaring thresholds are nearly the same as those
         for multiplying.  It's not quite clear why this is, it might be
         similar shaped size/time graphs for the mul and sqrs recursed into.

   BZ_THRESHOLD

      The B-Z division algorithm rearranges a traditional multi-precision
      long division so that NxN multiplies can be done rather than repeated
      Nx1 multiplies, thereby exploiting the algorithmic advantages of
      karatsuba and toom3, and leading to significant speedups.

      fast mul_basecase - decreases threshold

         CPUs with an optimized mul_basecase can expect a lower B-Z
         threshold due to the helping hand such a mul_basecase will give to
         B-Z as compared to submul_1 used in the schoolbook method.

   GCD_ACCEL_THRESHOLD

      Below this threshold a simple binary subtract and shift is used, above
      it Ken Weber's accelerated algorithm is used.  The accelerated GCD
      performs far fewer steps than the binary GCD and will normally kick in
      at quite small sizes.

      modlimb_invert and find_a - affect threshold

         At small sizes the performance of modlimb_invert and find_a will
         affect the accelerated algorithm and CPUs where those routines are
         not well optimized may see a higher threshold.  (At large sizes
         mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated
         algorithm.)

   GCDEXT_THRESHOLD

      mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of
      Euclid's algorithm.  The multipliers are found using single limb
      calculations below GCDEXT_THRESHOLD, or double limb calculations
      above.  The single limb code is fast but doesn't produce full-limb
      multipliers.

      data-dependent multiplier - big threshold

         If multiplications done by mpn_mul_1, addmul_1 and submul_1 run
         slower when there's more bits in the multiplier, then producing
         bigger multipliers with the double limb calculation doesn't save
         much more than some looping and function call overheads.  A large
         threshold can then be expected.

      slow division - low threshold

         The single limb calculation does some plain "/" divisions, whereas
         the double limb calculation has a divide routine optimized for the
         small quotients that often occur.  Until the single limb code does
         something similar a slow hardware divide will count against it.


FUTURE

Make a program to check the time base is working properly, for small and
large measurements.  Make it able to test each available method, including
perhaps the apparent resolution of each.

Add versions of the toom3 multiplication using either the mpn calls or the
open-coded style, so the two can be compared.

Add versions of the generic C mpn_divrem_1 using straight division versus a
multiply by inverse, so the two can be compared.  Include the branch-free
version of multiply by inverse too.

Make an option in struct speed_parameters to specify operand overlap,
perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1
dst2=src2, 4 for dst1=src2 dst2=src1.  This is done for addsub_n with the r
parameter (though addsub_n isn't yet enabled), and could be done for add_n,
xor_n, etc too.

When speed_measure() divides the total time measured by repetitions
performed, it divides the fixed overheads imposed by speed_starttime() and
speed_endtime().  When different routines are run with different repetitions
the overhead will then be differently counted.  It would improve precision
to try to avoid this.  Currently the idea is just to set speed_precision big
enough that the effect is insignificant compared to the routines being
measured.


----------------
Local variables:
mode: text
fill-column: 76
End: