===================================================================
RCS file: /home/cvs/OpenXM_contrib/gmp/doc/Attic/tasks.html,v
retrieving revision 1.1.1.1
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.1 -r1.1.1.2
--- OpenXM_contrib/gmp/doc/Attic/tasks.html 2000/09/09 14:12:20 1.1.1.1
+++ OpenXM_contrib/gmp/doc/Attic/tasks.html 2003/08/25 16:06:11 1.1.1.2
@@ -13,342 +13,888 @@
+
+Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
+This file is part of the GNU MP Library.
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published
+by the Free Software Foundation; either version 2.1 of the License, or (at
+your option) any later version.
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
+License for more details.
+You should have received a copy of the GNU Lesser General Public License
+along with the GNU MP Library; see the file COPYING.LIB. If not, write to
+the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
+MA 02111-1307, USA.
+
+
+
This file lists itemized GMP development tasks. Not all the tasks +
These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc
with a small (1 limb) size.
-mpz_XXX(a,a,a)
.
-mp_exp_t
, the precision of
+mpz_XXX(a,a,a)
.
+mpf_t
numbers with exponents >2^53 on
+ machines with 64-bit mp_exp_t
, the precision of
__mp_bases[base].chars_per_bit_exactly
is insufficient and
- mpf_get_str
aborts. Detect and compensate.
-mpz_get_si
to work properly for MIPS N32 ABI (and other
- machines that use long long
for storing limbs.)
+ mpf_get_str
aborts. Detect and compensate. Alternately,
+ think seriously about using some sort of fixed-point integer value.
+ Avoiding unnecessary floating point is probably a good thing in general,
+ and it might be faster on some CPUs.
mpz_*_ui
division routines. Perhaps make them return the
- real remainder instead? Changes return type to signed long int
.
+ unspecified (zero).
mpf_eq
is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub
for this
situation; put similar code in mpf_eq
.
-__builtin_constant_p
is unavailable? Same problem with MacOS
- X.
-dump_abort
in
- mp?/tests/*.c.
-mpz_get_si
returns 0x80000000 for -0x100000000.
+mpf_eq
doesn't implement what gmp.texi specifies. It should
+ not use just whole limbs, but partial limbs.
+mpf_set_str
doesn't validate it's exponent, for instance
+ garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
+ of a long
is not detected.
+mpf_add
doesn't check for a carry from truncated portions of
+ the inputs, and in that respect doesn't implement the "infinite precision
+ followed by truncate" specified in the manual.
+mpf_div
of x/x doesn't always give 1, reported by Peter
+ Moulder. Perhaps it suffices to put +1 on the effective divisor prec, so
+ that data bits rather than zeros are shifted in when normalizing. Would
+ prefer to switch to mpn_tdiv_qr
, where all shifting should
+ disappear.
+mpz_add
etc, which doesn't work
+ when those routines are coming from a DLL (because they're effectively
+ function pointer global variables themselves). Need to rearrange perhaps
+ to a set of calls to a test function rather than iterating over an array.
+main
might be
+ clobbered by the longjmp
.
count_leading_zeros
if
- the bit is set. This is an optimization on all machines, and significant
- on machines with slow count_leading_zeros
.
-count_trailing_zeros
is used
- on more or less uniformly distributed numbers. For some CPUs
- count_trailing_zeros
is slow and it's probably worth
- handling the frequently occurring 0 to 2 trailing zeros cases specially.
-udiv_qrnnd
for inverting limbs to
- instead use invert_limb
.
-mpn_gcdext
, mpz_get_d
,
+ mpf_get_str
: Don't test count_leading_zeros
for
+ zero, instead check the high bit of the operand and avoid invoking
+ count_leading_zeros
. This is an optimization on all
+ machines, and significant on machines with slow
+ count_leading_zeros
, though it's possible an already
+ normalized operand might not be encountered very often.
umul_ppmm
to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB
<= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm
code in longlong.h: Add partial
products with fewer operations.
-mpn_get_str
and mpn_set_str
running in
- the sub O(n^2) range, using some divide-and-conquer approach, preferably
- without using division.
-mpn_get_str
to mpf/get_str. (Talk to Torbjörn about this.)
-mpz_size
,
- mpz_set_ui
, mpz_set_q
, mpz_clear
,
- mpz_init
, mpz_get_ui
, mpz_scan0
,
- mpz_scan1
, mpz_getlimbn
,
- mpz_init_set_ui
, mpz_perfect_square_p
,
- mpz_popcount
, mpf_size
,
- mpf_get_prec
, mpf_set_prec_raw
,
- mpf_set_ui
, mpf_init
, mpf_init2
,
- mpf_clear
, mpf_set_si
.
+mpz_set_ui
. This would be both small and
+ fast, especially for compile-time constants, but would make application
+ binaries depend on having 1 limb allocated to an mpz_t
,
+ preventing the "lazy" allocation scheme below.
+mpz_[cft]div_ui
and maybe
+ mpz_[cft]div_r_ui
. A __gmp_divide_by_zero
+ would be needed for the divide by zero test, unless that could be left to
+ mpn_mod_1
(not sure currently whether all the risc chips
+ provoke the right exception there if using mul-by-inverse).
+mpz_fits_s*_p
. The setups for
+ LONG_MAX
etc would need to go into gmp.h, and on Cray it
+ might, unfortunately, be necessary to forcibly include <limits.h>
+ since there's no apparent way to get SHRT_MAX
with an
+ expression (since short
and unsigned short
can
+ be different sizes).
mpz_powm
and mpz_powm_ui
aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
mpz_powm
and mpz_powm_ui
want better
algorithm selection, and the latter should use REDC. Both could
change to use an mpn_powm
and mpn_redc
.
+mpz_powm
REDC should do multiplications by g[]
+ using the division method when they're small, since the REDC form of a
+ small multiplier is normally a full size product. Probably would need a
+ new tuned parameter to say what size multiplier is "small", as a function
+ of the size of the modulus.
+mpz_powm
REDC should handle even moduli if possible. Maybe
+ this would mean for m=n*2^k doing mod n using REDC and an auxiliary
+ calculation mod 2^k, then putting them together at the end.
mpn_gcd
might be able to be sped up on small to
moderate sizes by improving find_a
, possibly just by
providing an alternate implementation for CPUs with slowish
count_leading_zeros
.
-USE_MORE_MPN
code. The necessary
- right-to-left mpn_divexact_by3c
exists.
+USE_MORE_MPN
could use a low to high cache localized
+ evaluate and interpolate. The necessary mpn_divexact_by3c
+ exists.
mpn_mul_basecase
on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
+mpn_perfect_square_p
on small operands might be better off
+ skipping the residue tests and just taking a square root.
+mpz_perfect_power_p
could be improved in a number of ways.
+ Test for Nth power residues modulo small primes like
+ mpn_perfect_square_p
does. Use p-adic arithmetic to find
+ possible roots. Divisibility by other primes should be tested by
+ grouping into a limb like PP
.
+mpz_perfect_power_p
might like to use mpn_gcd_1
+ instead of a private GCD routine. The use it's put to isn't
+ time-critical, and it might help be ensure correctness to use the main GCD
+ routine.
+mpz_perfect_power_p
could use
+ mpz_divisible_ui_p
instead of mpz_tdiv_ui
for
+ divisibility testing, the former is faster on a number of systems. (But
+ all that prime test stuff is going to be rewritten some time.)
+PP
/PP_INVERTED
into an array of such
+ pairs, listing several hundred primes. Perhaps actually make the
+ products larger than one limb each.
+PP
can have factors of 2 introduced in order to get the high
+ bit set and therefore a PP_INVERTED
existing. The factors
+ of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
+ further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
+ The advantage of this is that mpn_preinv_mod_1
can then be
+ used if it's faster than plain mpn_mod_1
. This would be a
+ change only for 16-bit limbs, all the rest already have PP
+ in the right form.
+PP
could have extra factors of 3 or 5 or whatever introduced
+ if they fit, and final remainders mod 9 or 25 or whatever used, thereby
+ making more efficient use of the mpn_mod_1
done. On a
+ 16-bit limb it looks like PP
could take an extra factor of
+ 3.
+mpz_probab_prime_p
, mpn_perfect_square_p
and
+ mpz_perfect_power_p
could use mpn_mod_34lsub1
+ to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
+ 3, 5, 7, 13 and 17 (factors of 2^24-1). This could either replace the
+ PP
division currently done, or allow PP
to do
+ larger primes, depending how many residue tests seem worthwhile before
+ launching into full root extractions or Miller-Rabin etc.
+mpz_probab_prime_p
(and maybe others) could code the
+ divisibility tests like n%7 == 0
in the form
++#define MP_LIMB_DIVISIBLE_7_P(n) \ + ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7) ++ This would help compilers which don't know how to optimize divisions by + constants, and would help current gcc (3.0) too since gcc forms a whole + remainder rather than using a modular inverse and comparing. This + technique works for any odd modulus, and with some tweaks for even moduli + too. See Granlund and Montgomery "Division By Invariant Integers" + section 9. +
mpz_probab_prime_p
and mpz_nextprime
could
+ offer certainty for primes up to 2^32 by using a one limb miller-rabin
+ test to base 2, combined with a table of actual strong pseudoprimes in
+ that range (2314 of them). If that table is too big then both base 2 and
+ base 3 tests could be done, leaving a table of 104. The test could use
+ REDC and therefore be a modlimb_invert
a remainder (maybe)
+ then two multiplies per bit (successively dependent). Processors with
+ pipelined multipliers could do base 2 and 3 in parallel. Vector systems
+ could do a whole bunch of bases in parallel, and perhaps offer near
+ certainty up to 64-bits (certainty might depend on an exhaustive search
+ of pseudoprimes up to that limit). Obviously 2^32 is not a big number,
+ but an efficient and certain calculation is attractive. It might find
+ other uses internally, and could even be offered as a one limb prime test
+ mpn_probab_prime_1_p
or gmp_probab_prime_ui_p
+ perhaps.
+mpz_probab_prime_p
doesn't need to make a copy of
+ n
when the input is negative, it can setup an
+ mpz_t
alias, same data pointer but a positive size. With no
+ need to clear before returning, the recursive function call could be
+ dispensed with too.
+mpf_set_str
produces low zero limbs when a string has a
+ fraction but is exactly representable, eg. 0.5 in decimal. These could be
+ stripped to save work in later operations.
+mpz_and
, mpz_ior
and mpz_xor
should
+ use mpn_and_n
etc for the benefit of the small number of
+ targets with native versions of those routines. Need to be careful not to
+ pass size==0. Is some code sharing possible between the mpz
+ routines?
+mpf_add
: Don't do a copy to avoid overlapping operands
+ unless it's really necessary (currently only sizes are tested, not
+ whether r really is u or v).
+mpf_add
: Under the check for v having no effect on the
+ result, perhaps test for r==u and do nothing in that case, rather than
+ currently it looks like an MPN_COPY_INCR
will be done to
+ reduce prec+1 limbs to prec.
+mpn_divrem_2
could usefully accept unnormalized divisors and
+ shift the dividend on-the-fly, since this should cost nothing on
+ superscalar processors and avoid the need for temporary copying in
+ mpn_tdiv_qr
.
+mpf_sqrt_ui
calculates prec+1 limbs, whereas just prec would
+ satisfy the application requested precision. It should suffice to simply
+ reduce the rsize temporary to 2*prec-1 limbs. mpf_sqrt
+ might be similar.
+invert_limb
generic C: The division could use dividend
+ b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
+ (b-d):0, where b=2^BITS_PER_MP_LIMB
and d=divisor. The
+ former is per the original paper and is used in the x86 code, the
+ advantage is that the current special case for 0x80..00 could be dropped.
+ The two should be equivalent, but a little check of that would be wanted.
+mpq_cmp_ui
could form the num1*den2
and
+ num2*den1
products limb-by-limb from high to low and look at
+ each step for values differing by more than the possible carry bit from
+ the uncalculated portion.
+mpq_cmp
could do the same high-to-low progressive multiply
+ and compare. The benefits of karatsuba and higher multiplication
+ algorithms are lost, but if it's assumed only a few high limbs will be
+ needed to determine an order then that's fine.
+mpn_add_1
, mpn_sub_1
, mpn_add
,
+ mpn_sub
: Internally use __GMPN_ADD_1
etc
+ instead of the functions, so they get inlined on all compilers, not just
+ gcc and others with inline
recognised in gmp.h.
+ __GMPN_ADD_1
etc are meant mostly to support application
+ inline mpn_add_1
etc and if they don't come out good for
+ internal uses then special forms can be introduced, for instance many
+ internal uses are in-place. Sometimes a block of code is executed based
+ on the carry-out, rather than using it arithmetically, and those places
+ might want to do their own loops entirely.
+__gmp_extract_double
on 64-bit systems could use just one
+ bitfield for the mantissa extraction, not two, when endianness permits.
+ Might depend on the compiler allowing long long
bit fields
+ when that's the only actual 64-bit type.
+mpf_get_d
could be more like mpz_get_d
and do
+ more in integers and give the float conversion as such a chance to round
+ in its preferred direction. Some code sharing ought to be possible. Or
+ if nothing else then for consistency the two ought to give identical
+ results on integer operands (not clear if this is so right now).
+usqr_ppm
or some such could do a widening square in the
+ style of umul_ppmm
. This would help 68000, and be a small
+ improvement for the generic C (which is used on UltraSPARC/64 for
+ instance). GCC recognises the generic C ul*vh and vl*uh are identical,
+ but does two separate additions to the rest of the result.
+TMP_FREE
releases all memory, so
+ there's an allocate and free every time a top-level function using
+ TMP
is called. Would need
+ mp_set_memory_functions
to tell tal-notreent.c to release
+ any cached memory when changing allocation functions though.
+__gmp_tmp_alloc
from tal-notreent.c could be partially
+ inlined. If the current chunk has enough room then a couple of pointers
+ can be updated. Only if more space is required then a call to some sort
+ of __gmp_tmp_increase
would be needed. The requirement that
+ TMP_ALLOC
is an expression might make the implementation a
+ bit ugly and/or a bit sub-optimal.
++#define TMP_ALLOC(n) + ((ROUND_UP(n) > current->end - current->point ? + __gmp_tmp_increase (ROUND_UP (n)) : 0), + current->point += ROUND_UP (n), + current->point - ROUND_UP (n)) ++
__mp_bases
has a lot of data for bases which are pretty much
+ never used. Perhaps the table should just go up to base 16, and have
+ code to generate data above that, if and when required. Naturally this
+ assumes the code would be smaller than the data saved.
+__mp_bases
field big_base_inverted
is only used
+ if USE_PREINV_DIVREM_1
is true, and could be omitted
+ otherwise, to save space.
+mpf_get_str
and mpf_set_str
call the
+ corresponding, much faster, mpn functions.
+mpn_mod_1
could pre-calculate values of R mod N, R^2 mod N,
+ R^3 mod N, etc, with R=2^BITS_PER_MP_LIMB
, and use them to
+ process multiple limbs at each step by multiplying. Suggested by Peter
+ L. Montgomery.
+mpz_get_str
, mtox
: For power-of-2 bases, which
+ are of course fast, it seems a little silly to make a second pass over
+ the mpn_get_str
output to convert to ASCII. Perhaps combine
+ that with the bit extractions.
+mpz_gcdext
: If the caller requests only the S cofactor (of
+ A), and A<B, then the code ends up generating the cofactor T (of B) and
+ deriving S from that. Perhaps it'd be possible to arrange to get S in
+ the first place by calling mpn_gcdext
with A+B,B. This
+ might only be an advantage if A and B are about the same size.
+mpn_toom3_mul_n
, mpn_toom3_sqr_n
: Temporaries
+ B
and D
are adjacent in memory and at the final
+ coefficient additions look like they could use a single
+ mpn_add_n
of l4
limbs rather than two of
+ l2
limbs.
udiv_qrnnd_preinv2norm
, the branch-free version of
+ udiv_qrnnd_preinv
, might be faster on various pipelined
+ chips. In particular the first if (_xh != 0)
in
+ udiv_qrnnd_preinv
might be roughly a 50/50 chance and might
+ branch predict poorly. (The second test is probably almost always
+ false.) Measuring with the tuneup program would be possible, but perhaps
+ a bit messy. In any case maybe the default should be the branch-free
+ version.
+ udiv_qrnnd_preinv2norm
implementation
+ assumes a right shift will sign extend, which is not guaranteed by the C
+ standards, and doesn't happen on Cray vector systems.
mpn_addmul_1
, mpn_submul_1
, and
- mpn_mul_1
for the 21264. On 21264, they should run at 4, 3,
- and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
- Torbjörn for his xm.s and xam.s skeleton files.)
-mpn_addmul_1
, mpn_submul_1
, and
- mpn_mul_1
for the 21164. This should use both integer
+ + #ifdef (__GNUC__) + #if __GNUC__ == 2 && __GNUC_MINOR__ == 7 + ... + #endif + #if __GNUC__ == 2 && __GNUC_MINOR__ == 8 + ... + #endif + #ifndef MUL_KARATSUBA_THRESHOLD + /* Default GNUC values */ + ... + #endif + #else /* system compiler */ + ... + #endif+
invert_limb
on various processors might benefit from the
+ little Newton iteration done for alpha and ia64.
+mpn_mul_1
,
+ mpn_addmul_1
, and mpn_submul_1
.
+mpn_mul_1
, mpn_addmul_1
,
+ and mpn_submul_1
for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
- chunks.
-mpn_addmul_1
,
- mpn_submul_1
, and mpn_mul_1
. Should use
- floating-point operations, and split the invariant single-limb multiplier
- into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
- will become very deep. (Torbjörn has C code that is useful as a starting
- point.)
-mpn_lshift
and mpn_rshift
.
- Should give 2 cycles/limb. (Torbjörn has code that just needs to be
- finished.)
-mpn_addmul_1
- and the other multiplies varies so much on successive sizes.
+ chunks, or perhaps even better in four 16-bit chunks. Probably possible
+ to reach 9 cycles/limb.
+ctlz
and cttz
for
+ count_leading_zeros
andcount_trailing_zeros
.
+ Use inline for gcc, probably want asm files for elsewhere.
+umul_ppmm
to call
+ __umulsidi3
in libgcc. Could be copied straight across, but
+ perhaps ought to be tested.
+clz
instruction can be used for
+ count_leading_zeros
.
+mpn_divexact_by3
isn't particularly important, but
+ the generic C runs at about 27 c/l, whereas with the multiplies off the
+ dependent chain about 3 c/l ought to be possible.
+mpn_hamdist
could be put together based on the
+ current mpn_popcount
.
+popc_limb
in gmp-impl.h could use the
+ popcnt
insn.
+mpn_submul_1
is not implemented directly, only via
+ a combination of mpn_mul_1
and mpn_sub_n
.
+mpn_mul_1
, mpn_addmul_1
,
+ for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how
+ much this can improve the speed, though, since the symmetry that we rely
+ on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more
+ accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
+mpn_submul_1
, analogous to
+ mpn_addmul_1
.
+umul_ppmm
. Using four
+ "mulx
"s either with an asm block or via the generic C code is
+ about 90 cycles. Try using fp operations, and also try using karatsuba
+ for just three "mulx
"s.
+mpn_divrem_1
, mpn_mod_1
,
+ mpn_divexact_1
and mpn_modexact_1_odd
could
+ process 32 bits at a time when the divisor fits 32-bits. This will need
+ only 4 mulx
's per limb instead of 8 in the general case.
+mpn_lshift
, mpn_rshift
.
+ Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
+mulx
for umul_ppmm
if
+ possible (see commented out code in longlong.h). This is unlikely to
+ save more than a couple of cycles, so perhaps isn't worth bothering with.
+__sparc_v9__
+ or anything to indicate V9 support when -mcpu=v9 is selected. See
+ gcc/config/sol2-sld-64.h. Will need to pass something through from
+ ./configure to select the right code in longlong.h. (Currently nothing
+ is lost because mulx
for multiplying is commented out.)
+modlimb_invert
might save a few cycles from
+ masking down to just the useful bits at each point in the calculation,
+ since mulx
speed depends on the highest bit set. Either
+ explicit masks or small types like short
and
+ int
ought to work.
+mpn_popcount
and mpn_hamdist
+ could use popc
currently commented out in gmp-impl.h. This
+ chip reputedly implements popc
properly (see gcc sparc.md),
+ would need to recognise the chip as sparchalr1
or something
+ in configure / config.sub / config.guess.
mpn_addmul_1
, mpn_submul_1
, and
- mpn_mul_1
. The current development code runs at 11
- cycles/limb, which is already very good. But it should be possible to
- saturate the cache, which will happen at 7.5 cycles/limb.
-umul_ppmm
. Important in particular for
- mpn_sqr_basecase
. Using four "mulx
"s either
- with an asm block or via the generic C code is about 90 cycles.
+ mpn_mul_1
. The current code runs at 11 cycles/limb. It
+ should be possible to saturate the cache, which will happen at 8
+ cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32;
+ it should be possible to make them run at about 5 cycles/limb.
+mpn_addmul_1
, mpn_submul_1
, and
+ mpn_mul_1
. Use both integer and floating-point operations,
+ possibly two floating-point and one integer limb per loop. Split operands
+ into four 16-bit chunks for fast fp operations. Should easily reach 9
+ cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
+ (using one int + two fp).
+mpn_rshift
could do the same sort of unrolled loop
+ as mpn_lshift
. Some judicious use of m4 might let the two
+ share source code, or with a register to control the loop direction
+ perhaps even share object code.
+mpn_rshift
should do the same sort of unrolled
+ loop as mpn_lshift
.
mpn_mul_basecase
and mpn_sqr_basecase
for important machines. Helping the generic sqr_basecase.c with an
mpn_sqr_diagonal
might be enough for some of the RISCs.
mpn_lshift
/mpn_rshift
.
Will bring time from 1.75 to 1.25 cycles/limb.
-mpn_lshift
for shifts by 1. (See Pentium code.)
-count_leading_zeros
.
-udiv_qrnnd
. (Ask Torbjörn for the file
- test-udiv-preinv.c as a starting point.)
-mpn_add_n
and mpn_sub_n
.
- It should just require 3 cycles/limb, but the current code propagates
- carry poorly. The trick is to add carry-in later than we do now,
- decreasing the number of operations used to generate carry-out from 4 to
- to 3.
+mpn_lshift
for shifts by 1. (See
+ Pentium code.)
+rep
+ movs
would upset GCC register allocation for the whole function.
+ Is this still true in GCC 3? It uses rep movs
itself for
+ __builtin_memcpy
. Examine the code for some simple and
+ complex functions to find out. Inlining rep movs
would be
+ desirable, it'd be both smaller and faster.
+mpn_lshift
and mpn_rshift
can come
+ down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
+ shrdl
and shldl
, see mpn/x86/pentium/README.
+mpn_lshift
and mpn_rshift
+ might benefit from some destination prefetching.
+mpn_divrem_1
might be able to use a
+ mul-by-inverse, hoping for maybe 30 c/l.
+mpn_add_n
and mpn_sub_n
should be able to go
+ faster than the generic x86 code at 3.5 c/l. The athlon code for instance
+ runs at about 2.7.
+mpn_lshift
and mpn_rshift
might be able to
+ do something branch-free for unaligned startups, and shaving one insn
+ from the loop with alternative indexing might save a cycle.
mpn_lshift
.
- The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
- is unused. (Ask Torbjörn for a copy of the current code.)
+ The pipeline is now extremely deep, perhaps unnecessarily deep.
mpn_rshift
based on new mpn_lshift
.
mpn_add_n
and mpn_sub_n
. Should
- run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
+ run at just 3.25 cycles/limb.
mpn_mul_basecase
and
mpn_sqr_basecase
. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
-mpn_mul_basecase
and
- mpn_sqr_basecase
. Same comment applies to this as to the
- same functions for Fujitsu VPP.
+mpn_lshift
by 31 should use the special rshift
+ by 1 code, and vice versa mpn_rshift
by 31 should use the
+ special lshift by 1. This would be best as a jump across to the other
+ routine, could let both live in lshift.asm and omit rshift.asm on finding
+ mpn_rshift
already provided.
+mpn_com_n
and mpn_and_n
etc very probably
+ wants a pragma like MPN_COPY_INCR
.
+mpn_lshift
, mpn_rshift
,
+ mpn_popcount
and mpn_hamdist
are nice and small
+ and could be inlined to avoid function calls.
+TMP_ALLOC
to use them, or introduce a new scheme. Memory
+ blocks wanted unconditionally are easy enough, those wanted only
+ sometimes are a problem. Perhaps a special size calculation to ask for a
+ dummy length 1 when unwanted, or perhaps an inlined subroutine
+ duplicating code under each conditional. Don't really want to turn
+ everything into a dog's dinner just because Cray don't offer an
+ alloca
.
+mpn_get_str
on power-of-2 bases ought to vectorize.
+ Does it? bits_per_digit
and the inner loop over bits in a
+ limb might prevent it. Perhaps special cases for binary, octal and hex
+ would be worthwhile (very possibly for all processors too).
+popc_limb
could use the Cray _popc
+ intrinsic. That would help mpz_hamdist
and might make the
+ generic C versions of mpn_popcount
and
+ mpn_hamdist
suffice for Cray (if it vectorizes, or can be
+ given a hint to do so).
+mpn_mul_1
, mpn_addmul_1
,
+ mpn_submul_1
: Check for a 16-bit multiplier and use two
+ multiplies per limb, not four.
+mpn_lshift
and mpn_rshift
could use a
+ roll
and mask instead of lsrl
and
+ lsll
. This promises to be a speedup, effectively trading a
+ 6+2*n shift for one or two 4 cycle masks. Suggested by Jean-Charles
+ Meyrignac.
count_leading_zeros
for 64-bit machines:
-
- if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2} - if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4} - ...- + if ((x >> 32) == 0) { x <<= 32; cnt += 32; } + if ((x >> 48) == 0) { x <<= 16; cnt += 16; } + ... +
__inline
which could perhaps
+ be used in __GMP_EXTERN_INLINE
. What would be the right way
+ to identify suitable versions of that compiler?
+double
floats are straightforward and
+ could perhaps be handled directly in __gmp_extract_double
+ and maybe in mpz_get_d
, rather than falling back on the
+ generic code. (Both formats are detected by configure
.)
+mpn_get_str
final divisions by the base with
+ udiv_qrnd_unnorm
could use some sort of multiply-by-inverse
+ on suitable machines. This ends up happening for decimal by presenting
+ the compiler with a run-time constant, but the same for other bases would
+ be good. Perhaps use could be made of the fact base<256.
+mpn_umul_ppmm
, mpn_udiv_qrnnd
: Return a
+ structure like div_t
to avoid going through memory, in
+ particular helping RISCs that don't do store-to-load forwarding. Clearly
+ this is only possible if the ABI returns a structure of two
+ mp_limb_t
s in registers.
mpz_get_nth_ui
. Return the nth word (not necessarily the nth limb).
+mp?_out_raw
and
+ mp?_inp_raw
.
+mpz_get_nth_ui
. Return the nth word (not necessarily the
+ nth limb).
mpz_crr
(Chinese Remainder Reconstruction).
mpq_set_f
for assignment from mpf_t
- (cf. mpq_set_d
).
-mpz_init
(and mpq_init
) do lazy
- allocation. Set ALLOC(var)
to 0, and have
- mpz_realloc
special-handle that case. Update functions that
- rely on a single limb (like mpz_set_ui
,
- mpz_[tfc]div_r_ui
, and others).
+mpz_init
and mpq_init
could do lazy allocation.
+ Set ALLOC(var)
to 0 to indicate nothing allocated, and let
+ _mpz_realloc
do the initial alloc. Set
+ z->_mp_d
to a dummy that mpz_get_ui
and
+ similar can unconditionally fetch from. Niels Möller has had a go at
+ this.
+ mpz_init
and then
+ more or less immediately reallocating.
+ mpz_init
would only store magic values in the
+ mpz_t
fields, and could be inlined.
+ mpz_t z = MPZ_INITIALIZER;
, which might be convenient
+ for globals.
+ mpz_set_ui
and other similar routines needn't check the
+ size allocated and can just store unconditionally.
+ mpz_set_ui
and perhaps others like
+ mpz_tdiv_r_ui
and a prospective
+ mpz_set_ull
could be inlined.
+ mpf_out_raw
and mpf_inp_raw
. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
-gmp_errno
.
-gmp_fprintf
, gmp_sprintf
, and
- gmp_snprintf
. Think about some sort of wrapper
- around printf
so it and its several variants don't
- have to be completely reimplemented.
-mpq
input and output functions.
-mpz_kronecker
, leave
- mpz_jacobi
for compatibility.
-mpz_set_str
etc taking string
- lengths rather than null-terminators.
+mpn_and_n
... mpn_copyd
: Perhaps make the mpn
+ logops and copys available in gmp.h, either as library functions or
+ inlines, with the availability of library functions instantiated in the
+ generated gmp.h at build time.
+mpz_set_str
etc variants taking string lengths rather than
+ null-terminators.
<=
" rather than "<
", so a threshold can
be set to MP_SIZE_T_MAX
to get only the simpler code (the
compiler will know size <= MP_SIZE_T_MAX
is always true).
-mpz_cdiv_q_2exp
and mpz_cdiv_r_2exp
- could be implemented to match the corresponding tdiv and fdiv.
- Maybe some code sharing is possible.
+ Alternately it looks like the ABOVE_THRESHOLD
and
+ BELOW_THRESHOLD
macros can do this adequately, and also pick
+ up cases where a threshold of zero should mean only the second algorithm.
+mpz_nthprime
.
+mpz_init2
, initializing and making initial room for
+ N bits. The actual size would be rounded up to a limb, and perhaps an
+ extra limb added since so many mpz
routines need that on
+ their destination.
+mpz_andn
, mpz_iorn
, mpz_nand
,
+ mpz_nior
, mpz_xnor
might be useful additions,
+ if they could share code with the current such functions (which should be
+ possible).
+mpz_and_ui
etc might be of use sometimes. Suggested by
+ Niels Möller.
+mpf_set_str
and mpf_inp_str
could usefully
+ accept 0x, 0b etc when base==0. Perhaps the exponent could default to
+ decimal in this case, with a further 0x, 0b etc allowed there.
+ Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but
+ probably something like "0.123" ought not mean octal.
+GMP_LONG_LONG_LIMB
or some such could become a documented
+ feature of gmp.h, so applications could know whether to
+ printf
a limb using %lu
or %Lu
.
+PRIdMP_LIMB
and similar defines following C99
+ <inttypes.h> might be of use to applications printing limbs.
+ Perhaps they should be defined only if specifically requested, the way
+ <inttypes.h> does. But if GMP_LONG_LONG_LIMB
or
+ whatever is added then perhaps this can easily enough be left to
+ applications.
+mpf_get_ld
and mpf_set_ld
converting
+ mpf_t
to and from long double
. Other
+ long double
routines would be desirable too, but these would
+ be a start. Often long double
is the same as
+ double
, which is easy but pretty pointless. Should
+ recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
+ sparc, hppa and power. Might like an ABI sub-option or something when
+ it's a compiler option for 64-bit or 128-bit long double
.
+gmp_printf
could accept %b
for binary output.
+ It'd be nice if it worked for plain int
etc too, not just
+ mpz_t
etc.
+gmp_printf
in fact could usefully accept an arbitrary base,
+ for both integer and float conversions. A base either in the format
+ string or as a parameter with *
should be allowed. Maybe
+ &13b
(b for base) or something like that.
+gmp_printf
could perhaps have a type code for an
+ mp_limb_t
. That would save an application from having to
+ worry whether it's a long
or a long long
.
+gmp_printf
could perhaps accept mpq_t
for float
+ conversions, eg. "%.4Qf"
. This would be merely for
+ convenience, but still might be useful. Rounding would be the same as
+ for an mpf_t
(ie. currently round-to-nearest, but not
+ actually documented). Alternately, perhaps a separate
+ mpq_get_str_point
or some such might be more use. Suggested
+ by Pedro Gimeno.
+gmp_printf
could usefully accept a flag to control the
+ rounding of float conversions. The wouldn't do much for
+ mpf_t
, but would be good if mpfr_t
was
+ supported in the future, or perhaps for mpq_t
. Something
+ like &*r
(r for rounding, and mpfr style
+ GMP_RND
parameter).
+mpz_combit
to toggle a bit would be a good companion for
+ mpz_setbit
and mpz_clrbit
. Suggested by Niels
+ Möller (and has done some work towards it).
+mpz_scan0_reverse
or mpz_scan0low
or some such
+ searching towards the low end of an integer might match
+ mpz_scan0
nicely. Likewise for scan1
.
+ Suggested by Roberto Bagnara.
+mpz_bit_subset
or some such to test whether one integer is a
+ bitwise subset of another might be of use. Some sort of return value
+ indicating whether it's a proper or non-proper subset would be good and
+ wouldn't cost anything in the implementation. Suggested by Roberto
+ Bagnara.
+gmp_randinit_r
and maybe gmp_randstate_set
to
+ init-and-copy or to just copy a gmp_randstate_t
. Suggested
+ by Pedro Gimeno.
+mpf_get_ld
, mpf_set_ld
: Conversions between
+ mpf_t
and long double
, suggested by Dan
+ Christensen. There'd be some work to be done by configure
+ to recognise the format in use. xlc on aix for instance apparently has
+ an option for either plain double 64-bit or quad 128-bit precision. This
+ might mean library contents vary with the compiler used to build, which
+ is undesirable. It might be possible to detect the mode the application
+ is compiling with, and try to avoid mismatch problems.
+mpz_sqrt_if_perfect_square
: When
+ mpz_perfect_square_p
does its tests it calculates a square
+ root and then discards it. For some applications it might be useful to
+ return that root. Suggested by Jason Moxham.
+mpz_get_ull
, mpz_set_ull
,
+ mpz_get_sll
, mpz_get_sll
: Conversions for
+ long long
. These would aid interoperability, though a
+ mixture of GMP and long long
would probably not be too
+ common. Disadvantages of using long long
in libgmp.a would
+ be
+ #ifdef
block to decide if the
+ application compiler could take the long long
+ prototypes.
+ LIBGMP_HAS_LONGLONG
would be wanted to
+ indicate whether the functions are available. (Applications using
+ autoconf could probe the library too.)
+ long long
to
+ application compile time, by having something like
+ mpz_set_2ui
called with two halves of a long
+ long
. Disadvantages of this would be,
+ long
+ long
is normally passed as two halves anyway.
+ mpz_get_ull
would be a rather big inline, or would have
+ to be two function calls.
+ mpz_get_sll
would be a worse inline, and would put the
+ treatment of -0x10..00
into applications (see
+ mpz_get_si
correctness above).
+ long long
is probably the lesser evil, if only
+ because it makes best use of gcc.
+mpz_strtoz
parsing the same as strtol
.
+ Suggested by Alexander Kruppa.
GMP_C_DOUBLE_FORMAT
seems to work
+ well. Get rid of the #ifdef
mess in gmp-impl.h and use the
+ results of the test instead.
+umul_ppmm
in longlong.h always uses umull
,
+ but is that available only for M series chips or some such? Perhaps it
+ should be configured in some way.
+ia64-*-hpux*
. Does GMP need to know anything about that?
+AC_C_BIGENDIAN
seems the best way to handle that for GMP.
+*-*-aix*
. It might be more reliable to do some sort of
+ feature test, examining the compiler output perhaps. It might also be
+ nice to merge the aix.m4 files into powerpc-defs.m4.
+config.guess
recognises various exact sparcs, make
+ use of that information in configure
(work on this is in
+ progress).
+udiv
should be selected
+ according to the CPU target. Currently floating point ends up being
+ used on all sparcs, which is probably not right for generic V7 and V8.
+-xtarget=native
with cc
is
+ incorrect when cross-compiling, the target should be set according to the
+ configured $host
CPU.
+.balignw <n>,0x4e7f
to get
+ nops, if ALIGN
is going to be used for anything that's
+ executed across.
+umul
and udiv
code not being
+ used. Check all such for bit rot and then put umul and udiv in
+ $gmp_mpn_functions_optional
as "standard optional" objects.
+ mpn_umul_ppmm
and mpn_udiv_qrnnd
have a
+ different parameter order than those functions on other CPUs. It might
+ avoid confusion to have them under different names, maybe
+ mpn_umul_ppmm_r
or some such. Prototypes then wouldn't
+ be conditionalized, and the appropriate form could be selected with the
+ HAVE_NATIVE
scheme if/when the code switches to use a
+ PROLOGUE
style.
+DItype
: The setup in gmp-impl.h for non-GCC could use an
+ autoconf test to determine whether long long
is available.
+AC_OUTPUT
+ would work, but it might upset "make" to have things like L$
+ get into the Makefiles through AC_SUBST
.
+ AC_CONFIG_COMMANDS
would be the alternative. With some
+ careful m4 quoting the changequote
calls might not be
+ needed, which might free up the order in which things had to be output.
+make distclean
: Only the mpn directory links which were
+ created are removed, but perhaps all possible links should be removed, in
+ case someone runs configure a second time without a
+ distclean
in between. The only tricky part would be making
+ sure all possible extra_functions
are covered.
+-mno-cygwin
. For --host=*-*-mingw32*
it might
+ be convenient to automatically use that option, if it works. Needs
+ someone with a dual cygwin/mingw setup to test.
+CCAS
, CCASFLAGS
+ scheme. Though we probably wouldn't be using its assembler support we
+ could try to use those variables in compatible ways.
+implver
and
- amask
assembly instructions are better, but that
- doesn't differentiate between ev5 and ev56.
-_gmp_rand
is not particularly fast on the linear
+ congruential algorithm and could stand various improvements.
+ gmp_randstate_t
(or
+ _mp_algdata
rather) to save some copying.
+ 2exp
modulus, to
+ avoid mpn_mul
calls. Perhaps the same for two limbs.
+ lc
code, to avoid a function call and
+ TMP_ALLOC
for every chunk.
+ seedn==0
will be very rarely used,
+ and on that basis seems unnecessary.
+ 2exp
and general LC cases should be split,
+ for clarity (if the general case is retained).
+ gmp_randinit_mers
for a Mersenne Twister generator. It's
+ likely to be more random and about the same speed as Knuth's 55-element
+ Fibonacci generator, and can probably become the default. Pedro Gimeno
+ has started on this.
+gmp_randinit_lc
: Finish or remove. Doing a division for
+ every every step won't be very fast, so check whether the usefulness of
+ this algorithm can be justified. (Consensus is that it's not useful and
+ can be removed.)
+gmp_randinit_bbs
would be wanted, not the currently
+ commented out case in gmp_randinit
.
+_gmp_rand
could be done as a function pointer within
+ gmp_randstate_t
(or rather in the _mp_algdata
+ part), instead of switching on a gmp_randalg_t
. Likewise
+ gmp_randclear
, and perhaps gmp_randseed
if it
+ became algorithm-specific. This would be more modular, and would ensure
+ only code for the desired algorithms is dragged into the link.
+mpz_urandomm
should do something for n<=0, but what?
+mpz_urandomm
implementation looks like it could be improved.
+ Perhaps it's enough to calculate nbits
as ceil(log2(n)) and
+ call _gmp_rand
until a value <n
is obtained.
+gmp_randstate_t
used for parameters perhaps should become
+ gmp_randstate_ptr
the same as other types.
+In general, getting the exact right configuration, passing the -exact right options to the compiler, etc, might mean that the GMP -performance more than doubles. -
When testing, make sure to test at least the following for all out -target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode -and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system -`make' and GNU `make'. (4) With and without GNU binutils. - -
mpz_div
and mpz_divmod
use rounding
analogous to mpz_mod
. Document, and list as an
incompatibility.
-mcount
call at
- the start of each assembler subroutine (for important targets at
- least).
+mpz_gcdext
and mpn_gcdext
ought to document
+ what range of values the generated cofactors can take, and preferably
+ ensure the definition uniquely specifies the cofactors for given inputs.
+ A basic extended Euclidean algorithm or multi-step variant leads to
+ |x|<|b| and |y|<|a| or something like that, but there's probably
+ two solutions under just those restrictions.
+mpz_invert
should call mpn_gcdext
directly.
+mpz_divisible_ui_p
rather than
+ mpz_tdiv_qr_ui
. (Of course dividing multiple primes at a
+ time would be better still.)
+libgmp
. This establishes good cross-checks, but it might be
+ better to use simple reference routines where possible. Where it's not
+ possible some attention could be paid to the order of the tests, so a
+ libgmp
routine is only used for tests once it seems to be
+ good.
+mpf_set_q
is very similar to mpf_div
, it'd be
+ good for the two to share code. Perhaps mpf_set_q
should
+ make some mpf_t
aliases for its numerator and denominator
+ and just call mpf_div
. Both would be simplified a good deal
+ by switching to mpn_tdiv_qr
perhaps making them small enough
+ not to bother with sharing (especially since mpf_set_q
+ wouldn't need to watch out for overlaps).
+mftb
and
+ mftbu
) could be used for the speed and tune programs. Would
+ need to know its frequency of course. Usually it's 1/4 of bus speed
+ (eg. 25 MHz) but some chips drive it from an external input. Probably
+ have to measure to be sure.
+MUL_FFT_THRESHOLD
etc: the FFT thresholds should allow a
+ return to a previous k at certain sizes. This arises basically due to
+ the step effect caused by size multiples effectively used for each k.
+ Looking at a graph makes it fairly clear.
+__gmp_doprnt_mpf
does a rather unattractive round-to-nearest
+ on the string returned by mpf_get_str
. Perhaps some variant
+ of mpf_get_str
could be made which would better suit.
malloc
- separately for each TMP_ALLOC
block, so a redzoning
- malloc debugger could be used during development.
-ASSERT
s at the start of each user-visible
- mpz/mpq/mpf function to check the validity of each
- mp?_t
parameter, in particular to check they've been
- mp?_init
ed. This might catch elementary mistakes in
- user programs. Care would need to be taken over
- MPZ_TMP_INIT
ed variables used internally.
+ASSERT
s at the start of each user-visible mpz/mpq/mpf
+ function to check the validity of each mp?_t
parameter, in
+ particular to check they've been mp?_init
ed. This might
+ catch elementary mistakes in user programs. Care would need to be taken
+ over MPZ_TMP_INIT
ed variables used internally. If nothing
+ else then consistency checks like size<=alloc, ptr not
+ NULL
and ptr+size not wrapping around the address space,
+ would be possible. A more sophisticated scheme could track
+ _mp_d
pointers and ensure only a valid one is used. Such a
+ scheme probably wouldn't be reentrant, not without some help from the
+ system.
+getrusage
and gettimeofday
are reliable.
+ Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
+ getrusage
doesn't exist. If a test might take a long time
+ to run then perhaps cache the result in a file somewhere.
mpz_inp_str
(etc) doesn't say when it stops reading digits.
-
-
- Please send comments about this page to
- tege@swox.com. - Copyright (C) 1999, 2000 Torbjörn Granlund. - - |
- - | -
mpn_umul_ppmm
, and the corresponding umul.asm file could be
+ included in libgmp only in that case, the same as is effectively done for
+ __clz_tab
. Likewise udiv.asm and perhaps cntlz.asm. This
+ would only be a very small space saving, so perhaps not worth the
+ complexity.
+mpz_get_si
returns 0x80000000 for -0x100000000, whereas it's
+ sort of supposed to return the low 31 (or 63) bits. But this is
+ undocumented, and perhaps not too important.
+mpz_*_ui
division routines currently return abs(a%b).
+ Perhaps make them return the real remainder instead? Return type would
+ be signed long int
. But this would be an incompatible
+ change, so it might have to be under newly named functions.
+mpz_init_set*
and mpz_realloc
could allocate
+ say an extra 16 limbs over what's needed, so as to reduce the chance of
+ having to do a reallocate if the mpz_t
grows a bit more.
+ This could only be an option, since it'd badly bloat memory usage in
+ applications using many small values.
+mpq
functions could perhaps check for numerator or
+ denominator equal to 1, on the assumption that integers or
+ denominator-only values might be expected to occur reasonably often.
+count_trailing_zeros
is used on more or less uniformly
+ distributed numbers in a couple of places. For some CPUs
+ count_trailing_zeros
is slow and it's probably worth handling
+ the frequently occurring 0 to 2 trailing zeros cases specially.
+mpf_t
might like to let the exponent be undefined when
+ size==0, instead of requiring it 0 as now. It should be possible to do
+ size==0 tests before paying attention to the exponent. The advantage is
+ not needing to set exp in the various places a zero result can arise,
+ which avoids some tedium but is otherwise perhaps not too important.
+ Currently mpz_set_f
and mpf_cmp_ui
depend on
+ exp==0, maybe elsewhere too.
+__gmp_allocate_func
: Could use GCC __attribute__
+ ((malloc))
on this, though don't know if it'd do much. GCC 3.0
+ allows that attribute on functions, but not function pointers (see info
+ node "Attribute Syntax"), so would need a new autoconf test. This can
+ wait until there's a GCC that supports it.
+mpz_add_ui
contains two __GMPN_COPY
s, one from
+ mpn_add_1
and one from mpn_sub_1
. If those two
+ routines were opened up a bit maybe that code could be shared. When a
+ copy needs to be done there's no carry to append for the add, and if the
+ copy is non-empty no high zero for the sub. mpz_add
+ when one operand is longer than the other, and to mpz_com
+ since it's just -(x+1).
+restrict
'ed pointers: Does the C99 definition of restrict
+ (one writer many readers, or whatever it is) suit the GMP style "same or
+ separate" function parameters? If so, judicious use might improve the
+ code generated a bit. Do any compilers have their own flavour of
+ restrict as "completely unaliased", and is that still usable?
+ABI
+ option, selecting _SHORT_LIMB
in gmp.h. Naturally a new set
+ of asm subroutines would be necessary. Would need new
+ mpz_set_ui
etc since the current code assumes limb>=long,
+ but 2-limb operand forms would find a use for long long
on
+ other processors too.
+BITS_PER_MP_LIMB
) mod
+ m", then for the input limbs x calculating an inner product "sum
+ p[i]*x[i]", and a final 3x1 limb remainder mod m. If those powers take
+ roughly N divide steps to calculate then there'd be an advantage any time
+ the same m is used three or more times. Suggested by Victor Shoup in
+ connection with chinese-remainder style decompositions, but perhaps with
+ other uses.
+