OpenXM_contrib/gmp/doc/tasks.html - diff

Return to tasks.html CVS log

Up to [local] / OpenXM_contrib / gmp / doc

Diff for /OpenXM_contrib/gmp/doc/Attic/tasks.html between version 1.1.1.1 and 1.1.1.2

-version 1.1.1.1, 2000/09/09 14:12:20
+version 1.1.1.2, 2003/08/25 16:06:11
 Line 13
 Line 13
 Line 13
    </h1>
  </center>
+ <font size=-1>
+ Copyright 2000, 2001, 2002 Free Software Foundation, Inc. <br><br>
+ This file is part of the GNU MP Library. <br><br>
+ The GNU MP Library is free software; you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as published
+ by the Free Software Foundation; either version 2.1 of the License, or (at
+ your option) any later version. <br><br>
+ The GNU MP Library is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+ License for more details. <br><br>
+ You should have received a copy of the GNU Lesser General Public License
+ along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+ the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
+ MA 02111-1307, USA.
+ </font>
+ <hr>
+ <!-- NB. timestamp updated automatically by emacs -->
  <comment>
-   An up-to-date html version of this file is available at
+   This file current as of 20 May 2002.  An up-to-date version is available at
    <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
+   Please send comments about this page to
+   <a href="mailto:bug-gmp@gnu.org">bug-gmp@gnu.org</a>.
  </comment>
- <p> This file lists itemized GMP development tasks.  Not all the tasks
+ <p> These are itemized GMP development tasks.  Not all the tasks
      listed here are suitable for volunteers, but many of them are.
      Please see the <a href="projects.html">projects file</a> for more
      sizeable projects.
  <h4>Correctness and Completeness</h4>
  <ul>
- <li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
-      new instructions.  Unfortunately, the HPUX 9 assembler as well as earlier
-      assemblers reject that directive.  How very clever of HP!  We will have to
-      pass assembler options, and make sure it works with new and old systems
-      and GNU assembler.
  <li> The various reuse.c tests need to force reallocation by calling
       <code>_mpz_realloc</code> with a small (1 limb) size.
- <li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
+ <li> One reuse case is missing from mpX/tests/reuse.c:
- <li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
+      <code>mpz_XXX(a,a,a)</code>.
-      <code>mp_exp_t</code>, the precision of
+ <li> When printing <code>mpf_t</code> numbers with exponents &gt;2^53 on
+      machines with 64-bit <code>mp_exp_t</code>, the precision of
       <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
-      <code>mpf_get_str</code> aborts.  Detect and compensate.
+      <code>mpf_get_str</code> aborts.  Detect and compensate.  Alternately,
- <li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
+      think seriously about using some sort of fixed-point integer value.
-      machines that use <code>long long</code> for storing limbs.)
+      Avoiding unnecessary floating point is probably a good thing in general,
+      and it might be faster on some CPUs.
  <li> Make the string reading functions allow the `0x' prefix when the base is
       explicitly 16.  They currently only allow that prefix when the base is
-      unspecified.
+      unspecified (zero).
- <li> In the development sources, we return abs(a%b) in the
-      <code>mpz_*_ui</code> division routines.  Perhaps make them return the
-      real remainder instead?  Changes return type to <code>signed long int</code>.
  <li> <code>mpf_eq</code> is not always correct, when one operand is
       1000000000... and the other operand is 0111111111..., i.e., extremely
       close.  There is a special case in <code>mpf_sub</code> for this
       situation; put similar code in <code>mpf_eq</code>.
- <li> mpf_eq doesn't implement what gmp.texi specifies.  It should not use just
+ <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
-      whole limbs, but partial limbs.
+      not use just whole limbs, but partial limbs.
- <li> Install Alpha assembly changes (prec/gmp-alpha-patches).
+ <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
- <li> NeXT has problems with newlines in asm strings in longlong.h.  Also,
+      garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
-      <code>__builtin_constant_p</code> is unavailable?  Same problem with MacOS
+      of a <code>long</code> is not detected.
-      X.
+ <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
- <li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
+      the inputs, and in that respect doesn't implement the "infinite precision
-      mp?/tests/*.c.
+      followed by truncate" specified in the manual.
- <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
+ <li> <code>mpf_div</code> of x/x doesn't always give 1, reported by Peter
+      Moulder.  Perhaps it suffices to put +1 on the effective divisor prec, so
+      that data bits rather than zeros are shifted in when normalizing.  Would
+      prefer to switch to <code>mpn_tdiv_qr</code>, where all shifting should
+      disappear.
+ <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
+      variables with pointers to <code>mpz_add</code> etc, which doesn't work
+      when those routines are coming from a DLL (because they're effectively
+      function pointer global variables themselves).  Need to rearrange perhaps
+      to a set of calls to a test function rather than iterating over an array.
+ <li> demos/pexpr.c: The local variables in <code>main</code> might be
+      clobbered by the <code>longjmp</code>.
  </ul>
  <h4>Machine Independent Optimization</h4>
  <ul>
- <li> In hundreds of places in the code, we invoke count_leading_zeros and then
+ <li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>,
-      check if the returned count is zero.  Instead check the most significant
+      <code>mpf_get_str</code>: Don't test <code>count_leading_zeros</code> for
-      bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
+      zero, instead check the high bit of the operand and avoid invoking
-      the bit is set.  This is an optimization on all machines, and significant
+      <code>count_leading_zeros</code>.  This is an optimization on all
-      on machines with slow <code>count_leading_zeros</code>.
+      machines, and significant on machines with slow
- <li> In a couple of places <code>count_trailing_zeros</code> is used
+      <code>count_leading_zeros</code>, though it's possible an already
-      on more or less uniformly distributed numbers.  For some CPUs
+      normalized operand might not be encountered very often.
-      <code>count_trailing_zeros</code> is slow and it's probably worth
-      handling the frequently occurring 0 to 2 trailing zeros cases specially.
- <li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
-      instead use <code>invert_limb</code>.
- <li> Reorganize longlong.h so that we can inline the operations even for the
-      system compiler.  When there is no such compiler feature, make calls to
-      stub functions.  Write such stub functions for as many machines as
-      possible.
  <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
       most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
       (Peter Montgomery has some ideas on this subject.)
  <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
       products with fewer operations.
- <li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
+ <li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
-      the sub O(n^2) range, using some divide-and-conquer approach, preferably
+      fast, especially for compile-time constants, but would make application
-      without using division.
+      binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
- <li> Copy tricky code for converting a limb from development version of
+      preventing the "lazy" allocation scheme below.
-      <code>mpn_get_str</code> to mpf/get_str.  (Talk to Torbj�rn about this.)
+ <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
- <li> Consider inlining these functions: <code>mpz_size</code>,
+      <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
-      <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
+      would be needed for the divide by zero test, unless that could be left to
-      <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
+      <code>mpn_mod_1</code> (not sure currently whether all the risc chips
-      <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
+      provoke the right exception there if using mul-by-inverse).
-      <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
+ <li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
-      <code>mpz_popcount</code>, <code>mpf_size</code>,
+      <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
-      <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
+      might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
-      <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
+      since there's no apparent way to get <code>SHRT_MAX</code> with an
-      <code>mpf_clear</code>, <code>mpf_set_si</code>.
+      expression (since <code>short</code> and <code>unsigned short</code> can
+      be different sizes).
  <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
       fast on one or two limb moduli, due to a lot of function call
       overheads.  These could perhaps be handled as special cases.
  <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
       algorithm selection, and the latter should use REDC.  Both could
       change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
+ <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
+      using the division method when they're small, since the REDC form of a
+      small multiplier is normally a full size product.  Probably would need a
+      new tuned parameter to say what size multiplier is "small", as a function
+      of the size of the modulus.
+ <li> <code>mpz_powm</code> REDC should handle even moduli if possible.  Maybe
+      this would mean for m=n*2^k doing mod n using REDC and an auxiliary
+      calculation mod 2^k, then putting them together at the end.
  <li> <code>mpn_gcd</code> might be able to be sped up on small to
       moderate sizes by improving <code>find_a</code>, possibly just by
       providing an alternate implementation for CPUs with slowish
       <code>count_leading_zeros</code>.
- <li> Implement a cache localized evaluate and interpolate for the
+ <li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized
-      toom3 <code>USE_MORE_MPN</code> code.  The necessary
+      evaluate and interpolate.  The necessary <code>mpn_divexact_by3c</code>
-      right-to-left <code>mpn_divexact_by3c</code> exists.
+      exists.
  <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
       better cache locality by taking N piece by piece.  The current code could
       be left available for CPUs without caching.  Depending how karatsuba etc
       is applied to unequal size operands it might be possible to assume M is
       always smallish.
+ <li> <code>mpn_perfect_square_p</code> on small operands might be better off
+      skipping the residue tests and just taking a square root.
+ <li> <code>mpz_perfect_power_p</code> could be improved in a number of ways.
+      Test for Nth power residues modulo small primes like
+      <code>mpn_perfect_square_p</code> does.  Use p-adic arithmetic to find
+      possible roots.  Divisibility by other primes should be tested by
+      grouping into a limb like <code>PP</code>.
+ <li> <code>mpz_perfect_power_p</code> might like to use <code>mpn_gcd_1</code>
+      instead of a private GCD routine.  The use it's put to isn't
+      time-critical, and it might help be ensure correctness to use the main GCD
+      routine.
+ <li> <code>mpz_perfect_power_p</code> could use
+      <code>mpz_divisible_ui_p</code> instead of <code>mpz_tdiv_ui</code> for
+      divisibility testing, the former is faster on a number of systems.  (But
+      all that prime test stuff is going to be rewritten some time.)
+ <li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such
+      pairs, listing several hundred primes.  Perhaps actually make the
+      products larger than one limb each.
+ <li> <code>PP</code> can have factors of 2 introduced in order to get the high
+      bit set and therefore a <code>PP_INVERTED</code> existing.  The factors
+      of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
+      further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
+      The advantage of this is that <code>mpn_preinv_mod_1</code> can then be
+      used if it's faster than plain <code>mpn_mod_1</code>.  This would be a
+      change only for 16-bit limbs, all the rest already have <code>PP</code>
+      in the right form.
+ <li> <code>PP</code> could have extra factors of 3 or 5 or whatever introduced
+      if they fit, and final remainders mod 9 or 25 or whatever used, thereby
+      making more efficient use of the <code>mpn_mod_1</code> done.  On a
+-bit limb it looks like <code>PP</code> could take an extra factor of
+.
+ <li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and
+      <code>mpz_perfect_power_p</code> could use <code>mpn_mod_34lsub1</code>
+      to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
+, 5, 7, 13 and 17 (factors of 2^24-1).  This could either replace the
+      <code>PP</code> division currently done, or allow <code>PP</code> to do
+      larger primes, depending how many residue tests seem worthwhile before
+      launching into full root extractions or Miller-Rabin etc.
+ <li> <code>mpz_probab_prime_p</code> (and maybe others) could code the
+      divisibility tests like <code>n%7 == 0</code> in the form
+ <pre>
+ #define MP_LIMB_DIVISIBLE_7_P(n) \
+   ((n) * MODLIMB_INVERSE_7 &lt;= MP_LIMB_T_MAX/7)
+ </pre>
+      This would help compilers which don't know how to optimize divisions by
+      constants, and would help current gcc (3.0) too since gcc forms a whole
+      remainder rather than using a modular inverse and comparing.  This
+      technique works for any odd modulus, and with some tweaks for even moduli
+      too.  See Granlund and Montgomery "Division By Invariant Integers"
+      section 9.
+ <li> <code>mpz_probab_prime_p</code> and <code>mpz_nextprime</code> could
+      offer certainty for primes up to 2^32 by using a one limb miller-rabin
+      test to base 2, combined with a table of actual strong pseudoprimes in
+      that range (2314 of them).  If that table is too big then both base 2 and
+      base 3 tests could be done, leaving a table of 104.  The test could use
+      REDC and therefore be a <code>modlimb_invert</code> a remainder (maybe)
+      then two multiplies per bit (successively dependent).  Processors with
+      pipelined multipliers could do base 2 and 3 in parallel.  Vector systems
+      could do a whole bunch of bases in parallel, and perhaps offer near
+      certainty up to 64-bits (certainty might depend on an exhaustive search
+      of pseudoprimes up to that limit).  Obviously 2^32 is not a big number,
+      but an efficient and certain calculation is attractive.  It might find
+      other uses internally, and could even be offered as a one limb prime test
+      <code>mpn_probab_prime_1_p</code> or <code>gmp_probab_prime_ui_p</code>
+      perhaps.
+ <li> <code>mpz_probab_prime_p</code> doesn't need to make a copy of
+      <code>n</code> when the input is negative, it can setup an
+      <code>mpz_t</code> alias, same data pointer but a positive size.  With no
+      need to clear before returning, the recursive function call could be
+      dispensed with too.
+ <li> <code>mpf_set_str</code> produces low zero limbs when a string has a
+      fraction but is exactly representable, eg. 0.5 in decimal.  These could be
+      stripped to save work in later operations.
+ <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
+      use <code>mpn_and_n</code> etc for the benefit of the small number of
+      targets with native versions of those routines.  Need to be careful not to
+      pass size==0.  Is some code sharing possible between the <code>mpz</code>
+      routines?
+ <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
+      unless it's really necessary (currently only sizes are tested, not
+      whether r really is u or v).
+ <li> <code>mpf_add</code>: Under the check for v having no effect on the
+      result, perhaps test for r==u and do nothing in that case, rather than
+      currently it looks like an <code>MPN_COPY_INCR</code> will be done to
+      reduce prec+1 limbs to prec.
+ <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
+      shift the dividend on-the-fly, since this should cost nothing on
+      superscalar processors and avoid the need for temporary copying in
+      <code>mpn_tdiv_qr</code>.
+ <li> <code>mpf_sqrt_ui</code> calculates prec+1 limbs, whereas just prec would
+      satisfy the application requested precision.  It should suffice to simply
+      reduce the rsize temporary to 2*prec-1 limbs.  <code>mpf_sqrt</code>
+      might be similar.
+ <li> <code>invert_limb</code> generic C: The division could use dividend
+      b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
+      (b-d):0, where b=2^<code>BITS_PER_MP_LIMB</code> and d=divisor.  The
+      former is per the original paper and is used in the x86 code, the
+      advantage is that the current special case for 0x80..00 could be dropped.
+      The two should be equivalent, but a little check of that would be wanted.
+ <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
+      <code>num2*den1</code> products limb-by-limb from high to low and look at
+      each step for values differing by more than the possible carry bit from
+      the uncalculated portion.
+ <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
+      and compare.  The benefits of karatsuba and higher multiplication
+      algorithms are lost, but if it's assumed only a few high limbs will be
+      needed to determine an order then that's fine.
+ <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
+      <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
+      instead of the functions, so they get inlined on all compilers, not just
+      gcc and others with <code>inline</code> recognised in gmp.h.
+      <code>__GMPN_ADD_1</code> etc are meant mostly to support application
+      inline <code>mpn_add_1</code> etc and if they don't come out good for
+      internal uses then special forms can be introduced, for instance many
+      internal uses are in-place.  Sometimes a block of code is executed based
+      on the carry-out, rather than using it arithmetically, and those places
+      might want to do their own loops entirely.
+ <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
+      bitfield for the mantissa extraction, not two, when endianness permits.
+      Might depend on the compiler allowing <code>long long</code> bit fields
+      when that's the only actual 64-bit type.
+ <li> <code>mpf_get_d</code> could be more like <code>mpz_get_d</code> and do
+      more in integers and give the float conversion as such a chance to round
+      in its preferred direction.  Some code sharing ought to be possible.  Or
+      if nothing else then for consistency the two ought to give identical
+      results on integer operands (not clear if this is so right now).
+ <li> <code>usqr_ppm</code> or some such could do a widening square in the
+      style of <code>umul_ppmm</code>.  This would help 68000, and be a small
+      improvement for the generic C (which is used on UltraSPARC/64 for
+      instance).  GCC recognises the generic C ul*vh and vl*uh are identical,
+      but does two separate additions to the rest of the result.
+ <li> tal-notreent.c could keep a block of memory permanently allocated.
+      Currently the last nested <code>TMP_FREE</code> releases all memory, so
+      there's an allocate and free every time a top-level function using
+      <code>TMP</code> is called.  Would need
+      <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
+      any cached memory when changing allocation functions though.
+ <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
+      inlined.  If the current chunk has enough room then a couple of pointers
+      can be updated.  Only if more space is required then a call to some sort
+      of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
+      <code>TMP_ALLOC</code> is an expression might make the implementation a
+      bit ugly and/or a bit sub-optimal.
+ <pre>
+ #define TMP_ALLOC(n)
+   ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
+      __gmp_tmp_increase (ROUND_UP (n)) : 0),
+      current-&gt;point += ROUND_UP (n),
+      current-&gt;point - ROUND_UP (n))
+ </pre>
+ <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
+      never used.  Perhaps the table should just go up to base 16, and have
+      code to generate data above that, if and when required.  Naturally this
+      assumes the code would be smaller than the data saved.
+ <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
+      if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
+      otherwise, to save space.
+ <li> Make <code>mpf_get_str</code> and <code>mpf_set_str</code> call the
+      corresponding, much faster, mpn functions.
+ <li> <code>mpn_mod_1</code> could pre-calculate values of R mod N, R^2 mod N,
+      R^3 mod N, etc, with R=2^<code>BITS_PER_MP_LIMB</code>, and use them to
+      process multiple limbs at each step by multiplying.  Suggested by Peter
+      L. Montgomery.
+ <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
+      are of course fast, it seems a little silly to make a second pass over
+      the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
+      that with the bit extractions.
+ <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
+      A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
+      deriving S from that.  Perhaps it'd be possible to arrange to get S in
+      the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
+      might only be an advantage if A and B are about the same size.
+ <li> <code>mpn_toom3_mul_n</code>, <code>mpn_toom3_sqr_n</code>: Temporaries
+      <code>B</code> and <code>D</code> are adjacent in memory and at the final
+      coefficient additions look like they could use a single
+      <code>mpn_add_n</code> of <code>l4</code> limbs rather than two of
+      <code>l2</code> limbs.
  </ul>
  <h4>Machine Dependent Optimization</h4>
  <ul>
+ <li> <code>udiv_qrnnd_preinv2norm</code>, the branch-free version of
+      <code>udiv_qrnnd_preinv</code>, might be faster on various pipelined
+      chips.  In particular the first <code>if (_xh != 0)</code> in
+      <code>udiv_qrnnd_preinv</code> might be roughly a 50/50 chance and might
+      branch predict poorly.  (The second test is probably almost always
+      false.)  Measuring with the tuneup program would be possible, but perhaps
+      a bit messy.  In any case maybe the default should be the branch-free
+      version.
+      <br>
+      Note that the current <code>udiv_qrnnd_preinv2norm</code> implementation
+      assumes a right shift will sign extend, which is not guaranteed by the C
+      standards, and doesn't happen on Cray vector systems.
  <li> Run the `tune' utility for more compiler/CPU combinations.  We would like
       to have gmp-mparam.h files in practically every implementation specific
       mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
       compiler.  See the `tune' top-level directory for more information.
- <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
+         <pre>
-      <code>mpn_mul_1</code> for the 21264.  On 21264, they should run at 4, 3,
+         #ifdef (__GNUC__)
-      and 3 cycles/limb respectively, if the code is unrolled properly.  (Ask
+         #if __GNUC__ == 2 && __GNUC_MINOR__ == 7
-      Torbj�rn for his xm.s and xam.s skeleton files.)
+         ...
- <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
+         #endif
-      <code>mpn_mul_1</code> for the 21164.  This should use both integer
+         #if __GNUC__ == 2 && __GNUC_MINOR__ == 8
+         ...
+         #endif
+         #ifndef MUL_KARATSUBA_THRESHOLD
+         /* Default GNUC values */
+         ...
+         #endif
+         #else /* system compiler */
+         ...
+         #endif  </pre>
+ <li> <code>invert_limb</code> on various processors might benefit from the
+      little Newton iteration done for alpha and ia64.
+ <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
+      <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
+ <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
+      and <code>mpn_submul_1</code> for the 21164.  This should use both integer
       multiplies and floating-point multiplies.  For the floating-point
       operations, the single-limb multiplier should be split into three 21-bit
-      chunks.
+      chunks, or perhaps even better in four 16-bit chunks.  Probably possible
- <li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
+      to reach 9 cycles/limb.
-      <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>.  Should use
+ <li> Alpha 21264 ev67: Use <code>ctlz</code> and <code>cttz</code> for
-      floating-point operations, and split the invariant single-limb multiplier
+      <code>count_leading_zeros</code> and<code>count_trailing_zeros</code>.
-      into 21-bit chunks.  Should give about 18 cycles/limb, but the pipeline
+      Use inline for gcc, probably want asm files for elsewhere.
-      will become very deep.  (Torbj�rn has C code that is useful as a starting
+ <li> ARC: gcc longlong.h sets up <code>umul_ppmm</code> to call
-      point.)
+      <code>__umulsidi3</code> in libgcc.  Could be copied straight across, but
- <li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
+      perhaps ought to be tested.
-      Should give 2 cycles/limb.  (Torbj�rn has code that just needs to be
+ <li> ARM: On v5 cpus see if the <code>clz</code> instruction can be used for
-      finished.)
+      <code>count_leading_zeros</code>.
- <li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
+ <li> Itanium: <code>mpn_divexact_by3</code> isn't particularly important, but
-      and the other multiplies varies so much on successive sizes.
+      the generic C runs at about 27 c/l, whereas with the multiplies off the
+      dependent chain about 3 c/l ought to be possible.
+ <li> Itanium: <code>mpn_hamdist</code> could be put together based on the
+      current <code>mpn_popcount</code>.
+ <li> Itanium: <code>popc_limb</code> in gmp-impl.h could use the
+      <code>popcnt</code> insn.
+ <li> Itanium: <code>mpn_submul_1</code> is not implemented directly, only via
+      a combination of <code>mpn_mul_1</code> and <code>mpn_sub_n</code>.
+ <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
+      for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
+      much this can improve the speed, though, since the symmetry that we rely
+      on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
+      accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
+ <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
+      <code>mpn_addmul_1</code>.
+ <li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
+      "<code>mulx</code>"s either with an asm block or via the generic C code is
+      about 90 cycles.  Try using fp operations, and also try using karatsuba
+      for just three "<code>mulx</code>"s.
+ <li> UltraSPARC/64: <code>mpn_divrem_1</code>, <code>mpn_mod_1</code>,
+      <code>mpn_divexact_1</code> and <code>mpn_modexact_1_odd</code> could
+      process 32 bits at a time when the divisor fits 32-bits.  This will need
+      only 4 <code>mulx</code>'s per limb instead of 8 in the general case.
+ <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
+      Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
+ <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
+ <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
+      possible (see commented out code in longlong.h).  This is unlikely to
+      save more than a couple of cycles, so perhaps isn't worth bothering with.
+ <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
+      or anything to indicate V9 support when -mcpu=v9 is selected.  See
+      gcc/config/sol2-sld-64.h.  Will need to pass something through from
+      ./configure to select the right code in longlong.h.  (Currently nothing
+      is lost because <code>mulx</code> for multiplying is commented out.)
+ <li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from
+      masking down to just the useful bits at each point in the calculation,
+      since <code>mulx</code> speed depends on the highest bit set.  Either
+      explicit masks or small types like <code>short</code> and
+      <code>int</code> ought to work.
+ <li> Sparc64 HAL R1: <code>mpn_popcount</code> and <code>mpn_hamdist</code>
+      could use <code>popc</code> currently commented out in gmp-impl.h.  This
+      chip reputedly implements <code>popc</code> properly (see gcc sparc.md),
+      would need to recognise the chip as <code>sparchalr1</code> or something
+      in configure / config.sub / config.guess.
  <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
-      <code>mpn_mul_1</code>.  The current development code runs at 11
+      <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
-      cycles/limb, which is already very good.  But it should be possible to
+      should be possible to saturate the cache, which will happen at 8
-      saturate the cache, which will happen at 7.5 cycles/limb.
+      cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
- <li> Sparc & SparcV8: Enable umul.asm for native cc.  The generic
+      it should be possible to make them run at about 5 cycles/limb.
-      longlong.h umul_ppmm is suspected to be causing sqr_basecase to
+ <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
-      be slower than mul_basecase.
+      <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
- <li> UltraSPARC: Write <code>umul_ppmm</code>.  Important in particular for
+      possibly two floating-point and one integer limb per loop.  Split operands
-      <code>mpn_sqr_basecase</code>.  Using four "<code>mulx</code>"s either
+      into four 16-bit chunks for fast fp operations.  Should easily reach 9
-      with an asm block or via the generic C code is about 90 cycles.
+      cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
+      (using one int + two fp).
+ <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
+      as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
+      share source code, or with a register to control the loop direction
+      perhaps even share object code.
+ <li> PowerPC-32: <code>mpn_rshift</code> should do the same sort of unrolled
+      loop as <code>mpn_lshift</code>.
  <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
       for important machines.  Helping the generic sqr_basecase.c with an
       <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
  <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
       Will bring time from 1.75 to 1.25 cycles/limb.
- <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See Pentium code.)
+ <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
- <li> Alpha: Optimize <code>count_leading_zeros</code>.
+      Pentium code.)
- <li> Alpha: Optimize <code>udiv_qrnnd</code>.  (Ask Torbj�rn for the file
+ <li> X86: Good authority has it that in the past an inline <code>rep
-      test-udiv-preinv.c as a starting point.)
+      movs</code> would upset GCC register allocation for the whole function.
- <li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
+      Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
-      It should just require 3 cycles/limb, but the current code propagates
+      <code>__builtin_memcpy</code>.  Examine the code for some simple and
-      carry poorly.  The trick is to add carry-in later than we do now,
+      complex functions to find out.  Inlining <code>rep movs</code> would be
-      decreasing the number of operations used to generate carry-out from 4 to
+      desirable, it'd be both smaller and faster.
-      to 3.
+ <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
+      down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
+      <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
+ <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
+      might benefit from some destination prefetching.
+ <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
+      mul-by-inverse, hoping for maybe 30 c/l.
+ <li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go
+      faster than the generic x86 code at 3.5 c/l.  The athlon code for instance
+      runs at about 2.7.
+ <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
+      do something branch-free for unaligned startups, and shaving one insn
+      from the loop with alternative indexing might save a cycle.
  <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
-      The pipeline is now extremely deep, perhaps unnecessarily deep.  Also, r5
+      The pipeline is now extremely deep, perhaps unnecessarily deep.
-      is unused.  (Ask Torbj�rn for a copy of the current code.)
  <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
  <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
-      run at just 3.25 cycles/limb.  (Ask for xxx-add_n.s as a starting point.)
+      run at just 3.25 cycles/limb.
  <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
  <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
       <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
       method", to avoid carry propagation.  splitting one of the operands in
 -bit chunks.
- <li> Cray: Vectorize main functions, perhaps in assembly language.
+ <li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
- <li> Cray: Write <code>mpn_mul_basecase</code> and
+      by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
-      <code>mpn_sqr_basecase</code>.  Same comment applies to this as to the
+      special lshift by 1.  This would be best as a jump across to the other
-      same functions for Fujitsu VPP.
+      routine, could let both live in lshift.asm and omit rshift.asm on finding
+      <code>mpn_rshift</code> already provided.
+ <li> Cray T3E: Experiment with optimization options.  In particular,
+      -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
+ <li> Cray: <code>mpn_com_n</code> and <code>mpn_and_n</code> etc very probably
+      wants a pragma like <code>MPN_COPY_INCR</code>.
+ <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
+      <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
+      and could be inlined to avoid function calls.
+ <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
+      scheme.  Not sure why, maybe they merely give the compiler more
+      information about aliasing (or the lack thereof).  Would like to modify
+      <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
+      blocks wanted unconditionally are easy enough, those wanted only
+      sometimes are a problem.  Perhaps a special size calculation to ask for a
+      dummy length 1 when unwanted, or perhaps an inlined subroutine
+      duplicating code under each conditional.  Don't really want to turn
+      everything into a dog's dinner just because Cray don't offer an
+      <code>alloca</code>.
+ <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
+      Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
+      limb might prevent it.  Perhaps special cases for binary, octal and hex
+      would be worthwhile (very possibly for all processors too).
+ <li> Cray: <code>popc_limb</code> could use the Cray <code>_popc</code>
+      intrinsic.  That would help <code>mpz_hamdist</code> and might make the
+      generic C versions of <code>mpn_popcount</code> and
+      <code>mpn_hamdist</code> suffice for Cray (if it vectorizes, or can be
+      given a hint to do so).
+ <li> 68000: <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
+      <code>mpn_submul_1</code>: Check for a 16-bit multiplier and use two
+      multiplies per limb, not four.
+ <li> 68000: <code>mpn_lshift</code> and <code>mpn_rshift</code> could use a
+      <code>roll</code> and mask instead of <code>lsrl</code> and
+      <code>lsll</code>.  This promises to be a speedup, effectively trading a
++2*n shift for one or two 4 cycle masks.  Suggested by Jean-Charles
+      Meyrignac.
  <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
    <pre>
-   if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x &lt&lt= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
+            if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
-   if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x &lt&lt= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
+            if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
-   ... </pre>
+            ... </pre>
+ <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
+      be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
+      to identify suitable versions of that compiler?
+ <li> VAX D and G format <code>double</code> floats are straightforward and
+      could perhaps be handled directly in <code>__gmp_extract_double</code>
+      and maybe in <code>mpz_get_d</code>, rather than falling back on the
+      generic code.  (Both formats are detected by <code>configure</code>.)
+ <li> <code>mpn_get_str</code> final divisions by the base with
+      <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
+      on suitable machines.  This ends up happening for decimal by presenting
+      the compiler with a run-time constant, but the same for other bases would
+      be good.  Perhaps use could be made of the fact base&lt;256.
+ <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
+      structure like <code>div_t</code> to avoid going through memory, in
+      particular helping RISCs that don't do store-to-load forwarding.  Clearly
+      this is only possible if the ABI returns a structure of two
+      <code>mp_limb_t</code>s in registers.
  </ul>
  <h4>New Functionality</h4>
  <ul>
- <li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the nth limb).
+ <li> Add in-memory versions of <code>mp?_out_raw</code> and
+      <code>mp?_inp_raw</code>.
+ <li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the
+      nth limb).
  <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
  <li> Let `0b' and `0B' mean binary input everywhere.
- <li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
+ <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
-      (cf. <code>mpq_set_d</code>).
+      Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
- <li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
+      <code>_mpz_realloc</code> do the initial alloc.  Set
-      allocation.  Set <code>ALLOC(var)</code> to 0, and have
+      <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
-      <code>mpz_realloc</code> special-handle that case.  Update functions that
+      similar can unconditionally fetch from.  Niels M�ller has had a go at
-      rely on a single limb (like <code>mpz_set_ui</code>,
+      this.
-      <code>mpz_[tfc]div_r_ui</code>, and others).
+      <br>
+      The advantages of the lazy scheme would be:
+      <ul>
+      <li> Initial allocate would be the size required for the first value
+           stored, rather than getting 1 limb in <code>mpz_init</code> and then
+           more or less immediately reallocating.
+      <li> <code>mpz_init</code> would only store magic values in the
+           <code>mpz_t</code> fields, and could be inlined.
+      <li> A fixed initializer could even be used by applications, like
+           <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
+           for globals.
+      </ul>
+      The advantages of the current scheme are:
+      <ul>
+      <li> <code>mpz_set_ui</code> and other similar routines needn't check the
+           size allocated and can just store unconditionally.
+      <li> <code>mpz_set_ui</code> and perhaps others like
+           <code>mpz_tdiv_r_ui</code> and a prospective
+           <code>mpz_set_ull</code> could be inlined.
+      </ul>
  <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
       format is portable between 32-bit and 64-bit machines, and between
       little-endian and big-endian machines.
- <li> Handle numeric exceptions: Call an error handler, and/or set
+ <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
-      <code>gmp_errno</code>.
+      logops and copys available in gmp.h, either as library functions or
- <li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
+      inlines, with the availability of library functions instantiated in the
-      <code>gmp_snprintf</code>.  Think about some sort of wrapper
+      generated gmp.h at build time.
-      around <code>printf</code> so it and its several variants don't
+ <li> <code>mpz_set_str</code> etc variants taking string lengths rather than
-      have to be completely reimplemented.
+      null-terminators.
- <li> Implement some <code>mpq</code> input and output functions.
- <li> Implement a full precision <code>mpz_kronecker</code>, leave
-      <code>mpz_jacobi</code> for compatibility.
- <li> Make the mpn logops and copys available in gmp.h.  Since they can
-      be either library functions or inlines, gmp.h would need to be
-      generated from a gmp.in based on what's in the library.  gmp.h
-      would still be compiler-independent though.
- <li> Make versions of <code>mpz_set_str</code> etc taking string
-      lengths rather than null-terminators.
  <li> Consider changing the thresholds to apply the simpler algorithm when
       "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
       be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
       compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
- <li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
+      Alternately it looks like the <code>ABOVE_THRESHOLD</code> and
-      could be implemented to match the corresponding tdiv and fdiv.
+      <code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick
-      Maybe some code sharing is possible.
+      up cases where a threshold of zero should mean only the second algorithm.
+ <li> <code>mpz_nthprime</code>.
+ <li> Perhaps <code>mpz_init2</code>, initializing and making initial room for
+      N bits.  The actual size would be rounded up to a limb, and perhaps an
+      extra limb added since so many <code>mpz</code> routines need that on
+      their destination.
+ <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
+      <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
+      if they could share code with the current such functions (which should be
+      possible).
+ <li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
+      Niels M�ller.
+ <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
+      accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
+      decimal in this case, with a further 0x, 0b etc allowed there.
+      Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
+      probably something like "0.123" ought not mean octal.
+ <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
+      feature of gmp.h, so applications could know whether to
+      <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
+ <li> <code>PRIdMP_LIMB</code> and similar defines following C99
+      &lt;inttypes.h&gt; might be of use to applications printing limbs.
+      Perhaps they should be defined only if specifically requested, the way
+      &lt;inttypes.h&gt; does.  But if <code>GMP_LONG_LONG_LIMB</code> or
+      whatever is added then perhaps this can easily enough be left to
+      applications.
+ <li> <code>mpf_get_ld</code> and <code>mpf_set_ld</code> converting
+      <code>mpf_t</code> to and from <code>long double</code>.  Other
+      <code>long double</code> routines would be desirable too, but these would
+      be a start.  Often <code>long double</code> is the same as
+      <code>double</code>, which is easy but pretty pointless.  Should
+      recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
+      sparc, hppa and power.  Might like an ABI sub-option or something when
+      it's a compiler option for 64-bit or 128-bit <code>long double</code>.
+ <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
+      It'd be nice if it worked for plain <code>int</code> etc too, not just
+      <code>mpz_t</code> etc.
+ <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
+      for both integer and float conversions.  A base either in the format
+      string or as a parameter with <code>*</code> should be allowed.  Maybe
+      <code>&amp;13b</code> (b for base) or something like that.
+ <li> <code>gmp_printf</code> could perhaps have a type code for an
+      <code>mp_limb_t</code>.  That would save an application from having to
+      worry whether it's a <code>long</code> or a <code>long long</code>.
+ <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
+      conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
+      convenience, but still might be useful.  Rounding would be the same as
+      for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
+      actually documented).  Alternately, perhaps a separate
+      <code>mpq_get_str_point</code> or some such might be more use.  Suggested
+      by Pedro Gimeno.
+ <li> <code>gmp_printf</code> could usefully accept a flag to control the
+      rounding of float conversions.  The wouldn't do much for
+      <code>mpf_t</code>, but would be good if <code>mpfr_t</code> was
+      supported in the future, or perhaps for <code>mpq_t</code>.  Something
+      like <code>&amp;*r</code> (r for rounding, and mpfr style
+      <code>GMP_RND</code> parameter).
+ <li> <code>mpz_combit</code> to toggle a bit would be a good companion for
+      <code>mpz_setbit</code> and <code>mpz_clrbit</code>.  Suggested by Niels
+      M�ller (and has done some work towards it).
+ <li> <code>mpz_scan0_reverse</code> or <code>mpz_scan0low</code> or some such
+      searching towards the low end of an integer might match
+      <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
+      Suggested by Roberto Bagnara.
+ <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
+      bitwise subset of another might be of use.  Some sort of return value
+      indicating whether it's a proper or non-proper subset would be good and
+      wouldn't cost anything in the implementation.  Suggested by Roberto
+      Bagnara.
+ <li> <code>gmp_randinit_r</code> and maybe <code>gmp_randstate_set</code> to
+      init-and-copy or to just copy a <code>gmp_randstate_t</code>.  Suggested
+      by Pedro Gimeno.
+ <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
+      <code>mpf_t</code> and <code>long double</code>, suggested by Dan
+      Christensen.  There'd be some work to be done by <code>configure</code>
+      to recognise the format in use.  xlc on aix for instance apparently has
+      an option for either plain double 64-bit or quad 128-bit precision.  This
+      might mean library contents vary with the compiler used to build, which
+      is undesirable.  It might be possible to detect the mode the application
+      is compiling with, and try to avoid mismatch problems.
+ <li> <code>mpz_sqrt_if_perfect_square</code>: When
+      <code>mpz_perfect_square_p</code> does its tests it calculates a square
+      root and then discards it.  For some applications it might be useful to
+      return that root.  Suggested by Jason Moxham.
+ <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
+      <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
+      <code>long long</code>.  These would aid interoperability, though a
+      mixture of GMP and <code>long long</code> would probably not be too
+      common.  Disadvantages of using <code>long long</code> in libgmp.a would
+      be
+      <ul>
+      <li> Library contents vary according to the build compiler.
+      <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
+           application compiler could take the <code>long long</code>
+           prototypes.
+      <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> would be wanted to
+           indicate whether the functions are available.  (Applications using
+           autoconf could probe the library too.)
+      </ul>
+      It'd be possible to defer the need for <code>long long</code> to
+      application compile time, by having something like
+      <code>mpz_set_2ui</code> called with two halves of a <code>long
+      long</code>.  Disadvantages of this would be,
+      <ul>
+      <li> Bigger code in the application, though perhaps not if a <code>long
+           long</code> is normally passed as two halves anyway.
+      <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
+           to be two function calls.
+      <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
+           treatment of <code>-0x10..00</code> into applications (see
+           <code>mpz_get_si</code> correctness above).
+      <li> Although having libgmp.a independent of the build compiler is nice,
+           it sort of sacrifices the capabilities of a good compiler to
+           uniformity with inferior ones.
+      </ul>
+      Plain use of <code>long long</code> is probably the lesser evil, if only
+      because it makes best use of gcc.
+ <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
+      Suggested by Alexander Kruppa.
  </ul>
  <h4>Configuration</h4>
  <ul>
- <li> Improve config.guess.  We want to recognize the processor very
+ <li> Floating-point format: <code>GMP_C_DOUBLE_FORMAT</code> seems to work
-      accurately, more accurately than other GNU packages.
+      well.  Get rid of the <code>#ifdef</code> mess in gmp-impl.h and use the
-      config.guess does not currently make the distinctions we would
+      results of the test instead.
-      like it to do and a --target often needs to be set explicitly.
+ <li> a29k: umul.s and udiv.s exist but don't get used.
+ <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
+      but is that available only for M series chips or some such?  Perhaps it
+      should be configured in some way.
+ <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
+ <li> HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though
+      perhaps just as a build-time choice.  In any case, figure out how to
+      identify a suitable gcc or put it in the right mode, for the GMP compiler
+      choices.
+ <li> IA64: Latest libtool has some nonsense to detect ELF-32 or ELF-64 on
+      <code>ia64-*-hpux*</code>.  Does GMP need to know anything about that?
+ <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
+      "hinv -c processor" gives lots of information on Irix.  Standard
+      config.guess appends "el" to indicate endianness, but
+      <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
+ <li> PowerPC: The function descriptor nonsense for AIX is currently driven by
+      <code>*-*-aix*</code>.  It might be more reliable to do some sort of
+      feature test, examining the compiler output perhaps.  It might also be
+      nice to merge the aix.m4 files into powerpc-defs.m4.
+ <li> Sparc: <code>config.guess</code> recognises various exact sparcs, make
+      use of that information in <code>configure</code> (work on this is in
+      progress).
+ <li> Sparc32: floating point or integer <code>udiv</code> should be selected
+      according to the CPU target.  Currently floating point ends up being
+      used on all sparcs, which is probably not right for generic V7 and V8.
+ <li> Sparc: The use of <code>-xtarget=native</code> with <code>cc</code> is
+      incorrect when cross-compiling, the target should be set according to the
+      configured <code>$host</code> CPU.
+ <li> m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies
+      on system information for 030, 040 and 060.  Can they be identified by
+      running some code?
+ <li> m68k: gas 2.11.90.0.1 pads with zero bytes in text segments, which is not
+      valid code.  Probably need <code>.balignw &lt;n&gt;,0x4e7f</code> to get
+      nops, if <code>ALIGN</code> is going to be used for anything that's
+      executed across.
+ <li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being
+      used.  Check all such for bit rot and then put umul and udiv in
+      <code>$gmp_mpn_functions_optional</code> as "standard optional" objects.
+      <br> In particular Sparc and SparcV8 on non-gcc should benefit from
+      umul.asm enabled; the generic umul is suspected to be making sqr_basecase
+      slower than mul_basecase.
+ <li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a
+      different parameter order than those functions on other CPUs.  It might
+      avoid confusion to have them under different names, maybe
+      <code>mpn_umul_ppmm_r</code> or some such.  Prototypes then wouldn't
+      be conditionalized, and the appropriate form could be selected with the
+      <code>HAVE_NATIVE</code> scheme if/when the code switches to use a
+      <code>PROLOGUE</code> style.
+ <li> <code>DItype</code>: The setup in gmp-impl.h for non-GCC could use an
+      autoconf test to determine whether <code>long long</code> is available.
+ <li> m88k: Make the assembler code work on non-underscore systems.  Conversion
+      to .asm would be desirable.  Ought to be easy, but would want to be
+      tested.
+ <li> z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in
+      mpn/z8000 could be an ABI choice.  But this chip is obsolete and nothing
+      is likely to be done unless someone is actively using it.
+ <li> config.m4 is generated only by the configure script, it won't be
+      regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
+      would work, but it might upset "make" to have things like <code>L$</code>
+      get into the Makefiles through <code>AC_SUBST</code>.
+      <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
+      careful m4 quoting the <code>changequote</code> calls might not be
+      needed, which might free up the order in which things had to be output.
+ <li> <code>make distclean</code>: Only the mpn directory links which were
+      created are removed, but perhaps all possible links should be removed, in
+      case someone runs configure a second time without a
+      <code>distclean</code> in between.  The only tricky part would be making
+      sure all possible <code>extra_functions</code> are covered.
+ <li> MinGW: Apparently a Cygwin version of gcc can be used by passing
+      <code>-mno-cygwin</code>.  For <code>--host=*-*-mingw32*</code> it might
+      be convenient to automatically use that option, if it works.  Needs
+      someone with a dual cygwin/mingw setup to test.
+ <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
+      scheme.  Though we probably wouldn't be using its assembler support we
+      could try to use those variables in compatible ways.
+ </ul>
-      For example, "sparc" is not very useful as a machine architecture
-      denotation.  We want to distinguish old 32-bit SPARC without
-      multiply support from newer 32-bit SPARC with such support.  We
-      want to recognize a SuperSPARC, since its implementation of the
-      UDIV instruction is not complete, and will trap to the OS kernel
-      for certain operands.  And we want to recognize 64-bit capable
-      SPARC processors as such.  While the assembly routines can use
--bit operations on all 64-bit SPARC processors, one can not use
--bit limbs under all operating system.  E.g., Solaris 2.5 and
-.6 doesn't preserve the upper 32 bits of most processor
-      registers.  For SPARC we therefore sometimes need to choose GMP
-      configuration depending both on processor and operating system.
- <li> Remember to make sure config.sub accepts any output from config.guess.
+ <h4>Random Numbers</h4>
+ <ul>
- <li> Find out whether there's an alloca available and how to use it.
+ <li> <code>_gmp_rand</code> is not particularly fast on the linear
-      AC_FUNC_ALLOCA has various system dependencies covered, but we
+      congruential algorithm and could stand various improvements.
-      don't want its alloca.c replacement.  (One thing current cpp
+      <ul>
-      tests don't cover: HPUX 10 C compiler supports alloca, but
+      <li> Make a second seed area within <code>gmp_randstate_t</code> (or
-      cannot find any symbol to test in order to know if we're on
+           <code>_mp_algdata</code> rather) to save some copying.
-      HPUX 10.  Damn.)
+      <li> Make a special case for a single limb <code>2exp</code> modulus, to
- <li> Identify Mips processor under Irix: `hinv -c processor'.
+           avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
-      config.guess should say mips2, mips3, and mips4.
+      <li> Inline the <code>lc</code> code, to avoid a function call and
- <li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
+           <code>TMP_ALLOC</code> for every chunk.
-      Unfortunately, sizer is not available before some revision of
+      <li> The special case for <code>seedn==0</code> will be very rarely used,
-      Dec Unix 4.0, and it also returns some rather cryptic names for
+           and on that basis seems unnecessary.
-      processors.  Perhaps the <code>implver</code> and
+      <li> Perhaps the <code>2exp</code> and general LC cases should be split,
-      <code>amask</code> assembly instructions are better, but that
+           for clarity (if the general case is retained).
-      doesn't differentiate between ev5 and ev56.
+      </ul>
- <li> Identify Sparc processors.  config.guess should say supersparc,
+ <li> <code>gmp_randinit_mers</code> for a Mersenne Twister generator.  It's
-      microsparc, ultrasparc1, ultrasparc2, etc.
+      likely to be more random and about the same speed as Knuth's 55-element
- <li> Identify HPPA processors similarly.
+      Fibonacci generator, and can probably become the default.  Pedro Gimeno
- <li> Get lots of information about a Solaris system: prtconf -vp
+      has started on this.
- <li> For some target machines and some compilers, specific options
+ <li> <code>gmp_randinit_lc</code>: Finish or remove.  Doing a division for
-      are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
+      every every step won't be very fast, so check whether the usefulness of
-      Irix64/cc needs -64, Irix32/cc might need -n32, etc).  Some are
+      this algorithm can be justified.  (Consensus is that it's not useful and
-      set already, add more, see configure.in.
+      can be removed.)
- <li> Options to be passed to the assembler (via the compiler, using
+ <li> Blum-Blum-Shub: Finish or remove.  A separate
-      whatever syntax the compiler uses for passing options to the
+      <code>gmp_randinit_bbs</code> would be wanted, not the currently
-      assembler).
+      commented out case in <code>gmp_randinit</code>.
- <li> On Solaris 7, check if gcc supports native v9 64-bit
+ <li> <code>_gmp_rand</code> could be done as a function pointer within
-      arithmetic.  If not compile using "cc -fast -xarch=v9".
+      <code>gmp_randstate_t</code> (or rather in the <code>_mp_algdata</code>
-      (Problem: -fast requires that we link with -fast too, which
+      part), instead of switching on a <code>gmp_randalg_t</code>.  Likewise
-      might not be very good.  Pass "-xO4 -xtarget=native" instead?)
+      <code>gmp_randclear</code>, and perhaps <code>gmp_randseed</code> if it
- <li> Extend the "optional" compiler arguments to choose the first
+      became algorithm-specific.  This would be more modular, and would ensure
-      that works from from a set, so when gcc gets athlon support it
+      only code for the desired algorithms is dragged into the link.
-      can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
+ <li> <code>mpz_urandomm</code> should do something for n&lt;=0, but what?
-      whichever works.
+ <li> <code>mpz_urandomm</code> implementation looks like it could be improved.
- <li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
+      Perhaps it's enough to calculate <code>nbits</code> as ceil(log2(n)) and
-      x86s.  (A bug in gcc 2.95.2 prevents it being used
+      call <code>_gmp_rand</code> until a value <code>&lt;n</code> is obtained.
-      unconditionally.)
+ <li> <code>gmp_randstate_t</code> used for parameters perhaps should become
- <li> Build multiple variants of the library under certain systems.
+      <code>gmp_randstate_ptr</code> the same as other types.
-      An example is -n32, -o32, and -64 on Irix.
+ <li> Some of the empirical randomness tests could be included in a "make
- <li> There's a few filenames that don't fit in 14 chars, if this
+      check".  They ought to work everywhere, for a given seed at least.
-      matters.
- <li> Enable support for FORTRAN versions of mpn files (eg. for
-      mpn/cray/mulww.f).  Add "f" to the mpn path searching, run AC_PROG_F77 if
-      such a file is found.  Automake will generate some of what's needed in the
-      makefiles, but libtool doesn't know fortran and so rules like the current
-      ".asm.lo" will be needed.
- <li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
-      selected from the mpn path.  This might help say a generic C
-      build on weird systems.
  </ul>
- <p> In general, getting the exact right configuration, passing the
- exact right options to the compiler, etc, might mean that the GMP
- performance more than doubles.
- <p> When testing, make sure to test at least the following for all out
- target machines: (1) Both gcc and cc (and c89).  (2) Both 32-bit mode
- and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
- `make' and GNU `make'. (4) With and without GNU binutils.
  <h4>Miscellaneous</h4>
  <ul>
- <li> Work on the way we build the library.  We now do it building
-      convenience libraries but then listing all the object files a
-      second time in the top level Makefile.am.
- <li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
-      mp[zq]/add.c to decide whether it will add or subtract.  Will decrease
-      redundancy.  Similarly in other places.
  <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
       analogous to <code>mpz_mod</code>.  Document, and list as an
       incompatibility.
- <li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
+ <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
-      mpn/generic/pow_ui.
+      what range of values the generated cofactors can take, and preferably
- <li> Make mpz_invert call mpn_gcdext directly.
+      ensure the definition uniquely specifies the cofactors for given inputs.
- <li> Make a build option to enable execution profiling with gprof.  In
+      A basic extended Euclidean algorithm or multi-step variant leads to
-      particular look at getting the right <code>mcount</code> call at
+      |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
-      the start of each assembler subroutine (for important targets at
+      two solutions under just those restrictions.
-      least).
+ <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
+ <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
+      <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
+      time would be better still.)
+ <li> The various test programs use quite a bit of the main
+      <code>libgmp</code>.  This establishes good cross-checks, but it might be
+      better to use simple reference routines where possible.  Where it's not
+      possible some attention could be paid to the order of the tests, so a
+      <code>libgmp</code> routine is only used for tests once it seems to be
+      good.
+ <li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be
+      good for the two to share code.  Perhaps <code>mpf_set_q</code> should
+      make some <code>mpf_t</code> aliases for its numerator and denominator
+      and just call <code>mpf_div</code>.  Both would be simplified a good deal
+      by switching to <code>mpn_tdiv_qr</code> perhaps making them small enough
+      not to bother with sharing (especially since <code>mpf_set_q</code>
+      wouldn't need to watch out for overlaps).
+ <li> PowerPC: The cpu time base registers (per <code>mftb</code> and
+      <code>mftbu</code>) could be used for the speed and tune programs.  Would
+      need to know its frequency of course.  Usually it's 1/4 of bus speed
+      (eg. 25 MHz) but some chips drive it from an external input.  Probably
+      have to measure to be sure.
+ <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
+      return to a previous k at certain sizes.  This arises basically due to
+      the step effect caused by size multiples effectively used for each k.
+      Looking at a graph makes it fairly clear.
+ <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
+      on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
+      of <code>mpf_get_str</code> could be made which would better suit.
  </ul>
- <h4>Aids to Debugging</h4>
+ <h4>Aids to Development</h4>
  <ul>
- <li> Make an option for stack-alloc.c to call <code>malloc</code>
+ <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
-      separately for each <code>TMP_ALLOC</code> block, so a redzoning
+      function to check the validity of each <code>mp?_t</code> parameter, in
-      malloc debugger could be used during development.
+      particular to check they've been <code>mp?_init</code>ed.  This might
- <li> Add <code>ASSERT</code>s at the start of each user-visible
+      catch elementary mistakes in user programs.  Care would need to be taken
-      mpz/mpq/mpf function to check the validity of each
+      over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
-      <code>mp?_t</code> parameter, in particular to check they've been
+      else then consistency checks like size&lt;=alloc, ptr not
-      <code>mp?_init</code>ed.  This might catch elementary mistakes in
+      <code>NULL</code> and ptr+size not wrapping around the address space,
-      user programs.  Care would need to be taken over
+      would be possible.  A more sophisticated scheme could track
-      <code>MPZ_TMP_INIT</code>ed variables used internally.
+      <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
+      scheme probably wouldn't be reentrant, not without some help from the
+      system.
+ <li> tune/time.c could try to determine at runtime whether
+      <code>getrusage</code> and <code>gettimeofday</code> are reliable.
+      Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
+      <code>getrusage</code> doesn't exist.  If a test might take a long time
+      to run then perhaps cache the result in a file somewhere.
  </ul>
-Line 359  and 64-bit mode (such as -n32 vs -64 under Irix). (3)
+Line 905  and 64-bit mode (such as -n32 vs -64 under Irix). (3)
 Line 359  and 64-bit mode (such as -n32 vs -64 under Irix). (3)
 Line 905  and 64-bit mode (such as -n32 vs -64 under Irix). (3)
  <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
  </ul>
- <hr>
- <table width="100%">
+ <h4>Bright Ideas</h4>
-   <tr>
-     <td>
-       <font size=2>
-       Please send comments about this page to
-       <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
-       Copyright (C) 1999, 2000 Torbj�rn Granlund.
-       </font>
-     </td>
-     <td align=right>
-     </td>
-   </tr>
- </table>
+ The following may or may not be feasible, and aren't likely to get done in the
+ near future, but are at least worth thinking about.
+ <ul>
+ <li> Reorganize longlong.h so that we can inline the operations even for the
+      system compiler.  When there is no such compiler feature, make calls to
+      stub functions.  Write such stub functions for as many machines as
+      possible.
+ <li> longlong.h could declare when it's using, or would like to use,
+      <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
+      included in libgmp only in that case, the same as is effectively done for
+      <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
+      would only be a very small space saving, so perhaps not worth the
+      complexity.
+ <li> longlong.h could be built at configure time by concatenating or
+      #including fragments from each directory in the mpn path.  This would
+      select CPU specific macros the same way as CPU specific assembler code.
+      Code used would no longer depend on cpp predefines, and the current
+      nested conditionals could be flattened out.
+ <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
+      sort of supposed to return the low 31 (or 63) bits.  But this is
+      undocumented, and perhaps not too important.
+ <li> <code>mpz_*_ui</code> division routines currently return abs(a%b).
+      Perhaps make them return the real remainder instead?  Return type would
+      be <code>signed long int</code>.  But this would be an incompatible
+      change, so it might have to be under newly named functions.
+ <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
+      say an extra 16 limbs over what's needed, so as to reduce the chance of
+      having to do a reallocate if the <code>mpz_t</code> grows a bit more.
+      This could only be an option, since it'd badly bloat memory usage in
+      applications using many small values.
+ <li> <code>mpq</code> functions could perhaps check for numerator or
+      denominator equal to 1, on the assumption that integers or
+      denominator-only values might be expected to occur reasonably often.
+ <li> <code>count_trailing_zeros</code> is used on more or less uniformly
+      distributed numbers in a couple of places.  For some CPUs
+      <code>count_trailing_zeros</code> is slow and it's probably worth handling
+      the frequently occurring 0 to 2 trailing zeros cases specially.
+ <li> <code>mpf_t</code> might like to let the exponent be undefined when
+      size==0, instead of requiring it 0 as now.  It should be possible to do
+      size==0 tests before paying attention to the exponent.  The advantage is
+      not needing to set exp in the various places a zero result can arise,
+      which avoids some tedium but is otherwise perhaps not too important.
+      Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
+      exp==0, maybe elsewhere too.
+ <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
+      ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
+      allows that attribute on functions, but not function pointers (see info
+      node "Attribute Syntax"), so would need a new autoconf test.  This can
+      wait until there's a GCC that supports it.
+ <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
+      <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
+      routines were opened up a bit maybe that code could be shared.  When a
+      copy needs to be done there's no carry to append for the add, and if the
+      copy is non-empty no high zero for the sub. <br> An alternative would be
+      to do a copy at the start and then an in-place add or sub.  Obviously
+      that duplicates the fetches and stores for carry propagation, but that's
+      normally only one or two limbs.  The same applies to <code>mpz_add</code>
+      when one operand is longer than the other, and to <code>mpz_com</code>
+      since it's just -(x+1).
+ <li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict
+      (one writer many readers, or whatever it is) suit the GMP style "same or
+      separate" function parameters?  If so, judicious use might improve the
+      code generated a bit.  Do any compilers have their own flavour of
+      restrict as "completely unaliased", and is that still usable?
+ <li> 68000: A 16-bit limb might suit 68000 better than 32-bits, since the
+      native multiply is only 16x16.  Could have this as an <code>ABI</code>
+      option, selecting <code>_SHORT_LIMB</code> in gmp.h.  Naturally a new set
+      of asm subroutines would be necessary.  Would need new
+      <code>mpz_set_ui</code> etc since the current code assumes limb&gt;=long,
+      but 2-limb operand forms would find a use for <code>long long</code> on
+      other processors too.
+ <li> Nx1 remainders can be taken at multiplier throughput speed by
+      pre-calculating an array "p[i] = 2^(i*<code>BITS_PER_MP_LIMB</code>) mod
+      m", then for the input limbs x calculating an inner product "sum
+      p[i]*x[i]", and a final 3x1 limb remainder mod m.  If those powers take
+      roughly N divide steps to calculate then there'd be an advantage any time
+      the same m is used three or more times.  Suggested by Victor Shoup in
+      connection with chinese-remainder style decompositions, but perhaps with
+      other uses.
+ </ul>
+ <hr>
  </body>
  </html>
+ <!--
+ Local variables:
+ eval: (add-hook 'write-file-hooks 'time-stamp)
+ time-stamp-start: "This file current as of "
+ time-stamp-format: "%:d %3b %:y"
+ time-stamp-end: "\\."
+ time-stamp-line-limit: 50
+ End:
+ -->

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>