OpenXM_contrib/gmp/doc/tasks.html - annotate

Return to tasks.html CVS log
Up to [local] / OpenXM_contrib / gmp / doc
Annotation of OpenXM_contrib/gmp/doc/tasks.html, Revision 1.1.1.2

1.1       maekawa     1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
                      2: <html>
                      3: <head>
                      4:   <title>
                      5:     GMP Itemized Development Tasks
                      6:   </title>
                      7: </head>
                      8: <body bgcolor=lightgreen>
                      9:
                     10: <center>
                     11:   <h1>
                     12:     GMP Itemized Development Tasks
                     13:   </h1>
                     14: </center>
                     15:
1.1.1.2 ! ohara      16: <font size=-1>
        !            17: Copyright 2000, 2001, 2002 Free Software Foundation, Inc. <br><br>
        !            18: This file is part of the GNU MP Library. <br><br>
        !            19: The GNU MP Library is free software; you can redistribute it and/or modify
        !            20: it under the terms of the GNU Lesser General Public License as published
        !            21: by the Free Software Foundation; either version 2.1 of the License, or (at
        !            22: your option) any later version. <br><br>
        !            23: The GNU MP Library is distributed in the hope that it will be useful, but
        !            24: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            25: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            26: License for more details. <br><br>
        !            27: You should have received a copy of the GNU Lesser General Public License
        !            28: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            29: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
        !            30: MA 02111-1307, USA.
        !            31: </font>
        !            32:
        !            33: <hr>
        !            34: <!-- NB. timestamp updated automatically by emacs -->
1.1       maekawa    35: <comment>
1.1.1.2 ! ohara      36:   This file current as of 20 May 2002.  An up-to-date version is available at
1.1       maekawa    37:   <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
1.1.1.2 ! ohara      38:   Please send comments about this page to
        !            39:   <a href="mailto:bug-gmp@gnu.org">bug-gmp@gnu.org</a>.
1.1       maekawa    40: </comment>
                     41:
1.1.1.2 ! ohara      42: <p> These are itemized GMP development tasks.  Not all the tasks
1.1       maekawa    43:     listed here are suitable for volunteers, but many of them are.
                     44:     Please see the <a href="projects.html">projects file</a> for more
                     45:     sizeable projects.
                     46:
                     47: <h4>Correctness and Completeness</h4>
                     48: <ul>
                     49: <li> The various reuse.c tests need to force reallocation by calling
                     50:      <code>_mpz_realloc</code> with a small (1 limb) size.
1.1.1.2 ! ohara      51: <li> One reuse case is missing from mpX/tests/reuse.c:
        !            52:      <code>mpz_XXX(a,a,a)</code>.
        !            53: <li> When printing <code>mpf_t</code> numbers with exponents &gt;2^53 on
        !            54:      machines with 64-bit <code>mp_exp_t</code>, the precision of
1.1       maekawa    55:      <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
1.1.1.2 ! ohara      56:      <code>mpf_get_str</code> aborts.  Detect and compensate.  Alternately,
        !            57:      think seriously about using some sort of fixed-point integer value.
        !            58:      Avoiding unnecessary floating point is probably a good thing in general,
        !            59:      and it might be faster on some CPUs.
1.1       maekawa    60: <li> Make the string reading functions allow the `0x' prefix when the base is
                     61:      explicitly 16.  They currently only allow that prefix when the base is
1.1.1.2 ! ohara      62:      unspecified (zero).
1.1       maekawa    63: <li> <code>mpf_eq</code> is not always correct, when one operand is
                     64:      1000000000... and the other operand is 0111111111..., i.e., extremely
                     65:      close.  There is a special case in <code>mpf_sub</code> for this
                     66:      situation; put similar code in <code>mpf_eq</code>.
1.1.1.2 ! ohara      67: <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
        !            68:      not use just whole limbs, but partial limbs.
        !            69: <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
        !            70:      garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
        !            71:      of a <code>long</code> is not detected.
        !            72: <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
        !            73:      the inputs, and in that respect doesn't implement the "infinite precision
        !            74:      followed by truncate" specified in the manual.
        !            75: <li> <code>mpf_div</code> of x/x doesn't always give 1, reported by Peter
        !            76:      Moulder.  Perhaps it suffices to put +1 on the effective divisor prec, so
        !            77:      that data bits rather than zeros are shifted in when normalizing.  Would
        !            78:      prefer to switch to <code>mpn_tdiv_qr</code>, where all shifting should
        !            79:      disappear.
        !            80: <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
        !            81:      variables with pointers to <code>mpz_add</code> etc, which doesn't work
        !            82:      when those routines are coming from a DLL (because they're effectively
        !            83:      function pointer global variables themselves).  Need to rearrange perhaps
        !            84:      to a set of calls to a test function rather than iterating over an array.
        !            85: <li> demos/pexpr.c: The local variables in <code>main</code> might be
        !            86:      clobbered by the <code>longjmp</code>.
1.1       maekawa    87: </ul>
                     88:
                     89:
                     90:
                     91: <h4>Machine Independent Optimization</h4>
                     92: <ul>
1.1.1.2 ! ohara      93: <li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>,
        !            94:      <code>mpf_get_str</code>: Don't test <code>count_leading_zeros</code> for
        !            95:      zero, instead check the high bit of the operand and avoid invoking
        !            96:      <code>count_leading_zeros</code>.  This is an optimization on all
        !            97:      machines, and significant on machines with slow
        !            98:      <code>count_leading_zeros</code>, though it's possible an already
        !            99:      normalized operand might not be encountered very often.
1.1       maekawa   100: <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
                    101:      most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
                    102:      (Peter Montgomery has some ideas on this subject.)
                    103: <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
                    104:      products with fewer operations.
1.1.1.2 ! ohara     105: <li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
        !           106:      fast, especially for compile-time constants, but would make application
        !           107:      binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
        !           108:      preventing the "lazy" allocation scheme below.
        !           109: <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
        !           110:      <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
        !           111:      would be needed for the divide by zero test, unless that could be left to
        !           112:      <code>mpn_mod_1</code> (not sure currently whether all the risc chips
        !           113:      provoke the right exception there if using mul-by-inverse).
        !           114: <li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
        !           115:      <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
        !           116:      might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
        !           117:      since there's no apparent way to get <code>SHRT_MAX</code> with an
        !           118:      expression (since <code>short</code> and <code>unsigned short</code> can
        !           119:      be different sizes).
1.1       maekawa   120: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
                    121:      fast on one or two limb moduli, due to a lot of function call
                    122:      overheads.  These could perhaps be handled as special cases.
                    123: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
                    124:      algorithm selection, and the latter should use REDC.  Both could
                    125:      change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
1.1.1.2 ! ohara     126: <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
        !           127:      using the division method when they're small, since the REDC form of a
        !           128:      small multiplier is normally a full size product.  Probably would need a
        !           129:      new tuned parameter to say what size multiplier is "small", as a function
        !           130:      of the size of the modulus.
        !           131: <li> <code>mpz_powm</code> REDC should handle even moduli if possible.  Maybe
        !           132:      this would mean for m=n*2^k doing mod n using REDC and an auxiliary
        !           133:      calculation mod 2^k, then putting them together at the end.
1.1       maekawa   134: <li> <code>mpn_gcd</code> might be able to be sped up on small to
                    135:      moderate sizes by improving <code>find_a</code>, possibly just by
                    136:      providing an alternate implementation for CPUs with slowish
                    137:      <code>count_leading_zeros</code>.
1.1.1.2 ! ohara     138: <li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized
        !           139:      evaluate and interpolate.  The necessary <code>mpn_divexact_by3c</code>
        !           140:      exists.
1.1       maekawa   141: <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
                    142:      better cache locality by taking N piece by piece.  The current code could
                    143:      be left available for CPUs without caching.  Depending how karatsuba etc
                    144:      is applied to unequal size operands it might be possible to assume M is
                    145:      always smallish.
1.1.1.2 ! ohara     146: <li> <code>mpn_perfect_square_p</code> on small operands might be better off
        !           147:      skipping the residue tests and just taking a square root.
        !           148: <li> <code>mpz_perfect_power_p</code> could be improved in a number of ways.
        !           149:      Test for Nth power residues modulo small primes like
        !           150:      <code>mpn_perfect_square_p</code> does.  Use p-adic arithmetic to find
        !           151:      possible roots.  Divisibility by other primes should be tested by
        !           152:      grouping into a limb like <code>PP</code>.
        !           153: <li> <code>mpz_perfect_power_p</code> might like to use <code>mpn_gcd_1</code>
        !           154:      instead of a private GCD routine.  The use it's put to isn't
        !           155:      time-critical, and it might help be ensure correctness to use the main GCD
        !           156:      routine.
        !           157: <li> <code>mpz_perfect_power_p</code> could use
        !           158:      <code>mpz_divisible_ui_p</code> instead of <code>mpz_tdiv_ui</code> for
        !           159:      divisibility testing, the former is faster on a number of systems.  (But
        !           160:      all that prime test stuff is going to be rewritten some time.)
        !           161: <li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such
        !           162:      pairs, listing several hundred primes.  Perhaps actually make the
        !           163:      products larger than one limb each.
        !           164: <li> <code>PP</code> can have factors of 2 introduced in order to get the high
        !           165:      bit set and therefore a <code>PP_INVERTED</code> existing.  The factors
        !           166:      of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
        !           167:      further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
        !           168:      The advantage of this is that <code>mpn_preinv_mod_1</code> can then be
        !           169:      used if it's faster than plain <code>mpn_mod_1</code>.  This would be a
        !           170:      change only for 16-bit limbs, all the rest already have <code>PP</code>
        !           171:      in the right form.
        !           172: <li> <code>PP</code> could have extra factors of 3 or 5 or whatever introduced
        !           173:      if they fit, and final remainders mod 9 or 25 or whatever used, thereby
        !           174:      making more efficient use of the <code>mpn_mod_1</code> done.  On a
        !           175:      16-bit limb it looks like <code>PP</code> could take an extra factor of
        !           176:      3.
        !           177: <li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and
        !           178:      <code>mpz_perfect_power_p</code> could use <code>mpn_mod_34lsub1</code>
        !           179:      to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
        !           180:      3, 5, 7, 13 and 17 (factors of 2^24-1).  This could either replace the
        !           181:      <code>PP</code> division currently done, or allow <code>PP</code> to do
        !           182:      larger primes, depending how many residue tests seem worthwhile before
        !           183:      launching into full root extractions or Miller-Rabin etc.
        !           184: <li> <code>mpz_probab_prime_p</code> (and maybe others) could code the
        !           185:      divisibility tests like <code>n%7 == 0</code> in the form
        !           186: <pre>
        !           187: #define MP_LIMB_DIVISIBLE_7_P(n) \
        !           188:   ((n) * MODLIMB_INVERSE_7 &lt;= MP_LIMB_T_MAX/7)
        !           189: </pre>
        !           190:      This would help compilers which don't know how to optimize divisions by
        !           191:      constants, and would help current gcc (3.0) too since gcc forms a whole
        !           192:      remainder rather than using a modular inverse and comparing.  This
        !           193:      technique works for any odd modulus, and with some tweaks for even moduli
        !           194:      too.  See Granlund and Montgomery "Division By Invariant Integers"
        !           195:      section 9.
        !           196: <li> <code>mpz_probab_prime_p</code> and <code>mpz_nextprime</code> could
        !           197:      offer certainty for primes up to 2^32 by using a one limb miller-rabin
        !           198:      test to base 2, combined with a table of actual strong pseudoprimes in
        !           199:      that range (2314 of them).  If that table is too big then both base 2 and
        !           200:      base 3 tests could be done, leaving a table of 104.  The test could use
        !           201:      REDC and therefore be a <code>modlimb_invert</code> a remainder (maybe)
        !           202:      then two multiplies per bit (successively dependent).  Processors with
        !           203:      pipelined multipliers could do base 2 and 3 in parallel.  Vector systems
        !           204:      could do a whole bunch of bases in parallel, and perhaps offer near
        !           205:      certainty up to 64-bits (certainty might depend on an exhaustive search
        !           206:      of pseudoprimes up to that limit).  Obviously 2^32 is not a big number,
        !           207:      but an efficient and certain calculation is attractive.  It might find
        !           208:      other uses internally, and could even be offered as a one limb prime test
        !           209:      <code>mpn_probab_prime_1_p</code> or <code>gmp_probab_prime_ui_p</code>
        !           210:      perhaps.
        !           211: <li> <code>mpz_probab_prime_p</code> doesn't need to make a copy of
        !           212:      <code>n</code> when the input is negative, it can setup an
        !           213:      <code>mpz_t</code> alias, same data pointer but a positive size.  With no
        !           214:      need to clear before returning, the recursive function call could be
        !           215:      dispensed with too.
        !           216: <li> <code>mpf_set_str</code> produces low zero limbs when a string has a
        !           217:      fraction but is exactly representable, eg. 0.5 in decimal.  These could be
        !           218:      stripped to save work in later operations.
        !           219: <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
        !           220:      use <code>mpn_and_n</code> etc for the benefit of the small number of
        !           221:      targets with native versions of those routines.  Need to be careful not to
        !           222:      pass size==0.  Is some code sharing possible between the <code>mpz</code>
        !           223:      routines?
        !           224: <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
        !           225:      unless it's really necessary (currently only sizes are tested, not
        !           226:      whether r really is u or v).
        !           227: <li> <code>mpf_add</code>: Under the check for v having no effect on the
        !           228:      result, perhaps test for r==u and do nothing in that case, rather than
        !           229:      currently it looks like an <code>MPN_COPY_INCR</code> will be done to
        !           230:      reduce prec+1 limbs to prec.
        !           231: <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
        !           232:      shift the dividend on-the-fly, since this should cost nothing on
        !           233:      superscalar processors and avoid the need for temporary copying in
        !           234:      <code>mpn_tdiv_qr</code>.
        !           235: <li> <code>mpf_sqrt_ui</code> calculates prec+1 limbs, whereas just prec would
        !           236:      satisfy the application requested precision.  It should suffice to simply
        !           237:      reduce the rsize temporary to 2*prec-1 limbs.  <code>mpf_sqrt</code>
        !           238:      might be similar.
        !           239: <li> <code>invert_limb</code> generic C: The division could use dividend
        !           240:      b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
        !           241:      (b-d):0, where b=2^<code>BITS_PER_MP_LIMB</code> and d=divisor.  The
        !           242:      former is per the original paper and is used in the x86 code, the
        !           243:      advantage is that the current special case for 0x80..00 could be dropped.
        !           244:      The two should be equivalent, but a little check of that would be wanted.
        !           245: <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
        !           246:      <code>num2*den1</code> products limb-by-limb from high to low and look at
        !           247:      each step for values differing by more than the possible carry bit from
        !           248:      the uncalculated portion.
        !           249: <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
        !           250:      and compare.  The benefits of karatsuba and higher multiplication
        !           251:      algorithms are lost, but if it's assumed only a few high limbs will be
        !           252:      needed to determine an order then that's fine.
        !           253: <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
        !           254:      <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
        !           255:      instead of the functions, so they get inlined on all compilers, not just
        !           256:      gcc and others with <code>inline</code> recognised in gmp.h.
        !           257:      <code>__GMPN_ADD_1</code> etc are meant mostly to support application
        !           258:      inline <code>mpn_add_1</code> etc and if they don't come out good for
        !           259:      internal uses then special forms can be introduced, for instance many
        !           260:      internal uses are in-place.  Sometimes a block of code is executed based
        !           261:      on the carry-out, rather than using it arithmetically, and those places
        !           262:      might want to do their own loops entirely.
        !           263: <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
        !           264:      bitfield for the mantissa extraction, not two, when endianness permits.
        !           265:      Might depend on the compiler allowing <code>long long</code> bit fields
        !           266:      when that's the only actual 64-bit type.
        !           267: <li> <code>mpf_get_d</code> could be more like <code>mpz_get_d</code> and do
        !           268:      more in integers and give the float conversion as such a chance to round
        !           269:      in its preferred direction.  Some code sharing ought to be possible.  Or
        !           270:      if nothing else then for consistency the two ought to give identical
        !           271:      results on integer operands (not clear if this is so right now).
        !           272: <li> <code>usqr_ppm</code> or some such could do a widening square in the
        !           273:      style of <code>umul_ppmm</code>.  This would help 68000, and be a small
        !           274:      improvement for the generic C (which is used on UltraSPARC/64 for
        !           275:      instance).  GCC recognises the generic C ul*vh and vl*uh are identical,
        !           276:      but does two separate additions to the rest of the result.
        !           277: <li> tal-notreent.c could keep a block of memory permanently allocated.
        !           278:      Currently the last nested <code>TMP_FREE</code> releases all memory, so
        !           279:      there's an allocate and free every time a top-level function using
        !           280:      <code>TMP</code> is called.  Would need
        !           281:      <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
        !           282:      any cached memory when changing allocation functions though.
        !           283: <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
        !           284:      inlined.  If the current chunk has enough room then a couple of pointers
        !           285:      can be updated.  Only if more space is required then a call to some sort
        !           286:      of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
        !           287:      <code>TMP_ALLOC</code> is an expression might make the implementation a
        !           288:      bit ugly and/or a bit sub-optimal.
        !           289: <pre>
        !           290: #define TMP_ALLOC(n)
        !           291:   ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
        !           292:      __gmp_tmp_increase (ROUND_UP (n)) : 0),
        !           293:      current-&gt;point += ROUND_UP (n),
        !           294:      current-&gt;point - ROUND_UP (n))
        !           295: </pre>
        !           296: <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
        !           297:      never used.  Perhaps the table should just go up to base 16, and have
        !           298:      code to generate data above that, if and when required.  Naturally this
        !           299:      assumes the code would be smaller than the data saved.
        !           300: <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
        !           301:      if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
        !           302:      otherwise, to save space.
        !           303: <li> Make <code>mpf_get_str</code> and <code>mpf_set_str</code> call the
        !           304:      corresponding, much faster, mpn functions.
        !           305: <li> <code>mpn_mod_1</code> could pre-calculate values of R mod N, R^2 mod N,
        !           306:      R^3 mod N, etc, with R=2^<code>BITS_PER_MP_LIMB</code>, and use them to
        !           307:      process multiple limbs at each step by multiplying.  Suggested by Peter
        !           308:      L. Montgomery.
        !           309: <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
        !           310:      are of course fast, it seems a little silly to make a second pass over
        !           311:      the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
        !           312:      that with the bit extractions.
        !           313: <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
        !           314:      A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
        !           315:      deriving S from that.  Perhaps it'd be possible to arrange to get S in
        !           316:      the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
        !           317:      might only be an advantage if A and B are about the same size.
        !           318: <li> <code>mpn_toom3_mul_n</code>, <code>mpn_toom3_sqr_n</code>: Temporaries
        !           319:      <code>B</code> and <code>D</code> are adjacent in memory and at the final
        !           320:      coefficient additions look like they could use a single
        !           321:      <code>mpn_add_n</code> of <code>l4</code> limbs rather than two of
        !           322:      <code>l2</code> limbs.
1.1       maekawa   323: </ul>
                    324:
                    325:
                    326: <h4>Machine Dependent Optimization</h4>
                    327: <ul>
1.1.1.2 ! ohara     328: <li> <code>udiv_qrnnd_preinv2norm</code>, the branch-free version of
        !           329:      <code>udiv_qrnnd_preinv</code>, might be faster on various pipelined
        !           330:      chips.  In particular the first <code>if (_xh != 0)</code> in
        !           331:      <code>udiv_qrnnd_preinv</code> might be roughly a 50/50 chance and might
        !           332:      branch predict poorly.  (The second test is probably almost always
        !           333:      false.)  Measuring with the tuneup program would be possible, but perhaps
        !           334:      a bit messy.  In any case maybe the default should be the branch-free
        !           335:      version.
        !           336:      <br>
        !           337:      Note that the current <code>udiv_qrnnd_preinv2norm</code> implementation
        !           338:      assumes a right shift will sign extend, which is not guaranteed by the C
        !           339:      standards, and doesn't happen on Cray vector systems.
1.1       maekawa   340: <li> Run the `tune' utility for more compiler/CPU combinations.  We would like
                    341:      to have gmp-mparam.h files in practically every implementation specific
                    342:      mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
                    343:      compiler.  See the `tune' top-level directory for more information.
1.1.1.2 ! ohara     344:        <pre>
        !           345:        #ifdef (__GNUC__)
        !           346:        #if __GNUC__ == 2 && __GNUC_MINOR__ == 7
        !           347:        ...
        !           348:        #endif
        !           349:        #if __GNUC__ == 2 && __GNUC_MINOR__ == 8
        !           350:        ...
        !           351:        #endif
        !           352:        #ifndef MUL_KARATSUBA_THRESHOLD
        !           353:        /* Default GNUC values */
        !           354:        ...
        !           355:        #endif
        !           356:        #else /* system compiler */
        !           357:        ...
        !           358:        #endif  </pre>
        !           359: <li> <code>invert_limb</code> on various processors might benefit from the
        !           360:      little Newton iteration done for alpha and ia64.
        !           361: <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
        !           362:      <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
        !           363: <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
        !           364:      and <code>mpn_submul_1</code> for the 21164.  This should use both integer
1.1       maekawa   365:      multiplies and floating-point multiplies.  For the floating-point
                    366:      operations, the single-limb multiplier should be split into three 21-bit
1.1.1.2 ! ohara     367:      chunks, or perhaps even better in four 16-bit chunks.  Probably possible
        !           368:      to reach 9 cycles/limb.
        !           369: <li> Alpha 21264 ev67: Use <code>ctlz</code> and <code>cttz</code> for
        !           370:      <code>count_leading_zeros</code> and<code>count_trailing_zeros</code>.
        !           371:      Use inline for gcc, probably want asm files for elsewhere.
        !           372: <li> ARC: gcc longlong.h sets up <code>umul_ppmm</code> to call
        !           373:      <code>__umulsidi3</code> in libgcc.  Could be copied straight across, but
        !           374:      perhaps ought to be tested.
        !           375: <li> ARM: On v5 cpus see if the <code>clz</code> instruction can be used for
        !           376:      <code>count_leading_zeros</code>.
        !           377: <li> Itanium: <code>mpn_divexact_by3</code> isn't particularly important, but
        !           378:      the generic C runs at about 27 c/l, whereas with the multiplies off the
        !           379:      dependent chain about 3 c/l ought to be possible.
        !           380: <li> Itanium: <code>mpn_hamdist</code> could be put together based on the
        !           381:      current <code>mpn_popcount</code>.
        !           382: <li> Itanium: <code>popc_limb</code> in gmp-impl.h could use the
        !           383:      <code>popcnt</code> insn.
        !           384: <li> Itanium: <code>mpn_submul_1</code> is not implemented directly, only via
        !           385:      a combination of <code>mpn_mul_1</code> and <code>mpn_sub_n</code>.
        !           386: <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
        !           387:      for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
        !           388:      much this can improve the speed, though, since the symmetry that we rely
        !           389:      on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
        !           390:      accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
        !           391: <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
        !           392:      <code>mpn_addmul_1</code>.
        !           393: <li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
        !           394:      "<code>mulx</code>"s either with an asm block or via the generic C code is
        !           395:      about 90 cycles.  Try using fp operations, and also try using karatsuba
        !           396:      for just three "<code>mulx</code>"s.
        !           397: <li> UltraSPARC/64: <code>mpn_divrem_1</code>, <code>mpn_mod_1</code>,
        !           398:      <code>mpn_divexact_1</code> and <code>mpn_modexact_1_odd</code> could
        !           399:      process 32 bits at a time when the divisor fits 32-bits.  This will need
        !           400:      only 4 <code>mulx</code>'s per limb instead of 8 in the general case.
        !           401: <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
        !           402:      Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
        !           403: <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
        !           404: <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
        !           405:      possible (see commented out code in longlong.h).  This is unlikely to
        !           406:      save more than a couple of cycles, so perhaps isn't worth bothering with.
        !           407: <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
        !           408:      or anything to indicate V9 support when -mcpu=v9 is selected.  See
        !           409:      gcc/config/sol2-sld-64.h.  Will need to pass something through from
        !           410:      ./configure to select the right code in longlong.h.  (Currently nothing
        !           411:      is lost because <code>mulx</code> for multiplying is commented out.)
        !           412: <li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from
        !           413:      masking down to just the useful bits at each point in the calculation,
        !           414:      since <code>mulx</code> speed depends on the highest bit set.  Either
        !           415:      explicit masks or small types like <code>short</code> and
        !           416:      <code>int</code> ought to work.
        !           417: <li> Sparc64 HAL R1: <code>mpn_popcount</code> and <code>mpn_hamdist</code>
        !           418:      could use <code>popc</code> currently commented out in gmp-impl.h.  This
        !           419:      chip reputedly implements <code>popc</code> properly (see gcc sparc.md),
        !           420:      would need to recognise the chip as <code>sparchalr1</code> or something
        !           421:      in configure / config.sub / config.guess.
1.1       maekawa   422: <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
1.1.1.2 ! ohara     423:      <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
        !           424:      should be possible to saturate the cache, which will happen at 8
        !           425:      cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
        !           426:      it should be possible to make them run at about 5 cycles/limb.
        !           427: <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
        !           428:      <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
        !           429:      possibly two floating-point and one integer limb per loop.  Split operands
        !           430:      into four 16-bit chunks for fast fp operations.  Should easily reach 9
        !           431:      cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
        !           432:      (using one int + two fp).
        !           433: <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
        !           434:      as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
        !           435:      share source code, or with a register to control the loop direction
        !           436:      perhaps even share object code.
        !           437: <li> PowerPC-32: <code>mpn_rshift</code> should do the same sort of unrolled
        !           438:      loop as <code>mpn_lshift</code>.
1.1       maekawa   439: <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
                    440:      for important machines.  Helping the generic sqr_basecase.c with an
                    441:      <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
                    442: <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
                    443:      Will bring time from 1.75 to 1.25 cycles/limb.
1.1.1.2 ! ohara     444: <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
        !           445:      Pentium code.)
        !           446: <li> X86: Good authority has it that in the past an inline <code>rep
        !           447:      movs</code> would upset GCC register allocation for the whole function.
        !           448:      Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
        !           449:      <code>__builtin_memcpy</code>.  Examine the code for some simple and
        !           450:      complex functions to find out.  Inlining <code>rep movs</code> would be
        !           451:      desirable, it'd be both smaller and faster.
        !           452: <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
        !           453:      down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
        !           454:      <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
        !           455: <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
        !           456:      might benefit from some destination prefetching.
        !           457: <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
        !           458:      mul-by-inverse, hoping for maybe 30 c/l.
        !           459: <li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go
        !           460:      faster than the generic x86 code at 3.5 c/l.  The athlon code for instance
        !           461:      runs at about 2.7.
        !           462: <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
        !           463:      do something branch-free for unaligned startups, and shaving one insn
        !           464:      from the loop with alternative indexing might save a cycle.
1.1       maekawa   465: <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
1.1.1.2 ! ohara     466:      The pipeline is now extremely deep, perhaps unnecessarily deep.
1.1       maekawa   467: <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
                    468: <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
1.1.1.2 ! ohara     469:      run at just 3.25 cycles/limb.
1.1       maekawa   470: <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
                    471: <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
                    472:      <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
                    473:      method", to avoid carry propagation.  splitting one of the operands in
                    474:      11-bit chunks.
1.1.1.2 ! ohara     475: <li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
        !           476:      by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
        !           477:      special lshift by 1.  This would be best as a jump across to the other
        !           478:      routine, could let both live in lshift.asm and omit rshift.asm on finding
        !           479:      <code>mpn_rshift</code> already provided.
        !           480: <li> Cray T3E: Experiment with optimization options.  In particular,
        !           481:      -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
        !           482: <li> Cray: <code>mpn_com_n</code> and <code>mpn_and_n</code> etc very probably
        !           483:      wants a pragma like <code>MPN_COPY_INCR</code>.
        !           484: <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
        !           485:      <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
        !           486:      and could be inlined to avoid function calls.
        !           487: <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
        !           488:      scheme.  Not sure why, maybe they merely give the compiler more
        !           489:      information about aliasing (or the lack thereof).  Would like to modify
        !           490:      <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
        !           491:      blocks wanted unconditionally are easy enough, those wanted only
        !           492:      sometimes are a problem.  Perhaps a special size calculation to ask for a
        !           493:      dummy length 1 when unwanted, or perhaps an inlined subroutine
        !           494:      duplicating code under each conditional.  Don't really want to turn
        !           495:      everything into a dog's dinner just because Cray don't offer an
        !           496:      <code>alloca</code>.
        !           497: <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
        !           498:      Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
        !           499:      limb might prevent it.  Perhaps special cases for binary, octal and hex
        !           500:      would be worthwhile (very possibly for all processors too).
        !           501: <li> Cray: <code>popc_limb</code> could use the Cray <code>_popc</code>
        !           502:      intrinsic.  That would help <code>mpz_hamdist</code> and might make the
        !           503:      generic C versions of <code>mpn_popcount</code> and
        !           504:      <code>mpn_hamdist</code> suffice for Cray (if it vectorizes, or can be
        !           505:      given a hint to do so).
        !           506: <li> 68000: <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
        !           507:      <code>mpn_submul_1</code>: Check for a 16-bit multiplier and use two
        !           508:      multiplies per limb, not four.
        !           509: <li> 68000: <code>mpn_lshift</code> and <code>mpn_rshift</code> could use a
        !           510:      <code>roll</code> and mask instead of <code>lsrl</code> and
        !           511:      <code>lsll</code>.  This promises to be a speedup, effectively trading a
        !           512:      6+2*n shift for one or two 4 cycle masks.  Suggested by Jean-Charles
        !           513:      Meyrignac.
1.1       maekawa   514: <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
                    515:   <pre>
1.1.1.2 ! ohara     516:           if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
        !           517:           if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
        !           518:           ... </pre>
        !           519: <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
        !           520:      be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
        !           521:      to identify suitable versions of that compiler?
        !           522: <li> VAX D and G format <code>double</code> floats are straightforward and
        !           523:      could perhaps be handled directly in <code>__gmp_extract_double</code>
        !           524:      and maybe in <code>mpz_get_d</code>, rather than falling back on the
        !           525:      generic code.  (Both formats are detected by <code>configure</code>.)
        !           526: <li> <code>mpn_get_str</code> final divisions by the base with
        !           527:      <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
        !           528:      on suitable machines.  This ends up happening for decimal by presenting
        !           529:      the compiler with a run-time constant, but the same for other bases would
        !           530:      be good.  Perhaps use could be made of the fact base&lt;256.
        !           531: <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
        !           532:      structure like <code>div_t</code> to avoid going through memory, in
        !           533:      particular helping RISCs that don't do store-to-load forwarding.  Clearly
        !           534:      this is only possible if the ABI returns a structure of two
        !           535:      <code>mp_limb_t</code>s in registers.
1.1       maekawa   536: </ul>
                    537:
                    538: <h4>New Functionality</h4>
                    539: <ul>
1.1.1.2 ! ohara     540: <li> Add in-memory versions of <code>mp?_out_raw</code> and
        !           541:      <code>mp?_inp_raw</code>.
        !           542: <li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the
        !           543:      nth limb).
1.1       maekawa   544: <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
                    545: <li> Let `0b' and `0B' mean binary input everywhere.
1.1.1.2 ! ohara     546: <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
        !           547:      Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
        !           548:      <code>_mpz_realloc</code> do the initial alloc.  Set
        !           549:      <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
        !           550:      similar can unconditionally fetch from.  Niels M�ller has had a go at
        !           551:      this.
        !           552:      <br>
        !           553:      The advantages of the lazy scheme would be:
        !           554:      <ul>
        !           555:      <li> Initial allocate would be the size required for the first value
        !           556:           stored, rather than getting 1 limb in <code>mpz_init</code> and then
        !           557:           more or less immediately reallocating.
        !           558:      <li> <code>mpz_init</code> would only store magic values in the
        !           559:           <code>mpz_t</code> fields, and could be inlined.
        !           560:      <li> A fixed initializer could even be used by applications, like
        !           561:           <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
        !           562:           for globals.
        !           563:      </ul>
        !           564:      The advantages of the current scheme are:
        !           565:      <ul>
        !           566:      <li> <code>mpz_set_ui</code> and other similar routines needn't check the
        !           567:           size allocated and can just store unconditionally.
        !           568:      <li> <code>mpz_set_ui</code> and perhaps others like
        !           569:           <code>mpz_tdiv_r_ui</code> and a prospective
        !           570:           <code>mpz_set_ull</code> could be inlined.
        !           571:      </ul>
1.1       maekawa   572: <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
                    573:      format is portable between 32-bit and 64-bit machines, and between
                    574:      little-endian and big-endian machines.
1.1.1.2 ! ohara     575: <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
        !           576:      logops and copys available in gmp.h, either as library functions or
        !           577:      inlines, with the availability of library functions instantiated in the
        !           578:      generated gmp.h at build time.
        !           579: <li> <code>mpz_set_str</code> etc variants taking string lengths rather than
        !           580:      null-terminators.
1.1       maekawa   581: <li> Consider changing the thresholds to apply the simpler algorithm when
                    582:      "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
                    583:      be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
                    584:      compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
1.1.1.2 ! ohara     585:      Alternately it looks like the <code>ABOVE_THRESHOLD</code> and
        !           586:      <code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick
        !           587:      up cases where a threshold of zero should mean only the second algorithm.
        !           588: <li> <code>mpz_nthprime</code>.
        !           589: <li> Perhaps <code>mpz_init2</code>, initializing and making initial room for
        !           590:      N bits.  The actual size would be rounded up to a limb, and perhaps an
        !           591:      extra limb added since so many <code>mpz</code> routines need that on
        !           592:      their destination.
        !           593: <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
        !           594:      <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
        !           595:      if they could share code with the current such functions (which should be
        !           596:      possible).
        !           597: <li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
        !           598:      Niels M�ller.
        !           599: <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
        !           600:      accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
        !           601:      decimal in this case, with a further 0x, 0b etc allowed there.
        !           602:      Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
        !           603:      probably something like "0.123" ought not mean octal.
        !           604: <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
        !           605:      feature of gmp.h, so applications could know whether to
        !           606:      <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
        !           607: <li> <code>PRIdMP_LIMB</code> and similar defines following C99
        !           608:      &lt;inttypes.h&gt; might be of use to applications printing limbs.
        !           609:      Perhaps they should be defined only if specifically requested, the way
        !           610:      &lt;inttypes.h&gt; does.  But if <code>GMP_LONG_LONG_LIMB</code> or
        !           611:      whatever is added then perhaps this can easily enough be left to
        !           612:      applications.
        !           613: <li> <code>mpf_get_ld</code> and <code>mpf_set_ld</code> converting
        !           614:      <code>mpf_t</code> to and from <code>long double</code>.  Other
        !           615:      <code>long double</code> routines would be desirable too, but these would
        !           616:      be a start.  Often <code>long double</code> is the same as
        !           617:      <code>double</code>, which is easy but pretty pointless.  Should
        !           618:      recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
        !           619:      sparc, hppa and power.  Might like an ABI sub-option or something when
        !           620:      it's a compiler option for 64-bit or 128-bit <code>long double</code>.
        !           621: <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
        !           622:      It'd be nice if it worked for plain <code>int</code> etc too, not just
        !           623:      <code>mpz_t</code> etc.
        !           624: <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
        !           625:      for both integer and float conversions.  A base either in the format
        !           626:      string or as a parameter with <code>*</code> should be allowed.  Maybe
        !           627:      <code>&amp;13b</code> (b for base) or something like that.
        !           628: <li> <code>gmp_printf</code> could perhaps have a type code for an
        !           629:      <code>mp_limb_t</code>.  That would save an application from having to
        !           630:      worry whether it's a <code>long</code> or a <code>long long</code>.
        !           631: <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
        !           632:      conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
        !           633:      convenience, but still might be useful.  Rounding would be the same as
        !           634:      for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
        !           635:      actually documented).  Alternately, perhaps a separate
        !           636:      <code>mpq_get_str_point</code> or some such might be more use.  Suggested
        !           637:      by Pedro Gimeno.
        !           638: <li> <code>gmp_printf</code> could usefully accept a flag to control the
        !           639:      rounding of float conversions.  The wouldn't do much for
        !           640:      <code>mpf_t</code>, but would be good if <code>mpfr_t</code> was
        !           641:      supported in the future, or perhaps for <code>mpq_t</code>.  Something
        !           642:      like <code>&amp;*r</code> (r for rounding, and mpfr style
        !           643:      <code>GMP_RND</code> parameter).
        !           644: <li> <code>mpz_combit</code> to toggle a bit would be a good companion for
        !           645:      <code>mpz_setbit</code> and <code>mpz_clrbit</code>.  Suggested by Niels
        !           646:      M�ller (and has done some work towards it).
        !           647: <li> <code>mpz_scan0_reverse</code> or <code>mpz_scan0low</code> or some such
        !           648:      searching towards the low end of an integer might match
        !           649:      <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
        !           650:      Suggested by Roberto Bagnara.
        !           651: <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
        !           652:      bitwise subset of another might be of use.  Some sort of return value
        !           653:      indicating whether it's a proper or non-proper subset would be good and
        !           654:      wouldn't cost anything in the implementation.  Suggested by Roberto
        !           655:      Bagnara.
        !           656: <li> <code>gmp_randinit_r</code> and maybe <code>gmp_randstate_set</code> to
        !           657:      init-and-copy or to just copy a <code>gmp_randstate_t</code>.  Suggested
        !           658:      by Pedro Gimeno.
        !           659: <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
        !           660:      <code>mpf_t</code> and <code>long double</code>, suggested by Dan
        !           661:      Christensen.  There'd be some work to be done by <code>configure</code>
        !           662:      to recognise the format in use.  xlc on aix for instance apparently has
        !           663:      an option for either plain double 64-bit or quad 128-bit precision.  This
        !           664:      might mean library contents vary with the compiler used to build, which
        !           665:      is undesirable.  It might be possible to detect the mode the application
        !           666:      is compiling with, and try to avoid mismatch problems.
        !           667: <li> <code>mpz_sqrt_if_perfect_square</code>: When
        !           668:      <code>mpz_perfect_square_p</code> does its tests it calculates a square
        !           669:      root and then discards it.  For some applications it might be useful to
        !           670:      return that root.  Suggested by Jason Moxham.
        !           671: <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
        !           672:      <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
        !           673:      <code>long long</code>.  These would aid interoperability, though a
        !           674:      mixture of GMP and <code>long long</code> would probably not be too
        !           675:      common.  Disadvantages of using <code>long long</code> in libgmp.a would
        !           676:      be
        !           677:      <ul>
        !           678:      <li> Library contents vary according to the build compiler.
        !           679:      <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
        !           680:           application compiler could take the <code>long long</code>
        !           681:           prototypes.
        !           682:      <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> would be wanted to
        !           683:           indicate whether the functions are available.  (Applications using
        !           684:           autoconf could probe the library too.)
        !           685:      </ul>
        !           686:      It'd be possible to defer the need for <code>long long</code> to
        !           687:      application compile time, by having something like
        !           688:      <code>mpz_set_2ui</code> called with two halves of a <code>long
        !           689:      long</code>.  Disadvantages of this would be,
        !           690:      <ul>
        !           691:      <li> Bigger code in the application, though perhaps not if a <code>long
        !           692:           long</code> is normally passed as two halves anyway.
        !           693:      <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
        !           694:           to be two function calls.
        !           695:      <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
        !           696:           treatment of <code>-0x10..00</code> into applications (see
        !           697:           <code>mpz_get_si</code> correctness above).
        !           698:      <li> Although having libgmp.a independent of the build compiler is nice,
        !           699:           it sort of sacrifices the capabilities of a good compiler to
        !           700:           uniformity with inferior ones.
        !           701:      </ul>
        !           702:      Plain use of <code>long long</code> is probably the lesser evil, if only
        !           703:      because it makes best use of gcc.
        !           704: <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
        !           705:      Suggested by Alexander Kruppa.
1.1       maekawa   706: </ul>
                    707:
                    708:
                    709: <h4>Configuration</h4>
                    710:
                    711: <ul>
1.1.1.2 ! ohara     712: <li> Floating-point format: <code>GMP_C_DOUBLE_FORMAT</code> seems to work
        !           713:      well.  Get rid of the <code>#ifdef</code> mess in gmp-impl.h and use the
        !           714:      results of the test instead.
        !           715: <li> a29k: umul.s and udiv.s exist but don't get used.
        !           716: <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
        !           717:      but is that available only for M series chips or some such?  Perhaps it
        !           718:      should be configured in some way.
        !           719: <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
        !           720: <li> HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though
        !           721:      perhaps just as a build-time choice.  In any case, figure out how to
        !           722:      identify a suitable gcc or put it in the right mode, for the GMP compiler
        !           723:      choices.
        !           724: <li> IA64: Latest libtool has some nonsense to detect ELF-32 or ELF-64 on
        !           725:      <code>ia64-*-hpux*</code>.  Does GMP need to know anything about that?
        !           726: <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
        !           727:      "hinv -c processor" gives lots of information on Irix.  Standard
        !           728:      config.guess appends "el" to indicate endianness, but
        !           729:      <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
        !           730: <li> PowerPC: The function descriptor nonsense for AIX is currently driven by
        !           731:      <code>*-*-aix*</code>.  It might be more reliable to do some sort of
        !           732:      feature test, examining the compiler output perhaps.  It might also be
        !           733:      nice to merge the aix.m4 files into powerpc-defs.m4.
        !           734: <li> Sparc: <code>config.guess</code> recognises various exact sparcs, make
        !           735:      use of that information in <code>configure</code> (work on this is in
        !           736:      progress).
        !           737: <li> Sparc32: floating point or integer <code>udiv</code> should be selected
        !           738:      according to the CPU target.  Currently floating point ends up being
        !           739:      used on all sparcs, which is probably not right for generic V7 and V8.
        !           740: <li> Sparc: The use of <code>-xtarget=native</code> with <code>cc</code> is
        !           741:      incorrect when cross-compiling, the target should be set according to the
        !           742:      configured <code>$host</code> CPU.
        !           743: <li> m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies
        !           744:      on system information for 030, 040 and 060.  Can they be identified by
        !           745:      running some code?
        !           746: <li> m68k: gas 2.11.90.0.1 pads with zero bytes in text segments, which is not
        !           747:      valid code.  Probably need <code>.balignw &lt;n&gt;,0x4e7f</code> to get
        !           748:      nops, if <code>ALIGN</code> is going to be used for anything that's
        !           749:      executed across.
        !           750: <li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being
        !           751:      used.  Check all such for bit rot and then put umul and udiv in
        !           752:      <code>$gmp_mpn_functions_optional</code> as "standard optional" objects.
        !           753:      <br> In particular Sparc and SparcV8 on non-gcc should benefit from
        !           754:      umul.asm enabled; the generic umul is suspected to be making sqr_basecase
        !           755:      slower than mul_basecase.
        !           756: <li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a
        !           757:      different parameter order than those functions on other CPUs.  It might
        !           758:      avoid confusion to have them under different names, maybe
        !           759:      <code>mpn_umul_ppmm_r</code> or some such.  Prototypes then wouldn't
        !           760:      be conditionalized, and the appropriate form could be selected with the
        !           761:      <code>HAVE_NATIVE</code> scheme if/when the code switches to use a
        !           762:      <code>PROLOGUE</code> style.
        !           763: <li> <code>DItype</code>: The setup in gmp-impl.h for non-GCC could use an
        !           764:      autoconf test to determine whether <code>long long</code> is available.
        !           765: <li> m88k: Make the assembler code work on non-underscore systems.  Conversion
        !           766:      to .asm would be desirable.  Ought to be easy, but would want to be
        !           767:      tested.
        !           768: <li> z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in
        !           769:      mpn/z8000 could be an ABI choice.  But this chip is obsolete and nothing
        !           770:      is likely to be done unless someone is actively using it.
        !           771: <li> config.m4 is generated only by the configure script, it won't be
        !           772:      regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
        !           773:      would work, but it might upset "make" to have things like <code>L$</code>
        !           774:      get into the Makefiles through <code>AC_SUBST</code>.
        !           775:      <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
        !           776:      careful m4 quoting the <code>changequote</code> calls might not be
        !           777:      needed, which might free up the order in which things had to be output.
        !           778: <li> <code>make distclean</code>: Only the mpn directory links which were
        !           779:      created are removed, but perhaps all possible links should be removed, in
        !           780:      case someone runs configure a second time without a
        !           781:      <code>distclean</code> in between.  The only tricky part would be making
        !           782:      sure all possible <code>extra_functions</code> are covered.
        !           783: <li> MinGW: Apparently a Cygwin version of gcc can be used by passing
        !           784:      <code>-mno-cygwin</code>.  For <code>--host=*-*-mingw32*</code> it might
        !           785:      be convenient to automatically use that option, if it works.  Needs
        !           786:      someone with a dual cygwin/mingw setup to test.
        !           787: <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
        !           788:      scheme.  Though we probably wouldn't be using its assembler support we
        !           789:      could try to use those variables in compatible ways.
1.1       maekawa   790: </ul>
                    791:
1.1.1.2 ! ohara     792:
        !           793: <h4>Random Numbers</h4>
        !           794: <ul>
        !           795: <li> <code>_gmp_rand</code> is not particularly fast on the linear
        !           796:      congruential algorithm and could stand various improvements.
        !           797:      <ul>
        !           798:      <li> Make a second seed area within <code>gmp_randstate_t</code> (or
        !           799:           <code>_mp_algdata</code> rather) to save some copying.
        !           800:      <li> Make a special case for a single limb <code>2exp</code> modulus, to
        !           801:           avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
        !           802:      <li> Inline the <code>lc</code> code, to avoid a function call and
        !           803:           <code>TMP_ALLOC</code> for every chunk.
        !           804:      <li> The special case for <code>seedn==0</code> will be very rarely used,
        !           805:           and on that basis seems unnecessary.
        !           806:      <li> Perhaps the <code>2exp</code> and general LC cases should be split,
        !           807:           for clarity (if the general case is retained).
        !           808:      </ul>
        !           809: <li> <code>gmp_randinit_mers</code> for a Mersenne Twister generator.  It's
        !           810:      likely to be more random and about the same speed as Knuth's 55-element
        !           811:      Fibonacci generator, and can probably become the default.  Pedro Gimeno
        !           812:      has started on this.
        !           813: <li> <code>gmp_randinit_lc</code>: Finish or remove.  Doing a division for
        !           814:      every every step won't be very fast, so check whether the usefulness of
        !           815:      this algorithm can be justified.  (Consensus is that it's not useful and
        !           816:      can be removed.)
        !           817: <li> Blum-Blum-Shub: Finish or remove.  A separate
        !           818:      <code>gmp_randinit_bbs</code> would be wanted, not the currently
        !           819:      commented out case in <code>gmp_randinit</code>.
        !           820: <li> <code>_gmp_rand</code> could be done as a function pointer within
        !           821:      <code>gmp_randstate_t</code> (or rather in the <code>_mp_algdata</code>
        !           822:      part), instead of switching on a <code>gmp_randalg_t</code>.  Likewise
        !           823:      <code>gmp_randclear</code>, and perhaps <code>gmp_randseed</code> if it
        !           824:      became algorithm-specific.  This would be more modular, and would ensure
        !           825:      only code for the desired algorithms is dragged into the link.
        !           826: <li> <code>mpz_urandomm</code> should do something for n&lt;=0, but what?
        !           827: <li> <code>mpz_urandomm</code> implementation looks like it could be improved.
        !           828:      Perhaps it's enough to calculate <code>nbits</code> as ceil(log2(n)) and
        !           829:      call <code>_gmp_rand</code> until a value <code>&lt;n</code> is obtained.
        !           830: <li> <code>gmp_randstate_t</code> used for parameters perhaps should become
        !           831:      <code>gmp_randstate_ptr</code> the same as other types.
        !           832: <li> Some of the empirical randomness tests could be included in a "make
        !           833:      check".  They ought to work everywhere, for a given seed at least.
        !           834: </ul>
1.1       maekawa   835:
                    836:
                    837: <h4>Miscellaneous</h4>
                    838: <ul>
                    839: <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
                    840:      analogous to <code>mpz_mod</code>.  Document, and list as an
                    841:      incompatibility.
1.1.1.2 ! ohara     842: <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
        !           843:      what range of values the generated cofactors can take, and preferably
        !           844:      ensure the definition uniquely specifies the cofactors for given inputs.
        !           845:      A basic extended Euclidean algorithm or multi-step variant leads to
        !           846:      |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
        !           847:      two solutions under just those restrictions.
        !           848: <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
        !           849: <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
        !           850:      <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
        !           851:      time would be better still.)
        !           852: <li> The various test programs use quite a bit of the main
        !           853:      <code>libgmp</code>.  This establishes good cross-checks, but it might be
        !           854:      better to use simple reference routines where possible.  Where it's not
        !           855:      possible some attention could be paid to the order of the tests, so a
        !           856:      <code>libgmp</code> routine is only used for tests once it seems to be
        !           857:      good.
        !           858: <li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be
        !           859:      good for the two to share code.  Perhaps <code>mpf_set_q</code> should
        !           860:      make some <code>mpf_t</code> aliases for its numerator and denominator
        !           861:      and just call <code>mpf_div</code>.  Both would be simplified a good deal
        !           862:      by switching to <code>mpn_tdiv_qr</code> perhaps making them small enough
        !           863:      not to bother with sharing (especially since <code>mpf_set_q</code>
        !           864:      wouldn't need to watch out for overlaps).
        !           865: <li> PowerPC: The cpu time base registers (per <code>mftb</code> and
        !           866:      <code>mftbu</code>) could be used for the speed and tune programs.  Would
        !           867:      need to know its frequency of course.  Usually it's 1/4 of bus speed
        !           868:      (eg. 25 MHz) but some chips drive it from an external input.  Probably
        !           869:      have to measure to be sure.
        !           870: <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
        !           871:      return to a previous k at certain sizes.  This arises basically due to
        !           872:      the step effect caused by size multiples effectively used for each k.
        !           873:      Looking at a graph makes it fairly clear.
        !           874: <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
        !           875:      on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
        !           876:      of <code>mpf_get_str</code> could be made which would better suit.
1.1       maekawa   877: </ul>
                    878:
                    879:
1.1.1.2 ! ohara     880: <h4>Aids to Development</h4>
1.1       maekawa   881: <ul>
1.1.1.2 ! ohara     882: <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
        !           883:      function to check the validity of each <code>mp?_t</code> parameter, in
        !           884:      particular to check they've been <code>mp?_init</code>ed.  This might
        !           885:      catch elementary mistakes in user programs.  Care would need to be taken
        !           886:      over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
        !           887:      else then consistency checks like size&lt;=alloc, ptr not
        !           888:      <code>NULL</code> and ptr+size not wrapping around the address space,
        !           889:      would be possible.  A more sophisticated scheme could track
        !           890:      <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
        !           891:      scheme probably wouldn't be reentrant, not without some help from the
        !           892:      system.
        !           893: <li> tune/time.c could try to determine at runtime whether
        !           894:      <code>getrusage</code> and <code>gettimeofday</code> are reliable.
        !           895:      Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
        !           896:      <code>getrusage</code> doesn't exist.  If a test might take a long time
        !           897:      to run then perhaps cache the result in a file somewhere.
1.1       maekawa   898: </ul>
                    899:
                    900:
                    901: <h4>Documentation</h4>
                    902: <ul>
                    903: <li> Document conventions, like that <code> unsigned long int</code> is used for
                    904:      bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
                    905: <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
                    906: </ul>
                    907:
                    908:
1.1.1.2 ! ohara     909: <h4>Bright Ideas</h4>
        !           910:
        !           911: The following may or may not be feasible, and aren't likely to get done in the
        !           912: near future, but are at least worth thinking about.
        !           913:
        !           914: <ul>
        !           915: <li> Reorganize longlong.h so that we can inline the operations even for the
        !           916:      system compiler.  When there is no such compiler feature, make calls to
        !           917:      stub functions.  Write such stub functions for as many machines as
        !           918:      possible.
        !           919: <li> longlong.h could declare when it's using, or would like to use,
        !           920:      <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
        !           921:      included in libgmp only in that case, the same as is effectively done for
        !           922:      <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
        !           923:      would only be a very small space saving, so perhaps not worth the
        !           924:      complexity.
        !           925: <li> longlong.h could be built at configure time by concatenating or
        !           926:      #including fragments from each directory in the mpn path.  This would
        !           927:      select CPU specific macros the same way as CPU specific assembler code.
        !           928:      Code used would no longer depend on cpp predefines, and the current
        !           929:      nested conditionals could be flattened out.
        !           930: <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
        !           931:      sort of supposed to return the low 31 (or 63) bits.  But this is
        !           932:      undocumented, and perhaps not too important.
        !           933: <li> <code>mpz_*_ui</code> division routines currently return abs(a%b).
        !           934:      Perhaps make them return the real remainder instead?  Return type would
        !           935:      be <code>signed long int</code>.  But this would be an incompatible
        !           936:      change, so it might have to be under newly named functions.
        !           937: <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
        !           938:      say an extra 16 limbs over what's needed, so as to reduce the chance of
        !           939:      having to do a reallocate if the <code>mpz_t</code> grows a bit more.
        !           940:      This could only be an option, since it'd badly bloat memory usage in
        !           941:      applications using many small values.
        !           942: <li> <code>mpq</code> functions could perhaps check for numerator or
        !           943:      denominator equal to 1, on the assumption that integers or
        !           944:      denominator-only values might be expected to occur reasonably often.
        !           945: <li> <code>count_trailing_zeros</code> is used on more or less uniformly
        !           946:      distributed numbers in a couple of places.  For some CPUs
        !           947:      <code>count_trailing_zeros</code> is slow and it's probably worth handling
        !           948:      the frequently occurring 0 to 2 trailing zeros cases specially.
        !           949: <li> <code>mpf_t</code> might like to let the exponent be undefined when
        !           950:      size==0, instead of requiring it 0 as now.  It should be possible to do
        !           951:      size==0 tests before paying attention to the exponent.  The advantage is
        !           952:      not needing to set exp in the various places a zero result can arise,
        !           953:      which avoids some tedium but is otherwise perhaps not too important.
        !           954:      Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
        !           955:      exp==0, maybe elsewhere too.
        !           956: <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
        !           957:      ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
        !           958:      allows that attribute on functions, but not function pointers (see info
        !           959:      node "Attribute Syntax"), so would need a new autoconf test.  This can
        !           960:      wait until there's a GCC that supports it.
        !           961: <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
        !           962:      <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
        !           963:      routines were opened up a bit maybe that code could be shared.  When a
        !           964:      copy needs to be done there's no carry to append for the add, and if the
        !           965:      copy is non-empty no high zero for the sub. <br> An alternative would be
        !           966:      to do a copy at the start and then an in-place add or sub.  Obviously
        !           967:      that duplicates the fetches and stores for carry propagation, but that's
        !           968:      normally only one or two limbs.  The same applies to <code>mpz_add</code>
        !           969:      when one operand is longer than the other, and to <code>mpz_com</code>
        !           970:      since it's just -(x+1).
        !           971: <li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict
        !           972:      (one writer many readers, or whatever it is) suit the GMP style "same or
        !           973:      separate" function parameters?  If so, judicious use might improve the
        !           974:      code generated a bit.  Do any compilers have their own flavour of
        !           975:      restrict as "completely unaliased", and is that still usable?
        !           976: <li> 68000: A 16-bit limb might suit 68000 better than 32-bits, since the
        !           977:      native multiply is only 16x16.  Could have this as an <code>ABI</code>
        !           978:      option, selecting <code>_SHORT_LIMB</code> in gmp.h.  Naturally a new set
        !           979:      of asm subroutines would be necessary.  Would need new
        !           980:      <code>mpz_set_ui</code> etc since the current code assumes limb&gt;=long,
        !           981:      but 2-limb operand forms would find a use for <code>long long</code> on
        !           982:      other processors too.
        !           983: <li> Nx1 remainders can be taken at multiplier throughput speed by
        !           984:      pre-calculating an array "p[i] = 2^(i*<code>BITS_PER_MP_LIMB</code>) mod
        !           985:      m", then for the input limbs x calculating an inner product "sum
        !           986:      p[i]*x[i]", and a final 3x1 limb remainder mod m.  If those powers take
        !           987:      roughly N divide steps to calculate then there'd be an advantage any time
        !           988:      the same m is used three or more times.  Suggested by Victor Shoup in
        !           989:      connection with chinese-remainder style decompositions, but perhaps with
        !           990:      other uses.
        !           991: </ul>
        !           992: <hr>
1.1       maekawa   993:
                    994: </body>
                    995: </html>
1.1.1.2 ! ohara     996:
        !           997: <!--
        !           998: Local variables:
        !           999: eval: (add-hook 'write-file-hooks 'time-stamp)
        !          1000: time-stamp-start: "This file current as of "
        !          1001: time-stamp-format: "%:d %3b %:y"
        !          1002: time-stamp-end: "\\."
        !          1003: time-stamp-line-limit: 50
        !          1004: End:
        !          1005: -->
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>