Annotation of OpenXM_contrib/gmp/doc/tasks.html, Revision 1.1.1.2
1.1 maekawa 1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
2: <html>
3: <head>
4: <title>
5: GMP Itemized Development Tasks
6: </title>
7: </head>
8: <body bgcolor=lightgreen>
9:
10: <center>
11: <h1>
12: GMP Itemized Development Tasks
13: </h1>
14: </center>
15:
1.1.1.2 ! ohara 16: <font size=-1>
! 17: Copyright 2000, 2001, 2002 Free Software Foundation, Inc. <br><br>
! 18: This file is part of the GNU MP Library. <br><br>
! 19: The GNU MP Library is free software; you can redistribute it and/or modify
! 20: it under the terms of the GNU Lesser General Public License as published
! 21: by the Free Software Foundation; either version 2.1 of the License, or (at
! 22: your option) any later version. <br><br>
! 23: The GNU MP Library is distributed in the hope that it will be useful, but
! 24: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 25: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 26: License for more details. <br><br>
! 27: You should have received a copy of the GNU Lesser General Public License
! 28: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 29: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
! 30: MA 02111-1307, USA.
! 31: </font>
! 32:
! 33: <hr>
! 34: <!-- NB. timestamp updated automatically by emacs -->
1.1 maekawa 35: <comment>
1.1.1.2 ! ohara 36: This file current as of 20 May 2002. An up-to-date version is available at
1.1 maekawa 37: <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
1.1.1.2 ! ohara 38: Please send comments about this page to
! 39: <a href="mailto:bug-gmp@gnu.org">bug-gmp@gnu.org</a>.
1.1 maekawa 40: </comment>
41:
1.1.1.2 ! ohara 42: <p> These are itemized GMP development tasks. Not all the tasks
1.1 maekawa 43: listed here are suitable for volunteers, but many of them are.
44: Please see the <a href="projects.html">projects file</a> for more
45: sizeable projects.
46:
47: <h4>Correctness and Completeness</h4>
48: <ul>
49: <li> The various reuse.c tests need to force reallocation by calling
50: <code>_mpz_realloc</code> with a small (1 limb) size.
1.1.1.2 ! ohara 51: <li> One reuse case is missing from mpX/tests/reuse.c:
! 52: <code>mpz_XXX(a,a,a)</code>.
! 53: <li> When printing <code>mpf_t</code> numbers with exponents >2^53 on
! 54: machines with 64-bit <code>mp_exp_t</code>, the precision of
1.1 maekawa 55: <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
1.1.1.2 ! ohara 56: <code>mpf_get_str</code> aborts. Detect and compensate. Alternately,
! 57: think seriously about using some sort of fixed-point integer value.
! 58: Avoiding unnecessary floating point is probably a good thing in general,
! 59: and it might be faster on some CPUs.
1.1 maekawa 60: <li> Make the string reading functions allow the `0x' prefix when the base is
61: explicitly 16. They currently only allow that prefix when the base is
1.1.1.2 ! ohara 62: unspecified (zero).
1.1 maekawa 63: <li> <code>mpf_eq</code> is not always correct, when one operand is
64: 1000000000... and the other operand is 0111111111..., i.e., extremely
65: close. There is a special case in <code>mpf_sub</code> for this
66: situation; put similar code in <code>mpf_eq</code>.
1.1.1.2 ! ohara 67: <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should
! 68: not use just whole limbs, but partial limbs.
! 69: <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
! 70: garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
! 71: of a <code>long</code> is not detected.
! 72: <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
! 73: the inputs, and in that respect doesn't implement the "infinite precision
! 74: followed by truncate" specified in the manual.
! 75: <li> <code>mpf_div</code> of x/x doesn't always give 1, reported by Peter
! 76: Moulder. Perhaps it suffices to put +1 on the effective divisor prec, so
! 77: that data bits rather than zeros are shifted in when normalizing. Would
! 78: prefer to switch to <code>mpn_tdiv_qr</code>, where all shifting should
! 79: disappear.
! 80: <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
! 81: variables with pointers to <code>mpz_add</code> etc, which doesn't work
! 82: when those routines are coming from a DLL (because they're effectively
! 83: function pointer global variables themselves). Need to rearrange perhaps
! 84: to a set of calls to a test function rather than iterating over an array.
! 85: <li> demos/pexpr.c: The local variables in <code>main</code> might be
! 86: clobbered by the <code>longjmp</code>.
1.1 maekawa 87: </ul>
88:
89:
90:
91: <h4>Machine Independent Optimization</h4>
92: <ul>
1.1.1.2 ! ohara 93: <li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>,
! 94: <code>mpf_get_str</code>: Don't test <code>count_leading_zeros</code> for
! 95: zero, instead check the high bit of the operand and avoid invoking
! 96: <code>count_leading_zeros</code>. This is an optimization on all
! 97: machines, and significant on machines with slow
! 98: <code>count_leading_zeros</code>, though it's possible an already
! 99: normalized operand might not be encountered very often.
1.1 maekawa 100: <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
101: most significant limb (if <code>BITS_PER_MP_LIMB</code> <= 52 bits).
102: (Peter Montgomery has some ideas on this subject.)
103: <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
104: products with fewer operations.
1.1.1.2 ! ohara 105: <li> Consider inlining <code>mpz_set_ui</code>. This would be both small and
! 106: fast, especially for compile-time constants, but would make application
! 107: binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
! 108: preventing the "lazy" allocation scheme below.
! 109: <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
! 110: <code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code>
! 111: would be needed for the divide by zero test, unless that could be left to
! 112: <code>mpn_mod_1</code> (not sure currently whether all the risc chips
! 113: provoke the right exception there if using mul-by-inverse).
! 114: <li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for
! 115: <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
! 116: might, unfortunately, be necessary to forcibly include <limits.h>
! 117: since there's no apparent way to get <code>SHRT_MAX</code> with an
! 118: expression (since <code>short</code> and <code>unsigned short</code> can
! 119: be different sizes).
1.1 maekawa 120: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
121: fast on one or two limb moduli, due to a lot of function call
122: overheads. These could perhaps be handled as special cases.
123: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
124: algorithm selection, and the latter should use REDC. Both could
125: change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
1.1.1.2 ! ohara 126: <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
! 127: using the division method when they're small, since the REDC form of a
! 128: small multiplier is normally a full size product. Probably would need a
! 129: new tuned parameter to say what size multiplier is "small", as a function
! 130: of the size of the modulus.
! 131: <li> <code>mpz_powm</code> REDC should handle even moduli if possible. Maybe
! 132: this would mean for m=n*2^k doing mod n using REDC and an auxiliary
! 133: calculation mod 2^k, then putting them together at the end.
1.1 maekawa 134: <li> <code>mpn_gcd</code> might be able to be sped up on small to
135: moderate sizes by improving <code>find_a</code>, possibly just by
136: providing an alternate implementation for CPUs with slowish
137: <code>count_leading_zeros</code>.
1.1.1.2 ! ohara 138: <li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized
! 139: evaluate and interpolate. The necessary <code>mpn_divexact_by3c</code>
! 140: exists.
1.1 maekawa 141: <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
142: better cache locality by taking N piece by piece. The current code could
143: be left available for CPUs without caching. Depending how karatsuba etc
144: is applied to unequal size operands it might be possible to assume M is
145: always smallish.
1.1.1.2 ! ohara 146: <li> <code>mpn_perfect_square_p</code> on small operands might be better off
! 147: skipping the residue tests and just taking a square root.
! 148: <li> <code>mpz_perfect_power_p</code> could be improved in a number of ways.
! 149: Test for Nth power residues modulo small primes like
! 150: <code>mpn_perfect_square_p</code> does. Use p-adic arithmetic to find
! 151: possible roots. Divisibility by other primes should be tested by
! 152: grouping into a limb like <code>PP</code>.
! 153: <li> <code>mpz_perfect_power_p</code> might like to use <code>mpn_gcd_1</code>
! 154: instead of a private GCD routine. The use it's put to isn't
! 155: time-critical, and it might help be ensure correctness to use the main GCD
! 156: routine.
! 157: <li> <code>mpz_perfect_power_p</code> could use
! 158: <code>mpz_divisible_ui_p</code> instead of <code>mpz_tdiv_ui</code> for
! 159: divisibility testing, the former is faster on a number of systems. (But
! 160: all that prime test stuff is going to be rewritten some time.)
! 161: <li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such
! 162: pairs, listing several hundred primes. Perhaps actually make the
! 163: products larger than one limb each.
! 164: <li> <code>PP</code> can have factors of 2 introduced in order to get the high
! 165: bit set and therefore a <code>PP_INVERTED</code> existing. The factors
! 166: of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
! 167: further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
! 168: The advantage of this is that <code>mpn_preinv_mod_1</code> can then be
! 169: used if it's faster than plain <code>mpn_mod_1</code>. This would be a
! 170: change only for 16-bit limbs, all the rest already have <code>PP</code>
! 171: in the right form.
! 172: <li> <code>PP</code> could have extra factors of 3 or 5 or whatever introduced
! 173: if they fit, and final remainders mod 9 or 25 or whatever used, thereby
! 174: making more efficient use of the <code>mpn_mod_1</code> done. On a
! 175: 16-bit limb it looks like <code>PP</code> could take an extra factor of
! 176: 3.
! 177: <li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and
! 178: <code>mpz_perfect_power_p</code> could use <code>mpn_mod_34lsub1</code>
! 179: to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
! 180: 3, 5, 7, 13 and 17 (factors of 2^24-1). This could either replace the
! 181: <code>PP</code> division currently done, or allow <code>PP</code> to do
! 182: larger primes, depending how many residue tests seem worthwhile before
! 183: launching into full root extractions or Miller-Rabin etc.
! 184: <li> <code>mpz_probab_prime_p</code> (and maybe others) could code the
! 185: divisibility tests like <code>n%7 == 0</code> in the form
! 186: <pre>
! 187: #define MP_LIMB_DIVISIBLE_7_P(n) \
! 188: ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)
! 189: </pre>
! 190: This would help compilers which don't know how to optimize divisions by
! 191: constants, and would help current gcc (3.0) too since gcc forms a whole
! 192: remainder rather than using a modular inverse and comparing. This
! 193: technique works for any odd modulus, and with some tweaks for even moduli
! 194: too. See Granlund and Montgomery "Division By Invariant Integers"
! 195: section 9.
! 196: <li> <code>mpz_probab_prime_p</code> and <code>mpz_nextprime</code> could
! 197: offer certainty for primes up to 2^32 by using a one limb miller-rabin
! 198: test to base 2, combined with a table of actual strong pseudoprimes in
! 199: that range (2314 of them). If that table is too big then both base 2 and
! 200: base 3 tests could be done, leaving a table of 104. The test could use
! 201: REDC and therefore be a <code>modlimb_invert</code> a remainder (maybe)
! 202: then two multiplies per bit (successively dependent). Processors with
! 203: pipelined multipliers could do base 2 and 3 in parallel. Vector systems
! 204: could do a whole bunch of bases in parallel, and perhaps offer near
! 205: certainty up to 64-bits (certainty might depend on an exhaustive search
! 206: of pseudoprimes up to that limit). Obviously 2^32 is not a big number,
! 207: but an efficient and certain calculation is attractive. It might find
! 208: other uses internally, and could even be offered as a one limb prime test
! 209: <code>mpn_probab_prime_1_p</code> or <code>gmp_probab_prime_ui_p</code>
! 210: perhaps.
! 211: <li> <code>mpz_probab_prime_p</code> doesn't need to make a copy of
! 212: <code>n</code> when the input is negative, it can setup an
! 213: <code>mpz_t</code> alias, same data pointer but a positive size. With no
! 214: need to clear before returning, the recursive function call could be
! 215: dispensed with too.
! 216: <li> <code>mpf_set_str</code> produces low zero limbs when a string has a
! 217: fraction but is exactly representable, eg. 0.5 in decimal. These could be
! 218: stripped to save work in later operations.
! 219: <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
! 220: use <code>mpn_and_n</code> etc for the benefit of the small number of
! 221: targets with native versions of those routines. Need to be careful not to
! 222: pass size==0. Is some code sharing possible between the <code>mpz</code>
! 223: routines?
! 224: <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
! 225: unless it's really necessary (currently only sizes are tested, not
! 226: whether r really is u or v).
! 227: <li> <code>mpf_add</code>: Under the check for v having no effect on the
! 228: result, perhaps test for r==u and do nothing in that case, rather than
! 229: currently it looks like an <code>MPN_COPY_INCR</code> will be done to
! 230: reduce prec+1 limbs to prec.
! 231: <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
! 232: shift the dividend on-the-fly, since this should cost nothing on
! 233: superscalar processors and avoid the need for temporary copying in
! 234: <code>mpn_tdiv_qr</code>.
! 235: <li> <code>mpf_sqrt_ui</code> calculates prec+1 limbs, whereas just prec would
! 236: satisfy the application requested precision. It should suffice to simply
! 237: reduce the rsize temporary to 2*prec-1 limbs. <code>mpf_sqrt</code>
! 238: might be similar.
! 239: <li> <code>invert_limb</code> generic C: The division could use dividend
! 240: b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
! 241: (b-d):0, where b=2^<code>BITS_PER_MP_LIMB</code> and d=divisor. The
! 242: former is per the original paper and is used in the x86 code, the
! 243: advantage is that the current special case for 0x80..00 could be dropped.
! 244: The two should be equivalent, but a little check of that would be wanted.
! 245: <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
! 246: <code>num2*den1</code> products limb-by-limb from high to low and look at
! 247: each step for values differing by more than the possible carry bit from
! 248: the uncalculated portion.
! 249: <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
! 250: and compare. The benefits of karatsuba and higher multiplication
! 251: algorithms are lost, but if it's assumed only a few high limbs will be
! 252: needed to determine an order then that's fine.
! 253: <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
! 254: <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
! 255: instead of the functions, so they get inlined on all compilers, not just
! 256: gcc and others with <code>inline</code> recognised in gmp.h.
! 257: <code>__GMPN_ADD_1</code> etc are meant mostly to support application
! 258: inline <code>mpn_add_1</code> etc and if they don't come out good for
! 259: internal uses then special forms can be introduced, for instance many
! 260: internal uses are in-place. Sometimes a block of code is executed based
! 261: on the carry-out, rather than using it arithmetically, and those places
! 262: might want to do their own loops entirely.
! 263: <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
! 264: bitfield for the mantissa extraction, not two, when endianness permits.
! 265: Might depend on the compiler allowing <code>long long</code> bit fields
! 266: when that's the only actual 64-bit type.
! 267: <li> <code>mpf_get_d</code> could be more like <code>mpz_get_d</code> and do
! 268: more in integers and give the float conversion as such a chance to round
! 269: in its preferred direction. Some code sharing ought to be possible. Or
! 270: if nothing else then for consistency the two ought to give identical
! 271: results on integer operands (not clear if this is so right now).
! 272: <li> <code>usqr_ppm</code> or some such could do a widening square in the
! 273: style of <code>umul_ppmm</code>. This would help 68000, and be a small
! 274: improvement for the generic C (which is used on UltraSPARC/64 for
! 275: instance). GCC recognises the generic C ul*vh and vl*uh are identical,
! 276: but does two separate additions to the rest of the result.
! 277: <li> tal-notreent.c could keep a block of memory permanently allocated.
! 278: Currently the last nested <code>TMP_FREE</code> releases all memory, so
! 279: there's an allocate and free every time a top-level function using
! 280: <code>TMP</code> is called. Would need
! 281: <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
! 282: any cached memory when changing allocation functions though.
! 283: <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
! 284: inlined. If the current chunk has enough room then a couple of pointers
! 285: can be updated. Only if more space is required then a call to some sort
! 286: of <code>__gmp_tmp_increase</code> would be needed. The requirement that
! 287: <code>TMP_ALLOC</code> is an expression might make the implementation a
! 288: bit ugly and/or a bit sub-optimal.
! 289: <pre>
! 290: #define TMP_ALLOC(n)
! 291: ((ROUND_UP(n) > current->end - current->point ?
! 292: __gmp_tmp_increase (ROUND_UP (n)) : 0),
! 293: current->point += ROUND_UP (n),
! 294: current->point - ROUND_UP (n))
! 295: </pre>
! 296: <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
! 297: never used. Perhaps the table should just go up to base 16, and have
! 298: code to generate data above that, if and when required. Naturally this
! 299: assumes the code would be smaller than the data saved.
! 300: <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
! 301: if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
! 302: otherwise, to save space.
! 303: <li> Make <code>mpf_get_str</code> and <code>mpf_set_str</code> call the
! 304: corresponding, much faster, mpn functions.
! 305: <li> <code>mpn_mod_1</code> could pre-calculate values of R mod N, R^2 mod N,
! 306: R^3 mod N, etc, with R=2^<code>BITS_PER_MP_LIMB</code>, and use them to
! 307: process multiple limbs at each step by multiplying. Suggested by Peter
! 308: L. Montgomery.
! 309: <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
! 310: are of course fast, it seems a little silly to make a second pass over
! 311: the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine
! 312: that with the bit extractions.
! 313: <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
! 314: A), and A<B, then the code ends up generating the cofactor T (of B) and
! 315: deriving S from that. Perhaps it'd be possible to arrange to get S in
! 316: the first place by calling <code>mpn_gcdext</code> with A+B,B. This
! 317: might only be an advantage if A and B are about the same size.
! 318: <li> <code>mpn_toom3_mul_n</code>, <code>mpn_toom3_sqr_n</code>: Temporaries
! 319: <code>B</code> and <code>D</code> are adjacent in memory and at the final
! 320: coefficient additions look like they could use a single
! 321: <code>mpn_add_n</code> of <code>l4</code> limbs rather than two of
! 322: <code>l2</code> limbs.
1.1 maekawa 323: </ul>
324:
325:
326: <h4>Machine Dependent Optimization</h4>
327: <ul>
1.1.1.2 ! ohara 328: <li> <code>udiv_qrnnd_preinv2norm</code>, the branch-free version of
! 329: <code>udiv_qrnnd_preinv</code>, might be faster on various pipelined
! 330: chips. In particular the first <code>if (_xh != 0)</code> in
! 331: <code>udiv_qrnnd_preinv</code> might be roughly a 50/50 chance and might
! 332: branch predict poorly. (The second test is probably almost always
! 333: false.) Measuring with the tuneup program would be possible, but perhaps
! 334: a bit messy. In any case maybe the default should be the branch-free
! 335: version.
! 336: <br>
! 337: Note that the current <code>udiv_qrnnd_preinv2norm</code> implementation
! 338: assumes a right shift will sign extend, which is not guaranteed by the C
! 339: standards, and doesn't happen on Cray vector systems.
1.1 maekawa 340: <li> Run the `tune' utility for more compiler/CPU combinations. We would like
341: to have gmp-mparam.h files in practically every implementation specific
342: mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
343: compiler. See the `tune' top-level directory for more information.
1.1.1.2 ! ohara 344: <pre>
! 345: #ifdef (__GNUC__)
! 346: #if __GNUC__ == 2 && __GNUC_MINOR__ == 7
! 347: ...
! 348: #endif
! 349: #if __GNUC__ == 2 && __GNUC_MINOR__ == 8
! 350: ...
! 351: #endif
! 352: #ifndef MUL_KARATSUBA_THRESHOLD
! 353: /* Default GNUC values */
! 354: ...
! 355: #endif
! 356: #else /* system compiler */
! 357: ...
! 358: #endif </pre>
! 359: <li> <code>invert_limb</code> on various processors might benefit from the
! 360: little Newton iteration done for alpha and ia64.
! 361: <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
! 362: <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
! 363: <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
! 364: and <code>mpn_submul_1</code> for the 21164. This should use both integer
1.1 maekawa 365: multiplies and floating-point multiplies. For the floating-point
366: operations, the single-limb multiplier should be split into three 21-bit
1.1.1.2 ! ohara 367: chunks, or perhaps even better in four 16-bit chunks. Probably possible
! 368: to reach 9 cycles/limb.
! 369: <li> Alpha 21264 ev67: Use <code>ctlz</code> and <code>cttz</code> for
! 370: <code>count_leading_zeros</code> and<code>count_trailing_zeros</code>.
! 371: Use inline for gcc, probably want asm files for elsewhere.
! 372: <li> ARC: gcc longlong.h sets up <code>umul_ppmm</code> to call
! 373: <code>__umulsidi3</code> in libgcc. Could be copied straight across, but
! 374: perhaps ought to be tested.
! 375: <li> ARM: On v5 cpus see if the <code>clz</code> instruction can be used for
! 376: <code>count_leading_zeros</code>.
! 377: <li> Itanium: <code>mpn_divexact_by3</code> isn't particularly important, but
! 378: the generic C runs at about 27 c/l, whereas with the multiplies off the
! 379: dependent chain about 3 c/l ought to be possible.
! 380: <li> Itanium: <code>mpn_hamdist</code> could be put together based on the
! 381: current <code>mpn_popcount</code>.
! 382: <li> Itanium: <code>popc_limb</code> in gmp-impl.h could use the
! 383: <code>popcnt</code> insn.
! 384: <li> Itanium: <code>mpn_submul_1</code> is not implemented directly, only via
! 385: a combination of <code>mpn_mul_1</code> and <code>mpn_sub_n</code>.
! 386: <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
! 387: for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how
! 388: much this can improve the speed, though, since the symmetry that we rely
! 389: on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more
! 390: accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
! 391: <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
! 392: <code>mpn_addmul_1</code>.
! 393: <li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four
! 394: "<code>mulx</code>"s either with an asm block or via the generic C code is
! 395: about 90 cycles. Try using fp operations, and also try using karatsuba
! 396: for just three "<code>mulx</code>"s.
! 397: <li> UltraSPARC/64: <code>mpn_divrem_1</code>, <code>mpn_mod_1</code>,
! 398: <code>mpn_divexact_1</code> and <code>mpn_modexact_1_odd</code> could
! 399: process 32 bits at a time when the divisor fits 32-bits. This will need
! 400: only 4 <code>mulx</code>'s per limb instead of 8 in the general case.
! 401: <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
! 402: Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
! 403: <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16.
! 404: <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
! 405: possible (see commented out code in longlong.h). This is unlikely to
! 406: save more than a couple of cycles, so perhaps isn't worth bothering with.
! 407: <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
! 408: or anything to indicate V9 support when -mcpu=v9 is selected. See
! 409: gcc/config/sol2-sld-64.h. Will need to pass something through from
! 410: ./configure to select the right code in longlong.h. (Currently nothing
! 411: is lost because <code>mulx</code> for multiplying is commented out.)
! 412: <li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from
! 413: masking down to just the useful bits at each point in the calculation,
! 414: since <code>mulx</code> speed depends on the highest bit set. Either
! 415: explicit masks or small types like <code>short</code> and
! 416: <code>int</code> ought to work.
! 417: <li> Sparc64 HAL R1: <code>mpn_popcount</code> and <code>mpn_hamdist</code>
! 418: could use <code>popc</code> currently commented out in gmp-impl.h. This
! 419: chip reputedly implements <code>popc</code> properly (see gcc sparc.md),
! 420: would need to recognise the chip as <code>sparchalr1</code> or something
! 421: in configure / config.sub / config.guess.
1.1 maekawa 422: <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
1.1.1.2 ! ohara 423: <code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It
! 424: should be possible to saturate the cache, which will happen at 8
! 425: cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32;
! 426: it should be possible to make them run at about 5 cycles/limb.
! 427: <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
! 428: <code>mpn_mul_1</code>. Use both integer and floating-point operations,
! 429: possibly two floating-point and one integer limb per loop. Split operands
! 430: into four 16-bit chunks for fast fp operations. Should easily reach 9
! 431: cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
! 432: (using one int + two fp).
! 433: <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
! 434: as <code>mpn_lshift</code>. Some judicious use of m4 might let the two
! 435: share source code, or with a register to control the loop direction
! 436: perhaps even share object code.
! 437: <li> PowerPC-32: <code>mpn_rshift</code> should do the same sort of unrolled
! 438: loop as <code>mpn_lshift</code>.
1.1 maekawa 439: <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
440: for important machines. Helping the generic sqr_basecase.c with an
441: <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
442: <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
443: Will bring time from 1.75 to 1.25 cycles/limb.
1.1.1.2 ! ohara 444: <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See
! 445: Pentium code.)
! 446: <li> X86: Good authority has it that in the past an inline <code>rep
! 447: movs</code> would upset GCC register allocation for the whole function.
! 448: Is this still true in GCC 3? It uses <code>rep movs</code> itself for
! 449: <code>__builtin_memcpy</code>. Examine the code for some simple and
! 450: complex functions to find out. Inlining <code>rep movs</code> would be
! 451: desirable, it'd be both smaller and faster.
! 452: <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
! 453: down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
! 454: <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
! 455: <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
! 456: might benefit from some destination prefetching.
! 457: <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
! 458: mul-by-inverse, hoping for maybe 30 c/l.
! 459: <li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go
! 460: faster than the generic x86 code at 3.5 c/l. The athlon code for instance
! 461: runs at about 2.7.
! 462: <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
! 463: do something branch-free for unaligned startups, and shaving one insn
! 464: from the loop with alternative indexing might save a cycle.
1.1 maekawa 465: <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
1.1.1.2 ! ohara 466: The pipeline is now extremely deep, perhaps unnecessarily deep.
1.1 maekawa 467: <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
468: <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should
1.1.1.2 ! ohara 469: run at just 3.25 cycles/limb.
1.1 maekawa 470: <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
471: <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
472: <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
473: method", to avoid carry propagation. splitting one of the operands in
474: 11-bit chunks.
1.1.1.2 ! ohara 475: <li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
! 476: by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
! 477: special lshift by 1. This would be best as a jump across to the other
! 478: routine, could let both live in lshift.asm and omit rshift.asm on finding
! 479: <code>mpn_rshift</code> already provided.
! 480: <li> Cray T3E: Experiment with optimization options. In particular,
! 481: -hpipeline3 seems promising. We should at least up -O to -O2 or -O3.
! 482: <li> Cray: <code>mpn_com_n</code> and <code>mpn_and_n</code> etc very probably
! 483: wants a pragma like <code>MPN_COPY_INCR</code>.
! 484: <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
! 485: <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
! 486: and could be inlined to avoid function calls.
! 487: <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
! 488: scheme. Not sure why, maybe they merely give the compiler more
! 489: information about aliasing (or the lack thereof). Would like to modify
! 490: <code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory
! 491: blocks wanted unconditionally are easy enough, those wanted only
! 492: sometimes are a problem. Perhaps a special size calculation to ask for a
! 493: dummy length 1 when unwanted, or perhaps an inlined subroutine
! 494: duplicating code under each conditional. Don't really want to turn
! 495: everything into a dog's dinner just because Cray don't offer an
! 496: <code>alloca</code>.
! 497: <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
! 498: Does it? <code>bits_per_digit</code> and the inner loop over bits in a
! 499: limb might prevent it. Perhaps special cases for binary, octal and hex
! 500: would be worthwhile (very possibly for all processors too).
! 501: <li> Cray: <code>popc_limb</code> could use the Cray <code>_popc</code>
! 502: intrinsic. That would help <code>mpz_hamdist</code> and might make the
! 503: generic C versions of <code>mpn_popcount</code> and
! 504: <code>mpn_hamdist</code> suffice for Cray (if it vectorizes, or can be
! 505: given a hint to do so).
! 506: <li> 68000: <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
! 507: <code>mpn_submul_1</code>: Check for a 16-bit multiplier and use two
! 508: multiplies per limb, not four.
! 509: <li> 68000: <code>mpn_lshift</code> and <code>mpn_rshift</code> could use a
! 510: <code>roll</code> and mask instead of <code>lsrl</code> and
! 511: <code>lsll</code>. This promises to be a speedup, effectively trading a
! 512: 6+2*n shift for one or two 4 cycle masks. Suggested by Jean-Charles
! 513: Meyrignac.
1.1 maekawa 514: <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
515: <pre>
1.1.1.2 ! ohara 516: if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
! 517: if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
! 518: ... </pre>
! 519: <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
! 520: be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way
! 521: to identify suitable versions of that compiler?
! 522: <li> VAX D and G format <code>double</code> floats are straightforward and
! 523: could perhaps be handled directly in <code>__gmp_extract_double</code>
! 524: and maybe in <code>mpz_get_d</code>, rather than falling back on the
! 525: generic code. (Both formats are detected by <code>configure</code>.)
! 526: <li> <code>mpn_get_str</code> final divisions by the base with
! 527: <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
! 528: on suitable machines. This ends up happening for decimal by presenting
! 529: the compiler with a run-time constant, but the same for other bases would
! 530: be good. Perhaps use could be made of the fact base<256.
! 531: <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
! 532: structure like <code>div_t</code> to avoid going through memory, in
! 533: particular helping RISCs that don't do store-to-load forwarding. Clearly
! 534: this is only possible if the ABI returns a structure of two
! 535: <code>mp_limb_t</code>s in registers.
1.1 maekawa 536: </ul>
537:
538: <h4>New Functionality</h4>
539: <ul>
1.1.1.2 ! ohara 540: <li> Add in-memory versions of <code>mp?_out_raw</code> and
! 541: <code>mp?_inp_raw</code>.
! 542: <li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the
! 543: nth limb).
1.1 maekawa 544: <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
545: <li> Let `0b' and `0B' mean binary input everywhere.
1.1.1.2 ! ohara 546: <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
! 547: Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
! 548: <code>_mpz_realloc</code> do the initial alloc. Set
! 549: <code>z->_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
! 550: similar can unconditionally fetch from. Niels Möller has had a go at
! 551: this.
! 552: <br>
! 553: The advantages of the lazy scheme would be:
! 554: <ul>
! 555: <li> Initial allocate would be the size required for the first value
! 556: stored, rather than getting 1 limb in <code>mpz_init</code> and then
! 557: more or less immediately reallocating.
! 558: <li> <code>mpz_init</code> would only store magic values in the
! 559: <code>mpz_t</code> fields, and could be inlined.
! 560: <li> A fixed initializer could even be used by applications, like
! 561: <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
! 562: for globals.
! 563: </ul>
! 564: The advantages of the current scheme are:
! 565: <ul>
! 566: <li> <code>mpz_set_ui</code> and other similar routines needn't check the
! 567: size allocated and can just store unconditionally.
! 568: <li> <code>mpz_set_ui</code> and perhaps others like
! 569: <code>mpz_tdiv_r_ui</code> and a prospective
! 570: <code>mpz_set_ull</code> could be inlined.
! 571: </ul>
1.1 maekawa 572: <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
573: format is portable between 32-bit and 64-bit machines, and between
574: little-endian and big-endian machines.
1.1.1.2 ! ohara 575: <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
! 576: logops and copys available in gmp.h, either as library functions or
! 577: inlines, with the availability of library functions instantiated in the
! 578: generated gmp.h at build time.
! 579: <li> <code>mpz_set_str</code> etc variants taking string lengths rather than
! 580: null-terminators.
1.1 maekawa 581: <li> Consider changing the thresholds to apply the simpler algorithm when
582: "<code><=</code>" rather than "<code><</code>", so a threshold can
583: be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
584: compiler will know <code>size <= MP_SIZE_T_MAX</code> is always true).
1.1.1.2 ! ohara 585: Alternately it looks like the <code>ABOVE_THRESHOLD</code> and
! 586: <code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick
! 587: up cases where a threshold of zero should mean only the second algorithm.
! 588: <li> <code>mpz_nthprime</code>.
! 589: <li> Perhaps <code>mpz_init2</code>, initializing and making initial room for
! 590: N bits. The actual size would be rounded up to a limb, and perhaps an
! 591: extra limb added since so many <code>mpz</code> routines need that on
! 592: their destination.
! 593: <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
! 594: <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
! 595: if they could share code with the current such functions (which should be
! 596: possible).
! 597: <li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by
! 598: Niels Möller.
! 599: <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
! 600: accept 0x, 0b etc when base==0. Perhaps the exponent could default to
! 601: decimal in this case, with a further 0x, 0b etc allowed there.
! 602: Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but
! 603: probably something like "0.123" ought not mean octal.
! 604: <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
! 605: feature of gmp.h, so applications could know whether to
! 606: <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
! 607: <li> <code>PRIdMP_LIMB</code> and similar defines following C99
! 608: <inttypes.h> might be of use to applications printing limbs.
! 609: Perhaps they should be defined only if specifically requested, the way
! 610: <inttypes.h> does. But if <code>GMP_LONG_LONG_LIMB</code> or
! 611: whatever is added then perhaps this can easily enough be left to
! 612: applications.
! 613: <li> <code>mpf_get_ld</code> and <code>mpf_set_ld</code> converting
! 614: <code>mpf_t</code> to and from <code>long double</code>. Other
! 615: <code>long double</code> routines would be desirable too, but these would
! 616: be a start. Often <code>long double</code> is the same as
! 617: <code>double</code>, which is easy but pretty pointless. Should
! 618: recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
! 619: sparc, hppa and power. Might like an ABI sub-option or something when
! 620: it's a compiler option for 64-bit or 128-bit <code>long double</code>.
! 621: <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
! 622: It'd be nice if it worked for plain <code>int</code> etc too, not just
! 623: <code>mpz_t</code> etc.
! 624: <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
! 625: for both integer and float conversions. A base either in the format
! 626: string or as a parameter with <code>*</code> should be allowed. Maybe
! 627: <code>&13b</code> (b for base) or something like that.
! 628: <li> <code>gmp_printf</code> could perhaps have a type code for an
! 629: <code>mp_limb_t</code>. That would save an application from having to
! 630: worry whether it's a <code>long</code> or a <code>long long</code>.
! 631: <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
! 632: conversions, eg. <code>"%.4Qf"</code>. This would be merely for
! 633: convenience, but still might be useful. Rounding would be the same as
! 634: for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
! 635: actually documented). Alternately, perhaps a separate
! 636: <code>mpq_get_str_point</code> or some such might be more use. Suggested
! 637: by Pedro Gimeno.
! 638: <li> <code>gmp_printf</code> could usefully accept a flag to control the
! 639: rounding of float conversions. The wouldn't do much for
! 640: <code>mpf_t</code>, but would be good if <code>mpfr_t</code> was
! 641: supported in the future, or perhaps for <code>mpq_t</code>. Something
! 642: like <code>&*r</code> (r for rounding, and mpfr style
! 643: <code>GMP_RND</code> parameter).
! 644: <li> <code>mpz_combit</code> to toggle a bit would be a good companion for
! 645: <code>mpz_setbit</code> and <code>mpz_clrbit</code>. Suggested by Niels
! 646: Möller (and has done some work towards it).
! 647: <li> <code>mpz_scan0_reverse</code> or <code>mpz_scan0low</code> or some such
! 648: searching towards the low end of an integer might match
! 649: <code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>.
! 650: Suggested by Roberto Bagnara.
! 651: <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
! 652: bitwise subset of another might be of use. Some sort of return value
! 653: indicating whether it's a proper or non-proper subset would be good and
! 654: wouldn't cost anything in the implementation. Suggested by Roberto
! 655: Bagnara.
! 656: <li> <code>gmp_randinit_r</code> and maybe <code>gmp_randstate_set</code> to
! 657: init-and-copy or to just copy a <code>gmp_randstate_t</code>. Suggested
! 658: by Pedro Gimeno.
! 659: <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
! 660: <code>mpf_t</code> and <code>long double</code>, suggested by Dan
! 661: Christensen. There'd be some work to be done by <code>configure</code>
! 662: to recognise the format in use. xlc on aix for instance apparently has
! 663: an option for either plain double 64-bit or quad 128-bit precision. This
! 664: might mean library contents vary with the compiler used to build, which
! 665: is undesirable. It might be possible to detect the mode the application
! 666: is compiling with, and try to avoid mismatch problems.
! 667: <li> <code>mpz_sqrt_if_perfect_square</code>: When
! 668: <code>mpz_perfect_square_p</code> does its tests it calculates a square
! 669: root and then discards it. For some applications it might be useful to
! 670: return that root. Suggested by Jason Moxham.
! 671: <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
! 672: <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
! 673: <code>long long</code>. These would aid interoperability, though a
! 674: mixture of GMP and <code>long long</code> would probably not be too
! 675: common. Disadvantages of using <code>long long</code> in libgmp.a would
! 676: be
! 677: <ul>
! 678: <li> Library contents vary according to the build compiler.
! 679: <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
! 680: application compiler could take the <code>long long</code>
! 681: prototypes.
! 682: <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> would be wanted to
! 683: indicate whether the functions are available. (Applications using
! 684: autoconf could probe the library too.)
! 685: </ul>
! 686: It'd be possible to defer the need for <code>long long</code> to
! 687: application compile time, by having something like
! 688: <code>mpz_set_2ui</code> called with two halves of a <code>long
! 689: long</code>. Disadvantages of this would be,
! 690: <ul>
! 691: <li> Bigger code in the application, though perhaps not if a <code>long
! 692: long</code> is normally passed as two halves anyway.
! 693: <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
! 694: to be two function calls.
! 695: <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
! 696: treatment of <code>-0x10..00</code> into applications (see
! 697: <code>mpz_get_si</code> correctness above).
! 698: <li> Although having libgmp.a independent of the build compiler is nice,
! 699: it sort of sacrifices the capabilities of a good compiler to
! 700: uniformity with inferior ones.
! 701: </ul>
! 702: Plain use of <code>long long</code> is probably the lesser evil, if only
! 703: because it makes best use of gcc.
! 704: <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
! 705: Suggested by Alexander Kruppa.
1.1 maekawa 706: </ul>
707:
708:
709: <h4>Configuration</h4>
710:
711: <ul>
1.1.1.2 ! ohara 712: <li> Floating-point format: <code>GMP_C_DOUBLE_FORMAT</code> seems to work
! 713: well. Get rid of the <code>#ifdef</code> mess in gmp-impl.h and use the
! 714: results of the test instead.
! 715: <li> a29k: umul.s and udiv.s exist but don't get used.
! 716: <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
! 717: but is that available only for M series chips or some such? Perhaps it
! 718: should be configured in some way.
! 719: <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
! 720: <li> HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though
! 721: perhaps just as a build-time choice. In any case, figure out how to
! 722: identify a suitable gcc or put it in the right mode, for the GMP compiler
! 723: choices.
! 724: <li> IA64: Latest libtool has some nonsense to detect ELF-32 or ELF-64 on
! 725: <code>ia64-*-hpux*</code>. Does GMP need to know anything about that?
! 726: <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
! 727: "hinv -c processor" gives lots of information on Irix. Standard
! 728: config.guess appends "el" to indicate endianness, but
! 729: <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
! 730: <li> PowerPC: The function descriptor nonsense for AIX is currently driven by
! 731: <code>*-*-aix*</code>. It might be more reliable to do some sort of
! 732: feature test, examining the compiler output perhaps. It might also be
! 733: nice to merge the aix.m4 files into powerpc-defs.m4.
! 734: <li> Sparc: <code>config.guess</code> recognises various exact sparcs, make
! 735: use of that information in <code>configure</code> (work on this is in
! 736: progress).
! 737: <li> Sparc32: floating point or integer <code>udiv</code> should be selected
! 738: according to the CPU target. Currently floating point ends up being
! 739: used on all sparcs, which is probably not right for generic V7 and V8.
! 740: <li> Sparc: The use of <code>-xtarget=native</code> with <code>cc</code> is
! 741: incorrect when cross-compiling, the target should be set according to the
! 742: configured <code>$host</code> CPU.
! 743: <li> m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies
! 744: on system information for 030, 040 and 060. Can they be identified by
! 745: running some code?
! 746: <li> m68k: gas 2.11.90.0.1 pads with zero bytes in text segments, which is not
! 747: valid code. Probably need <code>.balignw <n>,0x4e7f</code> to get
! 748: nops, if <code>ALIGN</code> is going to be used for anything that's
! 749: executed across.
! 750: <li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being
! 751: used. Check all such for bit rot and then put umul and udiv in
! 752: <code>$gmp_mpn_functions_optional</code> as "standard optional" objects.
! 753: <br> In particular Sparc and SparcV8 on non-gcc should benefit from
! 754: umul.asm enabled; the generic umul is suspected to be making sqr_basecase
! 755: slower than mul_basecase.
! 756: <li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a
! 757: different parameter order than those functions on other CPUs. It might
! 758: avoid confusion to have them under different names, maybe
! 759: <code>mpn_umul_ppmm_r</code> or some such. Prototypes then wouldn't
! 760: be conditionalized, and the appropriate form could be selected with the
! 761: <code>HAVE_NATIVE</code> scheme if/when the code switches to use a
! 762: <code>PROLOGUE</code> style.
! 763: <li> <code>DItype</code>: The setup in gmp-impl.h for non-GCC could use an
! 764: autoconf test to determine whether <code>long long</code> is available.
! 765: <li> m88k: Make the assembler code work on non-underscore systems. Conversion
! 766: to .asm would be desirable. Ought to be easy, but would want to be
! 767: tested.
! 768: <li> z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in
! 769: mpn/z8000 could be an ABI choice. But this chip is obsolete and nothing
! 770: is likely to be done unless someone is actively using it.
! 771: <li> config.m4 is generated only by the configure script, it won't be
! 772: regenerated by config.status. Creating it as an <code>AC_OUTPUT</code>
! 773: would work, but it might upset "make" to have things like <code>L$</code>
! 774: get into the Makefiles through <code>AC_SUBST</code>.
! 775: <code>AC_CONFIG_COMMANDS</code> would be the alternative. With some
! 776: careful m4 quoting the <code>changequote</code> calls might not be
! 777: needed, which might free up the order in which things had to be output.
! 778: <li> <code>make distclean</code>: Only the mpn directory links which were
! 779: created are removed, but perhaps all possible links should be removed, in
! 780: case someone runs configure a second time without a
! 781: <code>distclean</code> in between. The only tricky part would be making
! 782: sure all possible <code>extra_functions</code> are covered.
! 783: <li> MinGW: Apparently a Cygwin version of gcc can be used by passing
! 784: <code>-mno-cygwin</code>. For <code>--host=*-*-mingw32*</code> it might
! 785: be convenient to automatically use that option, if it works. Needs
! 786: someone with a dual cygwin/mingw setup to test.
! 787: <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
! 788: scheme. Though we probably wouldn't be using its assembler support we
! 789: could try to use those variables in compatible ways.
1.1 maekawa 790: </ul>
791:
1.1.1.2 ! ohara 792:
! 793: <h4>Random Numbers</h4>
! 794: <ul>
! 795: <li> <code>_gmp_rand</code> is not particularly fast on the linear
! 796: congruential algorithm and could stand various improvements.
! 797: <ul>
! 798: <li> Make a second seed area within <code>gmp_randstate_t</code> (or
! 799: <code>_mp_algdata</code> rather) to save some copying.
! 800: <li> Make a special case for a single limb <code>2exp</code> modulus, to
! 801: avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs.
! 802: <li> Inline the <code>lc</code> code, to avoid a function call and
! 803: <code>TMP_ALLOC</code> for every chunk.
! 804: <li> The special case for <code>seedn==0</code> will be very rarely used,
! 805: and on that basis seems unnecessary.
! 806: <li> Perhaps the <code>2exp</code> and general LC cases should be split,
! 807: for clarity (if the general case is retained).
! 808: </ul>
! 809: <li> <code>gmp_randinit_mers</code> for a Mersenne Twister generator. It's
! 810: likely to be more random and about the same speed as Knuth's 55-element
! 811: Fibonacci generator, and can probably become the default. Pedro Gimeno
! 812: has started on this.
! 813: <li> <code>gmp_randinit_lc</code>: Finish or remove. Doing a division for
! 814: every every step won't be very fast, so check whether the usefulness of
! 815: this algorithm can be justified. (Consensus is that it's not useful and
! 816: can be removed.)
! 817: <li> Blum-Blum-Shub: Finish or remove. A separate
! 818: <code>gmp_randinit_bbs</code> would be wanted, not the currently
! 819: commented out case in <code>gmp_randinit</code>.
! 820: <li> <code>_gmp_rand</code> could be done as a function pointer within
! 821: <code>gmp_randstate_t</code> (or rather in the <code>_mp_algdata</code>
! 822: part), instead of switching on a <code>gmp_randalg_t</code>. Likewise
! 823: <code>gmp_randclear</code>, and perhaps <code>gmp_randseed</code> if it
! 824: became algorithm-specific. This would be more modular, and would ensure
! 825: only code for the desired algorithms is dragged into the link.
! 826: <li> <code>mpz_urandomm</code> should do something for n<=0, but what?
! 827: <li> <code>mpz_urandomm</code> implementation looks like it could be improved.
! 828: Perhaps it's enough to calculate <code>nbits</code> as ceil(log2(n)) and
! 829: call <code>_gmp_rand</code> until a value <code><n</code> is obtained.
! 830: <li> <code>gmp_randstate_t</code> used for parameters perhaps should become
! 831: <code>gmp_randstate_ptr</code> the same as other types.
! 832: <li> Some of the empirical randomness tests could be included in a "make
! 833: check". They ought to work everywhere, for a given seed at least.
! 834: </ul>
1.1 maekawa 835:
836:
837: <h4>Miscellaneous</h4>
838: <ul>
839: <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
840: analogous to <code>mpz_mod</code>. Document, and list as an
841: incompatibility.
1.1.1.2 ! ohara 842: <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
! 843: what range of values the generated cofactors can take, and preferably
! 844: ensure the definition uniquely specifies the cofactors for given inputs.
! 845: A basic extended Euclidean algorithm or multi-step variant leads to
! 846: |x|<|b| and |y|<|a| or something like that, but there's probably
! 847: two solutions under just those restrictions.
! 848: <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
! 849: <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
! 850: <code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a
! 851: time would be better still.)
! 852: <li> The various test programs use quite a bit of the main
! 853: <code>libgmp</code>. This establishes good cross-checks, but it might be
! 854: better to use simple reference routines where possible. Where it's not
! 855: possible some attention could be paid to the order of the tests, so a
! 856: <code>libgmp</code> routine is only used for tests once it seems to be
! 857: good.
! 858: <li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be
! 859: good for the two to share code. Perhaps <code>mpf_set_q</code> should
! 860: make some <code>mpf_t</code> aliases for its numerator and denominator
! 861: and just call <code>mpf_div</code>. Both would be simplified a good deal
! 862: by switching to <code>mpn_tdiv_qr</code> perhaps making them small enough
! 863: not to bother with sharing (especially since <code>mpf_set_q</code>
! 864: wouldn't need to watch out for overlaps).
! 865: <li> PowerPC: The cpu time base registers (per <code>mftb</code> and
! 866: <code>mftbu</code>) could be used for the speed and tune programs. Would
! 867: need to know its frequency of course. Usually it's 1/4 of bus speed
! 868: (eg. 25 MHz) but some chips drive it from an external input. Probably
! 869: have to measure to be sure.
! 870: <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
! 871: return to a previous k at certain sizes. This arises basically due to
! 872: the step effect caused by size multiples effectively used for each k.
! 873: Looking at a graph makes it fairly clear.
! 874: <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
! 875: on the string returned by <code>mpf_get_str</code>. Perhaps some variant
! 876: of <code>mpf_get_str</code> could be made which would better suit.
1.1 maekawa 877: </ul>
878:
879:
1.1.1.2 ! ohara 880: <h4>Aids to Development</h4>
1.1 maekawa 881: <ul>
1.1.1.2 ! ohara 882: <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
! 883: function to check the validity of each <code>mp?_t</code> parameter, in
! 884: particular to check they've been <code>mp?_init</code>ed. This might
! 885: catch elementary mistakes in user programs. Care would need to be taken
! 886: over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing
! 887: else then consistency checks like size<=alloc, ptr not
! 888: <code>NULL</code> and ptr+size not wrapping around the address space,
! 889: would be possible. A more sophisticated scheme could track
! 890: <code>_mp_d</code> pointers and ensure only a valid one is used. Such a
! 891: scheme probably wouldn't be reentrant, not without some help from the
! 892: system.
! 893: <li> tune/time.c could try to determine at runtime whether
! 894: <code>getrusage</code> and <code>gettimeofday</code> are reliable.
! 895: Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
! 896: <code>getrusage</code> doesn't exist. If a test might take a long time
! 897: to run then perhaps cache the result in a file somewhere.
1.1 maekawa 898: </ul>
899:
900:
901: <h4>Documentation</h4>
902: <ul>
903: <li> Document conventions, like that <code> unsigned long int</code> is used for
904: bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
905: <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
906: </ul>
907:
908:
1.1.1.2 ! ohara 909: <h4>Bright Ideas</h4>
! 910:
! 911: The following may or may not be feasible, and aren't likely to get done in the
! 912: near future, but are at least worth thinking about.
! 913:
! 914: <ul>
! 915: <li> Reorganize longlong.h so that we can inline the operations even for the
! 916: system compiler. When there is no such compiler feature, make calls to
! 917: stub functions. Write such stub functions for as many machines as
! 918: possible.
! 919: <li> longlong.h could declare when it's using, or would like to use,
! 920: <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
! 921: included in libgmp only in that case, the same as is effectively done for
! 922: <code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This
! 923: would only be a very small space saving, so perhaps not worth the
! 924: complexity.
! 925: <li> longlong.h could be built at configure time by concatenating or
! 926: #including fragments from each directory in the mpn path. This would
! 927: select CPU specific macros the same way as CPU specific assembler code.
! 928: Code used would no longer depend on cpp predefines, and the current
! 929: nested conditionals could be flattened out.
! 930: <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
! 931: sort of supposed to return the low 31 (or 63) bits. But this is
! 932: undocumented, and perhaps not too important.
! 933: <li> <code>mpz_*_ui</code> division routines currently return abs(a%b).
! 934: Perhaps make them return the real remainder instead? Return type would
! 935: be <code>signed long int</code>. But this would be an incompatible
! 936: change, so it might have to be under newly named functions.
! 937: <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
! 938: say an extra 16 limbs over what's needed, so as to reduce the chance of
! 939: having to do a reallocate if the <code>mpz_t</code> grows a bit more.
! 940: This could only be an option, since it'd badly bloat memory usage in
! 941: applications using many small values.
! 942: <li> <code>mpq</code> functions could perhaps check for numerator or
! 943: denominator equal to 1, on the assumption that integers or
! 944: denominator-only values might be expected to occur reasonably often.
! 945: <li> <code>count_trailing_zeros</code> is used on more or less uniformly
! 946: distributed numbers in a couple of places. For some CPUs
! 947: <code>count_trailing_zeros</code> is slow and it's probably worth handling
! 948: the frequently occurring 0 to 2 trailing zeros cases specially.
! 949: <li> <code>mpf_t</code> might like to let the exponent be undefined when
! 950: size==0, instead of requiring it 0 as now. It should be possible to do
! 951: size==0 tests before paying attention to the exponent. The advantage is
! 952: not needing to set exp in the various places a zero result can arise,
! 953: which avoids some tedium but is otherwise perhaps not too important.
! 954: Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
! 955: exp==0, maybe elsewhere too.
! 956: <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
! 957: ((malloc))</code> on this, though don't know if it'd do much. GCC 3.0
! 958: allows that attribute on functions, but not function pointers (see info
! 959: node "Attribute Syntax"), so would need a new autoconf test. This can
! 960: wait until there's a GCC that supports it.
! 961: <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
! 962: <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two
! 963: routines were opened up a bit maybe that code could be shared. When a
! 964: copy needs to be done there's no carry to append for the add, and if the
! 965: copy is non-empty no high zero for the sub. <br> An alternative would be
! 966: to do a copy at the start and then an in-place add or sub. Obviously
! 967: that duplicates the fetches and stores for carry propagation, but that's
! 968: normally only one or two limbs. The same applies to <code>mpz_add</code>
! 969: when one operand is longer than the other, and to <code>mpz_com</code>
! 970: since it's just -(x+1).
! 971: <li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict
! 972: (one writer many readers, or whatever it is) suit the GMP style "same or
! 973: separate" function parameters? If so, judicious use might improve the
! 974: code generated a bit. Do any compilers have their own flavour of
! 975: restrict as "completely unaliased", and is that still usable?
! 976: <li> 68000: A 16-bit limb might suit 68000 better than 32-bits, since the
! 977: native multiply is only 16x16. Could have this as an <code>ABI</code>
! 978: option, selecting <code>_SHORT_LIMB</code> in gmp.h. Naturally a new set
! 979: of asm subroutines would be necessary. Would need new
! 980: <code>mpz_set_ui</code> etc since the current code assumes limb>=long,
! 981: but 2-limb operand forms would find a use for <code>long long</code> on
! 982: other processors too.
! 983: <li> Nx1 remainders can be taken at multiplier throughput speed by
! 984: pre-calculating an array "p[i] = 2^(i*<code>BITS_PER_MP_LIMB</code>) mod
! 985: m", then for the input limbs x calculating an inner product "sum
! 986: p[i]*x[i]", and a final 3x1 limb remainder mod m. If those powers take
! 987: roughly N divide steps to calculate then there'd be an advantage any time
! 988: the same m is used three or more times. Suggested by Victor Shoup in
! 989: connection with chinese-remainder style decompositions, but perhaps with
! 990: other uses.
! 991: </ul>
! 992: <hr>
1.1 maekawa 993:
994: </body>
995: </html>
1.1.1.2 ! ohara 996:
! 997: <!--
! 998: Local variables:
! 999: eval: (add-hook 'write-file-hooks 'time-stamp)
! 1000: time-stamp-start: "This file current as of "
! 1001: time-stamp-format: "%:d %3b %:y"
! 1002: time-stamp-end: "\\."
! 1003: time-stamp-line-limit: 50
! 1004: End:
! 1005: -->
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>