Annotation of OpenXM_contrib/gmp/doc/tasks.html, Revision 1.1
1.1 ! maekawa 1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
! 2: <html>
! 3: <head>
! 4: <title>
! 5: GMP Itemized Development Tasks
! 6: </title>
! 7: </head>
! 8: <body bgcolor=lightgreen>
! 9:
! 10: <center>
! 11: <h1>
! 12: GMP Itemized Development Tasks
! 13: </h1>
! 14: </center>
! 15:
! 16: <comment>
! 17: An up-to-date html version of this file is available at
! 18: <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
! 19: </comment>
! 20:
! 21: <p> This file lists itemized GMP development tasks. Not all the tasks
! 22: listed here are suitable for volunteers, but many of them are.
! 23: Please see the <a href="projects.html">projects file</a> for more
! 24: sizeable projects.
! 25:
! 26: <h4>Correctness and Completeness</h4>
! 27: <ul>
! 28: <li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
! 29: new instructions. Unfortunately, the HPUX 9 assembler as well as earlier
! 30: assemblers reject that directive. How very clever of HP! We will have to
! 31: pass assembler options, and make sure it works with new and old systems
! 32: and GNU assembler.
! 33: <li> The various reuse.c tests need to force reallocation by calling
! 34: <code>_mpz_realloc</code> with a small (1 limb) size.
! 35: <li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
! 36: <li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
! 37: <code>mp_exp_t</code>, the precision of
! 38: <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
! 39: <code>mpf_get_str</code> aborts. Detect and compensate.
! 40: <li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
! 41: machines that use <code>long long</code> for storing limbs.)
! 42: <li> Make the string reading functions allow the `0x' prefix when the base is
! 43: explicitly 16. They currently only allow that prefix when the base is
! 44: unspecified.
! 45: <li> In the development sources, we return abs(a%b) in the
! 46: <code>mpz_*_ui</code> division routines. Perhaps make them return the
! 47: real remainder instead? Changes return type to <code>signed long int</code>.
! 48: <li> <code>mpf_eq</code> is not always correct, when one operand is
! 49: 1000000000... and the other operand is 0111111111..., i.e., extremely
! 50: close. There is a special case in <code>mpf_sub</code> for this
! 51: situation; put similar code in <code>mpf_eq</code>.
! 52: <li> mpf_eq doesn't implement what gmp.texi specifies. It should not use just
! 53: whole limbs, but partial limbs.
! 54: <li> Install Alpha assembly changes (prec/gmp-alpha-patches).
! 55: <li> NeXT has problems with newlines in asm strings in longlong.h. Also,
! 56: <code>__builtin_constant_p</code> is unavailable? Same problem with MacOS
! 57: X.
! 58: <li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
! 59: mp?/tests/*.c.
! 60: <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
! 61: </ul>
! 62:
! 63:
! 64:
! 65: <h4>Machine Independent Optimization</h4>
! 66: <ul>
! 67: <li> In hundreds of places in the code, we invoke count_leading_zeros and then
! 68: check if the returned count is zero. Instead check the most significant
! 69: bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
! 70: the bit is set. This is an optimization on all machines, and significant
! 71: on machines with slow <code>count_leading_zeros</code>.
! 72: <li> In a couple of places <code>count_trailing_zeros</code> is used
! 73: on more or less uniformly distributed numbers. For some CPUs
! 74: <code>count_trailing_zeros</code> is slow and it's probably worth
! 75: handling the frequently occurring 0 to 2 trailing zeros cases specially.
! 76: <li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
! 77: instead use <code>invert_limb</code>.
! 78: <li> Reorganize longlong.h so that we can inline the operations even for the
! 79: system compiler. When there is no such compiler feature, make calls to
! 80: stub functions. Write such stub functions for as many machines as
! 81: possible.
! 82: <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
! 83: most significant limb (if <code>BITS_PER_MP_LIMB</code> <= 52 bits).
! 84: (Peter Montgomery has some ideas on this subject.)
! 85: <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
! 86: products with fewer operations.
! 87: <li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
! 88: the sub O(n^2) range, using some divide-and-conquer approach, preferably
! 89: without using division.
! 90: <li> Copy tricky code for converting a limb from development version of
! 91: <code>mpn_get_str</code> to mpf/get_str. (Talk to Torbjörn about this.)
! 92: <li> Consider inlining these functions: <code>mpz_size</code>,
! 93: <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
! 94: <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
! 95: <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
! 96: <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
! 97: <code>mpz_popcount</code>, <code>mpf_size</code>,
! 98: <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
! 99: <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
! 100: <code>mpf_clear</code>, <code>mpf_set_si</code>.
! 101: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
! 102: fast on one or two limb moduli, due to a lot of function call
! 103: overheads. These could perhaps be handled as special cases.
! 104: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
! 105: algorithm selection, and the latter should use REDC. Both could
! 106: change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
! 107: <li> <code>mpn_gcd</code> might be able to be sped up on small to
! 108: moderate sizes by improving <code>find_a</code>, possibly just by
! 109: providing an alternate implementation for CPUs with slowish
! 110: <code>count_leading_zeros</code>.
! 111: <li> Implement a cache localized evaluate and interpolate for the
! 112: toom3 <code>USE_MORE_MPN</code> code. The necessary
! 113: right-to-left <code>mpn_divexact_by3c</code> exists.
! 114: <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
! 115: better cache locality by taking N piece by piece. The current code could
! 116: be left available for CPUs without caching. Depending how karatsuba etc
! 117: is applied to unequal size operands it might be possible to assume M is
! 118: always smallish.
! 119: </ul>
! 120:
! 121:
! 122: <h4>Machine Dependent Optimization</h4>
! 123: <ul>
! 124: <li> Run the `tune' utility for more compiler/CPU combinations. We would like
! 125: to have gmp-mparam.h files in practically every implementation specific
! 126: mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
! 127: compiler. See the `tune' top-level directory for more information.
! 128: <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
! 129: <code>mpn_mul_1</code> for the 21264. On 21264, they should run at 4, 3,
! 130: and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
! 131: Torbjörn for his xm.s and xam.s skeleton files.)
! 132: <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
! 133: <code>mpn_mul_1</code> for the 21164. This should use both integer
! 134: multiplies and floating-point multiplies. For the floating-point
! 135: operations, the single-limb multiplier should be split into three 21-bit
! 136: chunks.
! 137: <li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
! 138: <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>. Should use
! 139: floating-point operations, and split the invariant single-limb multiplier
! 140: into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
! 141: will become very deep. (Torbjörn has C code that is useful as a starting
! 142: point.)
! 143: <li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
! 144: Should give 2 cycles/limb. (Torbjörn has code that just needs to be
! 145: finished.)
! 146: <li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
! 147: and the other multiplies varies so much on successive sizes.
! 148: <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
! 149: <code>mpn_mul_1</code>. The current development code runs at 11
! 150: cycles/limb, which is already very good. But it should be possible to
! 151: saturate the cache, which will happen at 7.5 cycles/limb.
! 152: <li> Sparc & SparcV8: Enable umul.asm for native cc. The generic
! 153: longlong.h umul_ppmm is suspected to be causing sqr_basecase to
! 154: be slower than mul_basecase.
! 155: <li> UltraSPARC: Write <code>umul_ppmm</code>. Important in particular for
! 156: <code>mpn_sqr_basecase</code>. Using four "<code>mulx</code>"s either
! 157: with an asm block or via the generic C code is about 90 cycles.
! 158: <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
! 159: for important machines. Helping the generic sqr_basecase.c with an
! 160: <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
! 161: <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
! 162: Will bring time from 1.75 to 1.25 cycles/limb.
! 163: <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See Pentium code.)
! 164: <li> Alpha: Optimize <code>count_leading_zeros</code>.
! 165: <li> Alpha: Optimize <code>udiv_qrnnd</code>. (Ask Torbjörn for the file
! 166: test-udiv-preinv.c as a starting point.)
! 167: <li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
! 168: It should just require 3 cycles/limb, but the current code propagates
! 169: carry poorly. The trick is to add carry-in later than we do now,
! 170: decreasing the number of operations used to generate carry-out from 4 to
! 171: to 3.
! 172: <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
! 173: The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
! 174: is unused. (Ask Torbjörn for a copy of the current code.)
! 175: <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
! 176: <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should
! 177: run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
! 178: <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
! 179: <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
! 180: <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
! 181: method", to avoid carry propagation. splitting one of the operands in
! 182: 11-bit chunks.
! 183: <li> Cray: Vectorize main functions, perhaps in assembly language.
! 184: <li> Cray: Write <code>mpn_mul_basecase</code> and
! 185: <code>mpn_sqr_basecase</code>. Same comment applies to this as to the
! 186: same functions for Fujitsu VPP.
! 187: <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
! 188:
! 189: <pre>
! 190: if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
! 191: if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
! 192: ... </pre>
! 193:
! 194: </ul>
! 195:
! 196: <h4>New Functionality</h4>
! 197: <ul>
! 198: <li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the nth limb).
! 199: <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
! 200: <li> Let `0b' and `0B' mean binary input everywhere.
! 201: <li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
! 202: (cf. <code>mpq_set_d</code>).
! 203: <li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
! 204: allocation. Set <code>ALLOC(var)</code> to 0, and have
! 205: <code>mpz_realloc</code> special-handle that case. Update functions that
! 206: rely on a single limb (like <code>mpz_set_ui</code>,
! 207: <code>mpz_[tfc]div_r_ui</code>, and others).
! 208: <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
! 209: format is portable between 32-bit and 64-bit machines, and between
! 210: little-endian and big-endian machines.
! 211: <li> Handle numeric exceptions: Call an error handler, and/or set
! 212: <code>gmp_errno</code>.
! 213: <li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
! 214: <code>gmp_snprintf</code>. Think about some sort of wrapper
! 215: around <code>printf</code> so it and its several variants don't
! 216: have to be completely reimplemented.
! 217: <li> Implement some <code>mpq</code> input and output functions.
! 218: <li> Implement a full precision <code>mpz_kronecker</code>, leave
! 219: <code>mpz_jacobi</code> for compatibility.
! 220: <li> Make the mpn logops and copys available in gmp.h. Since they can
! 221: be either library functions or inlines, gmp.h would need to be
! 222: generated from a gmp.in based on what's in the library. gmp.h
! 223: would still be compiler-independent though.
! 224: <li> Make versions of <code>mpz_set_str</code> etc taking string
! 225: lengths rather than null-terminators.
! 226: <li> Consider changing the thresholds to apply the simpler algorithm when
! 227: "<code><=</code>" rather than "<code><</code>", so a threshold can
! 228: be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
! 229: compiler will know <code>size <= MP_SIZE_T_MAX</code> is always true).
! 230: <li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
! 231: could be implemented to match the corresponding tdiv and fdiv.
! 232: Maybe some code sharing is possible.
! 233: </ul>
! 234:
! 235:
! 236: <h4>Configuration</h4>
! 237:
! 238: <ul>
! 239: <li> Improve config.guess. We want to recognize the processor very
! 240: accurately, more accurately than other GNU packages.
! 241: config.guess does not currently make the distinctions we would
! 242: like it to do and a --target often needs to be set explicitly.
! 243:
! 244: For example, "sparc" is not very useful as a machine architecture
! 245: denotation. We want to distinguish old 32-bit SPARC without
! 246: multiply support from newer 32-bit SPARC with such support. We
! 247: want to recognize a SuperSPARC, since its implementation of the
! 248: UDIV instruction is not complete, and will trap to the OS kernel
! 249: for certain operands. And we want to recognize 64-bit capable
! 250: SPARC processors as such. While the assembly routines can use
! 251: 64-bit operations on all 64-bit SPARC processors, one can not use
! 252: 64-bit limbs under all operating system. E.g., Solaris 2.5 and
! 253: 2.6 doesn't preserve the upper 32 bits of most processor
! 254: registers. For SPARC we therefore sometimes need to choose GMP
! 255: configuration depending both on processor and operating system.
! 256:
! 257: <li> Remember to make sure config.sub accepts any output from config.guess.
! 258:
! 259: <li> Find out whether there's an alloca available and how to use it.
! 260: AC_FUNC_ALLOCA has various system dependencies covered, but we
! 261: don't want its alloca.c replacement. (One thing current cpp
! 262: tests don't cover: HPUX 10 C compiler supports alloca, but
! 263: cannot find any symbol to test in order to know if we're on
! 264: HPUX 10. Damn.)
! 265: <li> Identify Mips processor under Irix: `hinv -c processor'.
! 266: config.guess should say mips2, mips3, and mips4.
! 267: <li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
! 268: Unfortunately, sizer is not available before some revision of
! 269: Dec Unix 4.0, and it also returns some rather cryptic names for
! 270: processors. Perhaps the <code>implver</code> and
! 271: <code>amask</code> assembly instructions are better, but that
! 272: doesn't differentiate between ev5 and ev56.
! 273: <li> Identify Sparc processors. config.guess should say supersparc,
! 274: microsparc, ultrasparc1, ultrasparc2, etc.
! 275: <li> Identify HPPA processors similarly.
! 276: <li> Get lots of information about a Solaris system: prtconf -vp
! 277: <li> For some target machines and some compilers, specific options
! 278: are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
! 279: Irix64/cc needs -64, Irix32/cc might need -n32, etc). Some are
! 280: set already, add more, see configure.in.
! 281: <li> Options to be passed to the assembler (via the compiler, using
! 282: whatever syntax the compiler uses for passing options to the
! 283: assembler).
! 284: <li> On Solaris 7, check if gcc supports native v9 64-bit
! 285: arithmetic. If not compile using "cc -fast -xarch=v9".
! 286: (Problem: -fast requires that we link with -fast too, which
! 287: might not be very good. Pass "-xO4 -xtarget=native" instead?)
! 288: <li> Extend the "optional" compiler arguments to choose the first
! 289: that works from from a set, so when gcc gets athlon support it
! 290: can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
! 291: whichever works.
! 292: <li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
! 293: x86s. (A bug in gcc 2.95.2 prevents it being used
! 294: unconditionally.)
! 295: <li> Build multiple variants of the library under certain systems.
! 296: An example is -n32, -o32, and -64 on Irix.
! 297: <li> There's a few filenames that don't fit in 14 chars, if this
! 298: matters.
! 299: <li> Enable support for FORTRAN versions of mpn files (eg. for
! 300: mpn/cray/mulww.f). Add "f" to the mpn path searching, run AC_PROG_F77 if
! 301: such a file is found. Automake will generate some of what's needed in the
! 302: makefiles, but libtool doesn't know fortran and so rules like the current
! 303: ".asm.lo" will be needed.
! 304: <li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
! 305: selected from the mpn path. This might help say a generic C
! 306: build on weird systems.
! 307: </ul>
! 308:
! 309: <p> In general, getting the exact right configuration, passing the
! 310: exact right options to the compiler, etc, might mean that the GMP
! 311: performance more than doubles.
! 312:
! 313: <p> When testing, make sure to test at least the following for all out
! 314: target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode
! 315: and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
! 316: `make' and GNU `make'. (4) With and without GNU binutils.
! 317:
! 318:
! 319: <h4>Miscellaneous</h4>
! 320: <ul>
! 321:
! 322: <li> Work on the way we build the library. We now do it building
! 323: convenience libraries but then listing all the object files a
! 324: second time in the top level Makefile.am.
! 325: <li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
! 326: mp[zq]/add.c to decide whether it will add or subtract. Will decrease
! 327: redundancy. Similarly in other places.
! 328: <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
! 329: analogous to <code>mpz_mod</code>. Document, and list as an
! 330: incompatibility.
! 331: <li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
! 332: mpn/generic/pow_ui.
! 333: <li> Make mpz_invert call mpn_gcdext directly.
! 334: <li> Make a build option to enable execution profiling with gprof. In
! 335: particular look at getting the right <code>mcount</code> call at
! 336: the start of each assembler subroutine (for important targets at
! 337: least).
! 338: </ul>
! 339:
! 340:
! 341: <h4>Aids to Debugging</h4>
! 342: <ul>
! 343: <li> Make an option for stack-alloc.c to call <code>malloc</code>
! 344: separately for each <code>TMP_ALLOC</code> block, so a redzoning
! 345: malloc debugger could be used during development.
! 346: <li> Add <code>ASSERT</code>s at the start of each user-visible
! 347: mpz/mpq/mpf function to check the validity of each
! 348: <code>mp?_t</code> parameter, in particular to check they've been
! 349: <code>mp?_init</code>ed. This might catch elementary mistakes in
! 350: user programs. Care would need to be taken over
! 351: <code>MPZ_TMP_INIT</code>ed variables used internally.
! 352: </ul>
! 353:
! 354:
! 355: <h4>Documentation</h4>
! 356: <ul>
! 357: <li> Document conventions, like that <code> unsigned long int</code> is used for
! 358: bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
! 359: <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
! 360: </ul>
! 361:
! 362: <hr>
! 363:
! 364: <table width="100%">
! 365: <tr>
! 366: <td>
! 367: <font size=2>
! 368: Please send comments about this page to
! 369: <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
! 370: Copyright (C) 1999, 2000 Torbjörn Granlund.
! 371: </font>
! 372: </td>
! 373: <td align=right>
! 374: </td>
! 375: </tr>
! 376: </table>
! 377:
! 378: </body>
! 379: </html>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>