[BACK]Return to tasks.html CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / doc

Annotation of OpenXM_contrib/gmp/doc/tasks.html, Revision 1.1.1.1

1.1       maekawa     1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
                      2: <html>
                      3: <head>
                      4:   <title>
                      5:     GMP Itemized Development Tasks
                      6:   </title>
                      7: </head>
                      8: <body bgcolor=lightgreen>
                      9:
                     10: <center>
                     11:   <h1>
                     12:     GMP Itemized Development Tasks
                     13:   </h1>
                     14: </center>
                     15:
                     16: <comment>
                     17:   An up-to-date html version of this file is available at
                     18:   <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
                     19: </comment>
                     20:
                     21: <p> This file lists itemized GMP development tasks.  Not all the tasks
                     22:     listed here are suitable for volunteers, but many of them are.
                     23:     Please see the <a href="projects.html">projects file</a> for more
                     24:     sizeable projects.
                     25:
                     26: <h4>Correctness and Completeness</h4>
                     27: <ul>
                     28: <li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
                     29:      new instructions.  Unfortunately, the HPUX 9 assembler as well as earlier
                     30:      assemblers reject that directive.  How very clever of HP!  We will have to
                     31:      pass assembler options, and make sure it works with new and old systems
                     32:      and GNU assembler.
                     33: <li> The various reuse.c tests need to force reallocation by calling
                     34:      <code>_mpz_realloc</code> with a small (1 limb) size.
                     35: <li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
                     36: <li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
                     37:      <code>mp_exp_t</code>, the precision of
                     38:      <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
                     39:      <code>mpf_get_str</code> aborts.  Detect and compensate.
                     40: <li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
                     41:      machines that use <code>long long</code> for storing limbs.)
                     42: <li> Make the string reading functions allow the `0x' prefix when the base is
                     43:      explicitly 16.  They currently only allow that prefix when the base is
                     44:      unspecified.
                     45: <li> In the development sources, we return abs(a%b) in the
                     46:      <code>mpz_*_ui</code> division routines.  Perhaps make them return the
                     47:      real remainder instead?  Changes return type to <code>signed long int</code>.
                     48: <li> <code>mpf_eq</code> is not always correct, when one operand is
                     49:      1000000000... and the other operand is 0111111111..., i.e., extremely
                     50:      close.  There is a special case in <code>mpf_sub</code> for this
                     51:      situation; put similar code in <code>mpf_eq</code>.
                     52: <li> mpf_eq doesn't implement what gmp.texi specifies.  It should not use just
                     53:      whole limbs, but partial limbs.
                     54: <li> Install Alpha assembly changes (prec/gmp-alpha-patches).
                     55: <li> NeXT has problems with newlines in asm strings in longlong.h.  Also,
                     56:      <code>__builtin_constant_p</code> is unavailable?  Same problem with MacOS
                     57:      X.
                     58: <li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
                     59:      mp?/tests/*.c.
                     60: <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
                     61: </ul>
                     62:
                     63:
                     64:
                     65: <h4>Machine Independent Optimization</h4>
                     66: <ul>
                     67: <li> In hundreds of places in the code, we invoke count_leading_zeros and then
                     68:      check if the returned count is zero.  Instead check the most significant
                     69:      bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
                     70:      the bit is set.  This is an optimization on all machines, and significant
                     71:      on machines with slow <code>count_leading_zeros</code>.
                     72: <li> In a couple of places <code>count_trailing_zeros</code> is used
                     73:      on more or less uniformly distributed numbers.  For some CPUs
                     74:      <code>count_trailing_zeros</code> is slow and it's probably worth
                     75:      handling the frequently occurring 0 to 2 trailing zeros cases specially.
                     76: <li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
                     77:      instead use <code>invert_limb</code>.
                     78: <li> Reorganize longlong.h so that we can inline the operations even for the
                     79:      system compiler.  When there is no such compiler feature, make calls to
                     80:      stub functions.  Write such stub functions for as many machines as
                     81:      possible.
                     82: <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
                     83:      most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
                     84:      (Peter Montgomery has some ideas on this subject.)
                     85: <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
                     86:      products with fewer operations.
                     87: <li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
                     88:      the sub O(n^2) range, using some divide-and-conquer approach, preferably
                     89:      without using division.
                     90: <li> Copy tricky code for converting a limb from development version of
                     91:      <code>mpn_get_str</code> to mpf/get_str.  (Talk to Torbjörn about this.)
                     92: <li> Consider inlining these functions: <code>mpz_size</code>,
                     93:      <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
                     94:      <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
                     95:      <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
                     96:      <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
                     97:      <code>mpz_popcount</code>, <code>mpf_size</code>,
                     98:      <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
                     99:      <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
                    100:      <code>mpf_clear</code>, <code>mpf_set_si</code>.
                    101: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
                    102:      fast on one or two limb moduli, due to a lot of function call
                    103:      overheads.  These could perhaps be handled as special cases.
                    104: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
                    105:      algorithm selection, and the latter should use REDC.  Both could
                    106:      change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
                    107: <li> <code>mpn_gcd</code> might be able to be sped up on small to
                    108:      moderate sizes by improving <code>find_a</code>, possibly just by
                    109:      providing an alternate implementation for CPUs with slowish
                    110:      <code>count_leading_zeros</code>.
                    111: <li> Implement a cache localized evaluate and interpolate for the
                    112:      toom3 <code>USE_MORE_MPN</code> code.  The necessary
                    113:      right-to-left <code>mpn_divexact_by3c</code> exists.
                    114: <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
                    115:      better cache locality by taking N piece by piece.  The current code could
                    116:      be left available for CPUs without caching.  Depending how karatsuba etc
                    117:      is applied to unequal size operands it might be possible to assume M is
                    118:      always smallish.
                    119: </ul>
                    120:
                    121:
                    122: <h4>Machine Dependent Optimization</h4>
                    123: <ul>
                    124: <li> Run the `tune' utility for more compiler/CPU combinations.  We would like
                    125:      to have gmp-mparam.h files in practically every implementation specific
                    126:      mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
                    127:      compiler.  See the `tune' top-level directory for more information.
                    128: <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
                    129:      <code>mpn_mul_1</code> for the 21264.  On 21264, they should run at 4, 3,
                    130:      and 3 cycles/limb respectively, if the code is unrolled properly.  (Ask
                    131:      Torbjörn for his xm.s and xam.s skeleton files.)
                    132: <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
                    133:      <code>mpn_mul_1</code> for the 21164.  This should use both integer
                    134:      multiplies and floating-point multiplies.  For the floating-point
                    135:      operations, the single-limb multiplier should be split into three 21-bit
                    136:      chunks.
                    137: <li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
                    138:      <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>.  Should use
                    139:      floating-point operations, and split the invariant single-limb multiplier
                    140:      into 21-bit chunks.  Should give about 18 cycles/limb, but the pipeline
                    141:      will become very deep.  (Torbjörn has C code that is useful as a starting
                    142:      point.)
                    143: <li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
                    144:      Should give 2 cycles/limb.  (Torbjörn has code that just needs to be
                    145:      finished.)
                    146: <li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
                    147:      and the other multiplies varies so much on successive sizes.
                    148: <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
                    149:      <code>mpn_mul_1</code>.  The current development code runs at 11
                    150:      cycles/limb, which is already very good.  But it should be possible to
                    151:      saturate the cache, which will happen at 7.5 cycles/limb.
                    152: <li> Sparc & SparcV8: Enable umul.asm for native cc.  The generic
                    153:      longlong.h umul_ppmm is suspected to be causing sqr_basecase to
                    154:      be slower than mul_basecase.
                    155: <li> UltraSPARC: Write <code>umul_ppmm</code>.  Important in particular for
                    156:      <code>mpn_sqr_basecase</code>.  Using four "<code>mulx</code>"s either
                    157:      with an asm block or via the generic C code is about 90 cycles.
                    158: <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
                    159:      for important machines.  Helping the generic sqr_basecase.c with an
                    160:      <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
                    161: <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
                    162:      Will bring time from 1.75 to 1.25 cycles/limb.
                    163: <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See Pentium code.)
                    164: <li> Alpha: Optimize <code>count_leading_zeros</code>.
                    165: <li> Alpha: Optimize <code>udiv_qrnnd</code>.  (Ask Torbjörn for the file
                    166:      test-udiv-preinv.c as a starting point.)
                    167: <li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
                    168:      It should just require 3 cycles/limb, but the current code propagates
                    169:      carry poorly.  The trick is to add carry-in later than we do now,
                    170:      decreasing the number of operations used to generate carry-out from 4 to
                    171:      to 3.
                    172: <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
                    173:      The pipeline is now extremely deep, perhaps unnecessarily deep.  Also, r5
                    174:      is unused.  (Ask Torbjörn for a copy of the current code.)
                    175: <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
                    176: <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
                    177:      run at just 3.25 cycles/limb.  (Ask for xxx-add_n.s as a starting point.)
                    178: <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
                    179: <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
                    180:      <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
                    181:      method", to avoid carry propagation.  splitting one of the operands in
                    182:      11-bit chunks.
                    183: <li> Cray: Vectorize main functions, perhaps in assembly language.
                    184: <li> Cray: Write <code>mpn_mul_basecase</code> and
                    185:      <code>mpn_sqr_basecase</code>.  Same comment applies to this as to the
                    186:      same functions for Fujitsu VPP.
                    187: <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
                    188:
                    189:   <pre>
                    190:   if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x &lt&lt= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
                    191:   if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x &lt&lt= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
                    192:   ... </pre>
                    193:
                    194: </ul>
                    195:
                    196: <h4>New Functionality</h4>
                    197: <ul>
                    198: <li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the nth limb).
                    199: <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
                    200: <li> Let `0b' and `0B' mean binary input everywhere.
                    201: <li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
                    202:      (cf. <code>mpq_set_d</code>).
                    203: <li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
                    204:      allocation.  Set <code>ALLOC(var)</code> to 0, and have
                    205:      <code>mpz_realloc</code> special-handle that case.  Update functions that
                    206:      rely on a single limb (like <code>mpz_set_ui</code>,
                    207:      <code>mpz_[tfc]div_r_ui</code>, and others).
                    208: <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
                    209:      format is portable between 32-bit and 64-bit machines, and between
                    210:      little-endian and big-endian machines.
                    211: <li> Handle numeric exceptions: Call an error handler, and/or set
                    212:      <code>gmp_errno</code>.
                    213: <li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
                    214:      <code>gmp_snprintf</code>.  Think about some sort of wrapper
                    215:      around <code>printf</code> so it and its several variants don't
                    216:      have to be completely reimplemented.
                    217: <li> Implement some <code>mpq</code> input and output functions.
                    218: <li> Implement a full precision <code>mpz_kronecker</code>, leave
                    219:      <code>mpz_jacobi</code> for compatibility.
                    220: <li> Make the mpn logops and copys available in gmp.h.  Since they can
                    221:      be either library functions or inlines, gmp.h would need to be
                    222:      generated from a gmp.in based on what's in the library.  gmp.h
                    223:      would still be compiler-independent though.
                    224: <li> Make versions of <code>mpz_set_str</code> etc taking string
                    225:      lengths rather than null-terminators.
                    226: <li> Consider changing the thresholds to apply the simpler algorithm when
                    227:      "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
                    228:      be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
                    229:      compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
                    230: <li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
                    231:      could be implemented to match the corresponding tdiv and fdiv.
                    232:      Maybe some code sharing is possible.
                    233: </ul>
                    234:
                    235:
                    236: <h4>Configuration</h4>
                    237:
                    238: <ul>
                    239: <li> Improve config.guess.  We want to recognize the processor very
                    240:      accurately, more accurately than other GNU packages.
                    241:      config.guess does not currently make the distinctions we would
                    242:      like it to do and a --target often needs to be set explicitly.
                    243:
                    244:      For example, "sparc" is not very useful as a machine architecture
                    245:      denotation.  We want to distinguish old 32-bit SPARC without
                    246:      multiply support from newer 32-bit SPARC with such support.  We
                    247:      want to recognize a SuperSPARC, since its implementation of the
                    248:      UDIV instruction is not complete, and will trap to the OS kernel
                    249:      for certain operands.  And we want to recognize 64-bit capable
                    250:      SPARC processors as such.  While the assembly routines can use
                    251:      64-bit operations on all 64-bit SPARC processors, one can not use
                    252:      64-bit limbs under all operating system.  E.g., Solaris 2.5 and
                    253:      2.6 doesn't preserve the upper 32 bits of most processor
                    254:      registers.  For SPARC we therefore sometimes need to choose GMP
                    255:      configuration depending both on processor and operating system.
                    256:
                    257: <li> Remember to make sure config.sub accepts any output from config.guess.
                    258:
                    259: <li> Find out whether there's an alloca available and how to use it.
                    260:      AC_FUNC_ALLOCA has various system dependencies covered, but we
                    261:      don't want its alloca.c replacement.  (One thing current cpp
                    262:      tests don't cover: HPUX 10 C compiler supports alloca, but
                    263:      cannot find any symbol to test in order to know if we're on
                    264:      HPUX 10.  Damn.)
                    265: <li> Identify Mips processor under Irix: `hinv -c processor'.
                    266:      config.guess should say mips2, mips3, and mips4.
                    267: <li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
                    268:      Unfortunately, sizer is not available before some revision of
                    269:      Dec Unix 4.0, and it also returns some rather cryptic names for
                    270:      processors.  Perhaps the <code>implver</code> and
                    271:      <code>amask</code> assembly instructions are better, but that
                    272:      doesn't differentiate between ev5 and ev56.
                    273: <li> Identify Sparc processors.  config.guess should say supersparc,
                    274:      microsparc, ultrasparc1, ultrasparc2, etc.
                    275: <li> Identify HPPA processors similarly.
                    276: <li> Get lots of information about a Solaris system: prtconf -vp
                    277: <li> For some target machines and some compilers, specific options
                    278:      are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
                    279:      Irix64/cc needs -64, Irix32/cc might need -n32, etc).  Some are
                    280:      set already, add more, see configure.in.
                    281: <li> Options to be passed to the assembler (via the compiler, using
                    282:      whatever syntax the compiler uses for passing options to the
                    283:      assembler).
                    284: <li> On Solaris 7, check if gcc supports native v9 64-bit
                    285:      arithmetic.  If not compile using "cc -fast -xarch=v9".
                    286:      (Problem: -fast requires that we link with -fast too, which
                    287:      might not be very good.  Pass "-xO4 -xtarget=native" instead?)
                    288: <li> Extend the "optional" compiler arguments to choose the first
                    289:      that works from from a set, so when gcc gets athlon support it
                    290:      can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
                    291:      whichever works.
                    292: <li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
                    293:      x86s.  (A bug in gcc 2.95.2 prevents it being used
                    294:      unconditionally.)
                    295: <li> Build multiple variants of the library under certain systems.
                    296:      An example is -n32, -o32, and -64 on Irix.
                    297: <li> There's a few filenames that don't fit in 14 chars, if this
                    298:      matters.
                    299: <li> Enable support for FORTRAN versions of mpn files (eg. for
                    300:      mpn/cray/mulww.f).  Add "f" to the mpn path searching, run AC_PROG_F77 if
                    301:      such a file is found.  Automake will generate some of what's needed in the
                    302:      makefiles, but libtool doesn't know fortran and so rules like the current
                    303:      ".asm.lo" will be needed.
                    304: <li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
                    305:      selected from the mpn path.  This might help say a generic C
                    306:      build on weird systems.
                    307: </ul>
                    308:
                    309: <p> In general, getting the exact right configuration, passing the
                    310: exact right options to the compiler, etc, might mean that the GMP
                    311: performance more than doubles.
                    312:
                    313: <p> When testing, make sure to test at least the following for all out
                    314: target machines: (1) Both gcc and cc (and c89).  (2) Both 32-bit mode
                    315: and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
                    316: `make' and GNU `make'. (4) With and without GNU binutils.
                    317:
                    318:
                    319: <h4>Miscellaneous</h4>
                    320: <ul>
                    321:
                    322: <li> Work on the way we build the library.  We now do it building
                    323:      convenience libraries but then listing all the object files a
                    324:      second time in the top level Makefile.am.
                    325: <li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
                    326:      mp[zq]/add.c to decide whether it will add or subtract.  Will decrease
                    327:      redundancy.  Similarly in other places.
                    328: <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
                    329:      analogous to <code>mpz_mod</code>.  Document, and list as an
                    330:      incompatibility.
                    331: <li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
                    332:      mpn/generic/pow_ui.
                    333: <li> Make mpz_invert call mpn_gcdext directly.
                    334: <li> Make a build option to enable execution profiling with gprof.  In
                    335:      particular look at getting the right <code>mcount</code> call at
                    336:      the start of each assembler subroutine (for important targets at
                    337:      least).
                    338: </ul>
                    339:
                    340:
                    341: <h4>Aids to Debugging</h4>
                    342: <ul>
                    343: <li> Make an option for stack-alloc.c to call <code>malloc</code>
                    344:      separately for each <code>TMP_ALLOC</code> block, so a redzoning
                    345:      malloc debugger could be used during development.
                    346: <li> Add <code>ASSERT</code>s at the start of each user-visible
                    347:      mpz/mpq/mpf function to check the validity of each
                    348:      <code>mp?_t</code> parameter, in particular to check they've been
                    349:      <code>mp?_init</code>ed.  This might catch elementary mistakes in
                    350:      user programs.  Care would need to be taken over
                    351:      <code>MPZ_TMP_INIT</code>ed variables used internally.
                    352: </ul>
                    353:
                    354:
                    355: <h4>Documentation</h4>
                    356: <ul>
                    357: <li> Document conventions, like that <code> unsigned long int</code> is used for
                    358:      bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
                    359: <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
                    360: </ul>
                    361:
                    362: <hr>
                    363:
                    364: <table width="100%">
                    365:   <tr>
                    366:     <td>
                    367:       <font size=2>
                    368:       Please send comments about this page to
                    369:       <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
                    370:       Copyright (C) 1999, 2000 Torbjörn Granlund.
                    371:       </font>
                    372:     </td>
                    373:     <td align=right>
                    374:     </td>
                    375:   </tr>
                    376: </table>
                    377:
                    378: </body>
                    379: </html>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>