[BACK]Return to PROJECTS CVS log [TXT][DIR] Up to [local] / OpenXM / src / kan96xx / gmp-2.0.2

Annotation of OpenXM/src/kan96xx/gmp-2.0.2/PROJECTS, Revision 1.1.1.1

1.1       maekawa     1: IDEAS ABOUT THINGS TO WORK ON
                      2:
                      3: * mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
                      4:   4 most significant limbs of each operand and compare them.  If that is not
                      5:   sufficient, do the same for 8 limbs, etc.
                      6:
                      7: * Write mpi, the Multiple Precision Interval Arithmetic layer.
                      8:
                      9: * Write `mpX_eval' that take lambda-like expressions and a list of operands.
                     10:
                     11: * As a general rule, recognize special operand values in mpz and mpf, and
                     12:   use shortcuts for speed.  Examples: Recognize (small or all) 2^n in
                     13:   multiplication and division.  Recognize small bases in mpz_pow_ui.
                     14:
                     15: * Implement lazy allocation?  mpz->d == 0 would mean no allocation made yet.
                     16:
                     17: * Maybe store one-limb numbers according to Per Bothner's idea:
                     18:     struct {
                     19:       mp_ptr d;
                     20:       union {
                     21:        mp_limb val;    /* if (d == NULL).  */
                     22:         mp_size size;   /* Length of data array, if (d != NULL).  */
                     23:       } u;
                     24:     };
                     25:   Problem: We can't normalize to that format unless we free the space
                     26:   pointed to by d, and therefore small values will not be stored in a
                     27:   canonical way.
                     28:
                     29: * Document complexity of all functions.
                     30:
                     31: * Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
                     32:   mpz_fits_signedint_p, etc.
                     33:
                     34:   mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
                     35:
                     36: * Better random number generators.  There should be fast (like mpz_random),
                     37:   very good (mpz_veryrandom), and special purpose (like mpz_random2).  Sizes
                     38:   in *bits*, not in limbs.
                     39:
                     40: * It'd be possible to have an interface "s = add(a,b)" with automatic GC.
                     41:   If the mpz_xinit routine remembers the address of the variable we could
                     42:   walk-and-mark the list of remembered variables, and free the space
                     43:   occupied by the remembered variables that didn't get marked.  Fairly
                     44:   standard.
                     45:
                     46: * Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
                     47:   etc, to call __umul_ppmm, __udiv_qrnnd.  A typical definition for
                     48:   umul_ppmm would be
                     49:   #define umul_ppmm(ph,pl,m0,m1) \
                     50:     {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
                     51:   In order to maintain just one version of longlong.h (gmp and gcc), this
                     52:   has to be done outside of longlong.h.
                     53:
                     54: Bennet Yee at CMU proposes:
                     55: * mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
                     56: * A function mpfatal that is called for exceptions.  Let the user override
                     57:   a default definition.
                     58:
                     59: * Make all computation mpz_* functions return a signed int indicating if the
                     60:   result was zero, positive, or negative?
                     61:
                     62: * Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
                     63:   mpz_ldb, various bit string operations.  Also mpz_@_si for most @??
                     64:
                     65: * Add macros for looping efficiently over a number's limbs:
                     66:        MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
                     67:          { user code manipulating limb}
                     68:        MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
                     69:          { user code manipulating limb}
                     70:
                     71: Brian Beuning proposes:
                     72:    1. An array of small primes
                     73:    3. A function to factor a mpz_t.  [How do we return the factors?  Maybe
                     74:       we just return one arbitrary factor?  In the latter case, we have to
                     75:       use a data structure that records the state of the factoring routine.]
                     76:    4. A routine to look for "small" divisors of an mpz_t
                     77:    5. A 'multiply mod n' routine based on Montgomery's algorithm.
                     78:
                     79: Dough Lea proposes:
                     80:    1. A way to find out if an integer fits into a signed int, and if so, a
                     81:       way to convert it out.
                     82:    2. Similarly for double precision float conversion.
                     83:    3. A function to convert the ratio of two integers to a double.  This
                     84:       can be useful for mixed mode operations with integers, rationals, and
                     85:       doubles.
                     86:
                     87: Elliptic curve method description in the Chapter `Algorithms in Number
                     88: Theory' in the Handbook of Theoretical Computer Science, Elsevier,
                     89: Amsterdam, 1990.  Also in Carl Pomerance's lecture notes on Cryptology and
                     90: Computational Number Theory, 1990.
                     91:
                     92: * Harald Kirsh suggests:
                     93:   mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
                     94:
                     95: * New function: mpq_get_ifstr (int_str, frac_str, base,
                     96:   precision_in_som_way, rational_number).  Convert RATIONAL_NUMBER to a
                     97:   string in BASE and put the integer part in INT_STR and the fraction part
                     98:   in FRAC_STR.  (This function would do a division of the numerator and the
                     99:   denominator.)
                    100:
                    101: * Should mpz_powm* handle negative exponents?
                    102:
                    103: * udiv_qrnnd: If the denominator is normalized, the n0 argument has very
                    104:   little effect on the quotient.  Maybe we can assume it is 0, and
                    105:   compensate at a later stage?
                    106:
                    107: * Better sqrt: First calculate the reciprocal square root, then multiply by
                    108:   the operand to get the square root.  The reciprocal square root can be
                    109:   obtained through Newton-Raphson without division.  To compute sqrt(A), the
                    110:   iteration is,
                    111:
                    112:                                    2
                    113:                   x   = x  (3 - A x )/2.
                    114:                    i+1   i         i
                    115:
                    116:   The final result can be computed without division using,
                    117:
                    118:                     sqrt(A) = A x .
                    119:                                  n
                    120:
                    121: * Newton-Raphson using multiplication: We get twice as many correct digits
                    122:   in each iteration.  So if we square x(k) as part of the iteration, the
                    123:   result will have the leading digits in common with the entire result from
                    124:   iteration k-1.  A _mpn_mul_lowpart could help us take advantage of this.
                    125:
                    126: * Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
                    127:   a*b modulo p and the long long type is unavailable, then I can write
                    128:
                    129:          typedef   signed long slong;
                    130:          typedef unsigned long ulong;
                    131:          slong a, b, p, quot, rem;
                    132:
                    133:          quot = (slong) (0.5 + (double)a * (double)b / (double)p);
                    134:          rem =  (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
                    135:          if (rem < 0} {rem += p; quot--;}
                    136:
                    137: * Speed modulo arithmetic, using Montgomery's method or my pre-inversion
                    138:   method.  In either case, special arithmetic calls would be needed,
                    139:   mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
                    140:   functions.  Better yet: Write a new mpr layer.
                    141:
                    142: * mpz_powm* should not use division to reduce the result in the loop, but
                    143:   instead pre-compute the reciprocal of the MOD argument and do reduced_val
                    144:   = val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
                    145:
                    146: * mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
                    147:
                    148: * It would be a quite important feature never to allocate more memory than
                    149:   really necessary for a result.  Sometimes we can achieve this cheaply, by
                    150:   deferring reallocation until the result size is known.
                    151:
                    152: * New macro in longlong.h: shift_rhl that extracts a word by shifting two
                    153:   words as a unit.  (Supported by i386, i860, HP-PA, POWER, 29k.)  Useful
                    154:   for shifting multiple precision numbers.
                    155:
                    156: * The installation procedure should make a test run of multiplication to
                    157:   decide the threshold values for algorithm switching between the available
                    158:   methods.
                    159:
                    160: * Fast output conversion of x to base B:
                    161:     1. Find n, such that (B^n > x).
                    162:     2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
                    163:     3. Multiply the low half of y by B^(n/2), and recursively convert the
                    164:        result.  Truncate the low half of y and convert that recursively.
                    165:   Complexity: O(M(n)log(n))+O(D(n))!
                    166:
                    167: * Improve division using Newton-Raphson.  Check out "Newton Iteration and
                    168:   Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
                    169:   Morgan Kaufmann, 1993 ("beware of some errors"...)
                    170:
                    171: * Improve implementation of Karatsuba's algorithm.  For most operand sizes,
                    172:   we can reduce the number of operations by splitting differently.
                    173:
                    174: * Faster multiplication: The best approach is to first implement Toom-Cook.
                    175:   People report that it beats Karatsuba's algorithm already at about 100
                    176:   limbs.  FFT would probably never beat a well-written Toom-Cook (not even for
                    177:   millions of bits).
                    178:
                    179: FFT:
                    180: {
                    181:   * Multiplication could be done with Montgomery's method combined with
                    182:     the "three primes" method described in Lipson.  Maybe this would be
                    183:     faster than to Nussbaumer's method with 3 (simple) moduli?
                    184:
                    185:   * Maybe the modular tricks below are not needed: We are using very
                    186:     special numbers, Fermat numbers with a small base and a large exponent,
                    187:     and maybe it's possible to just subtract and add?
                    188:
                    189:   * Modify Nussbaumer's convolution algorithm, to use 3 words for each
                    190:     coefficient, calculating in 3 relatively prime moduli (e.g.
                    191:     0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer).  Both all
                    192:     operations and CRR would be very fast with such numbers.
                    193:
                    194:   * Optimize the Schoenhage-Stassen multiplication algorithm.  Take advantage
                    195:     of the real valued input to save half of the operations and half of the
                    196:     memory.  Use recursive FFT with large base cases, since recursive FFT has
                    197:     better memory locality.  A normal FFT get 100% cache misses for large
                    198:     enough operands.
                    199:
                    200:   * In the 3-prime convolution method, it might sometimes be a win to use 2,
                    201:     3, or 5 primes.  Imagine that using 3 primes would require a transform
                    202:     length of 2^n.  But 2 primes might still sometimes give us correct
                    203:     results with that same transform length, or 5 primes might allow us to
                    204:     decrease the transform size to 2^(n-1).
                    205:
                    206:   To optimize floating-point based complex FFT we have to think of:
                    207:
                    208:   1. The normal implementation accesses all input exactly once for each of
                    209:      the log(n) passes.  This means that we will get 0% cache hit when n >
                    210:      our cache.  Remedy: Reorganize computation to compute partial passes,
                    211:      maybe similar to a standard recursive FFT implementation.  Use a large
                    212:      `base case' to make any extra overhead of this organization negligible.
                    213:
                    214:   2. Use base-4, base-8 and base-16 FFT instead of just radix-2.  This can
                    215:      reduce the number of operations by 2x.
                    216:
                    217:   3. Inputs are real-valued.  According to Knuth's "Seminumerical
                    218:      Algorithms", exercise 4.6.4-14, we can save half the memory and half
                    219:      the operations if we take advantage of that.
                    220:
                    221:   4. Maybe make it possible to write the innermost loop in assembly, since
                    222:      that could win us another 2x speedup.  (If we write our FFT to avoid
                    223:      cache-miss (see #1 above) it might be logical to write the `base case'
                    224:      in assembly.)
                    225:
                    226:   5. Avoid multiplication by 1, i, -1, -i.  Similarly, optimize
                    227:      multiplication by (+-\/2 +- i\/2).
                    228:
                    229:   6. Put as many bits as possible in each double (but don't waste time if
                    230:      that doesn't make the transform size become smaller).
                    231:
                    232:   7. For n > some large number, we will get accuracy problems because of the
                    233:      limited precision of our floating point arithmetic.  This can easily be
                    234:      solved by using the Karatsuba trick a few times until our operands
                    235:      become small enough.
                    236:
                    237:   8. Precompute the roots-of-unity and store them in a vector.
                    238: }
                    239:
                    240: * When a division result is going to be just one limb, (i.e. nsize-dsize is
                    241:   small) normalization could be done in the division loop.
                    242:
                    243: * Never allocate temporary space for a source param that overlaps with a
                    244:   destination param needing reallocation.  Instead malloc a new block for
                    245:   the destination (and free the source before returning to the caller).
                    246:
                    247: * Parallel addition.  Since each processors have to tell it is ready to the
                    248:   next processor, we can use simplified synchronization, and actually write
                    249:   it in C:  For each processor (apart from the least significant):
                    250:
                    251:        while (*svar != my_number)
                    252:          ;
                    253:        *svar = my_number + 1;
                    254:
                    255:   The least significant processor does this:
                    256:
                    257:        *svar = my_number + 1;  /* i.e., *svar = 1 */
                    258:
                    259:   Before starting the addition, one processor has to store 0 in *svar.
                    260:
                    261:   Other things to think about for parallel addition: To avoid false
                    262:   (cache-line) sharing, allocate blocks on cache-line boundaries.
                    263:
                    264: 
                    265: Local Variables:
                    266: mode: text
                    267: fill-column: 77
                    268: fill-prefix: "  "
                    269: version-control: never
                    270: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>