OpenXM_contrib/gmp/PROJECTS - annotate

Return to PROJECTS CVS log
Up to [local] / OpenXM_contrib / gmp
Annotation of OpenXM_contrib/gmp/PROJECTS, Revision 1.1

1.1     ! maekawa     1: IDEAS ABOUT THINGS TO WORK ON
        !             2:
        !             3: * mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
        !             4:   4 most significant limbs of each operand and compare them.  If that is not
        !             5:   sufficient, do the same for 8 limbs, etc.
        !             6:
        !             7: * Write mpi, the Multiple Precision Interval Arithmetic layer.
        !             8:
        !             9: * Write `mpX_eval' that take lambda-like expressions and a list of operands.
        !            10:
        !            11: * As a general rule, recognize special operand values in mpz and mpf, and
        !            12:   use shortcuts for speed.  Examples: Recognize (small or all) 2^n in
        !            13:   multiplication and division.  Recognize small bases in mpz_pow_ui.
        !            14:
        !            15: * Implement lazy allocation?  mpz->d == 0 would mean no allocation made yet.
        !            16:
        !            17: * Maybe store one-limb numbers according to Per Bothner's idea:
        !            18:     struct {
        !            19:       mp_ptr d;
        !            20:       union {
        !            21:        mp_limb val;    /* if (d == NULL).  */
        !            22:         mp_size size;   /* Length of data array, if (d != NULL).  */
        !            23:       } u;
        !            24:     };
        !            25:   Problem: We can't normalize to that format unless we free the space
        !            26:   pointed to by d, and therefore small values will not be stored in a
        !            27:   canonical way.
        !            28:
        !            29: * Document complexity of all functions.
        !            30:
        !            31: * Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
        !            32:   mpz_fits_signedint_p, etc.
        !            33:
        !            34:   mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
        !            35:
        !            36: * Better random number generators.  There should be fast (like mpz_random),
        !            37:   very good (mpz_veryrandom), and special purpose (like mpz_random2).  Sizes
        !            38:   in *bits*, not in limbs.
        !            39:
        !            40: * It'd be possible to have an interface "s = add(a,b)" with automatic GC.
        !            41:   If the mpz_xinit routine remembers the address of the variable we could
        !            42:   walk-and-mark the list of remembered variables, and free the space
        !            43:   occupied by the remembered variables that didn't get marked.  Fairly
        !            44:   standard.
        !            45:
        !            46: * Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
        !            47:   etc, to call __umul_ppmm, __udiv_qrnnd.  A typical definition for
        !            48:   umul_ppmm would be
        !            49:   #define umul_ppmm(ph,pl,m0,m1) \
        !            50:     {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
        !            51:   In order to maintain just one version of longlong.h (gmp and gcc), this
        !            52:   has to be done outside of longlong.h.
        !            53:
        !            54: Bennet Yee at CMU proposes:
        !            55: * mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
        !            56: * A function mpfatal that is called for exceptions.  Let the user override
        !            57:   a default definition.
        !            58:
        !            59: * Make all computation mpz_* functions return a signed int indicating if the
        !            60:   result was zero, positive, or negative?
        !            61:
        !            62: * Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
        !            63:   mpz_ldb, various bit string operations.  Also mpz_@_si for most @??
        !            64:
        !            65: * Add macros for looping efficiently over a number's limbs:
        !            66:        MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
        !            67:          { user code manipulating limb}
        !            68:        MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
        !            69:          { user code manipulating limb}
        !            70:
        !            71: Brian Beuning proposes:
        !            72:    1. An array of small primes
        !            73:    3. A function to factor a mpz_t.  [How do we return the factors?  Maybe
        !            74:       we just return one arbitrary factor?  In the latter case, we have to
        !            75:       use a data structure that records the state of the factoring routine.]
        !            76:    4. A routine to look for "small" divisors of an mpz_t
        !            77:    5. A 'multiply mod n' routine based on Montgomery's algorithm.
        !            78:
        !            79: Dough Lea proposes:
        !            80:    1. A way to find out if an integer fits into a signed int, and if so, a
        !            81:       way to convert it out.
        !            82:    2. Similarly for double precision float conversion.
        !            83:    3. A function to convert the ratio of two integers to a double.  This
        !            84:       can be useful for mixed mode operations with integers, rationals, and
        !            85:       doubles.
        !            86:
        !            87: Elliptic curve method description in the Chapter `Algorithms in Number
        !            88: Theory' in the Handbook of Theoretical Computer Science, Elsevier,
        !            89: Amsterdam, 1990.  Also in Carl Pomerance's lecture notes on Cryptology and
        !            90: Computational Number Theory, 1990.
        !            91:
        !            92: * Harald Kirsh suggests:
        !            93:   mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
        !            94:
        !            95: * New function: mpq_get_ifstr (int_str, frac_str, base,
        !            96:   precision_in_som_way, rational_number).  Convert RATIONAL_NUMBER to a
        !            97:   string in BASE and put the integer part in INT_STR and the fraction part
        !            98:   in FRAC_STR.  (This function would do a division of the numerator and the
        !            99:   denominator.)
        !           100:
        !           101: * Should mpz_powm* handle negative exponents?
        !           102:
        !           103: * udiv_qrnnd: If the denominator is normalized, the n0 argument has very
        !           104:   little effect on the quotient.  Maybe we can assume it is 0, and
        !           105:   compensate at a later stage?
        !           106:
        !           107: * Better sqrt: First calculate the reciprocal square root, then multiply by
        !           108:   the operand to get the square root.  The reciprocal square root can be
        !           109:   obtained through Newton-Raphson without division.  To compute sqrt(A), the
        !           110:   iteration is,
        !           111:
        !           112:                                    2
        !           113:                   x   = x  (3 - A x )/2.
        !           114:                    i+1   i         i
        !           115:
        !           116:   The final result can be computed without division using,
        !           117:
        !           118:                     sqrt(A) = A x .
        !           119:                                  n
        !           120:
        !           121: * Newton-Raphson using multiplication: We get twice as many correct digits
        !           122:   in each iteration.  So if we square x(k) as part of the iteration, the
        !           123:   result will have the leading digits in common with the entire result from
        !           124:   iteration k-1.  A _mpn_mul_lowpart could help us take advantage of this.
        !           125:
        !           126: * Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
        !           127:   a*b modulo p and the long long type is unavailable, then I can write
        !           128:
        !           129:          typedef   signed long slong;
        !           130:          typedef unsigned long ulong;
        !           131:          slong a, b, p, quot, rem;
        !           132:
        !           133:          quot = (slong) (0.5 + (double)a * (double)b / (double)p);
        !           134:          rem =  (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
        !           135:          if (rem < 0} {rem += p; quot--;}
        !           136:
        !           137: * Speed modulo arithmetic, using Montgomery's method or my pre-inversion
        !           138:   method.  In either case, special arithmetic calls would be needed,
        !           139:   mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
        !           140:   functions.  Better yet: Write a new mpr layer.
        !           141:
        !           142: * mpz_powm* should not use division to reduce the result in the loop, but
        !           143:   instead pre-compute the reciprocal of the MOD argument and do reduced_val
        !           144:   = val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
        !           145:
        !           146: * mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
        !           147:
        !           148: * It would be a quite important feature never to allocate more memory than
        !           149:   really necessary for a result.  Sometimes we can achieve this cheaply, by
        !           150:   deferring reallocation until the result size is known.
        !           151:
        !           152: * New macro in longlong.h: shift_rhl that extracts a word by shifting two
        !           153:   words as a unit.  (Supported by i386, i860, HP-PA, POWER, 29k.)  Useful
        !           154:   for shifting multiple precision numbers.
        !           155:
        !           156: * The installation procedure should make a test run of multiplication to
        !           157:   decide the threshold values for algorithm switching between the available
        !           158:   methods.
        !           159:
        !           160: * Fast output conversion of x to base B:
        !           161:     1. Find n, such that (B^n > x).
        !           162:     2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
        !           163:     3. Multiply the low half of y by B^(n/2), and recursively convert the
        !           164:        result.  Truncate the low half of y and convert that recursively.
        !           165:   Complexity: O(M(n)log(n))+O(D(n))!
        !           166:
        !           167: * Improve division using Newton-Raphson.  Check out "Newton Iteration and
        !           168:   Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
        !           169:   Morgan Kaufmann, 1993 ("beware of some errors"...)
        !           170:
        !           171: * Improve implementation of Karatsuba's algorithm.  For most operand sizes,
        !           172:   we can reduce the number of operations by splitting differently.
        !           173:
        !           174: * Faster multiplication: The best approach is to first implement Toom-Cook.
        !           175:   People report that it beats Karatsuba's algorithm already at about 100
        !           176:   limbs.  FFT would probably never beat a well-written Toom-Cook (not even for
        !           177:   millions of bits).
        !           178:
        !           179: FFT:
        !           180: {
        !           181:   * Multiplication could be done with Montgomery's method combined with
        !           182:     the "three primes" method described in Lipson.  Maybe this would be
        !           183:     faster than to Nussbaumer's method with 3 (simple) moduli?
        !           184:
        !           185:   * Maybe the modular tricks below are not needed: We are using very
        !           186:     special numbers, Fermat numbers with a small base and a large exponent,
        !           187:     and maybe it's possible to just subtract and add?
        !           188:
        !           189:   * Modify Nussbaumer's convolution algorithm, to use 3 words for each
        !           190:     coefficient, calculating in 3 relatively prime moduli (e.g.
        !           191:     0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer).  Both all
        !           192:     operations and CRR would be very fast with such numbers.
        !           193:
        !           194:   * Optimize the Schoenhage-Stassen multiplication algorithm.  Take advantage
        !           195:     of the real valued input to save half of the operations and half of the
        !           196:     memory.  Use recursive FFT with large base cases, since recursive FFT has
        !           197:     better memory locality.  A normal FFT get 100% cache misses for large
        !           198:     enough operands.
        !           199:
        !           200:   * In the 3-prime convolution method, it might sometimes be a win to use 2,
        !           201:     3, or 5 primes.  Imagine that using 3 primes would require a transform
        !           202:     length of 2^n.  But 2 primes might still sometimes give us correct
        !           203:     results with that same transform length, or 5 primes might allow us to
        !           204:     decrease the transform size to 2^(n-1).
        !           205:
        !           206:   To optimize floating-point based complex FFT we have to think of:
        !           207:
        !           208:   1. The normal implementation accesses all input exactly once for each of
        !           209:      the log(n) passes.  This means that we will get 0% cache hit when n >
        !           210:      our cache.  Remedy: Reorganize computation to compute partial passes,
        !           211:      maybe similar to a standard recursive FFT implementation.  Use a large
        !           212:      `base case' to make any extra overhead of this organization negligible.
        !           213:
        !           214:   2. Use base-4, base-8 and base-16 FFT instead of just radix-2.  This can
        !           215:      reduce the number of operations by 2x.
        !           216:
        !           217:   3. Inputs are real-valued.  According to Knuth's "Seminumerical
        !           218:      Algorithms", exercise 4.6.4-14, we can save half the memory and half
        !           219:      the operations if we take advantage of that.
        !           220:
        !           221:   4. Maybe make it possible to write the innermost loop in assembly, since
        !           222:      that could win us another 2x speedup.  (If we write our FFT to avoid
        !           223:      cache-miss (see #1 above) it might be logical to write the `base case'
        !           224:      in assembly.)
        !           225:
        !           226:   5. Avoid multiplication by 1, i, -1, -i.  Similarly, optimize
        !           227:      multiplication by (+-\/2 +- i\/2).
        !           228:
        !           229:   6. Put as many bits as possible in each double (but don't waste time if
        !           230:      that doesn't make the transform size become smaller).
        !           231:
        !           232:   7. For n > some large number, we will get accuracy problems because of the
        !           233:      limited precision of our floating point arithmetic.  This can easily be
        !           234:      solved by using the Karatsuba trick a few times until our operands
        !           235:      become small enough.
        !           236:
        !           237:   8. Precompute the roots-of-unity and store them in a vector.
        !           238: }
        !           239:
        !           240: * When a division result is going to be just one limb, (i.e. nsize-dsize is
        !           241:   small) normalization could be done in the division loop.
        !           242:
        !           243: * Never allocate temporary space for a source param that overlaps with a
        !           244:   destination param needing reallocation.  Instead malloc a new block for
        !           245:   the destination (and free the source before returning to the caller).
        !           246:
        !           247: * Parallel addition.  Since each processors have to tell it is ready to the
        !           248:   next processor, we can use simplified synchronization, and actually write
        !           249:   it in C:  For each processor (apart from the least significant):
        !           250:
        !           251:        while (*svar != my_number)
        !           252:          ;
        !           253:        *svar = my_number + 1;
        !           254:
        !           255:   The least significant processor does this:
        !           256:
        !           257:        *svar = my_number + 1;  /* i.e., *svar = 1 */
        !           258:
        !           259:   Before starting the addition, one processor has to store 0 in *svar.
        !           260:
        !           261:   Other things to think about for parallel addition: To avoid false
        !           262:   (cache-line) sharing, allocate blocks on cache-line boundaries.
        !           263:
        !           264: 
        !           265: Local Variables:
        !           266: mode: text
        !           267: fill-column: 77
        !           268: fill-prefix: "  "
        !           269: version-control: never
        !           270: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>