Annotation of OpenXM_contrib/gmp/PROJECTS, Revision 1.1
1.1 ! maekawa 1: IDEAS ABOUT THINGS TO WORK ON
! 2:
! 3: * mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
! 4: 4 most significant limbs of each operand and compare them. If that is not
! 5: sufficient, do the same for 8 limbs, etc.
! 6:
! 7: * Write mpi, the Multiple Precision Interval Arithmetic layer.
! 8:
! 9: * Write `mpX_eval' that take lambda-like expressions and a list of operands.
! 10:
! 11: * As a general rule, recognize special operand values in mpz and mpf, and
! 12: use shortcuts for speed. Examples: Recognize (small or all) 2^n in
! 13: multiplication and division. Recognize small bases in mpz_pow_ui.
! 14:
! 15: * Implement lazy allocation? mpz->d == 0 would mean no allocation made yet.
! 16:
! 17: * Maybe store one-limb numbers according to Per Bothner's idea:
! 18: struct {
! 19: mp_ptr d;
! 20: union {
! 21: mp_limb val; /* if (d == NULL). */
! 22: mp_size size; /* Length of data array, if (d != NULL). */
! 23: } u;
! 24: };
! 25: Problem: We can't normalize to that format unless we free the space
! 26: pointed to by d, and therefore small values will not be stored in a
! 27: canonical way.
! 28:
! 29: * Document complexity of all functions.
! 30:
! 31: * Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
! 32: mpz_fits_signedint_p, etc.
! 33:
! 34: mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
! 35:
! 36: * Better random number generators. There should be fast (like mpz_random),
! 37: very good (mpz_veryrandom), and special purpose (like mpz_random2). Sizes
! 38: in *bits*, not in limbs.
! 39:
! 40: * It'd be possible to have an interface "s = add(a,b)" with automatic GC.
! 41: If the mpz_xinit routine remembers the address of the variable we could
! 42: walk-and-mark the list of remembered variables, and free the space
! 43: occupied by the remembered variables that didn't get marked. Fairly
! 44: standard.
! 45:
! 46: * Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
! 47: etc, to call __umul_ppmm, __udiv_qrnnd. A typical definition for
! 48: umul_ppmm would be
! 49: #define umul_ppmm(ph,pl,m0,m1) \
! 50: {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
! 51: In order to maintain just one version of longlong.h (gmp and gcc), this
! 52: has to be done outside of longlong.h.
! 53:
! 54: Bennet Yee at CMU proposes:
! 55: * mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
! 56: * A function mpfatal that is called for exceptions. Let the user override
! 57: a default definition.
! 58:
! 59: * Make all computation mpz_* functions return a signed int indicating if the
! 60: result was zero, positive, or negative?
! 61:
! 62: * Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
! 63: mpz_ldb, various bit string operations. Also mpz_@_si for most @??
! 64:
! 65: * Add macros for looping efficiently over a number's limbs:
! 66: MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
! 67: { user code manipulating limb}
! 68: MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
! 69: { user code manipulating limb}
! 70:
! 71: Brian Beuning proposes:
! 72: 1. An array of small primes
! 73: 3. A function to factor a mpz_t. [How do we return the factors? Maybe
! 74: we just return one arbitrary factor? In the latter case, we have to
! 75: use a data structure that records the state of the factoring routine.]
! 76: 4. A routine to look for "small" divisors of an mpz_t
! 77: 5. A 'multiply mod n' routine based on Montgomery's algorithm.
! 78:
! 79: Dough Lea proposes:
! 80: 1. A way to find out if an integer fits into a signed int, and if so, a
! 81: way to convert it out.
! 82: 2. Similarly for double precision float conversion.
! 83: 3. A function to convert the ratio of two integers to a double. This
! 84: can be useful for mixed mode operations with integers, rationals, and
! 85: doubles.
! 86:
! 87: Elliptic curve method description in the Chapter `Algorithms in Number
! 88: Theory' in the Handbook of Theoretical Computer Science, Elsevier,
! 89: Amsterdam, 1990. Also in Carl Pomerance's lecture notes on Cryptology and
! 90: Computational Number Theory, 1990.
! 91:
! 92: * Harald Kirsh suggests:
! 93: mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
! 94:
! 95: * New function: mpq_get_ifstr (int_str, frac_str, base,
! 96: precision_in_som_way, rational_number). Convert RATIONAL_NUMBER to a
! 97: string in BASE and put the integer part in INT_STR and the fraction part
! 98: in FRAC_STR. (This function would do a division of the numerator and the
! 99: denominator.)
! 100:
! 101: * Should mpz_powm* handle negative exponents?
! 102:
! 103: * udiv_qrnnd: If the denominator is normalized, the n0 argument has very
! 104: little effect on the quotient. Maybe we can assume it is 0, and
! 105: compensate at a later stage?
! 106:
! 107: * Better sqrt: First calculate the reciprocal square root, then multiply by
! 108: the operand to get the square root. The reciprocal square root can be
! 109: obtained through Newton-Raphson without division. To compute sqrt(A), the
! 110: iteration is,
! 111:
! 112: 2
! 113: x = x (3 - A x )/2.
! 114: i+1 i i
! 115:
! 116: The final result can be computed without division using,
! 117:
! 118: sqrt(A) = A x .
! 119: n
! 120:
! 121: * Newton-Raphson using multiplication: We get twice as many correct digits
! 122: in each iteration. So if we square x(k) as part of the iteration, the
! 123: result will have the leading digits in common with the entire result from
! 124: iteration k-1. A _mpn_mul_lowpart could help us take advantage of this.
! 125:
! 126: * Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
! 127: a*b modulo p and the long long type is unavailable, then I can write
! 128:
! 129: typedef signed long slong;
! 130: typedef unsigned long ulong;
! 131: slong a, b, p, quot, rem;
! 132:
! 133: quot = (slong) (0.5 + (double)a * (double)b / (double)p);
! 134: rem = (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
! 135: if (rem < 0} {rem += p; quot--;}
! 136:
! 137: * Speed modulo arithmetic, using Montgomery's method or my pre-inversion
! 138: method. In either case, special arithmetic calls would be needed,
! 139: mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
! 140: functions. Better yet: Write a new mpr layer.
! 141:
! 142: * mpz_powm* should not use division to reduce the result in the loop, but
! 143: instead pre-compute the reciprocal of the MOD argument and do reduced_val
! 144: = val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
! 145:
! 146: * mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
! 147:
! 148: * It would be a quite important feature never to allocate more memory than
! 149: really necessary for a result. Sometimes we can achieve this cheaply, by
! 150: deferring reallocation until the result size is known.
! 151:
! 152: * New macro in longlong.h: shift_rhl that extracts a word by shifting two
! 153: words as a unit. (Supported by i386, i860, HP-PA, POWER, 29k.) Useful
! 154: for shifting multiple precision numbers.
! 155:
! 156: * The installation procedure should make a test run of multiplication to
! 157: decide the threshold values for algorithm switching between the available
! 158: methods.
! 159:
! 160: * Fast output conversion of x to base B:
! 161: 1. Find n, such that (B^n > x).
! 162: 2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
! 163: 3. Multiply the low half of y by B^(n/2), and recursively convert the
! 164: result. Truncate the low half of y and convert that recursively.
! 165: Complexity: O(M(n)log(n))+O(D(n))!
! 166:
! 167: * Improve division using Newton-Raphson. Check out "Newton Iteration and
! 168: Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
! 169: Morgan Kaufmann, 1993 ("beware of some errors"...)
! 170:
! 171: * Improve implementation of Karatsuba's algorithm. For most operand sizes,
! 172: we can reduce the number of operations by splitting differently.
! 173:
! 174: * Faster multiplication: The best approach is to first implement Toom-Cook.
! 175: People report that it beats Karatsuba's algorithm already at about 100
! 176: limbs. FFT would probably never beat a well-written Toom-Cook (not even for
! 177: millions of bits).
! 178:
! 179: FFT:
! 180: {
! 181: * Multiplication could be done with Montgomery's method combined with
! 182: the "three primes" method described in Lipson. Maybe this would be
! 183: faster than to Nussbaumer's method with 3 (simple) moduli?
! 184:
! 185: * Maybe the modular tricks below are not needed: We are using very
! 186: special numbers, Fermat numbers with a small base and a large exponent,
! 187: and maybe it's possible to just subtract and add?
! 188:
! 189: * Modify Nussbaumer's convolution algorithm, to use 3 words for each
! 190: coefficient, calculating in 3 relatively prime moduli (e.g.
! 191: 0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer). Both all
! 192: operations and CRR would be very fast with such numbers.
! 193:
! 194: * Optimize the Schoenhage-Stassen multiplication algorithm. Take advantage
! 195: of the real valued input to save half of the operations and half of the
! 196: memory. Use recursive FFT with large base cases, since recursive FFT has
! 197: better memory locality. A normal FFT get 100% cache misses for large
! 198: enough operands.
! 199:
! 200: * In the 3-prime convolution method, it might sometimes be a win to use 2,
! 201: 3, or 5 primes. Imagine that using 3 primes would require a transform
! 202: length of 2^n. But 2 primes might still sometimes give us correct
! 203: results with that same transform length, or 5 primes might allow us to
! 204: decrease the transform size to 2^(n-1).
! 205:
! 206: To optimize floating-point based complex FFT we have to think of:
! 207:
! 208: 1. The normal implementation accesses all input exactly once for each of
! 209: the log(n) passes. This means that we will get 0% cache hit when n >
! 210: our cache. Remedy: Reorganize computation to compute partial passes,
! 211: maybe similar to a standard recursive FFT implementation. Use a large
! 212: `base case' to make any extra overhead of this organization negligible.
! 213:
! 214: 2. Use base-4, base-8 and base-16 FFT instead of just radix-2. This can
! 215: reduce the number of operations by 2x.
! 216:
! 217: 3. Inputs are real-valued. According to Knuth's "Seminumerical
! 218: Algorithms", exercise 4.6.4-14, we can save half the memory and half
! 219: the operations if we take advantage of that.
! 220:
! 221: 4. Maybe make it possible to write the innermost loop in assembly, since
! 222: that could win us another 2x speedup. (If we write our FFT to avoid
! 223: cache-miss (see #1 above) it might be logical to write the `base case'
! 224: in assembly.)
! 225:
! 226: 5. Avoid multiplication by 1, i, -1, -i. Similarly, optimize
! 227: multiplication by (+-\/2 +- i\/2).
! 228:
! 229: 6. Put as many bits as possible in each double (but don't waste time if
! 230: that doesn't make the transform size become smaller).
! 231:
! 232: 7. For n > some large number, we will get accuracy problems because of the
! 233: limited precision of our floating point arithmetic. This can easily be
! 234: solved by using the Karatsuba trick a few times until our operands
! 235: become small enough.
! 236:
! 237: 8. Precompute the roots-of-unity and store them in a vector.
! 238: }
! 239:
! 240: * When a division result is going to be just one limb, (i.e. nsize-dsize is
! 241: small) normalization could be done in the division loop.
! 242:
! 243: * Never allocate temporary space for a source param that overlaps with a
! 244: destination param needing reallocation. Instead malloc a new block for
! 245: the destination (and free the source before returning to the caller).
! 246:
! 247: * Parallel addition. Since each processors have to tell it is ready to the
! 248: next processor, we can use simplified synchronization, and actually write
! 249: it in C: For each processor (apart from the least significant):
! 250:
! 251: while (*svar != my_number)
! 252: ;
! 253: *svar = my_number + 1;
! 254:
! 255: The least significant processor does this:
! 256:
! 257: *svar = my_number + 1; /* i.e., *svar = 1 */
! 258:
! 259: Before starting the addition, one processor has to store 0 in *svar.
! 260:
! 261: Other things to think about for parallel addition: To avoid false
! 262: (cache-line) sharing, allocate blocks on cache-line boundaries.
! 263:
! 264:
! 265: Local Variables:
! 266: mode: text
! 267: fill-column: 77
! 268: fill-prefix: " "
! 269: version-control: never
! 270: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>