Annotation of OpenXM_contrib/gmp/doc/tasks.html, Revision 1.1.1.1
1.1 maekawa 1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
2: <html>
3: <head>
4: <title>
5: GMP Itemized Development Tasks
6: </title>
7: </head>
8: <body bgcolor=lightgreen>
9:
10: <center>
11: <h1>
12: GMP Itemized Development Tasks
13: </h1>
14: </center>
15:
16: <comment>
17: An up-to-date html version of this file is available at
18: <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
19: </comment>
20:
21: <p> This file lists itemized GMP development tasks. Not all the tasks
22: listed here are suitable for volunteers, but many of them are.
23: Please see the <a href="projects.html">projects file</a> for more
24: sizeable projects.
25:
26: <h4>Correctness and Completeness</h4>
27: <ul>
28: <li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
29: new instructions. Unfortunately, the HPUX 9 assembler as well as earlier
30: assemblers reject that directive. How very clever of HP! We will have to
31: pass assembler options, and make sure it works with new and old systems
32: and GNU assembler.
33: <li> The various reuse.c tests need to force reallocation by calling
34: <code>_mpz_realloc</code> with a small (1 limb) size.
35: <li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
36: <li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
37: <code>mp_exp_t</code>, the precision of
38: <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
39: <code>mpf_get_str</code> aborts. Detect and compensate.
40: <li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
41: machines that use <code>long long</code> for storing limbs.)
42: <li> Make the string reading functions allow the `0x' prefix when the base is
43: explicitly 16. They currently only allow that prefix when the base is
44: unspecified.
45: <li> In the development sources, we return abs(a%b) in the
46: <code>mpz_*_ui</code> division routines. Perhaps make them return the
47: real remainder instead? Changes return type to <code>signed long int</code>.
48: <li> <code>mpf_eq</code> is not always correct, when one operand is
49: 1000000000... and the other operand is 0111111111..., i.e., extremely
50: close. There is a special case in <code>mpf_sub</code> for this
51: situation; put similar code in <code>mpf_eq</code>.
52: <li> mpf_eq doesn't implement what gmp.texi specifies. It should not use just
53: whole limbs, but partial limbs.
54: <li> Install Alpha assembly changes (prec/gmp-alpha-patches).
55: <li> NeXT has problems with newlines in asm strings in longlong.h. Also,
56: <code>__builtin_constant_p</code> is unavailable? Same problem with MacOS
57: X.
58: <li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
59: mp?/tests/*.c.
60: <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
61: </ul>
62:
63:
64:
65: <h4>Machine Independent Optimization</h4>
66: <ul>
67: <li> In hundreds of places in the code, we invoke count_leading_zeros and then
68: check if the returned count is zero. Instead check the most significant
69: bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
70: the bit is set. This is an optimization on all machines, and significant
71: on machines with slow <code>count_leading_zeros</code>.
72: <li> In a couple of places <code>count_trailing_zeros</code> is used
73: on more or less uniformly distributed numbers. For some CPUs
74: <code>count_trailing_zeros</code> is slow and it's probably worth
75: handling the frequently occurring 0 to 2 trailing zeros cases specially.
76: <li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
77: instead use <code>invert_limb</code>.
78: <li> Reorganize longlong.h so that we can inline the operations even for the
79: system compiler. When there is no such compiler feature, make calls to
80: stub functions. Write such stub functions for as many machines as
81: possible.
82: <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
83: most significant limb (if <code>BITS_PER_MP_LIMB</code> <= 52 bits).
84: (Peter Montgomery has some ideas on this subject.)
85: <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
86: products with fewer operations.
87: <li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
88: the sub O(n^2) range, using some divide-and-conquer approach, preferably
89: without using division.
90: <li> Copy tricky code for converting a limb from development version of
91: <code>mpn_get_str</code> to mpf/get_str. (Talk to Torbjörn about this.)
92: <li> Consider inlining these functions: <code>mpz_size</code>,
93: <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
94: <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
95: <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
96: <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
97: <code>mpz_popcount</code>, <code>mpf_size</code>,
98: <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
99: <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
100: <code>mpf_clear</code>, <code>mpf_set_si</code>.
101: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
102: fast on one or two limb moduli, due to a lot of function call
103: overheads. These could perhaps be handled as special cases.
104: <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
105: algorithm selection, and the latter should use REDC. Both could
106: change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
107: <li> <code>mpn_gcd</code> might be able to be sped up on small to
108: moderate sizes by improving <code>find_a</code>, possibly just by
109: providing an alternate implementation for CPUs with slowish
110: <code>count_leading_zeros</code>.
111: <li> Implement a cache localized evaluate and interpolate for the
112: toom3 <code>USE_MORE_MPN</code> code. The necessary
113: right-to-left <code>mpn_divexact_by3c</code> exists.
114: <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
115: better cache locality by taking N piece by piece. The current code could
116: be left available for CPUs without caching. Depending how karatsuba etc
117: is applied to unequal size operands it might be possible to assume M is
118: always smallish.
119: </ul>
120:
121:
122: <h4>Machine Dependent Optimization</h4>
123: <ul>
124: <li> Run the `tune' utility for more compiler/CPU combinations. We would like
125: to have gmp-mparam.h files in practically every implementation specific
126: mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
127: compiler. See the `tune' top-level directory for more information.
128: <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
129: <code>mpn_mul_1</code> for the 21264. On 21264, they should run at 4, 3,
130: and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
131: Torbjörn for his xm.s and xam.s skeleton files.)
132: <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
133: <code>mpn_mul_1</code> for the 21164. This should use both integer
134: multiplies and floating-point multiplies. For the floating-point
135: operations, the single-limb multiplier should be split into three 21-bit
136: chunks.
137: <li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
138: <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>. Should use
139: floating-point operations, and split the invariant single-limb multiplier
140: into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
141: will become very deep. (Torbjörn has C code that is useful as a starting
142: point.)
143: <li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
144: Should give 2 cycles/limb. (Torbjörn has code that just needs to be
145: finished.)
146: <li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
147: and the other multiplies varies so much on successive sizes.
148: <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
149: <code>mpn_mul_1</code>. The current development code runs at 11
150: cycles/limb, which is already very good. But it should be possible to
151: saturate the cache, which will happen at 7.5 cycles/limb.
152: <li> Sparc & SparcV8: Enable umul.asm for native cc. The generic
153: longlong.h umul_ppmm is suspected to be causing sqr_basecase to
154: be slower than mul_basecase.
155: <li> UltraSPARC: Write <code>umul_ppmm</code>. Important in particular for
156: <code>mpn_sqr_basecase</code>. Using four "<code>mulx</code>"s either
157: with an asm block or via the generic C code is about 90 cycles.
158: <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
159: for important machines. Helping the generic sqr_basecase.c with an
160: <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
161: <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
162: Will bring time from 1.75 to 1.25 cycles/limb.
163: <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See Pentium code.)
164: <li> Alpha: Optimize <code>count_leading_zeros</code>.
165: <li> Alpha: Optimize <code>udiv_qrnnd</code>. (Ask Torbjörn for the file
166: test-udiv-preinv.c as a starting point.)
167: <li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
168: It should just require 3 cycles/limb, but the current code propagates
169: carry poorly. The trick is to add carry-in later than we do now,
170: decreasing the number of operations used to generate carry-out from 4 to
171: to 3.
172: <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
173: The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
174: is unused. (Ask Torbjörn for a copy of the current code.)
175: <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
176: <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should
177: run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
178: <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
179: <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
180: <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
181: method", to avoid carry propagation. splitting one of the operands in
182: 11-bit chunks.
183: <li> Cray: Vectorize main functions, perhaps in assembly language.
184: <li> Cray: Write <code>mpn_mul_basecase</code> and
185: <code>mpn_sqr_basecase</code>. Same comment applies to this as to the
186: same functions for Fujitsu VPP.
187: <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
188:
189: <pre>
190: if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
191: if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
192: ... </pre>
193:
194: </ul>
195:
196: <h4>New Functionality</h4>
197: <ul>
198: <li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the nth limb).
199: <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
200: <li> Let `0b' and `0B' mean binary input everywhere.
201: <li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
202: (cf. <code>mpq_set_d</code>).
203: <li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
204: allocation. Set <code>ALLOC(var)</code> to 0, and have
205: <code>mpz_realloc</code> special-handle that case. Update functions that
206: rely on a single limb (like <code>mpz_set_ui</code>,
207: <code>mpz_[tfc]div_r_ui</code>, and others).
208: <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
209: format is portable between 32-bit and 64-bit machines, and between
210: little-endian and big-endian machines.
211: <li> Handle numeric exceptions: Call an error handler, and/or set
212: <code>gmp_errno</code>.
213: <li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
214: <code>gmp_snprintf</code>. Think about some sort of wrapper
215: around <code>printf</code> so it and its several variants don't
216: have to be completely reimplemented.
217: <li> Implement some <code>mpq</code> input and output functions.
218: <li> Implement a full precision <code>mpz_kronecker</code>, leave
219: <code>mpz_jacobi</code> for compatibility.
220: <li> Make the mpn logops and copys available in gmp.h. Since they can
221: be either library functions or inlines, gmp.h would need to be
222: generated from a gmp.in based on what's in the library. gmp.h
223: would still be compiler-independent though.
224: <li> Make versions of <code>mpz_set_str</code> etc taking string
225: lengths rather than null-terminators.
226: <li> Consider changing the thresholds to apply the simpler algorithm when
227: "<code><=</code>" rather than "<code><</code>", so a threshold can
228: be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
229: compiler will know <code>size <= MP_SIZE_T_MAX</code> is always true).
230: <li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
231: could be implemented to match the corresponding tdiv and fdiv.
232: Maybe some code sharing is possible.
233: </ul>
234:
235:
236: <h4>Configuration</h4>
237:
238: <ul>
239: <li> Improve config.guess. We want to recognize the processor very
240: accurately, more accurately than other GNU packages.
241: config.guess does not currently make the distinctions we would
242: like it to do and a --target often needs to be set explicitly.
243:
244: For example, "sparc" is not very useful as a machine architecture
245: denotation. We want to distinguish old 32-bit SPARC without
246: multiply support from newer 32-bit SPARC with such support. We
247: want to recognize a SuperSPARC, since its implementation of the
248: UDIV instruction is not complete, and will trap to the OS kernel
249: for certain operands. And we want to recognize 64-bit capable
250: SPARC processors as such. While the assembly routines can use
251: 64-bit operations on all 64-bit SPARC processors, one can not use
252: 64-bit limbs under all operating system. E.g., Solaris 2.5 and
253: 2.6 doesn't preserve the upper 32 bits of most processor
254: registers. For SPARC we therefore sometimes need to choose GMP
255: configuration depending both on processor and operating system.
256:
257: <li> Remember to make sure config.sub accepts any output from config.guess.
258:
259: <li> Find out whether there's an alloca available and how to use it.
260: AC_FUNC_ALLOCA has various system dependencies covered, but we
261: don't want its alloca.c replacement. (One thing current cpp
262: tests don't cover: HPUX 10 C compiler supports alloca, but
263: cannot find any symbol to test in order to know if we're on
264: HPUX 10. Damn.)
265: <li> Identify Mips processor under Irix: `hinv -c processor'.
266: config.guess should say mips2, mips3, and mips4.
267: <li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
268: Unfortunately, sizer is not available before some revision of
269: Dec Unix 4.0, and it also returns some rather cryptic names for
270: processors. Perhaps the <code>implver</code> and
271: <code>amask</code> assembly instructions are better, but that
272: doesn't differentiate between ev5 and ev56.
273: <li> Identify Sparc processors. config.guess should say supersparc,
274: microsparc, ultrasparc1, ultrasparc2, etc.
275: <li> Identify HPPA processors similarly.
276: <li> Get lots of information about a Solaris system: prtconf -vp
277: <li> For some target machines and some compilers, specific options
278: are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
279: Irix64/cc needs -64, Irix32/cc might need -n32, etc). Some are
280: set already, add more, see configure.in.
281: <li> Options to be passed to the assembler (via the compiler, using
282: whatever syntax the compiler uses for passing options to the
283: assembler).
284: <li> On Solaris 7, check if gcc supports native v9 64-bit
285: arithmetic. If not compile using "cc -fast -xarch=v9".
286: (Problem: -fast requires that we link with -fast too, which
287: might not be very good. Pass "-xO4 -xtarget=native" instead?)
288: <li> Extend the "optional" compiler arguments to choose the first
289: that works from from a set, so when gcc gets athlon support it
290: can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
291: whichever works.
292: <li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
293: x86s. (A bug in gcc 2.95.2 prevents it being used
294: unconditionally.)
295: <li> Build multiple variants of the library under certain systems.
296: An example is -n32, -o32, and -64 on Irix.
297: <li> There's a few filenames that don't fit in 14 chars, if this
298: matters.
299: <li> Enable support for FORTRAN versions of mpn files (eg. for
300: mpn/cray/mulww.f). Add "f" to the mpn path searching, run AC_PROG_F77 if
301: such a file is found. Automake will generate some of what's needed in the
302: makefiles, but libtool doesn't know fortran and so rules like the current
303: ".asm.lo" will be needed.
304: <li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
305: selected from the mpn path. This might help say a generic C
306: build on weird systems.
307: </ul>
308:
309: <p> In general, getting the exact right configuration, passing the
310: exact right options to the compiler, etc, might mean that the GMP
311: performance more than doubles.
312:
313: <p> When testing, make sure to test at least the following for all out
314: target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode
315: and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
316: `make' and GNU `make'. (4) With and without GNU binutils.
317:
318:
319: <h4>Miscellaneous</h4>
320: <ul>
321:
322: <li> Work on the way we build the library. We now do it building
323: convenience libraries but then listing all the object files a
324: second time in the top level Makefile.am.
325: <li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
326: mp[zq]/add.c to decide whether it will add or subtract. Will decrease
327: redundancy. Similarly in other places.
328: <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
329: analogous to <code>mpz_mod</code>. Document, and list as an
330: incompatibility.
331: <li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
332: mpn/generic/pow_ui.
333: <li> Make mpz_invert call mpn_gcdext directly.
334: <li> Make a build option to enable execution profiling with gprof. In
335: particular look at getting the right <code>mcount</code> call at
336: the start of each assembler subroutine (for important targets at
337: least).
338: </ul>
339:
340:
341: <h4>Aids to Debugging</h4>
342: <ul>
343: <li> Make an option for stack-alloc.c to call <code>malloc</code>
344: separately for each <code>TMP_ALLOC</code> block, so a redzoning
345: malloc debugger could be used during development.
346: <li> Add <code>ASSERT</code>s at the start of each user-visible
347: mpz/mpq/mpf function to check the validity of each
348: <code>mp?_t</code> parameter, in particular to check they've been
349: <code>mp?_init</code>ed. This might catch elementary mistakes in
350: user programs. Care would need to be taken over
351: <code>MPZ_TMP_INIT</code>ed variables used internally.
352: </ul>
353:
354:
355: <h4>Documentation</h4>
356: <ul>
357: <li> Document conventions, like that <code> unsigned long int</code> is used for
358: bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
359: <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
360: </ul>
361:
362: <hr>
363:
364: <table width="100%">
365: <tr>
366: <td>
367: <font size=2>
368: Please send comments about this page to
369: <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
370: Copyright (C) 1999, 2000 Torbjörn Granlund.
371: </font>
372: </td>
373: <td align=right>
374: </td>
375: </tr>
376: </table>
377:
378: </body>
379: </html>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>