Annotation of OpenXM_contrib/gmp/doc/assembly_code, Revision 1.1.1.1
1.1 maekawa 1: Most mpn subdirectories contain machine-dependent code, written in
2: assembly or C. The `generic' subdirectory contains default code, used
3: when there is no machine-dependent replacement for a particular
4: machine.
5:
6: There is one subdirectory for each ISA family. Note that e.g., 32-bit SPARC
7: and 64-bit SPARC are very different ISA's, and thus cannot share any code.
8:
9: A particular compile will only use code from one subdirectory, and the
10: `generic' subdirectory. The ISA-specific subdirectories contain hierarchies of
11: directories for various architecture variants and implementations; the
12: top-most level contains code that runs correctly on all variants.
13:
14: HOW TO WRITE FAST ASSEMBLY CODE FOR GMP
15:
16: [This should ultimately be made into a chapter of the GMP manual.]
17:
18: The most basic techniques are software pipelining and loop unrolling.
19:
20: Software pipelining is the technique of scheduling instructions around
21: the branch point in a loop, so that consecutive iterations overlap.
22: It is very much like juggling.
23:
24: Unrolling is useful when software pipelining does not get us close
25: enough to the peek performance of a processor's pipeline. Unrolling
26: decreases the loop overhead, but also often allows a more even load on
27: a processor's functional units.
28:
29: For processors with very few registers, software pipelining is not
30: feasible as it increases register pressure.
31:
32: For superscalar machines, it is often the case that all available
33: execution capabilities are not used. Scheduling some instructions
34: for these otherwise unused resources will never cost us anything.
35:
36: Try to determine the alternative instructions that can be used for a
37: particular processor. For GMP, the problem that presents most
38: challenges is propagating carry from one iteration to the next.
39: Explore the different possibilities for doing that with the available
40: instructions!
41:
42: For wide superscalar processors, the performance might be completely
43: determined by the number of dependent instruction required from
44: accepting carry-in from the previous iteration until producing
45: carry-out for the next iteration. This is particularly true for
46: simple operations like mpn_add_n and mpn_sub_n. Some carry
47: propagation schemes require 4 instructions, translating to at least
48: four cycles per iterations. Other schemes can propagate carry in two
49: cycles or even just one cycle.
50:
51: Therefore, for wide superscalar processors, finding methods with
52: "shallow" carry propagation given an instruction set is often the
53: central problem we need to address. The rest is just is hard coding
54: work.
55:
56: [Describe: First find issue maps with desired performance
57: Then schedule for latency]
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>