[BACK]Return to assembly_code CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / doc

Annotation of OpenXM_contrib/gmp/doc/assembly_code, Revision 1.1.1.1

1.1       maekawa     1: Most mpn subdirectories contain machine-dependent code, written in
                      2: assembly or C.  The `generic' subdirectory contains default code, used
                      3: when there is no machine-dependent replacement for a particular
                      4: machine.
                      5:
                      6: There is one subdirectory for each ISA family.  Note that e.g., 32-bit SPARC
                      7: and 64-bit SPARC are very different ISA's, and thus cannot share any code.
                      8:
                      9: A particular compile will only use code from one subdirectory, and the
                     10: `generic' subdirectory.  The ISA-specific subdirectories contain hierarchies of
                     11: directories for various architecture variants and implementations; the
                     12: top-most level contains code that runs correctly on all variants.
                     13:
                     14: HOW TO WRITE FAST ASSEMBLY CODE FOR GMP
                     15:
                     16: [This should ultimately be made into a chapter of the GMP manual.]
                     17:
                     18: The most basic techniques are software pipelining and loop unrolling.
                     19:
                     20: Software pipelining is the technique of scheduling instructions around
                     21: the branch point in a loop, so that consecutive iterations overlap.
                     22: It is very much like juggling.
                     23:
                     24: Unrolling is useful when software pipelining does not get us close
                     25: enough to the peek performance of a processor's pipeline.  Unrolling
                     26: decreases the loop overhead, but also often allows a more even load on
                     27: a processor's functional units.
                     28:
                     29: For processors with very few registers, software pipelining is not
                     30: feasible as it increases register pressure.
                     31:
                     32: For superscalar machines, it is often the case that all available
                     33: execution capabilities are not used.  Scheduling some instructions
                     34: for these otherwise unused resources will never cost us anything.
                     35:
                     36: Try to determine the alternative instructions that can be used for a
                     37: particular processor.  For GMP, the problem that presents most
                     38: challenges is propagating carry from one iteration to the next.
                     39: Explore the different possibilities for doing that with the available
                     40: instructions!
                     41:
                     42: For wide superscalar processors, the performance might be completely
                     43: determined by the number of dependent instruction required from
                     44: accepting carry-in from the previous iteration until producing
                     45: carry-out for the next iteration.  This is particularly true for
                     46: simple operations like mpn_add_n and mpn_sub_n.  Some carry
                     47: propagation schemes require 4 instructions, translating to at least
                     48: four cycles per iterations.  Other schemes can propagate carry in two
                     49: cycles or even just one cycle.
                     50:
                     51: Therefore, for wide superscalar processors, finding methods with
                     52: "shallow" carry propagation given an instruction set is often the
                     53: central problem we need to address.  The rest is just is hard coding
                     54: work.
                     55:
                     56: [Describe: First find issue maps with desired performance
                     57:           Then schedule for latency]

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>