Annotation of OpenXM_contrib/gmp/doc/assembly_code, Revision 1.1
1.1 ! maekawa 1: Most mpn subdirectories contain machine-dependent code, written in
! 2: assembly or C. The `generic' subdirectory contains default code, used
! 3: when there is no machine-dependent replacement for a particular
! 4: machine.
! 5:
! 6: There is one subdirectory for each ISA family. Note that e.g., 32-bit SPARC
! 7: and 64-bit SPARC are very different ISA's, and thus cannot share any code.
! 8:
! 9: A particular compile will only use code from one subdirectory, and the
! 10: `generic' subdirectory. The ISA-specific subdirectories contain hierarchies of
! 11: directories for various architecture variants and implementations; the
! 12: top-most level contains code that runs correctly on all variants.
! 13:
! 14: HOW TO WRITE FAST ASSEMBLY CODE FOR GMP
! 15:
! 16: [This should ultimately be made into a chapter of the GMP manual.]
! 17:
! 18: The most basic techniques are software pipelining and loop unrolling.
! 19:
! 20: Software pipelining is the technique of scheduling instructions around
! 21: the branch point in a loop, so that consecutive iterations overlap.
! 22: It is very much like juggling.
! 23:
! 24: Unrolling is useful when software pipelining does not get us close
! 25: enough to the peek performance of a processor's pipeline. Unrolling
! 26: decreases the loop overhead, but also often allows a more even load on
! 27: a processor's functional units.
! 28:
! 29: For processors with very few registers, software pipelining is not
! 30: feasible as it increases register pressure.
! 31:
! 32: For superscalar machines, it is often the case that all available
! 33: execution capabilities are not used. Scheduling some instructions
! 34: for these otherwise unused resources will never cost us anything.
! 35:
! 36: Try to determine the alternative instructions that can be used for a
! 37: particular processor. For GMP, the problem that presents most
! 38: challenges is propagating carry from one iteration to the next.
! 39: Explore the different possibilities for doing that with the available
! 40: instructions!
! 41:
! 42: For wide superscalar processors, the performance might be completely
! 43: determined by the number of dependent instruction required from
! 44: accepting carry-in from the previous iteration until producing
! 45: carry-out for the next iteration. This is particularly true for
! 46: simple operations like mpn_add_n and mpn_sub_n. Some carry
! 47: propagation schemes require 4 instructions, translating to at least
! 48: four cycles per iterations. Other schemes can propagate carry in two
! 49: cycles or even just one cycle.
! 50:
! 51: Therefore, for wide superscalar processors, finding methods with
! 52: "shallow" carry propagation given an instruction set is often the
! 53: central problem we need to address. The rest is just is hard coding
! 54: work.
! 55:
! 56: [Describe: First find issue maps with desired performance
! 57: Then schedule for latency]
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>