Most mpn subdirectories contain machine-dependent code, written in
assembly or C.  The `generic' subdirectory contains default code, used
when there is no machine-dependent replacement for a particular
machine.

There is one subdirectory for each ISA family.  Note that e.g., 32-bit SPARC
and 64-bit SPARC are very different ISA's, and thus cannot share any code.

A particular compile will only use code from one subdirectory, and the
`generic' subdirectory.  The ISA-specific subdirectories contain hierarchies of
directories for various architecture variants and implementations; the
top-most level contains code that runs correctly on all variants.

HOW TO WRITE FAST ASSEMBLY CODE FOR GMP

[This should ultimately be made into a chapter of the GMP manual.]

The most basic techniques are software pipelining and loop unrolling.

Software pipelining is the technique of scheduling instructions around
the branch point in a loop, so that consecutive iterations overlap.
It is very much like juggling.

Unrolling is useful when software pipelining does not get us close
enough to the peek performance of a processor's pipeline.  Unrolling
decreases the loop overhead, but also often allows a more even load on
a processor's functional units.

For processors with very few registers, software pipelining is not
feasible as it increases register pressure.

For superscalar machines, it is often the case that all available
execution capabilities are not used.  Scheduling some instructions
for these otherwise unused resources will never cost us anything.

Try to determine the alternative instructions that can be used for a
particular processor.  For GMP, the problem that presents most
challenges is propagating carry from one iteration to the next.
Explore the different possibilities for doing that with the available
instructions!

For wide superscalar processors, the performance might be completely
determined by the number of dependent instruction required from
accepting carry-in from the previous iteration until producing
carry-out for the next iteration.  This is particularly true for
simple operations like mpn_add_n and mpn_sub_n.  Some carry
propagation schemes require 4 instructions, translating to at least
four cycles per iterations.  Other schemes can propagate carry in two
cycles or even just one cycle.

Therefore, for wide superscalar processors, finding methods with
"shallow" carry propagation given an instruction set is often the
central problem we need to address.  The rest is just is hard coding
work.

[Describe: First find issue maps with desired performance
	   Then schedule for latency]