Most mpn subdirectories contain machine-dependent code, written in assembly or C. The `generic' subdirectory contains default code, used when there is no machine-dependent replacement for a particular machine. There is one subdirectory for each ISA family. Note that e.g., 32-bit SPARC and 64-bit SPARC are very different ISA's, and thus cannot share any code. A particular compile will only use code from one subdirectory, and the `generic' subdirectory. The ISA-specific subdirectories contain hierarchies of directories for various architecture variants and implementations; the top-most level contains code that runs correctly on all variants. HOW TO WRITE FAST ASSEMBLY CODE FOR GMP [This should ultimately be made into a chapter of the GMP manual.] The most basic techniques are software pipelining and loop unrolling. Software pipelining is the technique of scheduling instructions around the branch point in a loop, so that consecutive iterations overlap. It is very much like juggling. Unrolling is useful when software pipelining does not get us close enough to the peek performance of a processor's pipeline. Unrolling decreases the loop overhead, but also often allows a more even load on a processor's functional units. For processors with very few registers, software pipelining is not feasible as it increases register pressure. For superscalar machines, it is often the case that all available execution capabilities are not used. Scheduling some instructions for these otherwise unused resources will never cost us anything. Try to determine the alternative instructions that can be used for a particular processor. For GMP, the problem that presents most challenges is propagating carry from one iteration to the next. Explore the different possibilities for doing that with the available instructions! For wide superscalar processors, the performance might be completely determined by the number of dependent instruction required from accepting carry-in from the previous iteration until producing carry-out for the next iteration. This is particularly true for simple operations like mpn_add_n and mpn_sub_n. Some carry propagation schemes require 4 instructions, translating to at least four cycles per iterations. Other schemes can propagate carry in two cycles or even just one cycle. Therefore, for wide superscalar processors, finding methods with "shallow" carry propagation given an instruction set is often the central problem we need to address. The rest is just is hard coding work. [Describe: First find issue maps with desired performance Then schedule for latency]