[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / pa64

Annotation of OpenXM_contrib/gmp/mpn/pa64/README, Revision 1.1.1.1

1.1       maekawa     1: This directory contains mpn functions for 64-bit PA-RISC 2.0.
                      2:
                      3: RELEVANT OPTIMIZATION ISSUES
                      4:
                      5: The PA8000 has a multi-issue pipeline with large buffers for instructions
                      6: awaiting pending results.  Therefore, no latency scheduling is necessary
                      7: (and might actually be harmful).
                      8:
                      9: Two 64-bit loads can be completed per cycle.  One 64-bit store can be
                     10: completed per cycle.  A store cannot complete in the same cycle as a load.
                     11:
                     12: STATUS
                     13:
                     14: * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
                     15:   the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
                     16:   for add/subtract.
                     17:
                     18: * The multiplication functions run at 11 cycles/limb.  The cache bandwidth
                     19:   allows 7.5 cycles/limb.  Perhaps it would be possible, using unrolling or
                     20:   better scheduling, to get closer to the cache bandwidth limit.
                     21:
                     22: * xaddmul_1.S contains a quicker method for forming the 128 bit product.  It
                     23:   uses some fewer operations, and keep the carry flag live across the loop
                     24:   boundary.  But it seems hard to make it run more than 1/4 cycle faster
                     25:   than the old code.  Perhaps we really ought to unroll this loop be 2x?
                     26:   2x should suffice since register latency schedling is never needed,
                     27:   but the unrolling would hide the store-load latency.  Here is a sketch:
                     28:
                     29:        1. A multiply and store 64-bit products
                     30:        2. B sum 64-bit products 128-bit product
                     31:        3. B load  64-bit products to integer registers
                     32:        4. B multiply and store 64-bit products
                     33:        5. A sum 64-bit products 128-bit product
                     34:        6. A load  64-bit products to integer registers
                     35:        7. goto 1
                     36:
                     37:   In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
                     38:   for better instruction mix.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>