OpenXM_contrib/gmp/mpn/pa64/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / pa64
Annotation of OpenXM_contrib/gmp/mpn/pa64/README, Revision 1.1

1.1     ! maekawa     1: This directory contains mpn functions for 64-bit PA-RISC 2.0.
        !             2:
        !             3: RELEVANT OPTIMIZATION ISSUES
        !             4:
        !             5: The PA8000 has a multi-issue pipeline with large buffers for instructions
        !             6: awaiting pending results.  Therefore, no latency scheduling is necessary
        !             7: (and might actually be harmful).
        !             8:
        !             9: Two 64-bit loads can be completed per cycle.  One 64-bit store can be
        !            10: completed per cycle.  A store cannot complete in the same cycle as a load.
        !            11:
        !            12: STATUS
        !            13:
        !            14: * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
        !            15:   the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
        !            16:   for add/subtract.
        !            17:
        !            18: * The multiplication functions run at 11 cycles/limb.  The cache bandwidth
        !            19:   allows 7.5 cycles/limb.  Perhaps it would be possible, using unrolling or
        !            20:   better scheduling, to get closer to the cache bandwidth limit.
        !            21:
        !            22: * xaddmul_1.S contains a quicker method for forming the 128 bit product.  It
        !            23:   uses some fewer operations, and keep the carry flag live across the loop
        !            24:   boundary.  But it seems hard to make it run more than 1/4 cycle faster
        !            25:   than the old code.  Perhaps we really ought to unroll this loop be 2x?
        !            26:   2x should suffice since register latency schedling is never needed,
        !            27:   but the unrolling would hide the store-load latency.  Here is a sketch:
        !            28:
        !            29:        1. A multiply and store 64-bit products
        !            30:        2. B sum 64-bit products 128-bit product
        !            31:        3. B load  64-bit products to integer registers
        !            32:        4. B multiply and store 64-bit products
        !            33:        5. A sum 64-bit products 128-bit product
        !            34:        6. A load  64-bit products to integer registers
        !            35:        7. goto 1
        !            36:
        !            37:   In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
        !            38:   for better instruction mix.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>