Annotation of OpenXM_contrib/gmp/mpn/pa64/README, Revision 1.1
1.1 ! maekawa 1: This directory contains mpn functions for 64-bit PA-RISC 2.0.
! 2:
! 3: RELEVANT OPTIMIZATION ISSUES
! 4:
! 5: The PA8000 has a multi-issue pipeline with large buffers for instructions
! 6: awaiting pending results. Therefore, no latency scheduling is necessary
! 7: (and might actually be harmful).
! 8:
! 9: Two 64-bit loads can be completed per cycle. One 64-bit store can be
! 10: completed per cycle. A store cannot complete in the same cycle as a load.
! 11:
! 12: STATUS
! 13:
! 14: * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
! 15: the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
! 16: for add/subtract.
! 17:
! 18: * The multiplication functions run at 11 cycles/limb. The cache bandwidth
! 19: allows 7.5 cycles/limb. Perhaps it would be possible, using unrolling or
! 20: better scheduling, to get closer to the cache bandwidth limit.
! 21:
! 22: * xaddmul_1.S contains a quicker method for forming the 128 bit product. It
! 23: uses some fewer operations, and keep the carry flag live across the loop
! 24: boundary. But it seems hard to make it run more than 1/4 cycle faster
! 25: than the old code. Perhaps we really ought to unroll this loop be 2x?
! 26: 2x should suffice since register latency schedling is never needed,
! 27: but the unrolling would hide the store-load latency. Here is a sketch:
! 28:
! 29: 1. A multiply and store 64-bit products
! 30: 2. B sum 64-bit products 128-bit product
! 31: 3. B load 64-bit products to integer registers
! 32: 4. B multiply and store 64-bit products
! 33: 5. A sum 64-bit products 128-bit product
! 34: 6. A load 64-bit products to integer registers
! 35: 7. goto 1
! 36:
! 37: In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
! 38: for better instruction mix.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>