Annotation of OpenXM_contrib/gmp/mpn/pa64/README, Revision 1.1.1.1
1.1 maekawa 1: This directory contains mpn functions for 64-bit PA-RISC 2.0.
2:
3: RELEVANT OPTIMIZATION ISSUES
4:
5: The PA8000 has a multi-issue pipeline with large buffers for instructions
6: awaiting pending results. Therefore, no latency scheduling is necessary
7: (and might actually be harmful).
8:
9: Two 64-bit loads can be completed per cycle. One 64-bit store can be
10: completed per cycle. A store cannot complete in the same cycle as a load.
11:
12: STATUS
13:
14: * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
15: the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
16: for add/subtract.
17:
18: * The multiplication functions run at 11 cycles/limb. The cache bandwidth
19: allows 7.5 cycles/limb. Perhaps it would be possible, using unrolling or
20: better scheduling, to get closer to the cache bandwidth limit.
21:
22: * xaddmul_1.S contains a quicker method for forming the 128 bit product. It
23: uses some fewer operations, and keep the carry flag live across the loop
24: boundary. But it seems hard to make it run more than 1/4 cycle faster
25: than the old code. Perhaps we really ought to unroll this loop be 2x?
26: 2x should suffice since register latency schedling is never needed,
27: but the unrolling would hide the store-load latency. Here is a sketch:
28:
29: 1. A multiply and store 64-bit products
30: 2. B sum 64-bit products 128-bit product
31: 3. B load 64-bit products to integer registers
32: 4. B multiply and store 64-bit products
33: 5. A sum 64-bit products 128-bit product
34: 6. A load 64-bit products to integer registers
35: 7. goto 1
36:
37: In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
38: for better instruction mix.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>