Annotation of OpenXM_contrib/gmp/mpn/sparc64/README, Revision 1.1
1.1 ! maekawa 1: This directory contains mpn functions for 64-bit V9 SPARC
! 2:
! 3: RELEVANT OPTIMIZATION ISSUES
! 4:
! 5: The Ultra I/II pipeline executes up to two simple integer arithmetic operations
! 6: per cycle. The 64-bit integer multiply instruction mulx takes from 5 cycles to
! 7: 35 cycles, depending on the position of the most significant bit of the 1st
! 8: source operand. It cannot overlap with other instructions. For our use of
! 9: mulx, it will take from 5 to 20 cycles.
! 10:
! 11: Integer conditional move instructions cannot dual-issue with other integer
! 12: instructions. No conditional move can issue 1-5 cycles after a load. (Or
! 13: something such bizzare.)
! 14:
! 15: Integer branches can issue with two integer arithmetic instructions. Likewise
! 16: for integer loads. Four instructions may issue (arith, arith, ld/st, branch)
! 17: but only if the branch is last.
! 18:
! 19: (The V9 architecture manual recommends that the 2nd operand of a multiply
! 20: instruction be the smaller one. For UltraSPARC, they got things backwards and
! 21: optimize for the wrong operand! Really helpful in the light of that multiply
! 22: is incredibly slow on these CPUs!)
! 23:
! 24: STATUS
! 25:
! 26: There is new code in ~/prec/gmp-remote/sparc64. Not tested or completed, but
! 27: the pipelines are worked out. Here are the timings:
! 28:
! 29: * lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.
! 30:
! 31: * add_n, sub_n: add3.s currently runs at 6 cycles/limb. We use a bizarre
! 32: scheme of compares and branches (with some nops and fnops to align things)
! 33: and carefully stay away from the instructions intended for this application
! 34: (i.e., movcs and movcc).
! 35:
! 36: Using movcc/movcs, even with deep unrolling, seems to get down to 7
! 37: cycles/limb.
! 38:
! 39: The most promising approach is to split operands in 32-bit pieces using
! 40: srlx, then use two addccc, and finally compile the results with sllx+or.
! 41: The result could run at 5 cycles/limb, I think. It might be possible to
! 42: do without unrolling, or with minimal unrolling.
! 43:
! 44: * addmul_1/submul_1: Should optimize for when scalar operand < 2^32.
! 45: * addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,
! 46: Karatsuba's method should save up to 16 cycles (i.e. > 20%).
! 47: * mul_1 (and possibly the other multiply functions): Handle carry in the
! 48: same tricky way as add_n,sub_n.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>