[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / sparc64

Annotation of OpenXM_contrib/gmp/mpn/sparc64/README, Revision 1.1.1.1

1.1       maekawa     1: This directory contains mpn functions for 64-bit V9 SPARC
                      2:
                      3: RELEVANT OPTIMIZATION ISSUES
                      4:
                      5: The Ultra I/II pipeline executes up to two simple integer arithmetic operations
                      6: per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to
                      7: 35 cycles, depending on the position of the most significant bit of the 1st
                      8: source operand.  It cannot overlap with other instructions.  For our use of
                      9: mulx, it will take from 5 to 20 cycles.
                     10:
                     11: Integer conditional move instructions cannot dual-issue with other integer
                     12: instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
                     13: something such bizzare.)
                     14:
                     15: Integer branches can issue with two integer arithmetic instructions.  Likewise
                     16: for integer loads.  Four instructions may issue (arith, arith, ld/st, branch)
                     17: but only if the branch is last.
                     18:
                     19: (The V9 architecture manual recommends that the 2nd operand of a multiply
                     20: instruction be the smaller one.  For UltraSPARC, they got things backwards and
                     21: optimize for the wrong operand!  Really helpful in the light of that multiply
                     22: is incredibly slow on these CPUs!)
                     23:
                     24: STATUS
                     25:
                     26: There is new code in ~/prec/gmp-remote/sparc64.  Not tested or completed, but
                     27: the pipelines are worked out.  Here are the timings:
                     28:
                     29: * lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.
                     30:
                     31: * add_n, sub_n: add3.s currently runs at 6 cycles/limb.  We use a bizarre
                     32:   scheme of compares and branches (with some nops and fnops to align things)
                     33:   and carefully stay away from the instructions intended for this application
                     34:   (i.e., movcs and movcc).
                     35:
                     36:   Using movcc/movcs, even with deep unrolling, seems to get down to 7
                     37:   cycles/limb.
                     38:
                     39:   The most promising approach is to split operands in 32-bit pieces using
                     40:   srlx, then use two addccc, and finally compile the results with sllx+or.
                     41:   The result could run at 5 cycles/limb, I think.  It might be possible to
                     42:   do without unrolling, or with minimal unrolling.
                     43:
                     44: * addmul_1/submul_1: Should optimize for when scalar operand < 2^32.
                     45: * addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,
                     46:   Karatsuba's method should save up to 16 cycles (i.e. > 20%).
                     47: * mul_1 (and possibly the other multiply functions): Handle carry in the
                     48:   same tricky way as add_n,sub_n.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>