OpenXM_contrib/gmp/mpn/sparc64/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / sparc64
Annotation of OpenXM_contrib/gmp/mpn/sparc64/README, Revision 1.1

1.1     ! maekawa     1: This directory contains mpn functions for 64-bit V9 SPARC
        !             2:
        !             3: RELEVANT OPTIMIZATION ISSUES
        !             4:
        !             5: The Ultra I/II pipeline executes up to two simple integer arithmetic operations
        !             6: per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to
        !             7: 35 cycles, depending on the position of the most significant bit of the 1st
        !             8: source operand.  It cannot overlap with other instructions.  For our use of
        !             9: mulx, it will take from 5 to 20 cycles.
        !            10:
        !            11: Integer conditional move instructions cannot dual-issue with other integer
        !            12: instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
        !            13: something such bizzare.)
        !            14:
        !            15: Integer branches can issue with two integer arithmetic instructions.  Likewise
        !            16: for integer loads.  Four instructions may issue (arith, arith, ld/st, branch)
        !            17: but only if the branch is last.
        !            18:
        !            19: (The V9 architecture manual recommends that the 2nd operand of a multiply
        !            20: instruction be the smaller one.  For UltraSPARC, they got things backwards and
        !            21: optimize for the wrong operand!  Really helpful in the light of that multiply
        !            22: is incredibly slow on these CPUs!)
        !            23:
        !            24: STATUS
        !            25:
        !            26: There is new code in ~/prec/gmp-remote/sparc64.  Not tested or completed, but
        !            27: the pipelines are worked out.  Here are the timings:
        !            28:
        !            29: * lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.
        !            30:
        !            31: * add_n, sub_n: add3.s currently runs at 6 cycles/limb.  We use a bizarre
        !            32:   scheme of compares and branches (with some nops and fnops to align things)
        !            33:   and carefully stay away from the instructions intended for this application
        !            34:   (i.e., movcs and movcc).
        !            35:
        !            36:   Using movcc/movcs, even with deep unrolling, seems to get down to 7
        !            37:   cycles/limb.
        !            38:
        !            39:   The most promising approach is to split operands in 32-bit pieces using
        !            40:   srlx, then use two addccc, and finally compile the results with sllx+or.
        !            41:   The result could run at 5 cycles/limb, I think.  It might be possible to
        !            42:   do without unrolling, or with minimal unrolling.
        !            43:
        !            44: * addmul_1/submul_1: Should optimize for when scalar operand < 2^32.
        !            45: * addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,
        !            46:   Karatsuba's method should save up to 16 cycles (i.e. > 20%).
        !            47: * mul_1 (and possibly the other multiply functions): Handle carry in the
        !            48:   same tricky way as add_n,sub_n.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>