[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / sparc64

Annotation of OpenXM_contrib/gmp/mpn/sparc64/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
        !            22:
        !            23:
1.1       maekawa    24: This directory contains mpn functions for 64-bit V9 SPARC
                     25:
                     26: RELEVANT OPTIMIZATION ISSUES
                     27:
1.1.1.2 ! ohara      28: Notation:
        !            29:   IANY = shift/add/sub/logical/sethi
        !            30:   IADDLOG = add/sub/logical/sethi
        !            31:   MEM = ld*/st*
        !            32:   FA = fadd*/fsub*/f*to*/fmov*
        !            33:   FM = fmul*
        !            34:
        !            35: UltraSPARC-1/2 can issue four instructions per cycle, with these restrictions:
        !            36: * Two IANY instructions, but only one of these may be a shift.  If there is a
        !            37:   shift and an IANY instruction, the shift must precede the IANY instruction.
        !            38: * One FA.
        !            39: * One FM.
        !            40: * One branch.
        !            41: * One MEM.
        !            42: * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
        !            43:   should not be in slot 4, since that makes the delay insn come from separate
        !            44:   bundle.
        !            45: * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
        !            46:   of these is setting the condition codes, that instruction must be the second
        !            47:   one.
        !            48:
        !            49: To summarize, ignoring branches, these are the bundles that can reach the peak
        !            50: execution speed:
        !            51:
        !            52: insn1  iany    iany    mem     iany    iany    mem     iany    iany    mem
        !            53: insn2  iaddlog mem     iany    mem     iaddlog iany    mem     iaddlog iany
        !            54: insn3  mem     iaddlog iaddlog fa      fa      fa      fm      fm      fm
        !            55: insn4  fa/fm   fa/fm   fa/fm   fm      fm      fm      fa      fa      fa
        !            56:
        !            57: The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
        !            58: depending on the position of the most significant bit of the first source
        !            59: operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
        !            60: Furthermore, it stalls the processor while executing.  We stay away from that
        !            61: instruction, and instead use floating-point operations.
1.1       maekawa    62:
                     63: Integer conditional move instructions cannot dual-issue with other integer
                     64: instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
1.1.1.2 ! ohara      65: something such bizarre.)  Useless.
1.1       maekawa    66:
1.1.1.2 ! ohara      67: The UltraSPARC-3 pipeline seems similar, but is somewhat more rigid.  Branches
        !            68: execute slower, and there may be other new stalls.  Integer multiply doesn't
        !            69: halt the CPU and also has a much lower latency.  But it's still not pipelined,
        !            70: and thus useless for our needs.
1.1       maekawa    71:
                     72: STATUS
                     73:
1.1.1.2 ! ohara      74: (Timings are for UltraSPARC-1/2.  UltraSPARC-3 is a few cycles slower.)
        !            75:
        !            76: * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb.  The IEU0
        !            77:   functional unit is saturated with shifts.
        !            78:
        !            79: * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb.  The 4
        !            80:   instruction recurrency is the speed limiter.
1.1       maekawa    81:
1.1.1.2 ! ohara      82: * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically.  The
        !            83:   code sustains 4 instructions/cycle.  It might be possible to invent a better
        !            84:   way of summing the intermediate 49-bit operands, but it is unlikely that it
        !            85:   will save enough instructions to save an entire cycle.
        !            86:
        !            87:   The load-use of the `rlimb' operand is not enough scheduled for good L2 cache
        !            88:   performance.  Since UltraSPARC-1/2 L1 cache is direct mapped, we miss to L2
        !            89:   very often.  The load-use of the std/ldx pairs via the stack are somewhat
        !            90:   over-scheduled.
        !            91:
        !            92:   It would be possible to save two instructions: (1) The `mov' could be avoided
        !            93:   if ths std/ldx were less scheduled.  (2) The ldx of `rlimb' could be split
        !            94:   into two `ld' instructions, saving the shifts/masks.
        !            95:
        !            96: * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
        !            97:   code.  It would be possible to shave one or two cycles from it, with some
        !            98:   labour.
        !            99:
        !           100: * mpn_submul_1: Braindead code just calling mpn_mul_1 + mpn_sub_n.  It would be
        !           101:   possible to either match the mpn_addmul_1 performance, or in the worst case
        !           102:   use one more instruction group.
1.1       maekawa   103:
1.1.1.2 ! ohara     104: * mpn_Xmul_2: These could be made to run at 9 cycles/limb.  Straightforward
        !           105:   generalization of mpn_Xmul_1.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>