Annotation of OpenXM_contrib/gmp/mpn/sparc64/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
! 22:
! 23:
1.1 maekawa 24: This directory contains mpn functions for 64-bit V9 SPARC
25:
26: RELEVANT OPTIMIZATION ISSUES
27:
1.1.1.2 ! ohara 28: Notation:
! 29: IANY = shift/add/sub/logical/sethi
! 30: IADDLOG = add/sub/logical/sethi
! 31: MEM = ld*/st*
! 32: FA = fadd*/fsub*/f*to*/fmov*
! 33: FM = fmul*
! 34:
! 35: UltraSPARC-1/2 can issue four instructions per cycle, with these restrictions:
! 36: * Two IANY instructions, but only one of these may be a shift. If there is a
! 37: shift and an IANY instruction, the shift must precede the IANY instruction.
! 38: * One FA.
! 39: * One FM.
! 40: * One branch.
! 41: * One MEM.
! 42: * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches
! 43: should not be in slot 4, since that makes the delay insn come from separate
! 44: bundle.
! 45: * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
! 46: of these is setting the condition codes, that instruction must be the second
! 47: one.
! 48:
! 49: To summarize, ignoring branches, these are the bundles that can reach the peak
! 50: execution speed:
! 51:
! 52: insn1 iany iany mem iany iany mem iany iany mem
! 53: insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany
! 54: insn3 mem iaddlog iaddlog fa fa fa fm fm fm
! 55: insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa
! 56:
! 57: The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
! 58: depending on the position of the most significant bit of the first source
! 59: operand. When used for 32x32->64 multiplication, it needs 20 cycles.
! 60: Furthermore, it stalls the processor while executing. We stay away from that
! 61: instruction, and instead use floating-point operations.
1.1 maekawa 62:
63: Integer conditional move instructions cannot dual-issue with other integer
64: instructions. No conditional move can issue 1-5 cycles after a load. (Or
1.1.1.2 ! ohara 65: something such bizarre.) Useless.
1.1 maekawa 66:
1.1.1.2 ! ohara 67: The UltraSPARC-3 pipeline seems similar, but is somewhat more rigid. Branches
! 68: execute slower, and there may be other new stalls. Integer multiply doesn't
! 69: halt the CPU and also has a much lower latency. But it's still not pipelined,
! 70: and thus useless for our needs.
1.1 maekawa 71:
72: STATUS
73:
1.1.1.2 ! ohara 74: (Timings are for UltraSPARC-1/2. UltraSPARC-3 is a few cycles slower.)
! 75:
! 76: * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb. The IEU0
! 77: functional unit is saturated with shifts.
! 78:
! 79: * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb. The 4
! 80: instruction recurrency is the speed limiter.
1.1 maekawa 81:
1.1.1.2 ! ohara 82: * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically. The
! 83: code sustains 4 instructions/cycle. It might be possible to invent a better
! 84: way of summing the intermediate 49-bit operands, but it is unlikely that it
! 85: will save enough instructions to save an entire cycle.
! 86:
! 87: The load-use of the `rlimb' operand is not enough scheduled for good L2 cache
! 88: performance. Since UltraSPARC-1/2 L1 cache is direct mapped, we miss to L2
! 89: very often. The load-use of the std/ldx pairs via the stack are somewhat
! 90: over-scheduled.
! 91:
! 92: It would be possible to save two instructions: (1) The `mov' could be avoided
! 93: if ths std/ldx were less scheduled. (2) The ldx of `rlimb' could be split
! 94: into two `ld' instructions, saving the shifts/masks.
! 95:
! 96: * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
! 97: code. It would be possible to shave one or two cycles from it, with some
! 98: labour.
! 99:
! 100: * mpn_submul_1: Braindead code just calling mpn_mul_1 + mpn_sub_n. It would be
! 101: possible to either match the mpn_addmul_1 performance, or in the worst case
! 102: use one more instruction group.
1.1 maekawa 103:
1.1.1.2 ! ohara 104: * mpn_Xmul_2: These could be made to run at 9 cycles/limb. Straightforward
! 105: generalization of mpn_Xmul_1.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>