=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/sparc64/Attic/README,v retrieving revision 1.1.1.1 retrieving revision 1.1.1.2 diff -u -p -r1.1.1.1 -r1.1.1.2 --- OpenXM_contrib/gmp/mpn/sparc64/Attic/README 2000/09/09 14:12:41 1.1.1.1 +++ OpenXM_contrib/gmp/mpn/sparc64/Attic/README 2003/08/25 16:06:26 1.1.1.2 @@ -1,48 +1,105 @@ +Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. + +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library; see the file COPYING.LIB. If not, write to +the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA +02111-1307, USA. + + + + + This directory contains mpn functions for 64-bit V9 SPARC RELEVANT OPTIMIZATION ISSUES -The Ultra I/II pipeline executes up to two simple integer arithmetic operations -per cycle. The 64-bit integer multiply instruction mulx takes from 5 cycles to -35 cycles, depending on the position of the most significant bit of the 1st -source operand. It cannot overlap with other instructions. For our use of -mulx, it will take from 5 to 20 cycles. +Notation: + IANY = shift/add/sub/logical/sethi + IADDLOG = add/sub/logical/sethi + MEM = ld*/st* + FA = fadd*/fsub*/f*to*/fmov* + FM = fmul* +UltraSPARC-1/2 can issue four instructions per cycle, with these restrictions: +* Two IANY instructions, but only one of these may be a shift. If there is a + shift and an IANY instruction, the shift must precede the IANY instruction. +* One FA. +* One FM. +* One branch. +* One MEM. +* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches + should not be in slot 4, since that makes the delay insn come from separate + bundle. +* If two IANY/IADDLOG instructions are to be executed in the same cycle and one + of these is setting the condition codes, that instruction must be the second + one. + +To summarize, ignoring branches, these are the bundles that can reach the peak +execution speed: + +insn1 iany iany mem iany iany mem iany iany mem +insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany +insn3 mem iaddlog iaddlog fa fa fa fm fm fm +insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa + +The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles, +depending on the position of the most significant bit of the first source +operand. When used for 32x32->64 multiplication, it needs 20 cycles. +Furthermore, it stalls the processor while executing. We stay away from that +instruction, and instead use floating-point operations. + Integer conditional move instructions cannot dual-issue with other integer instructions. No conditional move can issue 1-5 cycles after a load. (Or -something such bizzare.) +something such bizarre.) Useless. -Integer branches can issue with two integer arithmetic instructions. Likewise -for integer loads. Four instructions may issue (arith, arith, ld/st, branch) -but only if the branch is last. +The UltraSPARC-3 pipeline seems similar, but is somewhat more rigid. Branches +execute slower, and there may be other new stalls. Integer multiply doesn't +halt the CPU and also has a much lower latency. But it's still not pipelined, +and thus useless for our needs. -(The V9 architecture manual recommends that the 2nd operand of a multiply -instruction be the smaller one. For UltraSPARC, they got things backwards and -optimize for the wrong operand! Really helpful in the light of that multiply -is incredibly slow on these CPUs!) - STATUS -There is new code in ~/prec/gmp-remote/sparc64. Not tested or completed, but -the pipelines are worked out. Here are the timings: +(Timings are for UltraSPARC-1/2. UltraSPARC-3 is a few cycles slower.) -* lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb. +* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb. The IEU0 + functional unit is saturated with shifts. -* add_n, sub_n: add3.s currently runs at 6 cycles/limb. We use a bizarre - scheme of compares and branches (with some nops and fnops to align things) - and carefully stay away from the instructions intended for this application - (i.e., movcs and movcc). +* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb. The 4 + instruction recurrency is the speed limiter. - Using movcc/movcs, even with deep unrolling, seems to get down to 7 - cycles/limb. +* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically. The + code sustains 4 instructions/cycle. It might be possible to invent a better + way of summing the intermediate 49-bit operands, but it is unlikely that it + will save enough instructions to save an entire cycle. - The most promising approach is to split operands in 32-bit pieces using - srlx, then use two addccc, and finally compile the results with sllx+or. - The result could run at 5 cycles/limb, I think. It might be possible to - do without unrolling, or with minimal unrolling. + The load-use of the `rlimb' operand is not enough scheduled for good L2 cache + performance. Since UltraSPARC-1/2 L1 cache is direct mapped, we miss to L2 + very often. The load-use of the std/ldx pairs via the stack are somewhat + over-scheduled. -* addmul_1/submul_1: Should optimize for when scalar operand < 2^32. -* addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II, - Karatsuba's method should save up to 16 cycles (i.e. > 20%). -* mul_1 (and possibly the other multiply functions): Handle carry in the - same tricky way as add_n,sub_n. + It would be possible to save two instructions: (1) The `mov' could be avoided + if ths std/ldx were less scheduled. (2) The ldx of `rlimb' could be split + into two `ld' instructions, saving the shifts/masks. + +* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1 + code. It would be possible to shave one or two cycles from it, with some + labour. + +* mpn_submul_1: Braindead code just calling mpn_mul_1 + mpn_sub_n. It would be + possible to either match the mpn_addmul_1 performance, or in the worst case + use one more instruction group. + +* mpn_Xmul_2: These could be made to run at 9 cycles/limb. Straightforward + generalization of mpn_Xmul_1.