OpenXM_contrib/gmp/mpn/sparc64/README - diff

Return to README CVS log

Up to [local] / OpenXM_contrib / gmp / mpn / sparc64

Diff for /OpenXM_contrib/gmp/mpn/sparc64/Attic/README between version 1.1.1.1 and 1.1.1.2

-version 1.1.1.1, 2000/09/09 14:12:41
+version 1.1.1.2, 2003/08/25 16:06:26
 Line 1
 Line 1
 Line 1
+ Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
+ This file is part of the GNU MP Library.
+ The GNU MP Library is free software; you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as published by
+ the Free Software Foundation; either version 2.1 of the License, or (at your
+ option) any later version.
+ The GNU MP Library is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+ License for more details.
+ You should have received a copy of the GNU Lesser General Public License
+ along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+ the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+-1307, USA.
  This directory contains mpn functions for 64-bit V9 SPARC
  RELEVANT OPTIMIZATION ISSUES
- The Ultra I/II pipeline executes up to two simple integer arithmetic operations
+ Notation:
- per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to
+   IANY = shift/add/sub/logical/sethi
-cycles, depending on the position of the most significant bit of the 1st
+   IADDLOG = add/sub/logical/sethi
- source operand.  It cannot overlap with other instructions.  For our use of
+   MEM = ld*/st*
- mulx, it will take from 5 to 20 cycles.
+   FA = fadd*/fsub*/f*to*/fmov*
+   FM = fmul*
+ UltraSPARC-1/2 can issue four instructions per cycle, with these restrictions:
+ * Two IANY instructions, but only one of these may be a shift.  If there is a
+   shift and an IANY instruction, the shift must precede the IANY instruction.
+ * One FA.
+ * One FM.
+ * One branch.
+ * One MEM.
+ * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
+   should not be in slot 4, since that makes the delay insn come from separate
+   bundle.
+ * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
+   of these is setting the condition codes, that instruction must be the second
+   one.
+ To summarize, ignoring branches, these are the bundles that can reach the peak
+ execution speed:
+ insn1   iany    iany    mem     iany    iany    mem     iany    iany    mem
+ insn2   iaddlog mem     iany    mem     iaddlog iany    mem     iaddlog iany
+ insn3   mem     iaddlog iaddlog fa      fa      fa      fm      fm      fm
+ insn4   fa/fm   fa/fm   fa/fm   fm      fm      fm      fa      fa      fa
+ The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
+ depending on the position of the most significant bit of the first source
+ operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
+ Furthermore, it stalls the processor while executing.  We stay away from that
+ instruction, and instead use floating-point operations.
  Integer conditional move instructions cannot dual-issue with other integer
  instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
- something such bizzare.)
+ something such bizarre.)  Useless.
- Integer branches can issue with two integer arithmetic instructions.  Likewise
+ The UltraSPARC-3 pipeline seems similar, but is somewhat more rigid.  Branches
- for integer loads.  Four instructions may issue (arith, arith, ld/st, branch)
+ execute slower, and there may be other new stalls.  Integer multiply doesn't
- but only if the branch is last.
+ halt the CPU and also has a much lower latency.  But it's still not pipelined,
+ and thus useless for our needs.
- (The V9 architecture manual recommends that the 2nd operand of a multiply
- instruction be the smaller one.  For UltraSPARC, they got things backwards and
- optimize for the wrong operand!  Really helpful in the light of that multiply
- is incredibly slow on these CPUs!)
  STATUS
- There is new code in ~/prec/gmp-remote/sparc64.  Not tested or completed, but
+ (Timings are for UltraSPARC-1/2.  UltraSPARC-3 is a few cycles slower.)
- the pipelines are worked out.  Here are the timings:
- * lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.
+ * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb.  The IEU0
+   functional unit is saturated with shifts.
- * add_n, sub_n: add3.s currently runs at 6 cycles/limb.  We use a bizarre
+ * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb.  The 4
-   scheme of compares and branches (with some nops and fnops to align things)
+   instruction recurrency is the speed limiter.
-   and carefully stay away from the instructions intended for this application
-   (i.e., movcs and movcc).
-   Using movcc/movcs, even with deep unrolling, seems to get down to 7
+ * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically.  The
-   cycles/limb.
+   code sustains 4 instructions/cycle.  It might be possible to invent a better
+   way of summing the intermediate 49-bit operands, but it is unlikely that it
+   will save enough instructions to save an entire cycle.
-   The most promising approach is to split operands in 32-bit pieces using
+   The load-use of the `rlimb' operand is not enough scheduled for good L2 cache
-   srlx, then use two addccc, and finally compile the results with sllx+or.
+   performance.  Since UltraSPARC-1/2 L1 cache is direct mapped, we miss to L2
-   The result could run at 5 cycles/limb, I think.  It might be possible to
+   very often.  The load-use of the std/ldx pairs via the stack are somewhat
-   do without unrolling, or with minimal unrolling.
+   over-scheduled.
- * addmul_1/submul_1: Should optimize for when scalar operand < 2^32.
+   It would be possible to save two instructions: (1) The `mov' could be avoided
- * addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,
+   if ths std/ldx were less scheduled.  (2) The ldx of `rlimb' could be split
-   Karatsuba's method should save up to 16 cycles (i.e. > 20%).
+   into two `ld' instructions, saving the shifts/masks.
- * mul_1 (and possibly the other multiply functions): Handle carry in the
-   same tricky way as add_n,sub_n.
+ * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
+   code.  It would be possible to shave one or two cycles from it, with some
+   labour.
+ * mpn_submul_1: Braindead code just calling mpn_mul_1 + mpn_sub_n.  It would be
+   possible to either match the mpn_addmul_1 performance, or in the worst case
+   use one more instruction group.
+ * mpn_Xmul_2: These could be made to run at 9 cycles/limb.  Straightforward
+   generalization of mpn_Xmul_1.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>