[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / sparc64

Diff for /OpenXM_contrib/gmp/mpn/sparc64/Attic/README between version 1.1.1.1 and 1.1.1.2

version 1.1.1.1, 2000/09/09 14:12:41 version 1.1.1.2, 2003/08/25 16:06:26
Line 1 
Line 1 
   Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
   
   This file is part of the GNU MP Library.
   
   The GNU MP Library is free software; you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as published by
   the Free Software Foundation; either version 2.1 of the License, or (at your
   option) any later version.
   
   The GNU MP Library is distributed in the hope that it will be useful, but
   WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
   License for more details.
   
   You should have received a copy of the GNU Lesser General Public License
   along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
   the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
   02111-1307, USA.
   
   
   
   
   
 This directory contains mpn functions for 64-bit V9 SPARC  This directory contains mpn functions for 64-bit V9 SPARC
   
 RELEVANT OPTIMIZATION ISSUES  RELEVANT OPTIMIZATION ISSUES
   
 The Ultra I/II pipeline executes up to two simple integer arithmetic operations  Notation:
 per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to    IANY = shift/add/sub/logical/sethi
 35 cycles, depending on the position of the most significant bit of the 1st    IADDLOG = add/sub/logical/sethi
 source operand.  It cannot overlap with other instructions.  For our use of    MEM = ld*/st*
 mulx, it will take from 5 to 20 cycles.    FA = fadd*/fsub*/f*to*/fmov*
     FM = fmul*
   
   UltraSPARC-1/2 can issue four instructions per cycle, with these restrictions:
   * Two IANY instructions, but only one of these may be a shift.  If there is a
     shift and an IANY instruction, the shift must precede the IANY instruction.
   * One FA.
   * One FM.
   * One branch.
   * One MEM.
   * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
     should not be in slot 4, since that makes the delay insn come from separate
     bundle.
   * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
     of these is setting the condition codes, that instruction must be the second
     one.
   
   To summarize, ignoring branches, these are the bundles that can reach the peak
   execution speed:
   
   insn1   iany    iany    mem     iany    iany    mem     iany    iany    mem
   insn2   iaddlog mem     iany    mem     iaddlog iany    mem     iaddlog iany
   insn3   mem     iaddlog iaddlog fa      fa      fa      fm      fm      fm
   insn4   fa/fm   fa/fm   fa/fm   fm      fm      fm      fa      fa      fa
   
   The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
   depending on the position of the most significant bit of the first source
   operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
   Furthermore, it stalls the processor while executing.  We stay away from that
   instruction, and instead use floating-point operations.
   
 Integer conditional move instructions cannot dual-issue with other integer  Integer conditional move instructions cannot dual-issue with other integer
 instructions.  No conditional move can issue 1-5 cycles after a load.  (Or  instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
 something such bizzare.)  something such bizarre.)  Useless.
   
 Integer branches can issue with two integer arithmetic instructions.  Likewise  The UltraSPARC-3 pipeline seems similar, but is somewhat more rigid.  Branches
 for integer loads.  Four instructions may issue (arith, arith, ld/st, branch)  execute slower, and there may be other new stalls.  Integer multiply doesn't
 but only if the branch is last.  halt the CPU and also has a much lower latency.  But it's still not pipelined,
   and thus useless for our needs.
   
 (The V9 architecture manual recommends that the 2nd operand of a multiply  
 instruction be the smaller one.  For UltraSPARC, they got things backwards and  
 optimize for the wrong operand!  Really helpful in the light of that multiply  
 is incredibly slow on these CPUs!)  
   
 STATUS  STATUS
   
 There is new code in ~/prec/gmp-remote/sparc64.  Not tested or completed, but  (Timings are for UltraSPARC-1/2.  UltraSPARC-3 is a few cycles slower.)
 the pipelines are worked out.  Here are the timings:  
   
 * lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.  * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb.  The IEU0
     functional unit is saturated with shifts.
   
 * add_n, sub_n: add3.s currently runs at 6 cycles/limb.  We use a bizarre  * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb.  The 4
   scheme of compares and branches (with some nops and fnops to align things)    instruction recurrency is the speed limiter.
   and carefully stay away from the instructions intended for this application  
   (i.e., movcs and movcc).  
   
   Using movcc/movcs, even with deep unrolling, seems to get down to 7  * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically.  The
   cycles/limb.    code sustains 4 instructions/cycle.  It might be possible to invent a better
     way of summing the intermediate 49-bit operands, but it is unlikely that it
     will save enough instructions to save an entire cycle.
   
   The most promising approach is to split operands in 32-bit pieces using    The load-use of the `rlimb' operand is not enough scheduled for good L2 cache
   srlx, then use two addccc, and finally compile the results with sllx+or.    performance.  Since UltraSPARC-1/2 L1 cache is direct mapped, we miss to L2
   The result could run at 5 cycles/limb, I think.  It might be possible to    very often.  The load-use of the std/ldx pairs via the stack are somewhat
   do without unrolling, or with minimal unrolling.    over-scheduled.
   
 * addmul_1/submul_1: Should optimize for when scalar operand < 2^32.    It would be possible to save two instructions: (1) The `mov' could be avoided
 * addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,    if ths std/ldx were less scheduled.  (2) The ldx of `rlimb' could be split
   Karatsuba's method should save up to 16 cycles (i.e. > 20%).    into two `ld' instructions, saving the shifts/masks.
 * mul_1 (and possibly the other multiply functions): Handle carry in the  
   same tricky way as add_n,sub_n.  * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
     code.  It would be possible to shave one or two cycles from it, with some
     labour.
   
   * mpn_submul_1: Braindead code just calling mpn_mul_1 + mpn_sub_n.  It would be
     possible to either match the mpn_addmul_1 performance, or in the worst case
     use one more instruction group.
   
   * mpn_Xmul_2: These could be made to run at 9 cycles/limb.  Straightforward
     generalization of mpn_Xmul_1.

Legend:
Removed from v.1.1.1.1  
changed lines
  Added in v.1.1.1.2

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>