version 1.1.1.1, 2000/09/09 14:12:41 |
version 1.1.1.2, 2003/08/25 16:06:26 |
|
|
|
Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. |
|
|
|
This file is part of the GNU MP Library. |
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify |
|
it under the terms of the GNU Lesser General Public License as published by |
|
the Free Software Foundation; either version 2.1 of the License, or (at your |
|
option) any later version. |
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
|
License for more details. |
|
|
|
You should have received a copy of the GNU Lesser General Public License |
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
|
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA |
|
02111-1307, USA. |
|
|
|
|
|
|
|
|
|
|
This directory contains mpn functions for 64-bit V9 SPARC |
This directory contains mpn functions for 64-bit V9 SPARC |
|
|
RELEVANT OPTIMIZATION ISSUES |
RELEVANT OPTIMIZATION ISSUES |
|
|
The Ultra I/II pipeline executes up to two simple integer arithmetic operations |
Notation: |
per cycle. The 64-bit integer multiply instruction mulx takes from 5 cycles to |
IANY = shift/add/sub/logical/sethi |
35 cycles, depending on the position of the most significant bit of the 1st |
IADDLOG = add/sub/logical/sethi |
source operand. It cannot overlap with other instructions. For our use of |
MEM = ld*/st* |
mulx, it will take from 5 to 20 cycles. |
FA = fadd*/fsub*/f*to*/fmov* |
|
FM = fmul* |
|
|
|
UltraSPARC-1/2 can issue four instructions per cycle, with these restrictions: |
|
* Two IANY instructions, but only one of these may be a shift. If there is a |
|
shift and an IANY instruction, the shift must precede the IANY instruction. |
|
* One FA. |
|
* One FM. |
|
* One branch. |
|
* One MEM. |
|
* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches |
|
should not be in slot 4, since that makes the delay insn come from separate |
|
bundle. |
|
* If two IANY/IADDLOG instructions are to be executed in the same cycle and one |
|
of these is setting the condition codes, that instruction must be the second |
|
one. |
|
|
|
To summarize, ignoring branches, these are the bundles that can reach the peak |
|
execution speed: |
|
|
|
insn1 iany iany mem iany iany mem iany iany mem |
|
insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany |
|
insn3 mem iaddlog iaddlog fa fa fa fm fm fm |
|
insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa |
|
|
|
The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles, |
|
depending on the position of the most significant bit of the first source |
|
operand. When used for 32x32->64 multiplication, it needs 20 cycles. |
|
Furthermore, it stalls the processor while executing. We stay away from that |
|
instruction, and instead use floating-point operations. |
|
|
Integer conditional move instructions cannot dual-issue with other integer |
Integer conditional move instructions cannot dual-issue with other integer |
instructions. No conditional move can issue 1-5 cycles after a load. (Or |
instructions. No conditional move can issue 1-5 cycles after a load. (Or |
something such bizzare.) |
something such bizarre.) Useless. |
|
|
Integer branches can issue with two integer arithmetic instructions. Likewise |
The UltraSPARC-3 pipeline seems similar, but is somewhat more rigid. Branches |
for integer loads. Four instructions may issue (arith, arith, ld/st, branch) |
execute slower, and there may be other new stalls. Integer multiply doesn't |
but only if the branch is last. |
halt the CPU and also has a much lower latency. But it's still not pipelined, |
|
and thus useless for our needs. |
|
|
(The V9 architecture manual recommends that the 2nd operand of a multiply |
|
instruction be the smaller one. For UltraSPARC, they got things backwards and |
|
optimize for the wrong operand! Really helpful in the light of that multiply |
|
is incredibly slow on these CPUs!) |
|
|
|
STATUS |
STATUS |
|
|
There is new code in ~/prec/gmp-remote/sparc64. Not tested or completed, but |
(Timings are for UltraSPARC-1/2. UltraSPARC-3 is a few cycles slower.) |
the pipelines are worked out. Here are the timings: |
|
|
|
* lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb. |
* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb. The IEU0 |
|
functional unit is saturated with shifts. |
|
|
* add_n, sub_n: add3.s currently runs at 6 cycles/limb. We use a bizarre |
* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb. The 4 |
scheme of compares and branches (with some nops and fnops to align things) |
instruction recurrency is the speed limiter. |
and carefully stay away from the instructions intended for this application |
|
(i.e., movcs and movcc). |
|
|
|
Using movcc/movcs, even with deep unrolling, seems to get down to 7 |
* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically. The |
cycles/limb. |
code sustains 4 instructions/cycle. It might be possible to invent a better |
|
way of summing the intermediate 49-bit operands, but it is unlikely that it |
|
will save enough instructions to save an entire cycle. |
|
|
The most promising approach is to split operands in 32-bit pieces using |
The load-use of the `rlimb' operand is not enough scheduled for good L2 cache |
srlx, then use two addccc, and finally compile the results with sllx+or. |
performance. Since UltraSPARC-1/2 L1 cache is direct mapped, we miss to L2 |
The result could run at 5 cycles/limb, I think. It might be possible to |
very often. The load-use of the std/ldx pairs via the stack are somewhat |
do without unrolling, or with minimal unrolling. |
over-scheduled. |
|
|
* addmul_1/submul_1: Should optimize for when scalar operand < 2^32. |
It would be possible to save two instructions: (1) The `mov' could be avoided |
* addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II, |
if ths std/ldx were less scheduled. (2) The ldx of `rlimb' could be split |
Karatsuba's method should save up to 16 cycles (i.e. > 20%). |
into two `ld' instructions, saving the shifts/masks. |
* mul_1 (and possibly the other multiply functions): Handle carry in the |
|
same tricky way as add_n,sub_n. |
* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1 |
|
code. It would be possible to shave one or two cycles from it, with some |
|
labour. |
|
|
|
* mpn_submul_1: Braindead code just calling mpn_mul_1 + mpn_sub_n. It would be |
|
possible to either match the mpn_addmul_1 performance, or in the worst case |
|
use one more instruction group. |
|
|
|
* mpn_Xmul_2: These could be made to run at 9 cycles/limb. Straightforward |
|
generalization of mpn_Xmul_1. |