===================================================================
RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/alpha/Attic/README,v
retrieving revision 1.1
retrieving revision 1.1.1.3
diff -u -p -r1.1 -r1.1.1.3
--- OpenXM_contrib/gmp/mpn/alpha/Attic/README	2000/01/10 15:35:22	1.1
+++ OpenXM_contrib/gmp/mpn/alpha/Attic/README	2003/08/25 16:06:18	1.1.1.3
@@ -1,53 +1,134 @@
+Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
+
+This file is part of the GNU MP Library.
+
+The GNU MP Library is free software; you can redistribute it and/or modify it
+under the terms of the GNU Lesser General Public License as published by the
+Free Software Foundation; either version 2.1 of the License, or (at your
+option) any later version.
+
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+for more details.
+
+You should have received a copy of the GNU Lesser General Public License along
+with the GNU MP Library; see the file COPYING.LIB.  If not, write to the Free
+Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
+USA.
+
+
+
+
+
 This directory contains mpn functions optimized for DEC Alpha processors.
 
+ALPHA ASSEMBLY RULES AND REGULATIONS
+
+The `.prologue N' pseudo op marks the end of instruction that needs special
+handling by unwinding.  It also says whether $27 is really needed for computing
+the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
+and at what offset in the frame.
+
+Cray T3 code is very very different...
+
+
 RELEVANT OPTIMIZATION ISSUES
 
 EV4
 
-1. This chip has very limited store bandwidth.  The on-chip L1 cache is
-write-through, and a cache line is transfered from the store buffer to the
-off-chip L2 in as much 15 cycles on most systems.  This delay hurts
-mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
+1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
+   through, and a cache line is transfered from the store buffer to the off-
+   chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
+   mpn_sub_n, mpn_lshift, and mpn_rshift.
 
 2. Pairing is possible between memory instructions and integer arithmetic
-instructions.
+   instructions.
 
-3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of
-these cycles are pipelined.  Thus, multiply instructions can be issued at a
-rate of one each 21nd cycle.
+3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
+   cycles are pipelined.  Thus, multiply instructions can be issued at a rate
+   of one each 21st cycle.
 
 EV5
 
-1. The memory bandwidth of this chip seems excellent, both for loads and
-stores.  Even when the working set is larger than the on-chip L1 and L2
-caches, the perfromance remain almost unaffected.
+1. The memory bandwidth of this chip is good, both for loads and stores.  The
+   L1 cache can handle two loads or one store per cycle, but two cycles after a
+   store, no ld can issue.
 
-2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th
-cycle.  umulh has a measured latency of 15 cycles and an issue rate of 1
-each 10th cycle.  But the exact timing is somewhat confusing.
+2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
+   umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
+   (Note that published documentation gets these numbers slightly wrong.)
 
 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
    are memory operations.  This will take at least
-	ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles
+	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
-   cache cycles, which should be completely hidden in the 20 issue cycles.
+   cache cycles, which should be completely hidden in the 19 issue cycles.
    The computation is inherently serial, with these dependencies:
+
+	       ldq  ldq
+		 \  /\
+	  (or)   addq |
+	   |\   /   \ |
+	   | addq  cmpult
+	    \  |     |
+	     cmpult  |
+		 \  /
+		  or
+
+   I.e., 3 operations are needed between carry-in and carry-out, making 12
+   cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
+   a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
+   might waste a cycle on EV4.  The total depth remain unaffected, since cmov
+   has a latency of 2 cycles.
+
      addq
      /   \
    addq  cmpult
-     |     |
-   cmpult  |
-       \  /
-        or
-   I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute
-   minimum.  We could replace the `or' with a cmoveq/cmovne, which would save
-   a cycle on EV5, but that might waste a cycle on EV4.  Also, cmov takes 2
-   cycles.
-     addq
-     /   \
-   addq  cmpult
      |      \
    cmpult -> cmovne
 
-STATUS
+  Montgomery has a slightly different way of computing carry that requires one
+  less instruction, but has depth 4 (instead of the current 3).  Since the code
+  is currently instruction issue bound, Montgomery's idea should save us 1/2
+  cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
+  Unfortunately, this method will not be good for the EV6.
 
+4. addmul_1 and friends: We previously had a scheme for splitting the single-
+   limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
+   and then use FP operations for every 2nd multiply, and integer operations
+   for every 2nd multiply.
+
+   But it seems much better to split the single-limb operand in 16-bit chunks,
+   since we save many integer shifts and adds that way.  See powerpc64/README
+   for some more details.
+
+EV6
+
+Here we have a really parallel pipeline, capable of issuing up to 4 integer
+instructions per cycle.  In actual practice, it is never possible to sustain
+more than 3.5 integer insns/cycle due to rename register shortage.  One integer
+multiply instruction can issue each cycle.  To get optimal speed, we need to
+pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
+
+There are two dependencies to watch out for.  1) Address arithmetic
+dependencies, and 2) carry propagation dependencies.
+
+We can avoid serializing due to address arithmetic by unrolling loops, so that
+addresses don't depend heavily on an index variable.  Avoiding serializing
+because of carry propagation is trickier; the ultimate performance of the code
+will be determined of the number of latency cycles it takes from accepting
+carry-in to a vector point until we can generate carry-out.
+
+Most integer instructions can execute in either the L0, U0, L1, or U1
+pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
+
+CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
+split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
+should always be placed as the last instruction of an aligned 4 instruction
+block, or perhaps simply avoided.
+
+Perhaps the most important issue is the latency between the L0/U0 and L1/U1
+clusters; a result obtained on either cluster has an extra cycle of latency for
+consumers in the opposite cluster.  Because of the dynamic nature of the
+implementation, it is hard to predict where an instruction will execute.