===================================================================
RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/alpha/Attic/README,v
retrieving revision 1.1.1.1
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.1 -r1.1.1.2
--- OpenXM_contrib/gmp/mpn/alpha/Attic/README	2000/01/10 15:35:22	1.1.1.1
+++ OpenXM_contrib/gmp/mpn/alpha/Attic/README	2000/09/09 14:12:21	1.1.1.2
@@ -1,53 +1,224 @@
 This directory contains mpn functions optimized for DEC Alpha processors.
 
+ALPHA ASSEMBLY RULES AND REGULATIONS
+
+The `.prologue N' pseudo op marks the end of instruction that needs
+special handling by unwinding.  It also says whether $27 is really
+needed for computing the gp.  The `.mask M' pseudo op says which
+registers are saved on the stack, and at what offset in the frame.
+
+Cray code is very very different...
+
+
 RELEVANT OPTIMIZATION ISSUES
 
 EV4
 
 1. This chip has very limited store bandwidth.  The on-chip L1 cache is
-write-through, and a cache line is transfered from the store buffer to the
-off-chip L2 in as much 15 cycles on most systems.  This delay hurts
-mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
+   write-through, and a cache line is transfered from the store buffer to
+   the off-chip L2 in as much 15 cycles on most systems.  This delay hurts
+   mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
 
 2. Pairing is possible between memory instructions and integer arithmetic
-instructions.
+   instructions.
 
-3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of
-these cycles are pipelined.  Thus, multiply instructions can be issued at a
-rate of one each 21nd cycle.
+3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
+   these cycles are pipelined.  Thus, multiply instructions can be issued at
+   a rate of one each 21st cycle.
 
 EV5
 
 1. The memory bandwidth of this chip seems excellent, both for loads and
-stores.  Even when the working set is larger than the on-chip L1 and L2
-caches, the perfromance remain almost unaffected.
+   stores.  Even when the working set is larger than the on-chip L1 and L2
+   caches, the performance remain almost unaffected.
 
-2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th
-cycle.  umulh has a measured latency of 15 cycles and an issue rate of 1
-each 10th cycle.  But the exact timing is somewhat confusing.
+2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
+   umulh has a measured latency of 14 cycles and an issue rate of 1 each
+   10th cycle.  But the exact timing is somewhat confusing.
 
 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
    are memory operations.  This will take at least
-	ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles
+	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
-   cache cycles, which should be completely hidden in the 20 issue cycles.
+   cache cycles, which should be completely hidden in the 19 issue cycles.
    The computation is inherently serial, with these dependencies:
+
+	       ldq  ldq
+		 \  /\
+	  (or)   addq |
+	   |\   /   \ |
+	   | addq  cmpult
+	    \  |     |
+	     cmpult  |
+		 \  /
+		  or
+
+   I.e., 3 operations are needed between carry-in and carry-out, making 12
+   cycles the absolute minimum for the 4 limbs.  We could replace the `or'
+   with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
+   but that might waste a cycle on EV4.  The total depth remain unaffected,
+   since cmov has a latency of 2 cycles.
+
      addq
      /   \
    addq  cmpult
-     |     |
-   cmpult  |
-       \  /
-        or
-   I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute
-   minimum.  We could replace the `or' with a cmoveq/cmovne, which would save
-   a cycle on EV5, but that might waste a cycle on EV4.  Also, cmov takes 2
-   cycles.
-     addq
-     /   \
-   addq  cmpult
      |      \
    cmpult -> cmovne
 
-STATUS
+Montgomery has a slightly different way of computing carry that requires one
+less instruction, but has depth 4 (instead of the current 3).  Since the
+code is currently instruction issue bound, Montgomery's idea should save us
+1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
+cycles/limb.  Unfortunately, this method will not be good for the EV6.
 
+EV6
+
+Here we have a really parallel pipeline, capable of issuing up to 4 integer
+instructions per cycle.  One integer multiply instruction can issue each
+cycle.  To get optimal speed, we need to pretend we are vectorizing the code,
+i.e., minimize the iterative dependencies.
+
+There are two dependencies to watch out for.  1) Address arithmetic
+dependencies, and 2) carry propagation dependencies.
+
+We can avoid serializing due to address arithmetic by unrolling the loop, so
+that addresses don't depend heavily on an index variable.  Avoiding
+serializing because of carry propagation is trickier; the ultimate performance
+of the code will be determined of the number of latency cycles it takes from
+accepting carry-in to a vector point until we can generate carry-out.
+
+Most integer instructions can execute in either the L0, U0, L1, or U1
+pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
+
+CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
+the execute efficiently.  But CMOV split the mapping process (see pg 2-26 in
+cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
+instruction of an aligned 4 instruction block (?).
+
+Perhaps the most important issue is the latency between the L0/U0 and L1/U1
+clusters; a result obtained on either cluster has an extra cycle of latency
+for consumers in the opposite cluster.  Because of the dynamic nature of the
+implementation, it is hard to predict where an instruction will execute.
+
+The shift loops need (per limb):
+    1 load (Lx pipes)
+    1 store (Lx pipes)
+    2 shift (Ux pipes)
+    1 iaddlog (Lx pipes, Ux pipes)
+Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
+
+For mpn_add_n, we currently have
+    2 load (Lx pipes)
+    1 store (Lx pipes)
+    5 iaddlog (Lx pipes, Ux pipes)
+
+Again, we have a perfect balance and will be limited by carry propagation
+delays, currently three cycles.  The superoptimizer indicates that ther
+might be sequences that--using a final cmov--have a carry propagation delay
+of just two.  Montgomery's subtraction sequence could perhaps be used, by
+complementing some operands.  All in all, we should get down to 2 cycles
+without much problems.
+
+For mpn_mul_1, we could do, just like for mpn_add_n:
+    not		newlo,notnewlo
+    addq	cylimb,newlo,newlo  ||    cmpult	cylimb,notnewlo,cyout
+    addq	cyout,newhi,cylimb
+and get 2-cycle carry propagation.  The instructions needed will be
+    1 ld (Lx pipes)
+    1 st (Lx pipes)
+    2 mul (U1 pipe)
+    4 iaddlog (Lx pipes, Ux pipes)
+issue1: addq not mul ld
+issue2: cmpult addq mul st
+Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
+
+Last, we have mpn_addmul_1.  Almost certainly, we will get down to 3
+cycles/limb, which would be absolutely awesome.
+
+Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
+
+   i  
+   s
+   s  i
+   u  n
+   e  s
+   d  t
+      r
+   i  u
+l  n  c
+i  s  t
+v  t  i
+e  r  o
+   u  n
+v  c	
+a  t  t
+l  i  y
+u  o  p
+e  n  e
+s  s  s
+        issue
+         in
+        cycle
+         -1     ldq
+               /    \
+          0   |      \
+              |       \
+          1   |        |
+              |        |
+          2   |        |                   ldq
+              |        |                  /    \
+          3   |       mulq               |      \
+              |           \              |       \
+          4  umulh         \             |        |
+               |            |            |        |
+          5    |            |            |        |                   ldq
+               |            |            |        |                  /    \
+    4calm 6    |            |   ldq      |       mulq               |      \
+               |            |  /         |           \              |       \
+    4casm 7    |            | /         umulh         \             |        |
+6              |            ||            |            |            |        |
+    3aal  8    |            ||            |            |            |        |                   ldq
+7              |            ||            |            |            |        |                  /    \
+    4calm 9    |            ||            |            |   ldq      |       mulq               |      \
+9              |            ||            |            |  /         |           \              |       \
+    4casm 10   |            ||            |            | /         umulh         \             |        |
+9              |            ||            |            ||            |            |            |        |
+    3aal  11   |           addq           |            ||            |            |            |        |                   ldq
+9              |          //   \          |            ||            |            |            |        |                  /    \
+    4calm 12    \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
+13               \    /       //   \      |            ||            |            |  /         |           \              |       \
+    4casm 13      addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
+11                    \  /                |            ||            |            ||            |            |            |        |
+    3aal  14          addq                |           addq           |            ||            |            |            |        |                   ldq
+10                        \               |          //   \          |            ||            |            |            |        |                  /    \
+    4calm 15                cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
+13                                          \    /       //   \      |            ||            |            |  /         |           \              |       \
+    4casm 16                                 addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
+11                                               \  /                |            ||            |            ||            |            |            |        |
+    3aal  17                                     addq                |           addq           |            ||            |            |            |        |
+10                                                   \               |          //   \          |            ||            |            |            |        |
+    4calm 18                                           cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq
+13                                                                     \    /       //   \      |            ||            |            |  /         |           \
+    4casm 19                                                            addq   cmpult     stq   |            ||            |            | /         umulh         \
+11                                                                          \  /                |            ||            |            ||            |            |
+    3aal  20                                                                addq                |           addq           |            ||            |            |
+10                                                                              \               |          //   \          |            ||            |            |
+    4calm 21                                                                      cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq
+                                                                                                  \    /       //   \      |            ||            |            |  /
+          22                                                                                       addq   cmpult     stq   |            ||            |            | /
+                                                                                                       \  /                |            ||            |            ||
+          23                                                                                           addq                |           addq           |            ||
+                                                                                                           \               |          //   \          |            ||
+          24                                                                                                 cy ---->       \     cmpult    addq<-cy  |            ||
+                                                                                                                             \    /       //   \      |            ||
+          25                                                                                                                  addq   cmpult     stq   |            ||
+                                                                                                                                  \  /                |            ||
+          26                                                                                                                      addq                |           addq
+                                                                                                                                      \               |          //   \
+          27                                                                                                                            cy ---->       \     cmpult    addq<-cy
+                                                                                                                                                        \    /       //   \
+          28                                                                                                                                             addq   cmpult     stq
+                                                                                                                                                             \  /
+As many as 6 consecutive points will be under execution simultaneously, or if we                                                                             addq
+schedule loads even further away, maybe 7 or 8.  But the number of live quantities                                                                               \
+is reasonable, and can easily be satisfied.                                                                                                                        cy ---->