OpenXM_contrib/gmp/mpn/alpha/README - diff

Return to README CVS log

Up to [local] / OpenXM_contrib / gmp / mpn / alpha

Diff for /OpenXM_contrib/gmp/mpn/alpha/Attic/README between version 1.1.1.1 and 1.1.1.2

-version 1.1.1.1, 2000/01/10 15:35:22
+version 1.1.1.2, 2000/09/09 14:12:21
 Line 1
 Line 1
 Line 1
  This directory contains mpn functions optimized for DEC Alpha processors.
+ ALPHA ASSEMBLY RULES AND REGULATIONS
+ The `.prologue N' pseudo op marks the end of instruction that needs
+ special handling by unwinding.  It also says whether $27 is really
+ needed for computing the gp.  The `.mask M' pseudo op says which
+ registers are saved on the stack, and at what offset in the frame.
+ Cray code is very very different...
  RELEVANT OPTIMIZATION ISSUES
  EV4
 . This chip has very limited store bandwidth.  The on-chip L1 cache is
- write-through, and a cache line is transfered from the store buffer to the
+    write-through, and a cache line is transfered from the store buffer to
- off-chip L2 in as much 15 cycles on most systems.  This delay hurts
+    the off-chip L2 in as much 15 cycles on most systems.  This delay hurts
- mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
+    mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
 . Pairing is possible between memory instructions and integer arithmetic
- instructions.
+    instructions.
-. mulq and umulh is documented to have a latency of 23 cycles, but 2 of
+. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
- these cycles are pipelined.  Thus, multiply instructions can be issued at a
+    these cycles are pipelined.  Thus, multiply instructions can be issued at
- rate of one each 21nd cycle.
+    a rate of one each 21st cycle.
  EV5
 . The memory bandwidth of this chip seems excellent, both for loads and
- stores.  Even when the working set is larger than the on-chip L1 and L2
+    stores.  Even when the working set is larger than the on-chip L1 and L2
- caches, the perfromance remain almost unaffected.
+    caches, the performance remain almost unaffected.
-. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th
+. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
- cycle.  umulh has a measured latency of 15 cycles and an issue rate of 1
+    umulh has a measured latency of 14 cycles and an issue rate of 1 each
- each 10th cycle.  But the exact timing is somewhat confusing.
+th cycle.  But the exact timing is somewhat confusing.
 . mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
     are memory operations.  This will take at least
-         ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles
+         ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
     We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
-    cache cycles, which should be completely hidden in the 20 issue cycles.
+    cache cycles, which should be completely hidden in the 19 issue cycles.
     The computation is inherently serial, with these dependencies:
+                ldq  ldq
+                  \  /\
+           (or)   addq |
+            |\   /   \ |
+            | addq  cmpult
+             \  |     |
+              cmpult  |
+                  \  /
+                   or
+    I.e., 3 operations are needed between carry-in and carry-out, making 12
+    cycles the absolute minimum for the 4 limbs.  We could replace the `or'
+    with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
+    but that might waste a cycle on EV4.  The total depth remain unaffected,
+    since cmov has a latency of 2 cycles.
       addq
       /   \
     addq  cmpult
-      |     |
-    cmpult  |
-        \  /
-         or
-    I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute
-    minimum.  We could replace the `or' with a cmoveq/cmovne, which would save
-    a cycle on EV5, but that might waste a cycle on EV4.  Also, cmov takes 2
-    cycles.
-      addq
-      /   \
-    addq  cmpult
       |      \
     cmpult -> cmovne
- STATUS
+ Montgomery has a slightly different way of computing carry that requires one
+ less instruction, but has depth 4 (instead of the current 3).  Since the
+ code is currently instruction issue bound, Montgomery's idea should save us
+/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
+ cycles/limb.  Unfortunately, this method will not be good for the EV6.
+ EV6
+ Here we have a really parallel pipeline, capable of issuing up to 4 integer
+ instructions per cycle.  One integer multiply instruction can issue each
+ cycle.  To get optimal speed, we need to pretend we are vectorizing the code,
+ i.e., minimize the iterative dependencies.
+ There are two dependencies to watch out for.  1) Address arithmetic
+ dependencies, and 2) carry propagation dependencies.
+ We can avoid serializing due to address arithmetic by unrolling the loop, so
+ that addresses don't depend heavily on an index variable.  Avoiding
+ serializing because of carry propagation is trickier; the ultimate performance
+ of the code will be determined of the number of latency cycles it takes from
+ accepting carry-in to a vector point until we can generate carry-out.
+ Most integer instructions can execute in either the L0, U0, L1, or U1
+ pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
+ CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
+ the execute efficiently.  But CMOV split the mapping process (see pg 2-26 in
+ cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
+ instruction of an aligned 4 instruction block (?).
+ Perhaps the most important issue is the latency between the L0/U0 and L1/U1
+ clusters; a result obtained on either cluster has an extra cycle of latency
+ for consumers in the opposite cluster.  Because of the dynamic nature of the
+ implementation, it is hard to predict where an instruction will execute.
+ The shift loops need (per limb):
+load (Lx pipes)
+store (Lx pipes)
+shift (Ux pipes)
+iaddlog (Lx pipes, Ux pipes)
+ Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
+ For mpn_add_n, we currently have
+load (Lx pipes)
+store (Lx pipes)
+iaddlog (Lx pipes, Ux pipes)
+ Again, we have a perfect balance and will be limited by carry propagation
+ delays, currently three cycles.  The superoptimizer indicates that ther
+ might be sequences that--using a final cmov--have a carry propagation delay
+ of just two.  Montgomery's subtraction sequence could perhaps be used, by
+ complementing some operands.  All in all, we should get down to 2 cycles
+ without much problems.
+ For mpn_mul_1, we could do, just like for mpn_add_n:
+     not         newlo,notnewlo
+     addq        cylimb,newlo,newlo  ||    cmpult        cylimb,notnewlo,cyout
+     addq        cyout,newhi,cylimb
+ and get 2-cycle carry propagation.  The instructions needed will be
+ld (Lx pipes)
+st (Lx pipes)
+mul (U1 pipe)
+iaddlog (Lx pipes, Ux pipes)
+ issue1: addq not mul ld
+ issue2: cmpult addq mul st
+ Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
+ Last, we have mpn_addmul_1.  Almost certainly, we will get down to 3
+ cycles/limb, which would be absolutely awesome.
+ Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
+    i
+    s
+    s  i
+    u  n
+    e  s
+    d  t
+       r
+    i  u
+ l  n  c
+ i  s  t
+ v  t  i
+ e  r  o
+    u  n
+ v  c
+ a  t  t
+ l  i  y
+ u  o  p
+ e  n  e
+ s  s  s
+         issue
+          in
+         cycle
+          -1     ldq
+                /    \
+  |      \
+               |       \
+  |        |
+               |        |
+  |        |                   ldq
+               |        |                  /    \
+  |       mulq               |      \
+               |           \              |       \
+ umulh         \             |        |
+                |            |            |        |
+   |            |            |        |                   ldq
+                |            |            |        |                  /    \
+calm 6    |            |   ldq      |       mulq               |      \
+                |            |  /         |           \              |       \
+casm 7    |            | /         umulh         \             |        |
+             |            ||            |            |            |        |
+aal  8    |            ||            |            |            |        |                   ldq
+             |            ||            |            |            |        |                  /    \
+calm 9    |            ||            |            |   ldq      |       mulq               |      \
+             |            ||            |            |  /         |           \              |       \
+casm 10   |            ||            |            | /         umulh         \             |        |
+             |            ||            |            ||            |            |            |        |
+aal  11   |           addq           |            ||            |            |            |        |                   ldq
+             |          //   \          |            ||            |            |            |        |                  /    \
+calm 12    \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
+              \    /       //   \      |            ||            |            |  /         |           \              |       \
+casm 13      addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
+                   \  /                |            ||            |            ||            |            |            |        |
+aal  14          addq                |           addq           |            ||            |            |            |        |                   ldq
+                       \               |          //   \          |            ||            |            |            |        |                  /    \
+calm 15                cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
+                                         \    /       //   \      |            ||            |            |  /         |           \              |       \
+casm 16                                 addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
+                                              \  /                |            ||            |            ||            |            |            |        |
+aal  17                                     addq                |           addq           |            ||            |            |            |        |
+                                                  \               |          //   \          |            ||            |            |            |        |
+calm 18                                           cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq
+                                                                    \    /       //   \      |            ||            |            |  /         |           \
+casm 19                                                            addq   cmpult     stq   |            ||            |            | /         umulh         \
+                                                                         \  /                |            ||            |            ||            |            |
+aal  20                                                                addq                |           addq           |            ||            |            |
+                                                                             \               |          //   \          |            ||            |            |
+calm 21                                                                      cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq
+                                                                                                   \    /       //   \      |            ||            |            |  /
+                                                                                      addq   cmpult     stq   |            ||            |            | /
+                                                                                                        \  /                |            ||            |            ||
+                                                                                          addq                |           addq           |            ||
+                                                                                                            \               |          //   \          |            ||
+                                                                                                cy ---->       \     cmpult    addq<-cy  |            ||
+                                                                                                                              \    /       //   \      |            ||
+                                                                                                                 addq   cmpult     stq   |            ||
+                                                                                                                                   \  /                |            ||
+                                                                                                                     addq                |           addq
+                                                                                                                                       \               |          //   \
+                                                                                                                           cy ---->       \     cmpult    addq<-cy
+                                                                                                                                                         \    /       //   \
+                                                                                                                                            addq   cmpult     stq
+                                                                                                                                                              \  /
+ As many as 6 consecutive points will be under execution simultaneously, or if we                                                                             addq
+ schedule loads even further away, maybe 7 or 8.  But the number of live quantities                                                                               \
+ is reasonable, and can easily be satisfied.                                                                                                                        cy ---->

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>