OpenXM_contrib/gmp/mpn/alpha/README - diff

Return to README CVS log

Up to [local] / OpenXM_contrib / gmp / mpn / alpha

Diff for /OpenXM_contrib/gmp/mpn/alpha/Attic/README between version 1.1.1.2 and 1.1.1.3

-version 1.1.1.2, 2000/09/09 14:12:21
+version 1.1.1.3, 2003/08/25 16:06:18
 Line 1
 Line 1
 Line 1
+ Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
+ This file is part of the GNU MP Library.
+ The GNU MP Library is free software; you can redistribute it and/or modify it
+ under the terms of the GNU Lesser General Public License as published by the
+ Free Software Foundation; either version 2.1 of the License, or (at your
+ option) any later version.
+ The GNU MP Library is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ for more details.
+ You should have received a copy of the GNU Lesser General Public License along
+ with the GNU MP Library; see the file COPYING.LIB.  If not, write to the Free
+ Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
+ USA.
  This directory contains mpn functions optimized for DEC Alpha processors.
  ALPHA ASSEMBLY RULES AND REGULATIONS
- The `.prologue N' pseudo op marks the end of instruction that needs
+ The `.prologue N' pseudo op marks the end of instruction that needs special
- special handling by unwinding.  It also says whether $27 is really
+ handling by unwinding.  It also says whether $27 is really needed for computing
- needed for computing the gp.  The `.mask M' pseudo op says which
+ the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
- registers are saved on the stack, and at what offset in the frame.
+ and at what offset in the frame.
- Cray code is very very different...
+ Cray T3 code is very very different...
  RELEVANT OPTIMIZATION ISSUES
  EV4
-. This chip has very limited store bandwidth.  The on-chip L1 cache is
+. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
-    write-through, and a cache line is transfered from the store buffer to
+    through, and a cache line is transfered from the store buffer to the off-
-    the off-chip L2 in as much 15 cycles on most systems.  This delay hurts
+    chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
-    mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
+    mpn_sub_n, mpn_lshift, and mpn_rshift.
 . Pairing is possible between memory instructions and integer arithmetic
     instructions.
-. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
+. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
-    these cycles are pipelined.  Thus, multiply instructions can be issued at
+    cycles are pipelined.  Thus, multiply instructions can be issued at a rate
-    a rate of one each 21st cycle.
+    of one each 21st cycle.
  EV5
-. The memory bandwidth of this chip seems excellent, both for loads and
+. The memory bandwidth of this chip is good, both for loads and stores.  The
-    stores.  Even when the working set is larger than the on-chip L1 and L2
+    L1 cache can handle two loads or one store per cycle, but two cycles after a
-    caches, the performance remain almost unaffected.
+    store, no ld can issue.
 . mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
-    umulh has a measured latency of 14 cycles and an issue rate of 1 each
+    umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
-th cycle.  But the exact timing is somewhat confusing.
+    (Note that published documentation gets these numbers slightly wrong.)
 . mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
     are memory operations.  This will take at least
-Line 54  EV5
+Line 77  EV5
 Line 54  EV5
 Line 77  EV5
                    or
     I.e., 3 operations are needed between carry-in and carry-out, making 12
-    cycles the absolute minimum for the 4 limbs.  We could replace the `or'
+    cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
-    with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
+    a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
-    but that might waste a cycle on EV4.  The total depth remain unaffected,
+    might waste a cycle on EV4.  The total depth remain unaffected, since cmov
-    since cmov has a latency of 2 cycles.
+    has a latency of 2 cycles.
       addq
       /   \
-Line 65  EV5
+Line 88  EV5
 Line 65  EV5
 Line 88  EV5
       |      \
     cmpult -> cmovne
- Montgomery has a slightly different way of computing carry that requires one
+   Montgomery has a slightly different way of computing carry that requires one
- less instruction, but has depth 4 (instead of the current 3).  Since the
+   less instruction, but has depth 4 (instead of the current 3).  Since the code
- code is currently instruction issue bound, Montgomery's idea should save us
+   is currently instruction issue bound, Montgomery's idea should save us 1/2
-/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
+   cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
- cycles/limb.  Unfortunately, this method will not be good for the EV6.
+   Unfortunately, this method will not be good for the EV6.
+. addmul_1 and friends: We previously had a scheme for splitting the single-
+    limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
+    and then use FP operations for every 2nd multiply, and integer operations
+    for every 2nd multiply.
+    But it seems much better to split the single-limb operand in 16-bit chunks,
+    since we save many integer shifts and adds that way.  See powerpc64/README
+    for some more details.
  EV6
  Here we have a really parallel pipeline, capable of issuing up to 4 integer
- instructions per cycle.  One integer multiply instruction can issue each
+ instructions per cycle.  In actual practice, it is never possible to sustain
- cycle.  To get optimal speed, we need to pretend we are vectorizing the code,
+ more than 3.5 integer insns/cycle due to rename register shortage.  One integer
- i.e., minimize the iterative dependencies.
+ multiply instruction can issue each cycle.  To get optimal speed, we need to
+ pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
  There are two dependencies to watch out for.  1) Address arithmetic
  dependencies, and 2) carry propagation dependencies.
- We can avoid serializing due to address arithmetic by unrolling the loop, so
+ We can avoid serializing due to address arithmetic by unrolling loops, so that
- that addresses don't depend heavily on an index variable.  Avoiding
+ addresses don't depend heavily on an index variable.  Avoiding serializing
- serializing because of carry propagation is trickier; the ultimate performance
+ because of carry propagation is trickier; the ultimate performance of the code
- of the code will be determined of the number of latency cycles it takes from
+ will be determined of the number of latency cycles it takes from accepting
- accepting carry-in to a vector point until we can generate carry-out.
+ carry-in to a vector point until we can generate carry-out.
  Most integer instructions can execute in either the L0, U0, L1, or U1
  pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
- CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
+ CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
- the execute efficiently.  But CMOV split the mapping process (see pg 2-26 in
+ split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
- cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
+ should always be placed as the last instruction of an aligned 4 instruction
- instruction of an aligned 4 instruction block (?).
+ block, or perhaps simply avoided.
  Perhaps the most important issue is the latency between the L0/U0 and L1/U1
- clusters; a result obtained on either cluster has an extra cycle of latency
+ clusters; a result obtained on either cluster has an extra cycle of latency for
- for consumers in the opposite cluster.  Because of the dynamic nature of the
+ consumers in the opposite cluster.  Because of the dynamic nature of the
  implementation, it is hard to predict where an instruction will execute.
- The shift loops need (per limb):
-load (Lx pipes)
-store (Lx pipes)
-shift (Ux pipes)
-iaddlog (Lx pipes, Ux pipes)
- Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
- For mpn_add_n, we currently have
-load (Lx pipes)
-store (Lx pipes)
-iaddlog (Lx pipes, Ux pipes)
- Again, we have a perfect balance and will be limited by carry propagation
- delays, currently three cycles.  The superoptimizer indicates that ther
- might be sequences that--using a final cmov--have a carry propagation delay
- of just two.  Montgomery's subtraction sequence could perhaps be used, by
- complementing some operands.  All in all, we should get down to 2 cycles
- without much problems.
- For mpn_mul_1, we could do, just like for mpn_add_n:
-     not         newlo,notnewlo
-     addq        cylimb,newlo,newlo  ||    cmpult        cylimb,notnewlo,cyout
-     addq        cyout,newhi,cylimb
- and get 2-cycle carry propagation.  The instructions needed will be
-ld (Lx pipes)
-st (Lx pipes)
-mul (U1 pipe)
-iaddlog (Lx pipes, Ux pipes)
- issue1: addq not mul ld
- issue2: cmpult addq mul st
- Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
- Last, we have mpn_addmul_1.  Almost certainly, we will get down to 3
- cycles/limb, which would be absolutely awesome.
- Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
-    i
-    s
-    s  i
-    u  n
-    e  s
-    d  t
-       r
-    i  u
- l  n  c
- i  s  t
- v  t  i
- e  r  o
-    u  n
- v  c
- a  t  t
- l  i  y
- u  o  p
- e  n  e
- s  s  s
-         issue
-          in
-         cycle
-          -1     ldq
-                /    \
-  |      \
-               |       \
-  |        |
-               |        |
-  |        |                   ldq
-               |        |                  /    \
-  |       mulq               |      \
-               |           \              |       \
- umulh         \             |        |
-                |            |            |        |
-   |            |            |        |                   ldq
-                |            |            |        |                  /    \
-calm 6    |            |   ldq      |       mulq               |      \
-                |            |  /         |           \              |       \
-casm 7    |            | /         umulh         \             |        |
-             |            ||            |            |            |        |
-aal  8    |            ||            |            |            |        |                   ldq
-             |            ||            |            |            |        |                  /    \
-calm 9    |            ||            |            |   ldq      |       mulq               |      \
-             |            ||            |            |  /         |           \              |       \
-casm 10   |            ||            |            | /         umulh         \             |        |
-             |            ||            |            ||            |            |            |        |
-aal  11   |           addq           |            ||            |            |            |        |                   ldq
-             |          //   \          |            ||            |            |            |        |                  /    \
-calm 12    \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
-              \    /       //   \      |            ||            |            |  /         |           \              |       \
-casm 13      addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
-                   \  /                |            ||            |            ||            |            |            |        |
-aal  14          addq                |           addq           |            ||            |            |            |        |                   ldq
-                       \               |          //   \          |            ||            |            |            |        |                  /    \
-calm 15                cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
-                                         \    /       //   \      |            ||            |            |  /         |           \              |       \
-casm 16                                 addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
-                                              \  /                |            ||            |            ||            |            |            |        |
-aal  17                                     addq                |           addq           |            ||            |            |            |        |
-                                                  \               |          //   \          |            ||            |            |            |        |
-calm 18                                           cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq
-                                                                    \    /       //   \      |            ||            |            |  /         |           \
-casm 19                                                            addq   cmpult     stq   |            ||            |            | /         umulh         \
-                                                                         \  /                |            ||            |            ||            |            |
-aal  20                                                                addq                |           addq           |            ||            |            |
-                                                                             \               |          //   \          |            ||            |            |
-calm 21                                                                      cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq
-                                                                                                   \    /       //   \      |            ||            |            |  /
-                                                                                      addq   cmpult     stq   |            ||            |            | /
-                                                                                                        \  /                |            ||            |            ||
-                                                                                          addq                |           addq           |            ||
-                                                                                                            \               |          //   \          |            ||
-                                                                                                cy ---->       \     cmpult    addq<-cy  |            ||
-                                                                                                                              \    /       //   \      |            ||
-                                                                                                                 addq   cmpult     stq   |            ||
-                                                                                                                                   \  /                |            ||
-                                                                                                                     addq                |           addq
-                                                                                                                                       \               |          //   \
-                                                                                                                           cy ---->       \     cmpult    addq<-cy
-                                                                                                                                                         \    /       //   \
-                                                                                                                                            addq   cmpult     stq
-                                                                                                                                                              \  /
- As many as 6 consecutive points will be under execution simultaneously, or if we                                                                             addq
- schedule loads even further away, maybe 7 or 8.  But the number of live quantities                                                                               \
- is reasonable, and can easily be satisfied.                                                                                                                        cy ---->

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>