[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / alpha

Diff for /OpenXM_contrib/gmp/mpn/alpha/Attic/README between version 1.1.1.2 and 1.1.1.3

version 1.1.1.2, 2000/09/09 14:12:21 version 1.1.1.3, 2003/08/25 16:06:18
Line 1 
Line 1 
   Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
   
   This file is part of the GNU MP Library.
   
   The GNU MP Library is free software; you can redistribute it and/or modify it
   under the terms of the GNU Lesser General Public License as published by the
   Free Software Foundation; either version 2.1 of the License, or (at your
   option) any later version.
   
   The GNU MP Library is distributed in the hope that it will be useful, but
   WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
   for more details.
   
   You should have received a copy of the GNU Lesser General Public License along
   with the GNU MP Library; see the file COPYING.LIB.  If not, write to the Free
   Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
   USA.
   
   
   
   
   
 This directory contains mpn functions optimized for DEC Alpha processors.  This directory contains mpn functions optimized for DEC Alpha processors.
   
 ALPHA ASSEMBLY RULES AND REGULATIONS  ALPHA ASSEMBLY RULES AND REGULATIONS
   
 The `.prologue N' pseudo op marks the end of instruction that needs  The `.prologue N' pseudo op marks the end of instruction that needs special
 special handling by unwinding.  It also says whether $27 is really  handling by unwinding.  It also says whether $27 is really needed for computing
 needed for computing the gp.  The `.mask M' pseudo op says which  the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
 registers are saved on the stack, and at what offset in the frame.  and at what offset in the frame.
   
 Cray code is very very different...  Cray T3 code is very very different...
   
   
 RELEVANT OPTIMIZATION ISSUES  RELEVANT OPTIMIZATION ISSUES
   
 EV4  EV4
   
 1. This chip has very limited store bandwidth.  The on-chip L1 cache is  1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
    write-through, and a cache line is transfered from the store buffer to     through, and a cache line is transfered from the store buffer to the off-
    the off-chip L2 in as much 15 cycles on most systems.  This delay hurts     chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
    mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.     mpn_sub_n, mpn_lshift, and mpn_rshift.
   
 2. Pairing is possible between memory instructions and integer arithmetic  2. Pairing is possible between memory instructions and integer arithmetic
    instructions.     instructions.
   
 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of  3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
    these cycles are pipelined.  Thus, multiply instructions can be issued at     cycles are pipelined.  Thus, multiply instructions can be issued at a rate
    a rate of one each 21st cycle.     of one each 21st cycle.
   
 EV5  EV5
   
 1. The memory bandwidth of this chip seems excellent, both for loads and  1. The memory bandwidth of this chip is good, both for loads and stores.  The
    stores.  Even when the working set is larger than the on-chip L1 and L2     L1 cache can handle two loads or one store per cycle, but two cycles after a
    caches, the performance remain almost unaffected.     store, no ld can issue.
   
 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.  2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
    umulh has a measured latency of 14 cycles and an issue rate of 1 each     umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
    10th cycle.  But the exact timing is somewhat confusing.     (Note that published documentation gets these numbers slightly wrong.)
   
 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12  3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
    are memory operations.  This will take at least     are memory operations.  This will take at least
Line 54  EV5
Line 77  EV5
                   or                    or
   
    I.e., 3 operations are needed between carry-in and carry-out, making 12     I.e., 3 operations are needed between carry-in and carry-out, making 12
    cycles the absolute minimum for the 4 limbs.  We could replace the `or'     cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
    with a cmoveq/cmovne, which could issue one cycle earlier that the `or',     a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
    but that might waste a cycle on EV4.  The total depth remain unaffected,     might waste a cycle on EV4.  The total depth remain unaffected, since cmov
    since cmov has a latency of 2 cycles.     has a latency of 2 cycles.
   
      addq       addq
      /   \       /   \
Line 65  EV5
Line 88  EV5
      |      \       |      \
    cmpult -> cmovne     cmpult -> cmovne
   
 Montgomery has a slightly different way of computing carry that requires one    Montgomery has a slightly different way of computing carry that requires one
 less instruction, but has depth 4 (instead of the current 3).  Since the    less instruction, but has depth 4 (instead of the current 3).  Since the code
 code is currently instruction issue bound, Montgomery's idea should save us    is currently instruction issue bound, Montgomery's idea should save us 1/2
 1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25    cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
 cycles/limb.  Unfortunately, this method will not be good for the EV6.    Unfortunately, this method will not be good for the EV6.
   
   4. addmul_1 and friends: We previously had a scheme for splitting the single-
      limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
      and then use FP operations for every 2nd multiply, and integer operations
      for every 2nd multiply.
   
      But it seems much better to split the single-limb operand in 16-bit chunks,
      since we save many integer shifts and adds that way.  See powerpc64/README
      for some more details.
   
 EV6  EV6
   
 Here we have a really parallel pipeline, capable of issuing up to 4 integer  Here we have a really parallel pipeline, capable of issuing up to 4 integer
 instructions per cycle.  One integer multiply instruction can issue each  instructions per cycle.  In actual practice, it is never possible to sustain
 cycle.  To get optimal speed, we need to pretend we are vectorizing the code,  more than 3.5 integer insns/cycle due to rename register shortage.  One integer
 i.e., minimize the iterative dependencies.  multiply instruction can issue each cycle.  To get optimal speed, we need to
   pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
   
 There are two dependencies to watch out for.  1) Address arithmetic  There are two dependencies to watch out for.  1) Address arithmetic
 dependencies, and 2) carry propagation dependencies.  dependencies, and 2) carry propagation dependencies.
   
 We can avoid serializing due to address arithmetic by unrolling the loop, so  We can avoid serializing due to address arithmetic by unrolling loops, so that
 that addresses don't depend heavily on an index variable.  Avoiding  addresses don't depend heavily on an index variable.  Avoiding serializing
 serializing because of carry propagation is trickier; the ultimate performance  because of carry propagation is trickier; the ultimate performance of the code
 of the code will be determined of the number of latency cycles it takes from  will be determined of the number of latency cycles it takes from accepting
 accepting carry-in to a vector point until we can generate carry-out.  carry-in to a vector point until we can generate carry-out.
   
 Most integer instructions can execute in either the L0, U0, L1, or U1  Most integer instructions can execute in either the L0, U0, L1, or U1
 pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.  pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
   
 CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but  CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
 the execute efficiently.  But CMOV split the mapping process (see pg 2-26 in  split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
 cmpwrgd.pdf), suggesting the CMOV should always be placed as the last  should always be placed as the last instruction of an aligned 4 instruction
 instruction of an aligned 4 instruction block (?).  block, or perhaps simply avoided.
   
 Perhaps the most important issue is the latency between the L0/U0 and L1/U1  Perhaps the most important issue is the latency between the L0/U0 and L1/U1
 clusters; a result obtained on either cluster has an extra cycle of latency  clusters; a result obtained on either cluster has an extra cycle of latency for
 for consumers in the opposite cluster.  Because of the dynamic nature of the  consumers in the opposite cluster.  Because of the dynamic nature of the
 implementation, it is hard to predict where an instruction will execute.  implementation, it is hard to predict where an instruction will execute.
   
 The shift loops need (per limb):  
     1 load (Lx pipes)  
     1 store (Lx pipes)  
     2 shift (Ux pipes)  
     1 iaddlog (Lx pipes, Ux pipes)  
 Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.  
   
 For mpn_add_n, we currently have  
     2 load (Lx pipes)  
     1 store (Lx pipes)  
     5 iaddlog (Lx pipes, Ux pipes)  
   
 Again, we have a perfect balance and will be limited by carry propagation  
 delays, currently three cycles.  The superoptimizer indicates that ther  
 might be sequences that--using a final cmov--have a carry propagation delay  
 of just two.  Montgomery's subtraction sequence could perhaps be used, by  
 complementing some operands.  All in all, we should get down to 2 cycles  
 without much problems.  
   
 For mpn_mul_1, we could do, just like for mpn_add_n:  
     not         newlo,notnewlo  
     addq        cylimb,newlo,newlo  ||    cmpult        cylimb,notnewlo,cyout  
     addq        cyout,newhi,cylimb  
 and get 2-cycle carry propagation.  The instructions needed will be  
     1 ld (Lx pipes)  
     1 st (Lx pipes)  
     2 mul (U1 pipe)  
     4 iaddlog (Lx pipes, Ux pipes)  
 issue1: addq not mul ld  
 issue2: cmpult addq mul st  
 Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!  
   
 Last, we have mpn_addmul_1.  Almost certainly, we will get down to 3  
 cycles/limb, which would be absolutely awesome.  
   
 Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):  
   
    i  
    s  
    s  i  
    u  n  
    e  s  
    d  t  
       r  
    i  u  
 l  n  c  
 i  s  t  
 v  t  i  
 e  r  o  
    u  n  
 v  c  
 a  t  t  
 l  i  y  
 u  o  p  
 e  n  e  
 s  s  s  
         issue  
          in  
         cycle  
          -1     ldq  
                /    \  
           0   |      \  
               |       \  
           1   |        |  
               |        |  
           2   |        |                   ldq  
               |        |                  /    \  
           3   |       mulq               |      \  
               |           \              |       \  
           4  umulh         \             |        |  
                |            |            |        |  
           5    |            |            |        |                   ldq  
                |            |            |        |                  /    \  
     4calm 6    |            |   ldq      |       mulq               |      \  
                |            |  /         |           \              |       \  
     4casm 7    |            | /         umulh         \             |        |  
 6              |            ||            |            |            |        |  
     3aal  8    |            ||            |            |            |        |                   ldq  
 7              |            ||            |            |            |        |                  /    \  
     4calm 9    |            ||            |            |   ldq      |       mulq               |      \  
 9              |            ||            |            |  /         |           \              |       \  
     4casm 10   |            ||            |            | /         umulh         \             |        |  
 9              |            ||            |            ||            |            |            |        |  
     3aal  11   |           addq           |            ||            |            |            |        |                   ldq  
 9              |          //   \          |            ||            |            |            |        |                  /    \  
     4calm 12    \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \  
 13               \    /       //   \      |            ||            |            |  /         |           \              |       \  
     4casm 13      addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |  
 11                    \  /                |            ||            |            ||            |            |            |        |  
     3aal  14          addq                |           addq           |            ||            |            |            |        |                   ldq  
 10                        \               |          //   \          |            ||            |            |            |        |                  /    \  
     4calm 15                cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \  
 13                                          \    /       //   \      |            ||            |            |  /         |           \              |       \  
     4casm 16                                 addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |  
 11                                               \  /                |            ||            |            ||            |            |            |        |  
     3aal  17                                     addq                |           addq           |            ||            |            |            |        |  
 10                                                   \               |          //   \          |            ||            |            |            |        |  
     4calm 18                                           cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq  
 13                                                                     \    /       //   \      |            ||            |            |  /         |           \  
     4casm 19                                                            addq   cmpult     stq   |            ||            |            | /         umulh         \  
 11                                                                          \  /                |            ||            |            ||            |            |  
     3aal  20                                                                addq                |           addq           |            ||            |            |  
 10                                                                              \               |          //   \          |            ||            |            |  
     4calm 21                                                                      cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq  
                                                                                                   \    /       //   \      |            ||            |            |  /  
           22                                                                                       addq   cmpult     stq   |            ||            |            | /  
                                                                                                        \  /                |            ||            |            ||  
           23                                                                                           addq                |           addq           |            ||  
                                                                                                            \               |          //   \          |            ||  
           24                                                                                                 cy ---->       \     cmpult    addq<-cy  |            ||  
                                                                                                                              \    /       //   \      |            ||  
           25                                                                                                                  addq   cmpult     stq   |            ||  
                                                                                                                                   \  /                |            ||  
           26                                                                                                                      addq                |           addq  
                                                                                                                                       \               |          //   \  
           27                                                                                                                            cy ---->       \     cmpult    addq<-cy  
                                                                                                                                                         \    /       //   \  
           28                                                                                                                                             addq   cmpult     stq  
                                                                                                                                                              \  /  
 As many as 6 consecutive points will be under execution simultaneously, or if we                                                                             addq  
 schedule loads even further away, maybe 7 or 8.  But the number of live quantities                                                                               \  
 is reasonable, and can easily be satisfied.                                                                                                                        cy ---->  

Legend:
Removed from v.1.1.1.2  
changed lines
  Added in v.1.1.1.3

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>