OpenXM_contrib/gmp/mpn/alpha/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / alpha
Annotation of OpenXM_contrib/gmp/mpn/alpha/README, Revision 1.1.1.3

1.1.1.3 ! ohara       1: Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify it
        !             6: under the terms of the GNU Lesser General Public License as published by the
        !             7: Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
        !            12: FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
        !            13: for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License along
        !            16: with the GNU MP Library; see the file COPYING.LIB.  If not, write to the Free
        !            17: Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
        !            18: USA.
        !            19:
        !            20:
        !            21:
        !            22:
        !            23:
1.1       maekawa    24: This directory contains mpn functions optimized for DEC Alpha processors.
                     25:
1.1.1.2   maekawa    26: ALPHA ASSEMBLY RULES AND REGULATIONS
                     27:
1.1.1.3 ! ohara      28: The `.prologue N' pseudo op marks the end of instruction that needs special
        !            29: handling by unwinding.  It also says whether $27 is really needed for computing
        !            30: the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
        !            31: and at what offset in the frame.
1.1.1.2   maekawa    32:
1.1.1.3 ! ohara      33: Cray T3 code is very very different...
1.1.1.2   maekawa    34:
                     35:
1.1       maekawa    36: RELEVANT OPTIMIZATION ISSUES
                     37:
                     38: EV4
                     39:
1.1.1.3 ! ohara      40: 1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
        !            41:    through, and a cache line is transfered from the store buffer to the off-
        !            42:    chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
        !            43:    mpn_sub_n, mpn_lshift, and mpn_rshift.
1.1       maekawa    44:
                     45: 2. Pairing is possible between memory instructions and integer arithmetic
1.1.1.2   maekawa    46:    instructions.
1.1       maekawa    47:
1.1.1.3 ! ohara      48: 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
        !            49:    cycles are pipelined.  Thus, multiply instructions can be issued at a rate
        !            50:    of one each 21st cycle.
1.1       maekawa    51:
                     52: EV5
                     53:
1.1.1.3 ! ohara      54: 1. The memory bandwidth of this chip is good, both for loads and stores.  The
        !            55:    L1 cache can handle two loads or one store per cycle, but two cycles after a
        !            56:    store, no ld can issue.
1.1       maekawa    57:
1.1.1.2   maekawa    58: 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
1.1.1.3 ! ohara      59:    umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
        !            60:    (Note that published documentation gets these numbers slightly wrong.)
1.1       maekawa    61:
                     62: 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
                     63:    are memory operations.  This will take at least
1.1.1.2   maekawa    64:        ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
1.1       maekawa    65:    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
1.1.1.2   maekawa    66:    cache cycles, which should be completely hidden in the 19 issue cycles.
1.1       maekawa    67:    The computation is inherently serial, with these dependencies:
1.1.1.2   maekawa    68:
                     69:               ldq  ldq
                     70:                 \  /\
                     71:          (or)   addq |
                     72:           |\   /   \ |
                     73:           | addq  cmpult
                     74:            \  |     |
                     75:             cmpult  |
                     76:                 \  /
                     77:                  or
                     78:
                     79:    I.e., 3 operations are needed between carry-in and carry-out, making 12
1.1.1.3 ! ohara      80:    cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
        !            81:    a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
        !            82:    might waste a cycle on EV4.  The total depth remain unaffected, since cmov
        !            83:    has a latency of 2 cycles.
1.1.1.2   maekawa    84:
1.1       maekawa    85:      addq
                     86:      /   \
                     87:    addq  cmpult
                     88:      |      \
                     89:    cmpult -> cmovne
                     90:
1.1.1.3 ! ohara      91:   Montgomery has a slightly different way of computing carry that requires one
        !            92:   less instruction, but has depth 4 (instead of the current 3).  Since the code
        !            93:   is currently instruction issue bound, Montgomery's idea should save us 1/2
        !            94:   cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
        !            95:   Unfortunately, this method will not be good for the EV6.
        !            96:
        !            97: 4. addmul_1 and friends: We previously had a scheme for splitting the single-
        !            98:    limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
        !            99:    and then use FP operations for every 2nd multiply, and integer operations
        !           100:    for every 2nd multiply.
        !           101:
        !           102:    But it seems much better to split the single-limb operand in 16-bit chunks,
        !           103:    since we save many integer shifts and adds that way.  See powerpc64/README
        !           104:    for some more details.
1.1.1.2   maekawa   105:
                    106: EV6
                    107:
                    108: Here we have a really parallel pipeline, capable of issuing up to 4 integer
1.1.1.3 ! ohara     109: instructions per cycle.  In actual practice, it is never possible to sustain
        !           110: more than 3.5 integer insns/cycle due to rename register shortage.  One integer
        !           111: multiply instruction can issue each cycle.  To get optimal speed, we need to
        !           112: pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
1.1.1.2   maekawa   113:
                    114: There are two dependencies to watch out for.  1) Address arithmetic
                    115: dependencies, and 2) carry propagation dependencies.
                    116:
1.1.1.3 ! ohara     117: We can avoid serializing due to address arithmetic by unrolling loops, so that
        !           118: addresses don't depend heavily on an index variable.  Avoiding serializing
        !           119: because of carry propagation is trickier; the ultimate performance of the code
        !           120: will be determined of the number of latency cycles it takes from accepting
        !           121: carry-in to a vector point until we can generate carry-out.
1.1.1.2   maekawa   122:
                    123: Most integer instructions can execute in either the L0, U0, L1, or U1
                    124: pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
                    125:
1.1.1.3 ! ohara     126: CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
        !           127: split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
        !           128: should always be placed as the last instruction of an aligned 4 instruction
        !           129: block, or perhaps simply avoided.
1.1.1.2   maekawa   130:
                    131: Perhaps the most important issue is the latency between the L0/U0 and L1/U1
1.1.1.3 ! ohara     132: clusters; a result obtained on either cluster has an extra cycle of latency for
        !           133: consumers in the opposite cluster.  Because of the dynamic nature of the
1.1.1.2   maekawa   134: implementation, it is hard to predict where an instruction will execute.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>