Annotation of OpenXM_contrib/gmp/mpn/alpha/README, Revision 1.1.1.3
1.1.1.3 ! ohara 1: Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify it
! 6: under the terms of the GNU Lesser General Public License as published by the
! 7: Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
! 12: FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
! 13: for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License along
! 16: with the GNU MP Library; see the file COPYING.LIB. If not, write to the Free
! 17: Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
! 18: USA.
! 19:
! 20:
! 21:
! 22:
! 23:
1.1 maekawa 24: This directory contains mpn functions optimized for DEC Alpha processors.
25:
1.1.1.2 maekawa 26: ALPHA ASSEMBLY RULES AND REGULATIONS
27:
1.1.1.3 ! ohara 28: The `.prologue N' pseudo op marks the end of instruction that needs special
! 29: handling by unwinding. It also says whether $27 is really needed for computing
! 30: the gp. The `.mask M' pseudo op says which registers are saved on the stack,
! 31: and at what offset in the frame.
1.1.1.2 maekawa 32:
1.1.1.3 ! ohara 33: Cray T3 code is very very different...
1.1.1.2 maekawa 34:
35:
1.1 maekawa 36: RELEVANT OPTIMIZATION ISSUES
37:
38: EV4
39:
1.1.1.3 ! ohara 40: 1. This chip has very limited store bandwidth. The on-chip L1 cache is write-
! 41: through, and a cache line is transfered from the store buffer to the off-
! 42: chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n,
! 43: mpn_sub_n, mpn_lshift, and mpn_rshift.
1.1 maekawa 44:
45: 2. Pairing is possible between memory instructions and integer arithmetic
1.1.1.2 maekawa 46: instructions.
1.1 maekawa 47:
1.1.1.3 ! ohara 48: 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
! 49: cycles are pipelined. Thus, multiply instructions can be issued at a rate
! 50: of one each 21st cycle.
1.1 maekawa 51:
52: EV5
53:
1.1.1.3 ! ohara 54: 1. The memory bandwidth of this chip is good, both for loads and stores. The
! 55: L1 cache can handle two loads or one store per cycle, but two cycles after a
! 56: store, no ld can issue.
1.1 maekawa 57:
1.1.1.2 maekawa 58: 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
1.1.1.3 ! ohara 59: umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
! 60: (Note that published documentation gets these numbers slightly wrong.)
1.1 maekawa 61:
62: 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
63: are memory operations. This will take at least
1.1.1.2 maekawa 64: ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
1.1 maekawa 65: We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
1.1.1.2 maekawa 66: cache cycles, which should be completely hidden in the 19 issue cycles.
1.1 maekawa 67: The computation is inherently serial, with these dependencies:
1.1.1.2 maekawa 68:
69: ldq ldq
70: \ /\
71: (or) addq |
72: |\ / \ |
73: | addq cmpult
74: \ | |
75: cmpult |
76: \ /
77: or
78:
79: I.e., 3 operations are needed between carry-in and carry-out, making 12
1.1.1.3 ! ohara 80: cycles the absolute minimum for the 4 limbs. We could replace the `or' with
! 81: a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
! 82: might waste a cycle on EV4. The total depth remain unaffected, since cmov
! 83: has a latency of 2 cycles.
1.1.1.2 maekawa 84:
1.1 maekawa 85: addq
86: / \
87: addq cmpult
88: | \
89: cmpult -> cmovne
90:
1.1.1.3 ! ohara 91: Montgomery has a slightly different way of computing carry that requires one
! 92: less instruction, but has depth 4 (instead of the current 3). Since the code
! 93: is currently instruction issue bound, Montgomery's idea should save us 1/2
! 94: cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
! 95: Unfortunately, this method will not be good for the EV6.
! 96:
! 97: 4. addmul_1 and friends: We previously had a scheme for splitting the single-
! 98: limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
! 99: and then use FP operations for every 2nd multiply, and integer operations
! 100: for every 2nd multiply.
! 101:
! 102: But it seems much better to split the single-limb operand in 16-bit chunks,
! 103: since we save many integer shifts and adds that way. See powerpc64/README
! 104: for some more details.
1.1.1.2 maekawa 105:
106: EV6
107:
108: Here we have a really parallel pipeline, capable of issuing up to 4 integer
1.1.1.3 ! ohara 109: instructions per cycle. In actual practice, it is never possible to sustain
! 110: more than 3.5 integer insns/cycle due to rename register shortage. One integer
! 111: multiply instruction can issue each cycle. To get optimal speed, we need to
! 112: pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
1.1.1.2 maekawa 113:
114: There are two dependencies to watch out for. 1) Address arithmetic
115: dependencies, and 2) carry propagation dependencies.
116:
1.1.1.3 ! ohara 117: We can avoid serializing due to address arithmetic by unrolling loops, so that
! 118: addresses don't depend heavily on an index variable. Avoiding serializing
! 119: because of carry propagation is trickier; the ultimate performance of the code
! 120: will be determined of the number of latency cycles it takes from accepting
! 121: carry-in to a vector point until we can generate carry-out.
1.1.1.2 maekawa 122:
123: Most integer instructions can execute in either the L0, U0, L1, or U1
124: pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
125:
1.1.1.3 ! ohara 126: CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV
! 127: split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
! 128: should always be placed as the last instruction of an aligned 4 instruction
! 129: block, or perhaps simply avoided.
1.1.1.2 maekawa 130:
131: Perhaps the most important issue is the latency between the L0/U0 and L1/U1
1.1.1.3 ! ohara 132: clusters; a result obtained on either cluster has an extra cycle of latency for
! 133: consumers in the opposite cluster. Because of the dynamic nature of the
1.1.1.2 maekawa 134: implementation, it is hard to predict where an instruction will execute.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>