=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/alpha/Attic/README,v retrieving revision 1.1.1.1 retrieving revision 1.1.1.3 diff -u -p -r1.1.1.1 -r1.1.1.3 --- OpenXM_contrib/gmp/mpn/alpha/Attic/README 2000/01/10 15:35:22 1.1.1.1 +++ OpenXM_contrib/gmp/mpn/alpha/Attic/README 2003/08/25 16:06:18 1.1.1.3 @@ -1,53 +1,134 @@ +Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify it +under the terms of the GNU Lesser General Public License as published by the +Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License +for more details. + +You should have received a copy of the GNU Lesser General Public License along +with the GNU MP Library; see the file COPYING.LIB. If not, write to the Free +Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, +USA. + + + + + This directory contains mpn functions optimized for DEC Alpha processors. +ALPHA ASSEMBLY RULES AND REGULATIONS + +The `.prologue N' pseudo op marks the end of instruction that needs special +handling by unwinding. It also says whether $27 is really needed for computing +the gp. The `.mask M' pseudo op says which registers are saved on the stack, +and at what offset in the frame. + +Cray T3 code is very very different... + + RELEVANT OPTIMIZATION ISSUES EV4 -1. This chip has very limited store bandwidth. The on-chip L1 cache is -write-through, and a cache line is transfered from the store buffer to the -off-chip L2 in as much 15 cycles on most systems. This delay hurts -mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. +1. This chip has very limited store bandwidth. The on-chip L1 cache is write- + through, and a cache line is transfered from the store buffer to the off- + chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, + mpn_sub_n, mpn_lshift, and mpn_rshift. 2. Pairing is possible between memory instructions and integer arithmetic -instructions. + instructions. -3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of -these cycles are pipelined. Thus, multiply instructions can be issued at a -rate of one each 21nd cycle. +3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these + cycles are pipelined. Thus, multiply instructions can be issued at a rate + of one each 21st cycle. EV5 -1. The memory bandwidth of this chip seems excellent, both for loads and -stores. Even when the working set is larger than the on-chip L1 and L2 -caches, the perfromance remain almost unaffected. +1. The memory bandwidth of this chip is good, both for loads and stores. The + L1 cache can handle two loads or one store per cycle, but two cycles after a + store, no ld can issue. -2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th -cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 -each 10th cycle. But the exact timing is somewhat confusing. +2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. + umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle. + (Note that published documentation gets these numbers slightly wrong.) 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 are memory operations. This will take at least - ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles + ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data - cache cycles, which should be completely hidden in the 20 issue cycles. + cache cycles, which should be completely hidden in the 19 issue cycles. The computation is inherently serial, with these dependencies: + + ldq ldq + \ /\ + (or) addq | + |\ / \ | + | addq cmpult + \ | | + cmpult | + \ / + or + + I.e., 3 operations are needed between carry-in and carry-out, making 12 + cycles the absolute minimum for the 4 limbs. We could replace the `or' with + a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that + might waste a cycle on EV4. The total depth remain unaffected, since cmov + has a latency of 2 cycles. + addq / \ addq cmpult - | | - cmpult | - \ / - or - I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute - minimum. We could replace the `or' with a cmoveq/cmovne, which would save - a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 - cycles. - addq - / \ - addq cmpult | \ cmpult -> cmovne -STATUS + Montgomery has a slightly different way of computing carry that requires one + less instruction, but has depth 4 (instead of the current 3). Since the code + is currently instruction issue bound, Montgomery's idea should save us 1/2 + cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. + Unfortunately, this method will not be good for the EV6. +4. addmul_1 and friends: We previously had a scheme for splitting the single- + limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks, + and then use FP operations for every 2nd multiply, and integer operations + for every 2nd multiply. + + But it seems much better to split the single-limb operand in 16-bit chunks, + since we save many integer shifts and adds that way. See powerpc64/README + for some more details. + +EV6 + +Here we have a really parallel pipeline, capable of issuing up to 4 integer +instructions per cycle. In actual practice, it is never possible to sustain +more than 3.5 integer insns/cycle due to rename register shortage. One integer +multiply instruction can issue each cycle. To get optimal speed, we need to +pretend we are vectorizing the code, i.e., minimize the depth of recurrences. + +There are two dependencies to watch out for. 1) Address arithmetic +dependencies, and 2) carry propagation dependencies. + +We can avoid serializing due to address arithmetic by unrolling loops, so that +addresses don't depend heavily on an index variable. Avoiding serializing +because of carry propagation is trickier; the ultimate performance of the code +will be determined of the number of latency cycles it takes from accepting +carry-in to a vector point until we can generate carry-out. + +Most integer instructions can execute in either the L0, U0, L1, or U1 +pipelines. Shifts only execute in U0 and U1, and multiply only in U1. + +CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV +split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV +should always be placed as the last instruction of an aligned 4 instruction +block, or perhaps simply avoided. + +Perhaps the most important issue is the latency between the L0/U0 and L1/U1 +clusters; a result obtained on either cluster has an extra cycle of latency for +consumers in the opposite cluster. Because of the dynamic nature of the +implementation, it is hard to predict where an instruction will execute.