=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/alpha/Attic/README,v retrieving revision 1.1.1.2 retrieving revision 1.1.1.3 diff -u -p -r1.1.1.2 -r1.1.1.3 --- OpenXM_contrib/gmp/mpn/alpha/Attic/README 2000/09/09 14:12:21 1.1.1.2 +++ OpenXM_contrib/gmp/mpn/alpha/Attic/README 2003/08/25 16:06:18 1.1.1.3 @@ -1,40 +1,63 @@ +Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify it +under the terms of the GNU Lesser General Public License as published by the +Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License +for more details. + +You should have received a copy of the GNU Lesser General Public License along +with the GNU MP Library; see the file COPYING.LIB. If not, write to the Free +Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, +USA. + + + + + This directory contains mpn functions optimized for DEC Alpha processors. ALPHA ASSEMBLY RULES AND REGULATIONS -The `.prologue N' pseudo op marks the end of instruction that needs -special handling by unwinding. It also says whether $27 is really -needed for computing the gp. The `.mask M' pseudo op says which -registers are saved on the stack, and at what offset in the frame. +The `.prologue N' pseudo op marks the end of instruction that needs special +handling by unwinding. It also says whether $27 is really needed for computing +the gp. The `.mask M' pseudo op says which registers are saved on the stack, +and at what offset in the frame. -Cray code is very very different... +Cray T3 code is very very different... RELEVANT OPTIMIZATION ISSUES EV4 -1. This chip has very limited store bandwidth. The on-chip L1 cache is - write-through, and a cache line is transfered from the store buffer to - the off-chip L2 in as much 15 cycles on most systems. This delay hurts - mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. +1. This chip has very limited store bandwidth. The on-chip L1 cache is write- + through, and a cache line is transfered from the store buffer to the off- + chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, + mpn_sub_n, mpn_lshift, and mpn_rshift. 2. Pairing is possible between memory instructions and integer arithmetic instructions. -3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of - these cycles are pipelined. Thus, multiply instructions can be issued at - a rate of one each 21st cycle. +3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these + cycles are pipelined. Thus, multiply instructions can be issued at a rate + of one each 21st cycle. EV5 -1. The memory bandwidth of this chip seems excellent, both for loads and - stores. Even when the working set is larger than the on-chip L1 and L2 - caches, the performance remain almost unaffected. +1. The memory bandwidth of this chip is good, both for loads and stores. The + L1 cache can handle two loads or one store per cycle, but two cycles after a + store, no ld can issue. 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. - umulh has a measured latency of 14 cycles and an issue rate of 1 each - 10th cycle. But the exact timing is somewhat confusing. + umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle. + (Note that published documentation gets these numbers slightly wrong.) 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 are memory operations. This will take at least @@ -54,10 +77,10 @@ EV5 or I.e., 3 operations are needed between carry-in and carry-out, making 12 - cycles the absolute minimum for the 4 limbs. We could replace the `or' - with a cmoveq/cmovne, which could issue one cycle earlier that the `or', - but that might waste a cycle on EV4. The total depth remain unaffected, - since cmov has a latency of 2 cycles. + cycles the absolute minimum for the 4 limbs. We could replace the `or' with + a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that + might waste a cycle on EV4. The total depth remain unaffected, since cmov + has a latency of 2 cycles. addq / \ @@ -65,160 +88,47 @@ EV5 | \ cmpult -> cmovne -Montgomery has a slightly different way of computing carry that requires one -less instruction, but has depth 4 (instead of the current 3). Since the -code is currently instruction issue bound, Montgomery's idea should save us -1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25 -cycles/limb. Unfortunately, this method will not be good for the EV6. + Montgomery has a slightly different way of computing carry that requires one + less instruction, but has depth 4 (instead of the current 3). Since the code + is currently instruction issue bound, Montgomery's idea should save us 1/2 + cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. + Unfortunately, this method will not be good for the EV6. +4. addmul_1 and friends: We previously had a scheme for splitting the single- + limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks, + and then use FP operations for every 2nd multiply, and integer operations + for every 2nd multiply. + + But it seems much better to split the single-limb operand in 16-bit chunks, + since we save many integer shifts and adds that way. See powerpc64/README + for some more details. + EV6 Here we have a really parallel pipeline, capable of issuing up to 4 integer -instructions per cycle. One integer multiply instruction can issue each -cycle. To get optimal speed, we need to pretend we are vectorizing the code, -i.e., minimize the iterative dependencies. +instructions per cycle. In actual practice, it is never possible to sustain +more than 3.5 integer insns/cycle due to rename register shortage. One integer +multiply instruction can issue each cycle. To get optimal speed, we need to +pretend we are vectorizing the code, i.e., minimize the depth of recurrences. There are two dependencies to watch out for. 1) Address arithmetic dependencies, and 2) carry propagation dependencies. -We can avoid serializing due to address arithmetic by unrolling the loop, so -that addresses don't depend heavily on an index variable. Avoiding -serializing because of carry propagation is trickier; the ultimate performance -of the code will be determined of the number of latency cycles it takes from -accepting carry-in to a vector point until we can generate carry-out. +We can avoid serializing due to address arithmetic by unrolling loops, so that +addresses don't depend heavily on an index variable. Avoiding serializing +because of carry propagation is trickier; the ultimate performance of the code +will be determined of the number of latency cycles it takes from accepting +carry-in to a vector point until we can generate carry-out. Most integer instructions can execute in either the L0, U0, L1, or U1 pipelines. Shifts only execute in U0 and U1, and multiply only in U1. -CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but -the execute efficiently. But CMOV split the mapping process (see pg 2-26 in -cmpwrgd.pdf), suggesting the CMOV should always be placed as the last -instruction of an aligned 4 instruction block (?). +CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV +split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV +should always be placed as the last instruction of an aligned 4 instruction +block, or perhaps simply avoided. Perhaps the most important issue is the latency between the L0/U0 and L1/U1 -clusters; a result obtained on either cluster has an extra cycle of latency -for consumers in the opposite cluster. Because of the dynamic nature of the +clusters; a result obtained on either cluster has an extra cycle of latency for +consumers in the opposite cluster. Because of the dynamic nature of the implementation, it is hard to predict where an instruction will execute. - -The shift loops need (per limb): - 1 load (Lx pipes) - 1 store (Lx pipes) - 2 shift (Ux pipes) - 1 iaddlog (Lx pipes, Ux pipes) -Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb. - -For mpn_add_n, we currently have - 2 load (Lx pipes) - 1 store (Lx pipes) - 5 iaddlog (Lx pipes, Ux pipes) - -Again, we have a perfect balance and will be limited by carry propagation -delays, currently three cycles. The superoptimizer indicates that ther -might be sequences that--using a final cmov--have a carry propagation delay -of just two. Montgomery's subtraction sequence could perhaps be used, by -complementing some operands. All in all, we should get down to 2 cycles -without much problems. - -For mpn_mul_1, we could do, just like for mpn_add_n: - not newlo,notnewlo - addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout - addq cyout,newhi,cylimb -and get 2-cycle carry propagation. The instructions needed will be - 1 ld (Lx pipes) - 1 st (Lx pipes) - 2 mul (U1 pipe) - 4 iaddlog (Lx pipes, Ux pipes) -issue1: addq not mul ld -issue2: cmpult addq mul st -Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb! - -Last, we have mpn_addmul_1. Almost certainly, we will get down to 3 -cycles/limb, which would be absolutely awesome. - -Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen): - - i - s - s i - u n - e s - d t - r - i u -l n c -i s t -v t i -e r o - u n -v c -a t t -l i y -u o p -e n e -s s s - issue - in - cycle - -1 ldq - / \ - 0 | \ - | \ - 1 | | - | | - 2 | | ldq - | | / \ - 3 | mulq | \ - | \ | \ - 4 umulh \ | | - | | | | - 5 | | | | ldq - | | | | / \ - 4calm 6 | | ldq | mulq | \ - | | / | \ | \ - 4casm 7 | | / umulh \ | | -6 | || | | | | - 3aal 8 | || | | | | ldq -7 | || | | | | / \ - 4calm 9 | || | | ldq | mulq | \ -9 | || | | / | \ | \ - 4casm 10 | || | | / umulh \ | | -9 | || | || | | | | - 3aal 11 | addq | || | | | | ldq -9 | // \ | || | | | | / \ - 4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \ -13 \ / // \ | || | | / | \ | \ - 4casm 13 addq cmpult stq | || | | / umulh \ | | -11 \ / | || | || | | | | - 3aal 14 addq | addq | || | | | | ldq -10 \ | // \ | || | | | | / \ - 4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \ -13 \ / // \ | || | | / | \ | \ - 4casm 16 addq cmpult stq | || | | / umulh \ | | -11 \ / | || | || | | | | - 3aal 17 addq | addq | || | | | | -10 \ | // \ | || | | | | - 4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq -13 \ / // \ | || | | / | \ - 4casm 19 addq cmpult stq | || | | / umulh \ -11 \ / | || | || | | - 3aal 20 addq | addq | || | | -10 \ | // \ | || | | - 4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq - \ / // \ | || | | / - 22 addq cmpult stq | || | | / - \ / | || | || - 23 addq | addq | || - \ | // \ | || - 24 cy ----> \ cmpult addq<-cy | || - \ / // \ | || - 25 addq cmpult stq | || - \ / | || - 26 addq | addq - \ | // \ - 27 cy ----> \ cmpult addq<-cy - \ / // \ - 28 addq cmpult stq - \ / -As many as 6 consecutive points will be under execution simultaneously, or if we addq -schedule loads even further away, maybe 7 or 8. But the number of live quantities \ -is reasonable, and can easily be satisfied. cy ---->