version 1.1.1.2, 2000/09/09 14:12:21 |
version 1.1.1.3, 2003/08/25 16:06:18 |
|
|
|
Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. |
|
|
|
This file is part of the GNU MP Library. |
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify it |
|
under the terms of the GNU Lesser General Public License as published by the |
|
Free Software Foundation; either version 2.1 of the License, or (at your |
|
option) any later version. |
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or |
|
FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License |
|
for more details. |
|
|
|
You should have received a copy of the GNU Lesser General Public License along |
|
with the GNU MP Library; see the file COPYING.LIB. If not, write to the Free |
|
Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, |
|
USA. |
|
|
|
|
|
|
|
|
|
|
This directory contains mpn functions optimized for DEC Alpha processors. |
This directory contains mpn functions optimized for DEC Alpha processors. |
|
|
ALPHA ASSEMBLY RULES AND REGULATIONS |
ALPHA ASSEMBLY RULES AND REGULATIONS |
|
|
The `.prologue N' pseudo op marks the end of instruction that needs |
The `.prologue N' pseudo op marks the end of instruction that needs special |
special handling by unwinding. It also says whether $27 is really |
handling by unwinding. It also says whether $27 is really needed for computing |
needed for computing the gp. The `.mask M' pseudo op says which |
the gp. The `.mask M' pseudo op says which registers are saved on the stack, |
registers are saved on the stack, and at what offset in the frame. |
and at what offset in the frame. |
|
|
Cray code is very very different... |
Cray T3 code is very very different... |
|
|
|
|
RELEVANT OPTIMIZATION ISSUES |
RELEVANT OPTIMIZATION ISSUES |
|
|
EV4 |
EV4 |
|
|
1. This chip has very limited store bandwidth. The on-chip L1 cache is |
1. This chip has very limited store bandwidth. The on-chip L1 cache is write- |
write-through, and a cache line is transfered from the store buffer to |
through, and a cache line is transfered from the store buffer to the off- |
the off-chip L2 in as much 15 cycles on most systems. This delay hurts |
chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, |
mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. |
mpn_sub_n, mpn_lshift, and mpn_rshift. |
|
|
2. Pairing is possible between memory instructions and integer arithmetic |
2. Pairing is possible between memory instructions and integer arithmetic |
instructions. |
instructions. |
|
|
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of |
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these |
these cycles are pipelined. Thus, multiply instructions can be issued at |
cycles are pipelined. Thus, multiply instructions can be issued at a rate |
a rate of one each 21st cycle. |
of one each 21st cycle. |
|
|
EV5 |
EV5 |
|
|
1. The memory bandwidth of this chip seems excellent, both for loads and |
1. The memory bandwidth of this chip is good, both for loads and stores. The |
stores. Even when the working set is larger than the on-chip L1 and L2 |
L1 cache can handle two loads or one store per cycle, but two cycles after a |
caches, the performance remain almost unaffected. |
store, no ld can issue. |
|
|
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. |
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. |
umulh has a measured latency of 14 cycles and an issue rate of 1 each |
umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle. |
10th cycle. But the exact timing is somewhat confusing. |
(Note that published documentation gets these numbers slightly wrong.) |
|
|
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
are memory operations. This will take at least |
are memory operations. This will take at least |
|
|
or |
or |
|
|
I.e., 3 operations are needed between carry-in and carry-out, making 12 |
I.e., 3 operations are needed between carry-in and carry-out, making 12 |
cycles the absolute minimum for the 4 limbs. We could replace the `or' |
cycles the absolute minimum for the 4 limbs. We could replace the `or' with |
with a cmoveq/cmovne, which could issue one cycle earlier that the `or', |
a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that |
but that might waste a cycle on EV4. The total depth remain unaffected, |
might waste a cycle on EV4. The total depth remain unaffected, since cmov |
since cmov has a latency of 2 cycles. |
has a latency of 2 cycles. |
|
|
addq |
addq |
/ \ |
/ \ |
|
|
| \ |
| \ |
cmpult -> cmovne |
cmpult -> cmovne |
|
|
Montgomery has a slightly different way of computing carry that requires one |
Montgomery has a slightly different way of computing carry that requires one |
less instruction, but has depth 4 (instead of the current 3). Since the |
less instruction, but has depth 4 (instead of the current 3). Since the code |
code is currently instruction issue bound, Montgomery's idea should save us |
is currently instruction issue bound, Montgomery's idea should save us 1/2 |
1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25 |
cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. |
cycles/limb. Unfortunately, this method will not be good for the EV6. |
Unfortunately, this method will not be good for the EV6. |
|
|
|
4. addmul_1 and friends: We previously had a scheme for splitting the single- |
|
limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks, |
|
and then use FP operations for every 2nd multiply, and integer operations |
|
for every 2nd multiply. |
|
|
|
But it seems much better to split the single-limb operand in 16-bit chunks, |
|
since we save many integer shifts and adds that way. See powerpc64/README |
|
for some more details. |
|
|
EV6 |
EV6 |
|
|
Here we have a really parallel pipeline, capable of issuing up to 4 integer |
Here we have a really parallel pipeline, capable of issuing up to 4 integer |
instructions per cycle. One integer multiply instruction can issue each |
instructions per cycle. In actual practice, it is never possible to sustain |
cycle. To get optimal speed, we need to pretend we are vectorizing the code, |
more than 3.5 integer insns/cycle due to rename register shortage. One integer |
i.e., minimize the iterative dependencies. |
multiply instruction can issue each cycle. To get optimal speed, we need to |
|
pretend we are vectorizing the code, i.e., minimize the depth of recurrences. |
|
|
There are two dependencies to watch out for. 1) Address arithmetic |
There are two dependencies to watch out for. 1) Address arithmetic |
dependencies, and 2) carry propagation dependencies. |
dependencies, and 2) carry propagation dependencies. |
|
|
We can avoid serializing due to address arithmetic by unrolling the loop, so |
We can avoid serializing due to address arithmetic by unrolling loops, so that |
that addresses don't depend heavily on an index variable. Avoiding |
addresses don't depend heavily on an index variable. Avoiding serializing |
serializing because of carry propagation is trickier; the ultimate performance |
because of carry propagation is trickier; the ultimate performance of the code |
of the code will be determined of the number of latency cycles it takes from |
will be determined of the number of latency cycles it takes from accepting |
accepting carry-in to a vector point until we can generate carry-out. |
carry-in to a vector point until we can generate carry-out. |
|
|
Most integer instructions can execute in either the L0, U0, L1, or U1 |
Most integer instructions can execute in either the L0, U0, L1, or U1 |
pipelines. Shifts only execute in U0 and U1, and multiply only in U1. |
pipelines. Shifts only execute in U0 and U1, and multiply only in U1. |
|
|
CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but |
CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV |
the execute efficiently. But CMOV split the mapping process (see pg 2-26 in |
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV |
cmpwrgd.pdf), suggesting the CMOV should always be placed as the last |
should always be placed as the last instruction of an aligned 4 instruction |
instruction of an aligned 4 instruction block (?). |
block, or perhaps simply avoided. |
|
|
Perhaps the most important issue is the latency between the L0/U0 and L1/U1 |
Perhaps the most important issue is the latency between the L0/U0 and L1/U1 |
clusters; a result obtained on either cluster has an extra cycle of latency |
clusters; a result obtained on either cluster has an extra cycle of latency for |
for consumers in the opposite cluster. Because of the dynamic nature of the |
consumers in the opposite cluster. Because of the dynamic nature of the |
implementation, it is hard to predict where an instruction will execute. |
implementation, it is hard to predict where an instruction will execute. |
|
|
The shift loops need (per limb): |
|
1 load (Lx pipes) |
|
1 store (Lx pipes) |
|
2 shift (Ux pipes) |
|
1 iaddlog (Lx pipes, Ux pipes) |
|
Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb. |
|
|
|
For mpn_add_n, we currently have |
|
2 load (Lx pipes) |
|
1 store (Lx pipes) |
|
5 iaddlog (Lx pipes, Ux pipes) |
|
|
|
Again, we have a perfect balance and will be limited by carry propagation |
|
delays, currently three cycles. The superoptimizer indicates that ther |
|
might be sequences that--using a final cmov--have a carry propagation delay |
|
of just two. Montgomery's subtraction sequence could perhaps be used, by |
|
complementing some operands. All in all, we should get down to 2 cycles |
|
without much problems. |
|
|
|
For mpn_mul_1, we could do, just like for mpn_add_n: |
|
not newlo,notnewlo |
|
addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout |
|
addq cyout,newhi,cylimb |
|
and get 2-cycle carry propagation. The instructions needed will be |
|
1 ld (Lx pipes) |
|
1 st (Lx pipes) |
|
2 mul (U1 pipe) |
|
4 iaddlog (Lx pipes, Ux pipes) |
|
issue1: addq not mul ld |
|
issue2: cmpult addq mul st |
|
Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb! |
|
|
|
Last, we have mpn_addmul_1. Almost certainly, we will get down to 3 |
|
cycles/limb, which would be absolutely awesome. |
|
|
|
Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen): |
|
|
|
i |
|
s |
|
s i |
|
u n |
|
e s |
|
d t |
|
r |
|
i u |
|
l n c |
|
i s t |
|
v t i |
|
e r o |
|
u n |
|
v c |
|
a t t |
|
l i y |
|
u o p |
|
e n e |
|
s s s |
|
issue |
|
in |
|
cycle |
|
-1 ldq |
|
/ \ |
|
0 | \ |
|
| \ |
|
1 | | |
|
| | |
|
2 | | ldq |
|
| | / \ |
|
3 | mulq | \ |
|
| \ | \ |
|
4 umulh \ | | |
|
| | | | |
|
5 | | | | ldq |
|
| | | | / \ |
|
4calm 6 | | ldq | mulq | \ |
|
| | / | \ | \ |
|
4casm 7 | | / umulh \ | | |
|
6 | || | | | | |
|
3aal 8 | || | | | | ldq |
|
7 | || | | | | / \ |
|
4calm 9 | || | | ldq | mulq | \ |
|
9 | || | | / | \ | \ |
|
4casm 10 | || | | / umulh \ | | |
|
9 | || | || | | | | |
|
3aal 11 | addq | || | | | | ldq |
|
9 | // \ | || | | | | / \ |
|
4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \ |
|
13 \ / // \ | || | | / | \ | \ |
|
4casm 13 addq cmpult stq | || | | / umulh \ | | |
|
11 \ / | || | || | | | | |
|
3aal 14 addq | addq | || | | | | ldq |
|
10 \ | // \ | || | | | | / \ |
|
4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \ |
|
13 \ / // \ | || | | / | \ | \ |
|
4casm 16 addq cmpult stq | || | | / umulh \ | | |
|
11 \ / | || | || | | | | |
|
3aal 17 addq | addq | || | | | | |
|
10 \ | // \ | || | | | | |
|
4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq |
|
13 \ / // \ | || | | / | \ |
|
4casm 19 addq cmpult stq | || | | / umulh \ |
|
11 \ / | || | || | | |
|
3aal 20 addq | addq | || | | |
|
10 \ | // \ | || | | |
|
4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq |
|
\ / // \ | || | | / |
|
22 addq cmpult stq | || | | / |
|
\ / | || | || |
|
23 addq | addq | || |
|
\ | // \ | || |
|
24 cy ----> \ cmpult addq<-cy | || |
|
\ / // \ | || |
|
25 addq cmpult stq | || |
|
\ / | || |
|
26 addq | addq |
|
\ | // \ |
|
27 cy ----> \ cmpult addq<-cy |
|
\ / // \ |
|
28 addq cmpult stq |
|
\ / |
|
As many as 6 consecutive points will be under execution simultaneously, or if we addq |
|
schedule loads even further away, maybe 7 or 8. But the number of live quantities \ |
|
is reasonable, and can easily be satisfied. cy ----> |
|