version 1.1, 2000/01/10 15:35:22 |
version 1.1.1.3, 2003/08/25 16:06:18 |
|
|
|
Copyright 1996, 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc. |
|
|
|
This file is part of the GNU MP Library. |
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify it |
|
under the terms of the GNU Lesser General Public License as published by the |
|
Free Software Foundation; either version 2.1 of the License, or (at your |
|
option) any later version. |
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or |
|
FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License |
|
for more details. |
|
|
|
You should have received a copy of the GNU Lesser General Public License along |
|
with the GNU MP Library; see the file COPYING.LIB. If not, write to the Free |
|
Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, |
|
USA. |
|
|
|
|
|
|
|
|
|
|
This directory contains mpn functions optimized for DEC Alpha processors. |
This directory contains mpn functions optimized for DEC Alpha processors. |
|
|
|
ALPHA ASSEMBLY RULES AND REGULATIONS |
|
|
|
The `.prologue N' pseudo op marks the end of instruction that needs special |
|
handling by unwinding. It also says whether $27 is really needed for computing |
|
the gp. The `.mask M' pseudo op says which registers are saved on the stack, |
|
and at what offset in the frame. |
|
|
|
Cray T3 code is very very different... |
|
|
|
|
RELEVANT OPTIMIZATION ISSUES |
RELEVANT OPTIMIZATION ISSUES |
|
|
EV4 |
EV4 |
|
|
1. This chip has very limited store bandwidth. The on-chip L1 cache is |
1. This chip has very limited store bandwidth. The on-chip L1 cache is write- |
write-through, and a cache line is transfered from the store buffer to the |
through, and a cache line is transfered from the store buffer to the off- |
off-chip L2 in as much 15 cycles on most systems. This delay hurts |
chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, |
mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. |
mpn_sub_n, mpn_lshift, and mpn_rshift. |
|
|
2. Pairing is possible between memory instructions and integer arithmetic |
2. Pairing is possible between memory instructions and integer arithmetic |
instructions. |
instructions. |
|
|
3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of |
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these |
these cycles are pipelined. Thus, multiply instructions can be issued at a |
cycles are pipelined. Thus, multiply instructions can be issued at a rate |
rate of one each 21nd cycle. |
of one each 21st cycle. |
|
|
EV5 |
EV5 |
|
|
1. The memory bandwidth of this chip seems excellent, both for loads and |
1. The memory bandwidth of this chip is good, both for loads and stores. The |
stores. Even when the working set is larger than the on-chip L1 and L2 |
L1 cache can handle two loads or one store per cycle, but two cycles after a |
caches, the perfromance remain almost unaffected. |
store, no ld can issue. |
|
|
2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th |
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. |
cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 |
umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle. |
each 10th cycle. But the exact timing is somewhat confusing. |
(Note that published documentation gets these numbers slightly wrong.) |
|
|
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
are memory operations. This will take at least |
are memory operations. This will take at least |
ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles |
ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles |
We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
cache cycles, which should be completely hidden in the 20 issue cycles. |
cache cycles, which should be completely hidden in the 19 issue cycles. |
The computation is inherently serial, with these dependencies: |
The computation is inherently serial, with these dependencies: |
|
|
|
ldq ldq |
|
\ /\ |
|
(or) addq | |
|
|\ / \ | |
|
| addq cmpult |
|
\ | | |
|
cmpult | |
|
\ / |
|
or |
|
|
|
I.e., 3 operations are needed between carry-in and carry-out, making 12 |
|
cycles the absolute minimum for the 4 limbs. We could replace the `or' with |
|
a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that |
|
might waste a cycle on EV4. The total depth remain unaffected, since cmov |
|
has a latency of 2 cycles. |
|
|
addq |
addq |
/ \ |
/ \ |
addq cmpult |
addq cmpult |
| | |
|
cmpult | |
|
\ / |
|
or |
|
I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute |
|
minimum. We could replace the `or' with a cmoveq/cmovne, which would save |
|
a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 |
|
cycles. |
|
addq |
|
/ \ |
|
addq cmpult |
|
| \ |
| \ |
cmpult -> cmovne |
cmpult -> cmovne |
|
|
STATUS |
Montgomery has a slightly different way of computing carry that requires one |
|
less instruction, but has depth 4 (instead of the current 3). Since the code |
|
is currently instruction issue bound, Montgomery's idea should save us 1/2 |
|
cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. |
|
Unfortunately, this method will not be good for the EV6. |
|
|
|
4. addmul_1 and friends: We previously had a scheme for splitting the single- |
|
limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks, |
|
and then use FP operations for every 2nd multiply, and integer operations |
|
for every 2nd multiply. |
|
|
|
But it seems much better to split the single-limb operand in 16-bit chunks, |
|
since we save many integer shifts and adds that way. See powerpc64/README |
|
for some more details. |
|
|
|
EV6 |
|
|
|
Here we have a really parallel pipeline, capable of issuing up to 4 integer |
|
instructions per cycle. In actual practice, it is never possible to sustain |
|
more than 3.5 integer insns/cycle due to rename register shortage. One integer |
|
multiply instruction can issue each cycle. To get optimal speed, we need to |
|
pretend we are vectorizing the code, i.e., minimize the depth of recurrences. |
|
|
|
There are two dependencies to watch out for. 1) Address arithmetic |
|
dependencies, and 2) carry propagation dependencies. |
|
|
|
We can avoid serializing due to address arithmetic by unrolling loops, so that |
|
addresses don't depend heavily on an index variable. Avoiding serializing |
|
because of carry propagation is trickier; the ultimate performance of the code |
|
will be determined of the number of latency cycles it takes from accepting |
|
carry-in to a vector point until we can generate carry-out. |
|
|
|
Most integer instructions can execute in either the L0, U0, L1, or U1 |
|
pipelines. Shifts only execute in U0 and U1, and multiply only in U1. |
|
|
|
CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV |
|
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV |
|
should always be placed as the last instruction of an aligned 4 instruction |
|
block, or perhaps simply avoided. |
|
|
|
Perhaps the most important issue is the latency between the L0/U0 and L1/U1 |
|
clusters; a result obtained on either cluster has an extra cycle of latency for |
|
consumers in the opposite cluster. Because of the dynamic nature of the |
|
implementation, it is hard to predict where an instruction will execute. |