version 1.1.1.1, 2000/01/10 15:35:22 |
version 1.1.1.2, 2000/09/09 14:12:21 |
|
|
This directory contains mpn functions optimized for DEC Alpha processors. |
This directory contains mpn functions optimized for DEC Alpha processors. |
|
|
|
ALPHA ASSEMBLY RULES AND REGULATIONS |
|
|
|
The `.prologue N' pseudo op marks the end of instruction that needs |
|
special handling by unwinding. It also says whether $27 is really |
|
needed for computing the gp. The `.mask M' pseudo op says which |
|
registers are saved on the stack, and at what offset in the frame. |
|
|
|
Cray code is very very different... |
|
|
|
|
RELEVANT OPTIMIZATION ISSUES |
RELEVANT OPTIMIZATION ISSUES |
|
|
EV4 |
EV4 |
|
|
1. This chip has very limited store bandwidth. The on-chip L1 cache is |
1. This chip has very limited store bandwidth. The on-chip L1 cache is |
write-through, and a cache line is transfered from the store buffer to the |
write-through, and a cache line is transfered from the store buffer to |
off-chip L2 in as much 15 cycles on most systems. This delay hurts |
the off-chip L2 in as much 15 cycles on most systems. This delay hurts |
mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. |
mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. |
|
|
2. Pairing is possible between memory instructions and integer arithmetic |
2. Pairing is possible between memory instructions and integer arithmetic |
instructions. |
instructions. |
|
|
3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of |
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of |
these cycles are pipelined. Thus, multiply instructions can be issued at a |
these cycles are pipelined. Thus, multiply instructions can be issued at |
rate of one each 21nd cycle. |
a rate of one each 21st cycle. |
|
|
EV5 |
EV5 |
|
|
1. The memory bandwidth of this chip seems excellent, both for loads and |
1. The memory bandwidth of this chip seems excellent, both for loads and |
stores. Even when the working set is larger than the on-chip L1 and L2 |
stores. Even when the working set is larger than the on-chip L1 and L2 |
caches, the perfromance remain almost unaffected. |
caches, the performance remain almost unaffected. |
|
|
2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th |
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. |
cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 |
umulh has a measured latency of 14 cycles and an issue rate of 1 each |
each 10th cycle. But the exact timing is somewhat confusing. |
10th cycle. But the exact timing is somewhat confusing. |
|
|
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
are memory operations. This will take at least |
are memory operations. This will take at least |
ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles |
ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles |
We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
cache cycles, which should be completely hidden in the 20 issue cycles. |
cache cycles, which should be completely hidden in the 19 issue cycles. |
The computation is inherently serial, with these dependencies: |
The computation is inherently serial, with these dependencies: |
|
|
|
ldq ldq |
|
\ /\ |
|
(or) addq | |
|
|\ / \ | |
|
| addq cmpult |
|
\ | | |
|
cmpult | |
|
\ / |
|
or |
|
|
|
I.e., 3 operations are needed between carry-in and carry-out, making 12 |
|
cycles the absolute minimum for the 4 limbs. We could replace the `or' |
|
with a cmoveq/cmovne, which could issue one cycle earlier that the `or', |
|
but that might waste a cycle on EV4. The total depth remain unaffected, |
|
since cmov has a latency of 2 cycles. |
|
|
addq |
addq |
/ \ |
/ \ |
addq cmpult |
addq cmpult |
| | |
|
cmpult | |
|
\ / |
|
or |
|
I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute |
|
minimum. We could replace the `or' with a cmoveq/cmovne, which would save |
|
a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 |
|
cycles. |
|
addq |
|
/ \ |
|
addq cmpult |
|
| \ |
| \ |
cmpult -> cmovne |
cmpult -> cmovne |
|
|
STATUS |
Montgomery has a slightly different way of computing carry that requires one |
|
less instruction, but has depth 4 (instead of the current 3). Since the |
|
code is currently instruction issue bound, Montgomery's idea should save us |
|
1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25 |
|
cycles/limb. Unfortunately, this method will not be good for the EV6. |
|
|
|
EV6 |
|
|
|
Here we have a really parallel pipeline, capable of issuing up to 4 integer |
|
instructions per cycle. One integer multiply instruction can issue each |
|
cycle. To get optimal speed, we need to pretend we are vectorizing the code, |
|
i.e., minimize the iterative dependencies. |
|
|
|
There are two dependencies to watch out for. 1) Address arithmetic |
|
dependencies, and 2) carry propagation dependencies. |
|
|
|
We can avoid serializing due to address arithmetic by unrolling the loop, so |
|
that addresses don't depend heavily on an index variable. Avoiding |
|
serializing because of carry propagation is trickier; the ultimate performance |
|
of the code will be determined of the number of latency cycles it takes from |
|
accepting carry-in to a vector point until we can generate carry-out. |
|
|
|
Most integer instructions can execute in either the L0, U0, L1, or U1 |
|
pipelines. Shifts only execute in U0 and U1, and multiply only in U1. |
|
|
|
CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but |
|
the execute efficiently. But CMOV split the mapping process (see pg 2-26 in |
|
cmpwrgd.pdf), suggesting the CMOV should always be placed as the last |
|
instruction of an aligned 4 instruction block (?). |
|
|
|
Perhaps the most important issue is the latency between the L0/U0 and L1/U1 |
|
clusters; a result obtained on either cluster has an extra cycle of latency |
|
for consumers in the opposite cluster. Because of the dynamic nature of the |
|
implementation, it is hard to predict where an instruction will execute. |
|
|
|
The shift loops need (per limb): |
|
1 load (Lx pipes) |
|
1 store (Lx pipes) |
|
2 shift (Ux pipes) |
|
1 iaddlog (Lx pipes, Ux pipes) |
|
Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb. |
|
|
|
For mpn_add_n, we currently have |
|
2 load (Lx pipes) |
|
1 store (Lx pipes) |
|
5 iaddlog (Lx pipes, Ux pipes) |
|
|
|
Again, we have a perfect balance and will be limited by carry propagation |
|
delays, currently three cycles. The superoptimizer indicates that ther |
|
might be sequences that--using a final cmov--have a carry propagation delay |
|
of just two. Montgomery's subtraction sequence could perhaps be used, by |
|
complementing some operands. All in all, we should get down to 2 cycles |
|
without much problems. |
|
|
|
For mpn_mul_1, we could do, just like for mpn_add_n: |
|
not newlo,notnewlo |
|
addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout |
|
addq cyout,newhi,cylimb |
|
and get 2-cycle carry propagation. The instructions needed will be |
|
1 ld (Lx pipes) |
|
1 st (Lx pipes) |
|
2 mul (U1 pipe) |
|
4 iaddlog (Lx pipes, Ux pipes) |
|
issue1: addq not mul ld |
|
issue2: cmpult addq mul st |
|
Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb! |
|
|
|
Last, we have mpn_addmul_1. Almost certainly, we will get down to 3 |
|
cycles/limb, which would be absolutely awesome. |
|
|
|
Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen): |
|
|
|
i |
|
s |
|
s i |
|
u n |
|
e s |
|
d t |
|
r |
|
i u |
|
l n c |
|
i s t |
|
v t i |
|
e r o |
|
u n |
|
v c |
|
a t t |
|
l i y |
|
u o p |
|
e n e |
|
s s s |
|
issue |
|
in |
|
cycle |
|
-1 ldq |
|
/ \ |
|
0 | \ |
|
| \ |
|
1 | | |
|
| | |
|
2 | | ldq |
|
| | / \ |
|
3 | mulq | \ |
|
| \ | \ |
|
4 umulh \ | | |
|
| | | | |
|
5 | | | | ldq |
|
| | | | / \ |
|
4calm 6 | | ldq | mulq | \ |
|
| | / | \ | \ |
|
4casm 7 | | / umulh \ | | |
|
6 | || | | | | |
|
3aal 8 | || | | | | ldq |
|
7 | || | | | | / \ |
|
4calm 9 | || | | ldq | mulq | \ |
|
9 | || | | / | \ | \ |
|
4casm 10 | || | | / umulh \ | | |
|
9 | || | || | | | | |
|
3aal 11 | addq | || | | | | ldq |
|
9 | // \ | || | | | | / \ |
|
4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \ |
|
13 \ / // \ | || | | / | \ | \ |
|
4casm 13 addq cmpult stq | || | | / umulh \ | | |
|
11 \ / | || | || | | | | |
|
3aal 14 addq | addq | || | | | | ldq |
|
10 \ | // \ | || | | | | / \ |
|
4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \ |
|
13 \ / // \ | || | | / | \ | \ |
|
4casm 16 addq cmpult stq | || | | / umulh \ | | |
|
11 \ / | || | || | | | | |
|
3aal 17 addq | addq | || | | | | |
|
10 \ | // \ | || | | | | |
|
4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq |
|
13 \ / // \ | || | | / | \ |
|
4casm 19 addq cmpult stq | || | | / umulh \ |
|
11 \ / | || | || | | |
|
3aal 20 addq | addq | || | | |
|
10 \ | // \ | || | | |
|
4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq |
|
\ / // \ | || | | / |
|
22 addq cmpult stq | || | | / |
|
\ / | || | || |
|
23 addq | addq | || |
|
\ | // \ | || |
|
24 cy ----> \ cmpult addq<-cy | || |
|
\ / // \ | || |
|
25 addq cmpult stq | || |
|
\ / | || |
|
26 addq | addq |
|
\ | // \ |
|
27 cy ----> \ cmpult addq<-cy |
|
\ / // \ |
|
28 addq cmpult stq |
|
\ / |
|
As many as 6 consecutive points will be under execution simultaneously, or if we addq |
|
schedule loads even further away, maybe 7 or 8. But the number of live quantities \ |
|
is reasonable, and can easily be satisfied. cy ----> |