version 1.1.1.1, 2000/01/10 15:35:26 |
version 1.1.1.2, 2000/09/09 14:12:44 |
|
|
This directory contains mpn functions optimized for Intel Pentium |
|
processors. |
|
|
|
|
INTEL PENTIUM P5 MPN SUBROUTINES |
|
|
|
|
|
This directory contains mpn functions optimized for Intel Pentium (P5,P54) |
|
processors. The mmx subdirectory has code for Pentium with MMX (P55). |
|
|
|
|
|
STATUS |
|
|
|
cycles/limb |
|
|
|
mpn_add_n/sub_n 2.375 |
|
|
|
mpn_copyi/copyd 1.0 |
|
|
|
mpn_divrem_1 44.0 |
|
mpn_mod_1 44.0 |
|
mpn_divexact_by3 15.0 |
|
|
|
mpn_l/rshift 5.375 normal (6.0 on P54) |
|
1.875 special shift by 1 bit |
|
|
|
mpn_mul_1 13.0 |
|
mpn_add/submul_1 14.0 |
|
|
|
mpn_mul_basecase 14.2 cycles/crossproduct (approx) |
|
|
|
mpn_sqr_basecase 8 cycles/crossproduct (approx) |
|
or 15.5 cycles/triangleproduct (approx) |
|
|
|
Pentium MMX gets the following improvements |
|
|
|
mpn_l/rshift 1.75 |
|
|
|
|
|
1. mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the |
|
documentation indicates that they should take only 43/8 = 5.375 cycles/limb, |
|
or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. |
|
|
|
2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop |
|
overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. |
|
|
|
3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they |
|
should. Intel documentation says a mul instruction is 10 cycles, but it |
|
measures 9 and the routines using it run with it as 9. |
|
|
|
|
|
|
RELEVANT OPTIMIZATION ISSUES |
RELEVANT OPTIMIZATION ISSUES |
|
|
1. Pentium doesn't allocate cache lines on writes, unlike most other modern |
1. Pentium doesn't allocate cache lines on writes, unlike most other modern |
Line 13 to different cache banks. The simplest way to insure |
|
Line 59 to different cache banks. The simplest way to insure |
|
two words from the same object. If we make operations on different objects, |
two words from the same object. If we make operations on different objects, |
they might or might not be to the same cache bank. |
they might or might not be to the same cache bank. |
|
|
STATUS |
|
|
|
1. mpn_lshift and mpn_rshift run at about 6 cycles/limb, but the Pentium |
|
documentation indicates that they should take only 43/8 = 5.375 cycles/limb, |
|
or 5 cycles/limb asymptotically. |
|
|
|
2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop |
REFERENCES |
overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. |
|
|
|
3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they |
"Intel Architecture Optimization Manual", 1997, order number 242816. This |
should... |
is mostly about P5, the parts about P6 aren't relevant. Available on-line: |
|
|
|
http://download.intel.com/design/PentiumII/manuals/242816.htm |
|
|
|
|
|
|
|
---------------- |
|
Local variables: |
|
mode: text |
|
fill-column: 76 |
|
End: |