[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86 / p6

Annotation of OpenXM_contrib/gmp/mpn/x86/p6/README, Revision 1.1.1.1

1.1       maekawa     1:
                      2:                       INTEL P6 MPN SUBROUTINES
                      3:
                      4:
                      5:
                      6: This directory contains code optimized for Intel P6 class CPUs, meaning
                      7: PentiumPro, Pentium II and Pentium III.  The mmx and p3mmx subdirectories
                      8: have routines using MMX instructions.
                      9:
                     10:
                     11:
                     12: STATUS
                     13:
                     14: Times for the loops, with all code and data in L1 cache, are as follows.
                     15: Some of these might be able to be improved.
                     16:
                     17:                                cycles/limb
                     18:
                     19:        mpn_add_n/sub_n           3.7
                     20:
                     21:        mpn_copyi                 0.75
                     22:        mpn_copyd                 2.4
                     23:
                     24:        mpn_divrem_1             39.0
                     25:        mpn_mod_1                39.0
                     26:        mpn_divexact_by3          8.5
                     27:
                     28:        mpn_mul_1                 5.5
                     29:        mpn_addmul/submul_1       6.35
                     30:
                     31:        mpn_l/rshift              2.5
                     32:
                     33:        mpn_mul_basecase          8.2 cycles/crossproduct (approx)
                     34:        mpn_sqr_basecase          4.0 cycles/crossproduct (approx)
                     35:                                  or 7.75 cycles/triangleproduct (approx)
                     36:
                     37: Pentium II and III have MMX and get the following improvements.
                     38:
                     39:        mpn_divrem_1             25.0 integer part, 17.5 fractional part
                     40:        mpn_mod_1                24.0
                     41:
                     42:        mpn_l/rshift              1.75
                     43:
                     44:
                     45:
                     46:
                     47: NOTES
                     48:
                     49: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
                     50:
                     51: Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
                     52: to 26 cycles depending how far speculative execution has gone.  The 9 cycle
                     53: minimum penalty comes from the issue pipeline being 9 stages.
                     54:
                     55: A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
                     56: 5, 6 or 7 limb operations are all the same.  The 0.75 cycles/limb would be 3
                     57: cycles per 16 byte block.
                     58:
                     59:
                     60:
                     61:
                     62: CODING
                     63:
                     64: Instructions in general code have been shown grouped if they can execute
                     65: together, which means up to three instructions with no successive
                     66: dependencies, and with only the first being a multiple micro-op.
                     67:
                     68: P6 has out-of-order execution, so the groupings are really only showing
                     69: dependent paths where some shuffling might allow some latencies to be
                     70: hidden.
                     71:
                     72:
                     73:
                     74:
                     75: REFERENCES
                     76:
                     77: "Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
                     78: 02/99, order number 245127 (order number 730795-001 is in the document too).
                     79: Available on-line:
                     80:
                     81:        http://download.intel.com/design/PentiumII/manuals/245127.htm
                     82:
                     83: "Intel Architecture Optimization Manual", 1997, order number 242816.  This
                     84: is an older document mostly about P5 and not as good as the above.
                     85: Available on-line:
                     86:
                     87:        http://download.intel.com/design/PentiumII/manuals/242816.htm
                     88:
                     89:
                     90:
                     91: ----------------
                     92: Local variables:
                     93: mode: text
                     94: fill-column: 76
                     95: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>