Annotation of OpenXM_contrib/gmp/mpn/x86/p6/README, Revision 1.1
1.1 ! maekawa 1:
! 2: INTEL P6 MPN SUBROUTINES
! 3:
! 4:
! 5:
! 6: This directory contains code optimized for Intel P6 class CPUs, meaning
! 7: PentiumPro, Pentium II and Pentium III. The mmx and p3mmx subdirectories
! 8: have routines using MMX instructions.
! 9:
! 10:
! 11:
! 12: STATUS
! 13:
! 14: Times for the loops, with all code and data in L1 cache, are as follows.
! 15: Some of these might be able to be improved.
! 16:
! 17: cycles/limb
! 18:
! 19: mpn_add_n/sub_n 3.7
! 20:
! 21: mpn_copyi 0.75
! 22: mpn_copyd 2.4
! 23:
! 24: mpn_divrem_1 39.0
! 25: mpn_mod_1 39.0
! 26: mpn_divexact_by3 8.5
! 27:
! 28: mpn_mul_1 5.5
! 29: mpn_addmul/submul_1 6.35
! 30:
! 31: mpn_l/rshift 2.5
! 32:
! 33: mpn_mul_basecase 8.2 cycles/crossproduct (approx)
! 34: mpn_sqr_basecase 4.0 cycles/crossproduct (approx)
! 35: or 7.75 cycles/triangleproduct (approx)
! 36:
! 37: Pentium II and III have MMX and get the following improvements.
! 38:
! 39: mpn_divrem_1 25.0 integer part, 17.5 fractional part
! 40: mpn_mod_1 24.0
! 41:
! 42: mpn_l/rshift 1.75
! 43:
! 44:
! 45:
! 46:
! 47: NOTES
! 48:
! 49: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
! 50:
! 51: Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
! 52: to 26 cycles depending how far speculative execution has gone. The 9 cycle
! 53: minimum penalty comes from the issue pipeline being 9 stages.
! 54:
! 55: A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
! 56: 5, 6 or 7 limb operations are all the same. The 0.75 cycles/limb would be 3
! 57: cycles per 16 byte block.
! 58:
! 59:
! 60:
! 61:
! 62: CODING
! 63:
! 64: Instructions in general code have been shown grouped if they can execute
! 65: together, which means up to three instructions with no successive
! 66: dependencies, and with only the first being a multiple micro-op.
! 67:
! 68: P6 has out-of-order execution, so the groupings are really only showing
! 69: dependent paths where some shuffling might allow some latencies to be
! 70: hidden.
! 71:
! 72:
! 73:
! 74:
! 75: REFERENCES
! 76:
! 77: "Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
! 78: 02/99, order number 245127 (order number 730795-001 is in the document too).
! 79: Available on-line:
! 80:
! 81: http://download.intel.com/design/PentiumII/manuals/245127.htm
! 82:
! 83: "Intel Architecture Optimization Manual", 1997, order number 242816. This
! 84: is an older document mostly about P5 and not as good as the above.
! 85: Available on-line:
! 86:
! 87: http://download.intel.com/design/PentiumII/manuals/242816.htm
! 88:
! 89:
! 90:
! 91: ----------------
! 92: Local variables:
! 93: mode: text
! 94: fill-column: 76
! 95: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>