Annotation of OpenXM_contrib/gmp/mpn/x86/p6/README, Revision 1.1.1.1
1.1 maekawa 1:
2: INTEL P6 MPN SUBROUTINES
3:
4:
5:
6: This directory contains code optimized for Intel P6 class CPUs, meaning
7: PentiumPro, Pentium II and Pentium III. The mmx and p3mmx subdirectories
8: have routines using MMX instructions.
9:
10:
11:
12: STATUS
13:
14: Times for the loops, with all code and data in L1 cache, are as follows.
15: Some of these might be able to be improved.
16:
17: cycles/limb
18:
19: mpn_add_n/sub_n 3.7
20:
21: mpn_copyi 0.75
22: mpn_copyd 2.4
23:
24: mpn_divrem_1 39.0
25: mpn_mod_1 39.0
26: mpn_divexact_by3 8.5
27:
28: mpn_mul_1 5.5
29: mpn_addmul/submul_1 6.35
30:
31: mpn_l/rshift 2.5
32:
33: mpn_mul_basecase 8.2 cycles/crossproduct (approx)
34: mpn_sqr_basecase 4.0 cycles/crossproduct (approx)
35: or 7.75 cycles/triangleproduct (approx)
36:
37: Pentium II and III have MMX and get the following improvements.
38:
39: mpn_divrem_1 25.0 integer part, 17.5 fractional part
40: mpn_mod_1 24.0
41:
42: mpn_l/rshift 1.75
43:
44:
45:
46:
47: NOTES
48:
49: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
50:
51: Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
52: to 26 cycles depending how far speculative execution has gone. The 9 cycle
53: minimum penalty comes from the issue pipeline being 9 stages.
54:
55: A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
56: 5, 6 or 7 limb operations are all the same. The 0.75 cycles/limb would be 3
57: cycles per 16 byte block.
58:
59:
60:
61:
62: CODING
63:
64: Instructions in general code have been shown grouped if they can execute
65: together, which means up to three instructions with no successive
66: dependencies, and with only the first being a multiple micro-op.
67:
68: P6 has out-of-order execution, so the groupings are really only showing
69: dependent paths where some shuffling might allow some latencies to be
70: hidden.
71:
72:
73:
74:
75: REFERENCES
76:
77: "Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
78: 02/99, order number 245127 (order number 730795-001 is in the document too).
79: Available on-line:
80:
81: http://download.intel.com/design/PentiumII/manuals/245127.htm
82:
83: "Intel Architecture Optimization Manual", 1997, order number 242816. This
84: is an older document mostly about P5 and not as good as the above.
85: Available on-line:
86:
87: http://download.intel.com/design/PentiumII/manuals/242816.htm
88:
89:
90:
91: ----------------
92: Local variables:
93: mode: text
94: fill-column: 76
95: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>