Annotation of OpenXM_contrib/gmp/mpn/x86/p6/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 2000, 2001 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
! 22:
1.1 maekawa 23:
24: INTEL P6 MPN SUBROUTINES
25:
26:
27:
28: This directory contains code optimized for Intel P6 class CPUs, meaning
29: PentiumPro, Pentium II and Pentium III. The mmx and p3mmx subdirectories
30: have routines using MMX instructions.
31:
32:
33:
34: STATUS
35:
36: Times for the loops, with all code and data in L1 cache, are as follows.
37: Some of these might be able to be improved.
38:
39: cycles/limb
40:
41: mpn_add_n/sub_n 3.7
42:
43: mpn_copyi 0.75
1.1.1.2 ! ohara 44: mpn_copyd 1.75 (or 0.75 if no overlap)
1.1 maekawa 45:
46: mpn_divrem_1 39.0
1.1.1.2 ! ohara 47: mpn_mod_1 21.5
1.1 maekawa 48: mpn_divexact_by3 8.5
49:
50: mpn_mul_1 5.5
51: mpn_addmul/submul_1 6.35
52:
53: mpn_l/rshift 2.5
54:
55: mpn_mul_basecase 8.2 cycles/crossproduct (approx)
56: mpn_sqr_basecase 4.0 cycles/crossproduct (approx)
57: or 7.75 cycles/triangleproduct (approx)
58:
59: Pentium II and III have MMX and get the following improvements.
60:
61: mpn_divrem_1 25.0 integer part, 17.5 fractional part
62:
63: mpn_l/rshift 1.75
64:
65:
66:
67:
68: NOTES
69:
70: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
71:
72: Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
73: to 26 cycles depending how far speculative execution has gone. The 9 cycle
74: minimum penalty comes from the issue pipeline being 9 stages.
75:
76: A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
77: 5, 6 or 7 limb operations are all the same. The 0.75 cycles/limb would be 3
78: cycles per 16 byte block.
79:
80:
81:
82:
83: CODING
84:
85: Instructions in general code have been shown grouped if they can execute
86: together, which means up to three instructions with no successive
87: dependencies, and with only the first being a multiple micro-op.
88:
89: P6 has out-of-order execution, so the groupings are really only showing
90: dependent paths where some shuffling might allow some latencies to be
91: hidden.
92:
93:
94:
95:
96: REFERENCES
97:
98: "Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
99: 02/99, order number 245127 (order number 730795-001 is in the document too).
100: Available on-line:
101:
102: http://download.intel.com/design/PentiumII/manuals/245127.htm
103:
104: "Intel Architecture Optimization Manual", 1997, order number 242816. This
105: is an older document mostly about P5 and not as good as the above.
106: Available on-line:
107:
108: http://download.intel.com/design/PentiumII/manuals/242816.htm
109:
110:
111:
112: ----------------
113: Local variables:
114: mode: text
115: fill-column: 76
116: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>