Annotation of OpenXM_contrib/gmp/mpn/x86/pentium4/README, Revision 1.1.1.1
1.1 ohara 1: Copyright 2001 Free Software Foundation, Inc.
2:
3: This file is part of the GNU MP Library.
4:
5: The GNU MP Library is free software; you can redistribute it and/or modify
6: it under the terms of the GNU Lesser General Public License as published by
7: the Free Software Foundation; either version 2.1 of the License, or (at your
8: option) any later version.
9:
10: The GNU MP Library is distributed in the hope that it will be useful, but
11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
13: License for more details.
14:
15: You should have received a copy of the GNU Lesser General Public License
16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
18: 02111-1307, USA.
19:
20:
21:
22:
23: INTEL PENTIUM-4 MPN SUBROUTINES
24:
25:
26: This directory contains mpn functions optimized for Intel Pentium-4.
27:
28: The mmx subdirectory has routines using MMX instructions, the sse2
29: subdirectory has routines using SSE2 instructions. All P4s have these, the
30: separate directories are just so configure can omit that code if the
31: assembler doesn't support it.
32:
33:
34: STATUS
35:
36: cycles/limb
37:
38: mpn_add_n/sub_n 4 normal, 6 in-place
39:
40: mpn_mul_1 4 normal, 6 in-place
41: mpn_addmul_1 6
42: mpn_submul_1 7
43:
44: mpn_mul_basecase 6 cycles/crossproduct (approx)
45:
46: mpn_sqr_basecase 3.5 cycles/crossproduct (approx)
47: or 7.0 cycles/triangleproduct (approx)
48:
49: mpn_l/rshift 1.75
50:
51:
52:
53: The shifts ought to be able to go at 1.5 c/l, but not much effort has been
54: applied to them yet.
55:
56: In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
57: calls, suffer from pipeline anomalies associated with write combining and
58: movd reads and writes to the same or nearby locations. The movq
59: instructions do not trigger the same hardware problems. Unfortunately,
60: using movq and splitting/combining seems to require too many extra
61: instructions to help. Perhaps future chip steppings will be better.
62:
63:
64:
65: NOTES
66:
67: The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
68: Many traditional x86 instructions run very slowly, requiring use of
69: alterative instructions for acceptable performance.
70:
71: adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits
72: within a 64-bit mmx register seems better, though the combination
73: paddq/psrlq when propagating a carry is still a 4 cycle latency.
74:
75: incl and decl should be avoided, instead use add $1 and sub $1. Apparently
76: the carry flag is not separately renamed, so incl and decl depend on all
77: previous flags-setting instructions.
78:
79: shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
80: integer instructions (addl, subl, orl, andl, and some more). shldl and
81: shrdl seem to have 13 and 15 cycles latency, respectively. Bizarre.
82:
83: movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
84: pxor/por or similar combination at 2 cycles latency can be used instead.
85: The movq however executes in the float unit, thereby saving MMX execution
86: resources. With the right juggling, data moves shouldn't be on a dependent
87: chain.
88:
89: L1 is write-through, but the write-combining sounds like it does enough to
90: not require explicit destination prefetching.
91:
92: xmm registers so far haven't found a use, but not much effort has been
93: expended. A configure test for whether the operating system knows
94: fxsave/fxrestor will be needed if they're used.
95:
96:
97:
98: REFERENCES
99:
100: Intel Pentium-4 processor manuals,
101:
102: http://developer.intel.com/design/pentium4/manuals
103:
104: "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
105: order number 248966. Available on-line:
106:
107: http://developer.intel.com/design/pentium4/manuals/248966.htm
108:
109:
110:
111: ----------------
112: Local variables:
113: mode: text
114: fill-column: 76
115: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>