Annotation of OpenXM_contrib/gmp/mpn/x86/k6/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 2000, 2001 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
1.1 maekawa 22:
23: AMD K6 MPN SUBROUTINES
24:
25:
26:
27: This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
28: K6-3.
29:
1.1.1.2 ! ohara 30: The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
! 31: has MMX code suiting K6-2 and K6-3. All chips in the K6 family have MMX,
! 32: the separate directories are just so that ./configure can omit them if the
! 33: assembler doesn't support MMX.
1.1 maekawa 34:
35:
36:
37:
38: STATUS
39:
40: Times for the loops, with all code and data in L1 cache, are as follows.
41:
42: cycles/limb
43:
44: mpn_add_n/sub_n 3.25 normal, 2.75 in-place
45:
46: mpn_mul_1 6.25
47: mpn_add/submul_1 7.65-8.4 (varying with data values)
48:
49: mpn_mul_basecase 9.25 cycles/crossproduct (approx)
50: mpn_sqr_basecase 4.7 cycles/crossproduct (approx)
51: or 9.2 cycles/triangleproduct (approx)
52:
1.1.1.2 ! ohara 53: mpn_l/rshift 3.0
! 54:
1.1 maekawa 55: mpn_divrem_1 20.0
56: mpn_mod_1 20.0
57: mpn_divexact_by3 11.0
58:
1.1.1.2 ! ohara 59: mpn_copyi 1.0
! 60: mpn_copyd 1.0
1.1 maekawa 61:
62:
63: K6-2 and K6-3 have dual-issue MMX and get the following improvements.
64:
65: mpn_l/rshift 1.75
66:
67:
68: Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch"
69: instruction, code seems to run slower, and with just "mov" loads it doesn't
70: seem faster. Results so far are inconsistent. The K6 does a hardware
71: prefetch of the second cache line in a sector, so the penalty for not
72: prefetching in software is reduced.
73:
74:
75:
76:
77: NOTES
78:
79: All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
80:
81: Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
1.1.1.2 ! ohara 82: execute them in both X and Y (and in both together).
1.1 maekawa 83:
84: Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
85: chapter 6 table 12).
86:
87: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
88: Store queue is 7 entries of 64 bits each.
89:
90: Floating point multiplications can be done in parallel with integer
91: multiplications, but there doesn't seem to be any way to make use of this.
92:
93:
94:
95: OPTIMIZATIONS
96:
97: Unrolled loops are used to reduce looping overhead. The unrolling is
98: configurable up to 32 limbs/loop for most routines, up to 64 for some.
99:
100: Sometimes computed jumps into the unrolling are used to handle sizes not a
101: multiple of the unrolling. An attractive feature of this is that times
102: smoothly increase with operand size, but an indirect jump is about 6 cycles
103: and the setups about another 6, so it depends on how much the unrolled code
104: is faster than a simple loop as to whether a computed jump ought to be used.
105:
106: Position independent code is implemented using a call to get eip for
107: computed jumps and a ret is always done, rather than an addl $4,%esp or a
108: popl, so the CPU return address branch prediction stack stays synchronised
109: with the actual stack in memory. Such a call however still costs 4 to 7
110: cycles.
111:
112: Branch prediction, in absence of any history, will guess forward jumps are
113: not taken and backward jumps are taken. Where possible it's arranged that
114: the less likely or less important case is under a taken forward jump.
115:
116:
117:
118: MMX
119:
120: Putting emms or femms as late as possible in a routine seems to be fastest.
121: Perhaps an emms or femms stalls until all outstanding MMX instructions have
122: completed, so putting it later gives them a chance to complete on their own,
123: in parallel with other operations (like register popping).
124:
125: The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
126: at the start of a routine, in case it's been preceded by x87 floating point
127: operations. This isn't done because in gmp programs it's expected that x87
128: floating point won't be much used and that chances are an mpn routine won't
129: have been preceded by any x87 code.
130:
131:
132:
133: CODING
134:
135: Instructions in general code are shown paired if they can decode and execute
136: together, meaning two short decode instructions with the second not
137: depending on the first, only the first using the shifter, no more than one
138: load, and no more than one store.
139:
140: K6 does some out of order execution so the pairings aren't essential, they
141: just show what slots might be available. When decoding is the limiting
142: factor things can be scheduled that might not execute until later.
143:
144:
145:
146: NOTES
147:
148: Code alignment
149:
150: - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
151: short decode is inhibited. The cross.pl script detects this.
152:
153: - loops and branch targets should be aligned to 16 bytes, or ensure at least
154: 2 instructions before a 32 byte boundary. This makes use of the 16 byte
155: cache in the BTB.
156:
157: Addressing modes
158:
159: - (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this
160: problem, and can be used as an equivalent, or easier is just to use a
161: different register, like %ebx.
162:
163: - K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3
164: have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
165:
166: If more than 3 bytes are needed to determine instruction length then
167: decoding degrades from direct to long, or from long to vector. This
168: happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
169: with mod=00 the sib determines whether there's a displacement.
170:
1.1.1.2 ! ohara 171: This affects all MMX and 3DNow instructions, and others with an 0F prefix,
1.1 maekawa 172: like movzbl. The modes affected are anything with an index and no
173: displacement, or an index but no base, and this includes (%esp) which is
174: really (,%esp,1).
175:
176: The cross.pl script detects problem cases. The workaround is to always
177: use a displacement, and to do this with Zdisp if it's zero so the
178: assembler doesn't discard it.
179:
180: See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
181: 13-14 and 36-37.
182:
183: Calls
184:
185: - indirect jumps and calls are not branch predicted, they measure about 6
186: cycles.
187:
188: Various
189:
190: - adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe
191: - bsf 12-27 cycles
192: - emms 5 cycles
193: - femms 3 cycles
194: - jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken)
195: - divl 20 cycles back-to-back
1.1.1.2 ! ohara 196: - imull 2 decode, 3 execute
1.1 maekawa 197: - mull 2 decode, 3 execute (optimization manual decoding sample)
198: - prefetch 2 cycles
199: - rcll/rcrl implicit by one bit: 2 cycles
200: immediate or %cl count: 11 + 2 per bit for dword
201: 13 + 4 per bit for byte
202: - setCC 2 cycles
203: - xchgl %eax,reg 1.5 cycles, back-to-back (strange)
204: reg,reg 2 cycles, back-to-back
205:
206:
207:
208:
209: REFERENCES
210:
211: "AMD-K6 Processor Code Optimization Application Note", AMD publication
212: number 21924, revision D amendment 0, January 2000. This describes K6-2 and
213: K6-3. Available on-line,
214:
215: http://www.amd.com/K6/k6docs/pdf/21924.pdf
216:
217: "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
218: publication number 21828, revision A amendment 0, August 1997. This is an
219: older edition of the above document, describing plain K6. Available
220: on-line,
221:
222: http://www.amd.com/K6/k6docs/pdf/21828.pdf
223:
224: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
225: This describes the femms and prefetch instructions, but nothing else from
226: 3DNow has been used. Available on-line,
227:
228: http://www.amd.com/K6/k6docs/pdf/21928.pdf
229:
230: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
231: August 1999. This has some notes on general K6 optimizations as well as
232: 3DNow. Available on-line,
233:
234: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
235:
236:
237:
238: ----------------
239: Local variables:
240: mode: text
241: fill-column: 76
242: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>