Annotation of OpenXM_contrib/gmp/mpn/x86/k6/README, Revision 1.1.1.1
1.1 maekawa 1:
2: AMD K6 MPN SUBROUTINES
3:
4:
5:
6: This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
7: K6-3.
8:
9: The mmx and k62mmx subdirectories have routines using MMX instructions. All
10: K6s have MMX, the separate directories are just so that ./configure can omit
11: them if the assembler doesn't support MMX.
12:
13:
14:
15:
16: STATUS
17:
18: Times for the loops, with all code and data in L1 cache, are as follows.
19:
20: cycles/limb
21:
22: mpn_add_n/sub_n 3.25 normal, 2.75 in-place
23:
24: mpn_mul_1 6.25
25: mpn_add/submul_1 7.65-8.4 (varying with data values)
26:
27: mpn_mul_basecase 9.25 cycles/crossproduct (approx)
28: mpn_sqr_basecase 4.7 cycles/crossproduct (approx)
29: or 9.2 cycles/triangleproduct (approx)
30:
31: mpn_divrem_1 20.0
32: mpn_mod_1 20.0
33: mpn_divexact_by3 11.0
34:
35: mpn_l/rshift 3.0
36:
37: mpn_copyi/copyd 1.0
38:
39: mpn_com_n 1.5-1.85 \
40: mpn_and/andn/ior/xor_n 1.5-1.75 | varying with
41: mpn_iorn/xnor_n 2.0-2.25 | data alignment
42: mpn_nand/nior_n 2.0-2.25 /
43:
44: mpn_popcount 12.5
45: mpn_hamdist 13.0
46:
47:
48: K6-2 and K6-3 have dual-issue MMX and get the following improvements.
49:
50: mpn_l/rshift 1.75
51:
52: mpn_copyi/copyd 0.56 or 1.0 \
53: |
54: mpn_com_n 1.0-1.2 | varying with
55: mpn_and/andn/ior/xor_n 1.2-1.5 | data alignment
56: mpn_iorn/xnor_n 1.5-2.0 |
57: mpn_nand/nior_n 1.75-2.0 /
58:
59: mpn_popcount 9.0
60: mpn_hamdist 11.5
61:
62:
63: Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch"
64: instruction, code seems to run slower, and with just "mov" loads it doesn't
65: seem faster. Results so far are inconsistent. The K6 does a hardware
66: prefetch of the second cache line in a sector, so the penalty for not
67: prefetching in software is reduced.
68:
69:
70:
71:
72: NOTES
73:
74: All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
75:
76: Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
77: execute them in both X and Y (and together).
78:
79: Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
80: chapter 6 table 12).
81:
82: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
83: Store queue is 7 entries of 64 bits each.
84:
85: Floating point multiplications can be done in parallel with integer
86: multiplications, but there doesn't seem to be any way to make use of this.
87:
88:
89:
90: OPTIMIZATIONS
91:
92: Unrolled loops are used to reduce looping overhead. The unrolling is
93: configurable up to 32 limbs/loop for most routines, up to 64 for some.
94:
95: Sometimes computed jumps into the unrolling are used to handle sizes not a
96: multiple of the unrolling. An attractive feature of this is that times
97: smoothly increase with operand size, but an indirect jump is about 6 cycles
98: and the setups about another 6, so it depends on how much the unrolled code
99: is faster than a simple loop as to whether a computed jump ought to be used.
100:
101: Position independent code is implemented using a call to get eip for
102: computed jumps and a ret is always done, rather than an addl $4,%esp or a
103: popl, so the CPU return address branch prediction stack stays synchronised
104: with the actual stack in memory. Such a call however still costs 4 to 7
105: cycles.
106:
107: Branch prediction, in absence of any history, will guess forward jumps are
108: not taken and backward jumps are taken. Where possible it's arranged that
109: the less likely or less important case is under a taken forward jump.
110:
111:
112:
113: MMX
114:
115: Putting emms or femms as late as possible in a routine seems to be fastest.
116: Perhaps an emms or femms stalls until all outstanding MMX instructions have
117: completed, so putting it later gives them a chance to complete on their own,
118: in parallel with other operations (like register popping).
119:
120: The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
121: at the start of a routine, in case it's been preceded by x87 floating point
122: operations. This isn't done because in gmp programs it's expected that x87
123: floating point won't be much used and that chances are an mpn routine won't
124: have been preceded by any x87 code.
125:
126:
127:
128: CODING
129:
130: Instructions in general code are shown paired if they can decode and execute
131: together, meaning two short decode instructions with the second not
132: depending on the first, only the first using the shifter, no more than one
133: load, and no more than one store.
134:
135: K6 does some out of order execution so the pairings aren't essential, they
136: just show what slots might be available. When decoding is the limiting
137: factor things can be scheduled that might not execute until later.
138:
139:
140:
141: NOTES
142:
143: Code alignment
144:
145: - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
146: short decode is inhibited. The cross.pl script detects this.
147:
148: - loops and branch targets should be aligned to 16 bytes, or ensure at least
149: 2 instructions before a 32 byte boundary. This makes use of the 16 byte
150: cache in the BTB.
151:
152: Addressing modes
153:
154: - (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this
155: problem, and can be used as an equivalent, or easier is just to use a
156: different register, like %ebx.
157:
158: - K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3
159: have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
160:
161: If more than 3 bytes are needed to determine instruction length then
162: decoding degrades from direct to long, or from long to vector. This
163: happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
164: with mod=00 the sib determines whether there's a displacement.
165:
166: This affects all MMX and 3DNow instructions, and others with an 0F prefix
167: like movzbl. The modes affected are anything with an index and no
168: displacement, or an index but no base, and this includes (%esp) which is
169: really (,%esp,1).
170:
171: The cross.pl script detects problem cases. The workaround is to always
172: use a displacement, and to do this with Zdisp if it's zero so the
173: assembler doesn't discard it.
174:
175: See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
176: 13-14 and 36-37.
177:
178: Calls
179:
180: - indirect jumps and calls are not branch predicted, they measure about 6
181: cycles.
182:
183: Various
184:
185: - adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe
186: - bsf 12-27 cycles
187: - emms 5 cycles
188: - femms 3 cycles
189: - jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken)
190: - divl 20 cycles back-to-back
191: - imull 2 decode, 2 execute
192: - mull 2 decode, 3 execute (optimization manual decoding sample)
193: - prefetch 2 cycles
194: - rcll/rcrl implicit by one bit: 2 cycles
195: immediate or %cl count: 11 + 2 per bit for dword
196: 13 + 4 per bit for byte
197: - setCC 2 cycles
198: - xchgl %eax,reg 1.5 cycles, back-to-back (strange)
199: reg,reg 2 cycles, back-to-back
200:
201:
202:
203:
204: REFERENCES
205:
206: "AMD-K6 Processor Code Optimization Application Note", AMD publication
207: number 21924, revision D amendment 0, January 2000. This describes K6-2 and
208: K6-3. Available on-line,
209:
210: http://www.amd.com/K6/k6docs/pdf/21924.pdf
211:
212: "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
213: publication number 21828, revision A amendment 0, August 1997. This is an
214: older edition of the above document, describing plain K6. Available
215: on-line,
216:
217: http://www.amd.com/K6/k6docs/pdf/21828.pdf
218:
219: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
220: This describes the femms and prefetch instructions, but nothing else from
221: 3DNow has been used. Available on-line,
222:
223: http://www.amd.com/K6/k6docs/pdf/21928.pdf
224:
225: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
226: August 1999. This has some notes on general K6 optimizations as well as
227: 3DNow. Available on-line,
228:
229: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
230:
231:
232:
233: ----------------
234: Local variables:
235: mode: text
236: fill-column: 76
237: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>