Annotation of OpenXM_contrib/gmp/mpn/x86/README.family, Revision 1.1.1.1
1.1 maekawa 1:
2: X86 CPU FAMILY MPN SUBROUTINES
3:
4:
5: This file has some notes on things common to all the x86 family code.
6:
7:
8:
9: ASM FILES
10:
11: The x86 .asm files are BSD style x86 assembler code, first put through m4
12: for macro processing. The generic mpn/asm-defs.m4 is used, together with
13: mpn/x86/x86-defs.m4. Detailed notes are in those files.
14:
15: The code is meant for use with GNU "gas" or a system "as". There's no
16: support for assemblers that demand Intel style, and with gas freely
17: available and easy to use that shouldn't be a problem.
18:
19:
20:
21: STACK FRAME
22:
23: m4 macros are used to define the parameters passed on the stack, and these
24: act like comments on what the stack frame looks like too. For example,
25: mpn_mul_1() has the following.
26:
27: defframe(PARAM_MULTIPLIER, 16)
28: defframe(PARAM_SIZE, 12)
29: defframe(PARAM_SRC, 8)
30: defframe(PARAM_DST, 4)
31:
32: Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others
33: similarly. The return address is at offset 0, but there's not normally any
34: need to access that.
35:
36: FRAME is redefined as necessary through the code so it's the number of bytes
37: pushed on the stack, and hence the offsets in the parameter macros stay
38: correct. At the start of a routine FRAME should be zero.
39:
40: deflit(`FRAME',0)
41: ...
42: deflit(`FRAME',4)
43: ...
44: deflit(`FRAME',8)
45: ...
46:
47: Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
48: FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
49: and can be used instead of explicit definitions if preferred.
50: defframe_pushl() is a combination FRAME_pushl() and defframe().
51:
52: There's generally some slackness in redefining FRAME. If new values aren't
53: going to get used, then the redefinitions are omitted to keep from
54: cluttering up the code. This happens for instance at the end of a routine,
55: where there might be just four register pops and then a ret, so FRAME isn't
56: getting used.
57:
58: Local variables and saved registers can be similarly defined, with negative
59: offsets representing stack space below the initial stack pointer. For
60: example,
61:
62: defframe(SAVE_ESI, -4)
63: defframe(SAVE_EDI, -8)
64: defframe(VAR_COUNTER,-12)
65:
66: deflit(STACK_SPACE, 12)
67:
68: Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
69: space, and that instruction must be followed by a redefinition of FRAME
70: (setting it equal to STACK_SPACE) to reflect the change in %esp.
71:
72: Definitions for pushed registers are only put in when they're going to be
73: used. If registers are just saved and restored with pushes and pops then
74: definitions aren't made.
75:
76:
77:
78: ASSEMBLER EXPRESSIONS
79:
80: Only addition and subtraction seem to be universally available, certainly
81: that's all the Solaris 8 "as" seems to accept. If expressions are wanted
82: then m4 eval() should be used.
83:
84: In particular note that a "/" anywhere in a line starts a comment in Solaris
85: "as", and in some configurations of gas too.
86:
87: addl $32/2, %eax <-- wrong
88:
89: addl $eval(32/2), %eax <-- right
90:
91: Binutils gas/config/tc-i386.c has a choice between "/" being a comment
92: anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select
93: the latter, and as of 2.9.5 it's the default for GNU/Linux too.
94:
95:
96:
97: ASSEMBLER COMMENTS
98:
99: Solaris "as" doesn't support "#" commenting, using /* */ instead,
100: unfortunately. For that reason "C" commenting is used (see asm-defs.m4) and
101: the intermediate ".s" files have no comments.
102:
103:
104:
105: ZERO DISPLACEMENTS
106:
107: In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
108: displacement are wanted, rather than (%ebx) with no displacement. These are
109: either for computed jumps or to get desirable code alignment. Explicit
110: .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
111: (%ebx). The Zdisp() macro in x86-defs.m4 is used for this.
112:
113: Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
114: 1.92.3 changes it. In general changing would be the sort of "optimization"
115: an assembler might perform, hence explicit ".byte"s are used where
116: necessary.
117:
118:
119:
120: SHLD/SHRD INSTRUCTIONS
121:
122: The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
123: must be written "shldl %eax,%ebx" for some assemblers. gas takes either,
124: Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
125: gas), and omits %cl elsewhere.
126:
127: For GMP an autoconf test is used to determine whether %cl should be used and
128: the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass
129: through or omit %cl as necessary. See comments with those macros for usage.
130:
131:
132:
133: DIRECTION FLAG
134:
135: The x86 calling conventions say that the direction flag should be clear at
136: function entry and exit. (See iBCS2 and SVR4 ABI books, references below.)
137:
138: Although this has been so since the year dot, it's not absolutely clear
139: whether it's universally respected. Since it's better to be safe than
140: sorry, gmp follows glibc and does a "cld" if it depends on the direction
141: flag being clear. This happens only in a few places.
142:
143:
144:
145: POSITION INDEPENDENT CODE
146:
147: Defining the symbol PIC in m4 processing selects position independent code.
148: This mainly affects computed jumps, and these are implemented in a
149: self-contained fashion (without using the global offset table). The few
150: calls from assembly code to global functions use the normal procedure
151: linkage table.
152:
153: PIC is necessary for ELF shared libraries because they can be mapped into
154: different processes at different virtual addresses. Text relocations in
155: shared libraries are allowed, but that presumably means a page with such a
156: relocation isn't shared. The use of the PLT for PIC adds a fixed cost to
157: every function call, which is small but might be noticeable when working with
158: small operands.
159:
160: Calls from one library function to another don't need to go through the PLT,
161: since of course the call instruction uses a displacement, not an absolute
162: address, and the relative locations of object files are known when libgmp.so
163: is created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
164: this way, so that there's no jump through the PLT, but of course leaving
165: setups of the GOT address in %ebx that may be unnecessary.
166:
167: The %ebx setup could be avoided in assembly if a separate option controlled
168: PIC for calls as opposed to computed jumps etc. But there's only ever
169: likely to be a handful of calls out of assembler, and getting the same
170: optimization for C intra-library calls would be more important. There seems
171: no easy way to tell gcc that certain functions can be called non-PIC, and
172: unfortunately many gmp functions use the global memory allocation variables,
173: so they need the GOT anyway. Object files with no global data references
174: and only intra-library calls could go into the library as non-PIC under
175: -Bsymbolic. Integrating this into libtool and automake is left as an
176: exercise for the reader.
177:
178:
179:
180: SIMPLE LOOPS
181:
182: The overheads in setting up for an unrolled loop can mean that at small
183: sizes a simple loop is faster. Making small sizes go fast is important,
184: even if it adds a cycle or two to bigger sizes. To this end various
185: routines choose between a simple loop and an unrolled loop according to
186: operand size. The path to the simple loop, or to special case code for
187: small sizes, is always as fast as possible.
188:
189: Adding a simple loop requires a conditional jump to choose between the
190: simple and unrolled code. The size of a branch misprediction penalty
191: affects whether a simple loop is worthwhile.
192:
193: The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
194: point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
195: UNROLL_THRESHOLD using the unrolled loop. If position independent code adds
196: a couple of cycles to an unrolled loop setup, the threshold will vary with
197: PIC or non-PIC. Something like the following is typical.
198:
199: ifdef(`PIC',`
200: deflit(UNROLL_THRESHOLD, 10)
201: ',`
202: deflit(UNROLL_THRESHOLD, 8)
203: ')
204:
205: There's no automated way to determine the threshold. Setting it to a small
206: value and then to a big value makes it possible to measure the simple and
207: unrolled loops each over a range of sizes, from which the crossover point
208: can be determined. Alternately, just adjust the threshold up or down until
209: there's no more speedups.
210:
211:
212:
213: UNROLLED LOOP CODING
214:
215: The x86 addressing modes allow a byte displacement of -128 to +127, making
216: it possible to access 256 bytes, which is 64 limbs, without adjusting
217: pointer registers within the loop. Dword sized displacements can be used
218: too, but they increase code size, and unrolling to 64 ought to be enough.
219:
220: When unrolling to the full 64 limbs/loop, the limb at the top of the loop
221: will have a displacement of -128, so pointers have to have a corresponding
222: +128 added before entering the loop. When unrolling to 32 limbs/loop
223: displacements 0 to 127 can be used with 0 at the top of the loop and no
224: adjustment needed to the pointers.
225:
226: Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
227: limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or
228: 16 is small, so support for 64 limbs/loop is generally only for comparison.
229:
230:
231:
232: COMPUTED JUMPS
233:
234: When working from least significant limb to most significant limb (most
235: routines) the computed jump and pointer calculations in preparation for an
236: unrolled loop are as follows.
237:
238: S = operand size in limbs
239: N = number of limbs per loop (UNROLL_COUNT)
240: L = log2 of unrolling (UNROLL_LOG2)
241: M = mask for unrolling (UNROLL_MASK)
242: C = code bytes per limb in the loop
243: B = bytes per limb (4 for x86)
244:
245: computed jump (-S & M) * C + entrypoint
246: subtract from pointers (-S & M) * B
247: initial loop counter (S-1) >> L
248: displacements 0 to B*(N-1)
249:
250: The loop counter is decremented at the end of each loop, and the looping
251: stops when the decrement takes the counter to -1. The displacements are for
252: the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
253:
254: Usually the multiply by "C" can be handled without an imul, using instead an
255: leal, or a shift and subtract.
256:
257: When working from most significant to least significant limb (eg. mpn_lshift
258: and mpn_copyd), the calculations change as follows.
259:
260: add to pointers (-S & M) * B
261: displacements 0 to -B*(N-1)
262:
263:
264:
265: OLD GAS 1.92.3
266:
267: This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
268: affect gmp code.
269:
270: Firstly, an expression involving two forward references to labels comes out
271: as zero. For example,
272:
273: addl $bar-foo, %eax
274: foo:
275: nop
276: bar:
277:
278: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
279: When only one forward reference is involved, it works correctly, as for
280: example,
281:
282: foo:
283: addl $bar-foo, %eax
284: nop
285: bar:
286:
287: Secondly, an expression involving two labels can't be used as the
288: displacement for an leal. For example,
289:
290: foo:
291: nop
292: bar:
293: leal bar-foo(%eax,%ebx,8), %ecx
294:
295: A slightly cryptic error is given, "Unimplemented segment type 0 in
296: parse_operand". When only one label is used it's ok, and the label can be a
297: forward reference too, as for example,
298:
299: leal foo(%eax,%ebx,8), %ecx
300: nop
301: foo:
302:
303: These problems only affect PIC computed jump calculations. The workarounds
304: are just to do an leal without a displacement and then an addl, and to make
305: sure the code is placed so that there's at most one forward reference in the
306: addl.
307:
308:
309:
310: REFERENCES
311:
312: "Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999,
313: order numbers 243190, 243191 and 243192. Available on-line,
314:
315: ftp://download.intel.com/design/PentiumII/manuals/243190.htm
316: ftp://download.intel.com/design/PentiumII/manuals/243191.htm
317: ftp://download.intel.com/design/PentiumII/manuals/243192.htm
318:
319: "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
320: published by McGraw-Hill, 1991, ISBN 0-07-031219-2.
321:
322: "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
323: published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor
324: Supplement", AT&T, 1991, ISBN 0-13-877689-X. (These have details of ELF
325: shared library PIC coding.)
326:
327:
328:
329: ----------------
330: Local variables:
331: mode: text
332: fill-column: 76
333: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>