Annotation of OpenXM_contrib/gmp/mpn/x86/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 1999, 2000, 2001 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
! 22:
1.1 maekawa 23:
24: X86 MPN SUBROUTINES
25:
26:
27: This directory contains mpn functions for various 80x86 chips.
28:
29:
30: CODE ORGANIZATION
31:
1.1.1.2 ! ohara 32: x86 i386, generic
! 33: x86/i486 i486
! 34: x86/pentium Intel Pentium (P5, P54)
! 35: x86/pentium/mmx Intel Pentium with MMX (P55)
! 36: x86/p6 Intel Pentium Pro
! 37: x86/p6/mmx Intel Pentium II, III
! 38: x86/p6/p3mmx Intel Pentium III
! 39: x86/k6 \ AMD K6
! 40: x86/k6/mmx /
! 41: x86/k6/k62mmx AMD K6-2
! 42: x86/k7 \ AMD Athlon
! 43: x86/k7/mmx /
! 44: x86/pentium4 \
! 45: x86/pentium4/mmx | Intel Pentium 4
! 46: x86/pentium4/sse2 /
1.1 maekawa 47:
48:
1.1.1.2 ! ohara 49: The top-level x86 directory contains blended style code, meant to be
! 50: reasonable on all x86s.
1.1 maekawa 51:
52:
53:
54: STATUS
55:
1.1.1.2 ! ohara 56: The code is well-optimized for AMD and Intel chips, but there's nothing
! 57: specific for Cyrix chips, nor for actual 80386 and 80486 chips.
! 58:
! 59:
! 60:
! 61: ASM FILES
! 62:
! 63: The x86 .asm files are BSD style assembler code, first put through m4 for
! 64: macro processing. The generic mpn/asm-defs.m4 is used, together with
! 65: mpn/x86/x86-defs.m4. See comments in those files.
! 66:
! 67: The code is meant for use with GNU "gas" or a system "as". There's no
! 68: support for assemblers that demand Intel style code.
! 69:
! 70:
! 71:
! 72: STACK FRAME
! 73:
! 74: m4 macros are used to define the parameters passed on the stack, and these
! 75: act like comments on what the stack frame looks like too. For example,
! 76: mpn_mul_1() has the following.
! 77:
! 78: defframe(PARAM_MULTIPLIER, 16)
! 79: defframe(PARAM_SIZE, 12)
! 80: defframe(PARAM_SRC, 8)
! 81: defframe(PARAM_DST, 4)
! 82:
! 83: PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The
! 84: return address is at offset 0, but there's not normally any need to access
! 85: that.
! 86:
! 87: FRAME is redefined as necessary through the code so it's the number of bytes
! 88: pushed on the stack, and hence the offsets in the parameter macros stay
! 89: correct. At the start of a routine FRAME should be zero.
! 90:
! 91: deflit(`FRAME',0)
! 92: ...
! 93: deflit(`FRAME',4)
! 94: ...
! 95: deflit(`FRAME',8)
! 96: ...
! 97:
! 98: Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
! 99: FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
! 100: and can be used instead of explicit definitions if preferred.
! 101: defframe_pushl() is a combination FRAME_pushl() and defframe().
! 102:
! 103: There's generally some slackness in redefining FRAME. If new values aren't
! 104: going to get used then the redefinitions are omitted to keep from cluttering
! 105: up the code. This happens for instance at the end of a routine, where there
! 106: might be just four pops and then a ret, so FRAME isn't getting used.
! 107:
! 108: Local variables and saved registers can be similarly defined, with negative
! 109: offsets representing stack space below the initial stack pointer. For
! 110: example,
! 111:
! 112: defframe(SAVE_ESI, -4)
! 113: defframe(SAVE_EDI, -8)
! 114: defframe(VAR_COUNTER,-12)
! 115:
! 116: deflit(STACK_SPACE, 12)
! 117:
! 118: Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
! 119: space, and that instruction must be followed by a redefinition of FRAME
! 120: (setting it equal to STACK_SPACE) to reflect the change in %esp.
! 121:
! 122: Definitions for pushed registers are only put in when they're going to be
! 123: used. If registers are just saved and restored with pushes and pops then
! 124: definitions aren't made.
! 125:
! 126:
! 127:
! 128: ASSEMBLER EXPRESSIONS
! 129:
! 130: Only addition and subtraction seem to be universally available, certainly
! 131: that's all the Solaris 8 "as" seems to accept. If expressions are wanted
! 132: then m4 eval() should be used.
! 133:
! 134: In particular note that a "/" anywhere in a line starts a comment in Solaris
! 135: "as", and in some configurations of gas too.
! 136:
! 137: addl $32/2, %eax <-- wrong
! 138:
! 139: addl $eval(32/2), %eax <-- right
! 140:
! 141: Binutils gas/config/tc-i386.c has a choice between "/" being a comment
! 142: anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select
! 143: the latter, and from 2.9.5 it's the default for GNU/Linux too.
! 144:
! 145:
! 146:
! 147: ASSEMBLER COMMENTS
! 148:
! 149: Solaris "as" doesn't support "#" commenting, using /* */ instead. For that
! 150: reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
! 151: files have no comments.
! 152:
! 153: Any comments before include(`../config.m4') must use m4 "dnl", since it's
! 154: only after the include that "C" is available. By convention "dnl" is also
! 155: used for comments about m4 macros.
! 156:
! 157:
! 158:
! 159: TEMPORARY LABELS
! 160:
! 161: Temporary numbered labels like "1:" used as "1f" or "1b" are available in
! 162: "gas" and Solaris "as", but not in SCO "as". Normal L() labels should be
! 163: used instead, possibly with a counter to make them unique, see jadcl0() for
! 164: instance. A separate counter for each macro makes it possible to nest them,
! 165: for instance movl_text_address() can be used within an ASSERT().
! 166:
! 167: "1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a
! 168: unique number looks like a good alternative, but is that actually a
! 169: documented feature? In any case this problem doesn't currently arise.
! 170:
! 171:
! 172:
! 173: ZERO DISPLACEMENTS
! 174:
! 175: In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
! 176: displacement are wanted, rather than (%ebx) with no displacement. These are
! 177: either for computed jumps or to get desirable code alignment. Explicit
! 178: .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
! 179: (%ebx). The Zdisp() macro in x86-defs.m4 is used for this.
! 180:
! 181: Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
! 182: 1.92.3 changes it. In general changing would be the sort of "optimization"
! 183: an assembler might perform, hence explicit ".byte"s are used where
! 184: necessary.
! 185:
! 186:
! 187:
! 188: SHLD/SHRD INSTRUCTIONS
! 189:
! 190: The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
! 191: must be written "shldl %eax,%ebx" for some assemblers. gas takes either,
! 192: Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
! 193: gas), and omits %cl elsewhere.
! 194:
! 195: For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
! 196: %cl should be used, and the macros shldl, shrdl, shldw and shrdw in
! 197: mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments
! 198: with those macros for usage.
! 199:
! 200:
! 201:
! 202: IMUL INSTRUCTION
! 203:
! 204: GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
! 205: that the following two forms produce identical object code
! 206:
! 207: imul $12, %eax
! 208: imul $12, %eax, %eax
! 209:
! 210: but that the former isn't accepted by some assemblers, in particular the SCO
! 211: OSR5 COFF assembler. GMP follows GCC and uses only the latter form.
! 212:
! 213: (This applies only to immediate operands, the three operand form is only
! 214: valid with an immediate.)
! 215:
! 216:
! 217:
! 218: DIRECTION FLAG
! 219:
! 220: The x86 calling conventions say that the direction flag should be clear at
! 221: function entry and exit. (See iBCS2 and SVR4 ABI books, references below.)
! 222: Although this has been so since the year dot, it's not absolutely clear
! 223: whether it's universally respected. Since it's better to be safe than
! 224: sorry, GMP follows glibc and does a "cld" if it depends on the direction
! 225: flag being clear. This happens only in a few places.
! 226:
! 227:
! 228:
! 229: POSITION INDEPENDENT CODE
! 230:
! 231: Defining the symbol PIC in m4 processing selects SVR4 / ELF style position
! 232: independent code. This is necessary for shared libraries because they can
! 233: be mapped into different processes at different virtual addresses. Actually
! 234: relocations are allowed, but presumably pages with relocations aren't
! 235: shared, defeating the purpose of a shared library.
! 236:
! 237: The use of the PLT adds a fixed cost to every function call, and the GOT
! 238: adds a cost to any function accessing global variables. These are small but
! 239: might be noticeable when working with small operands.
! 240:
! 241: Calls from one library function to another don't need to go through the PLT,
! 242: since of course the call instruction uses a displacement, not an absolute
! 243: address, and the relative locations of object files are known when libgmp.so
! 244: is created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
! 245: this way, so that there's no jump through the PLT, but of course leaving
! 246: setups of the GOT address in %ebx that may be unnecessary.
! 247:
! 248: The %ebx setup could be avoided in assembly if a separate option controlled
! 249: PIC for calls as opposed to computed jumps etc. But there's only ever
! 250: likely to be a handful of calls out of assembler, and getting the same
! 251: optimization for C intra-library calls would be more important. There seems
! 252: no easy way to tell gcc that certain functions can be called non-PIC, and
! 253: unfortunately many GMP functions use the global memory allocation variables,
! 254: so they need the GOT anyway. Object files with no global data references
! 255: and only intra-library calls could go into the library as non-PIC under
! 256: -Bsymbolic. Integrating this into libtool and automake is left as an
! 257: exercise for the reader.
! 258:
! 259:
! 260:
! 261: GLOBAL OFFSET TABLE CODING
! 262:
! 263: It's believed the magic _GLOBAL_OFFSET_TABLE_ used by code establishing the
! 264: address of the GOT should be written without a GSYM_PREFIX, ie. that it's
! 265: the same "_GLOBAL_OFFSET_TABLE_" on an underscore or non-underscore system.
! 266: Certainly this is true for instance of NetBSD 1.4 which is an underscore
! 267: system but requires "_GLOBAL_OFFSET_TABLE_".
! 268:
! 269: Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
! 270: asked to assemble the following,
! 271:
! 272: L1:
! 273: addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
! 274:
! 275: It seems that using the label in the same instruction it refers to is the
! 276: problem, since a nop in between works. But the simplest workaround is to
! 277: follow gcc and omit the +[.-L1] since it does nothing,
! 278:
! 279: addl $_GLOBAL_OFFSET_TABLE_, %ebx
! 280:
! 281: Current gas 2.10 generates incorrect object code when %eax is used in such a
! 282: construction (with or without +[.-L1]),
! 283:
! 284: addl $_GLOBAL_OFFSET_TABLE_, %eax
! 285:
! 286: The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
! 287: the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any
! 288: other register, since then it's a two byte opcode+mod/rm. GCC for example
! 289: always uses %ebx (which is needed for calls through the PLT).
! 290:
! 291: A similar problem occurs in an leal (again with or without a +[.-L1]),
! 292:
! 293: leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx
! 294:
! 295: This time the R_386_GOTPC gets a displacement of 0 rather than the 2
! 296: appropriate for the opcode and mod/rm, making this form unusable.
! 297:
! 298:
! 299:
! 300: SIMPLE LOOPS
! 301:
! 302: The overheads in setting up for an unrolled loop can mean that at small
! 303: sizes a simple loop is faster. Making small sizes go fast is important,
! 304: even if it adds a cycle or two to bigger sizes. To this end various
! 305: routines choose between a simple loop and an unrolled loop according to
! 306: operand size. The path to the simple loop, or to special case code for
! 307: small sizes, is always as fast as possible.
! 308:
! 309: Adding a simple loop requires a conditional jump to choose between the
! 310: simple and unrolled code. The size of a branch misprediction penalty
! 311: affects whether a simple loop is worthwhile.
! 312:
! 313: The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
! 314: point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
! 315: UNROLL_THRESHOLD using the unrolled loop. If position independent code adds
! 316: a couple of cycles to an unrolled loop setup, the threshold will vary with
! 317: PIC or non-PIC. Something like the following is typical.
! 318:
! 319: deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
! 320:
! 321: There's no automated way to determine the threshold. Setting it to a small
! 322: value and then to a big value makes it possible to measure the simple and
! 323: unrolled loops each over a range of sizes, from which the crossover point
! 324: can be determined. Alternately, just adjust the threshold up or down until
! 325: there's no more speedups.
! 326:
! 327:
! 328:
! 329: UNROLLED LOOP CODING
! 330:
! 331: The x86 addressing modes allow a byte displacement of -128 to +127, making
! 332: it possible to access 256 bytes, which is 64 limbs, without adjusting
! 333: pointer registers within the loop. Dword sized displacements can be used
! 334: too, but they increase code size, and unrolling to 64 ought to be enough.
! 335:
! 336: When unrolling to the full 64 limbs/loop, the limb at the top of the loop
! 337: will have a displacement of -128, so pointers have to have a corresponding
! 338: +128 added before entering the loop. When unrolling to 32 limbs/loop
! 339: displacements 0 to 127 can be used with 0 at the top of the loop and no
! 340: adjustment needed to the pointers.
! 341:
! 342: Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
! 343: limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or
! 344: 16 is small, so support for 64 limbs/loop is generally only for comparison.
! 345:
! 346:
! 347:
! 348: COMPUTED JUMPS
! 349:
! 350: When working from least significant limb to most significant limb (most
! 351: routines) the computed jump and pointer calculations in preparation for an
! 352: unrolled loop are as follows.
! 353:
! 354: S = operand size in limbs
! 355: N = number of limbs per loop (UNROLL_COUNT)
! 356: L = log2 of unrolling (UNROLL_LOG2)
! 357: M = mask for unrolling (UNROLL_MASK)
! 358: C = code bytes per limb in the loop
! 359: B = bytes per limb (4 for x86)
! 360:
! 361: computed jump (-S & M) * C + entrypoint
! 362: subtract from pointers (-S & M) * B
! 363: initial loop counter (S-1) >> L
! 364: displacements 0 to B*(N-1)
! 365:
! 366: The loop counter is decremented at the end of each loop, and the looping
! 367: stops when the decrement takes the counter to -1. The displacements are for
! 368: the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
! 369:
! 370: Usually the multiply by "C" can be handled without an imul, using instead an
! 371: leal, or a shift and subtract.
! 372:
! 373: When working from most significant to least significant limb (eg. mpn_lshift
! 374: and mpn_copyd), the calculations change as follows.
! 375:
! 376: add to pointers (-S & M) * B
! 377: displacements 0 to -B*(N-1)
! 378:
! 379:
! 380:
! 381: OLD GAS 1.92.3
! 382:
! 383: This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
! 384: affect GMP code.
! 385:
! 386: Firstly, an expression involving two forward references to labels comes out
! 387: as zero. For example,
! 388:
! 389: addl $bar-foo, %eax
! 390: foo:
! 391: nop
! 392: bar:
! 393:
! 394: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
! 395: When only one forward reference is involved, it works correctly, as for
! 396: example,
! 397:
! 398: foo:
! 399: addl $bar-foo, %eax
! 400: nop
! 401: bar:
! 402:
! 403: Secondly, an expression involving two labels can't be used as the
! 404: displacement for an leal. For example,
! 405:
! 406: foo:
! 407: nop
! 408: bar:
! 409: leal bar-foo(%eax,%ebx,8), %ecx
! 410:
! 411: A slightly cryptic error is given, "Unimplemented segment type 0 in
! 412: parse_operand". When only one label is used it's ok, and the label can be a
! 413: forward reference too, as for example,
! 414:
! 415: leal foo(%eax,%ebx,8), %ecx
! 416: nop
! 417: foo:
! 418:
! 419: These problems only affect PIC computed jump calculations. The workarounds
! 420: are just to do an leal without a displacement and then an addl, and to make
! 421: sure the code is placed so that there's at most one forward reference in the
! 422: addl.
! 423:
! 424:
! 425:
! 426: REFERENCES
! 427:
! 428: "Intel Architecture Software Developer's Manual", volumes 1 to 3, 2001,
! 429: order numbers 245470, 245471 and 245472. Available on-line,
! 430:
! 431: http://developer.intel.com/design/pentium4/manuals/245470.htm
! 432: http://developer.intel.com/design/pentium4/manuals/245471.htm
! 433: http://developer.intel.com/design/pentium4/manuals/245472.htm
! 434:
! 435: "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
! 436: published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor
! 437: Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling
! 438: conventions and ELF shared library PIC coding. Versions of both available
! 439: on-line,
! 440:
! 441: http://www.sco.com/developer/devspecs
1.1 maekawa 442:
1.1.1.2 ! ohara 443: "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
! 444: published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386
! 445: ABI supplement.)
1.1 maekawa 446:
447:
448:
1.1.1.2 ! ohara 449: ----------------
! 450: Local variables:
! 451: mode: text
! 452: fill-column: 76
! 453: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>