Annotation of OpenXM_contrib/gmp/mpn/x86/README.family, Revision 1.1
1.1 ! maekawa 1:
! 2: X86 CPU FAMILY MPN SUBROUTINES
! 3:
! 4:
! 5: This file has some notes on things common to all the x86 family code.
! 6:
! 7:
! 8:
! 9: ASM FILES
! 10:
! 11: The x86 .asm files are BSD style x86 assembler code, first put through m4
! 12: for macro processing. The generic mpn/asm-defs.m4 is used, together with
! 13: mpn/x86/x86-defs.m4. Detailed notes are in those files.
! 14:
! 15: The code is meant for use with GNU "gas" or a system "as". There's no
! 16: support for assemblers that demand Intel style, and with gas freely
! 17: available and easy to use that shouldn't be a problem.
! 18:
! 19:
! 20:
! 21: STACK FRAME
! 22:
! 23: m4 macros are used to define the parameters passed on the stack, and these
! 24: act like comments on what the stack frame looks like too. For example,
! 25: mpn_mul_1() has the following.
! 26:
! 27: defframe(PARAM_MULTIPLIER, 16)
! 28: defframe(PARAM_SIZE, 12)
! 29: defframe(PARAM_SRC, 8)
! 30: defframe(PARAM_DST, 4)
! 31:
! 32: Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others
! 33: similarly. The return address is at offset 0, but there's not normally any
! 34: need to access that.
! 35:
! 36: FRAME is redefined as necessary through the code so it's the number of bytes
! 37: pushed on the stack, and hence the offsets in the parameter macros stay
! 38: correct. At the start of a routine FRAME should be zero.
! 39:
! 40: deflit(`FRAME',0)
! 41: ...
! 42: deflit(`FRAME',4)
! 43: ...
! 44: deflit(`FRAME',8)
! 45: ...
! 46:
! 47: Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
! 48: FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
! 49: and can be used instead of explicit definitions if preferred.
! 50: defframe_pushl() is a combination FRAME_pushl() and defframe().
! 51:
! 52: There's generally some slackness in redefining FRAME. If new values aren't
! 53: going to get used, then the redefinitions are omitted to keep from
! 54: cluttering up the code. This happens for instance at the end of a routine,
! 55: where there might be just four register pops and then a ret, so FRAME isn't
! 56: getting used.
! 57:
! 58: Local variables and saved registers can be similarly defined, with negative
! 59: offsets representing stack space below the initial stack pointer. For
! 60: example,
! 61:
! 62: defframe(SAVE_ESI, -4)
! 63: defframe(SAVE_EDI, -8)
! 64: defframe(VAR_COUNTER,-12)
! 65:
! 66: deflit(STACK_SPACE, 12)
! 67:
! 68: Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
! 69: space, and that instruction must be followed by a redefinition of FRAME
! 70: (setting it equal to STACK_SPACE) to reflect the change in %esp.
! 71:
! 72: Definitions for pushed registers are only put in when they're going to be
! 73: used. If registers are just saved and restored with pushes and pops then
! 74: definitions aren't made.
! 75:
! 76:
! 77:
! 78: ASSEMBLER EXPRESSIONS
! 79:
! 80: Only addition and subtraction seem to be universally available, certainly
! 81: that's all the Solaris 8 "as" seems to accept. If expressions are wanted
! 82: then m4 eval() should be used.
! 83:
! 84: In particular note that a "/" anywhere in a line starts a comment in Solaris
! 85: "as", and in some configurations of gas too.
! 86:
! 87: addl $32/2, %eax <-- wrong
! 88:
! 89: addl $eval(32/2), %eax <-- right
! 90:
! 91: Binutils gas/config/tc-i386.c has a choice between "/" being a comment
! 92: anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select
! 93: the latter, and as of 2.9.5 it's the default for GNU/Linux too.
! 94:
! 95:
! 96:
! 97: ASSEMBLER COMMENTS
! 98:
! 99: Solaris "as" doesn't support "#" commenting, using /* */ instead,
! 100: unfortunately. For that reason "C" commenting is used (see asm-defs.m4) and
! 101: the intermediate ".s" files have no comments.
! 102:
! 103:
! 104:
! 105: ZERO DISPLACEMENTS
! 106:
! 107: In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
! 108: displacement are wanted, rather than (%ebx) with no displacement. These are
! 109: either for computed jumps or to get desirable code alignment. Explicit
! 110: .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
! 111: (%ebx). The Zdisp() macro in x86-defs.m4 is used for this.
! 112:
! 113: Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
! 114: 1.92.3 changes it. In general changing would be the sort of "optimization"
! 115: an assembler might perform, hence explicit ".byte"s are used where
! 116: necessary.
! 117:
! 118:
! 119:
! 120: SHLD/SHRD INSTRUCTIONS
! 121:
! 122: The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
! 123: must be written "shldl %eax,%ebx" for some assemblers. gas takes either,
! 124: Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
! 125: gas), and omits %cl elsewhere.
! 126:
! 127: For GMP an autoconf test is used to determine whether %cl should be used and
! 128: the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass
! 129: through or omit %cl as necessary. See comments with those macros for usage.
! 130:
! 131:
! 132:
! 133: DIRECTION FLAG
! 134:
! 135: The x86 calling conventions say that the direction flag should be clear at
! 136: function entry and exit. (See iBCS2 and SVR4 ABI books, references below.)
! 137:
! 138: Although this has been so since the year dot, it's not absolutely clear
! 139: whether it's universally respected. Since it's better to be safe than
! 140: sorry, gmp follows glibc and does a "cld" if it depends on the direction
! 141: flag being clear. This happens only in a few places.
! 142:
! 143:
! 144:
! 145: POSITION INDEPENDENT CODE
! 146:
! 147: Defining the symbol PIC in m4 processing selects position independent code.
! 148: This mainly affects computed jumps, and these are implemented in a
! 149: self-contained fashion (without using the global offset table). The few
! 150: calls from assembly code to global functions use the normal procedure
! 151: linkage table.
! 152:
! 153: PIC is necessary for ELF shared libraries because they can be mapped into
! 154: different processes at different virtual addresses. Text relocations in
! 155: shared libraries are allowed, but that presumably means a page with such a
! 156: relocation isn't shared. The use of the PLT for PIC adds a fixed cost to
! 157: every function call, which is small but might be noticeable when working with
! 158: small operands.
! 159:
! 160: Calls from one library function to another don't need to go through the PLT,
! 161: since of course the call instruction uses a displacement, not an absolute
! 162: address, and the relative locations of object files are known when libgmp.so
! 163: is created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
! 164: this way, so that there's no jump through the PLT, but of course leaving
! 165: setups of the GOT address in %ebx that may be unnecessary.
! 166:
! 167: The %ebx setup could be avoided in assembly if a separate option controlled
! 168: PIC for calls as opposed to computed jumps etc. But there's only ever
! 169: likely to be a handful of calls out of assembler, and getting the same
! 170: optimization for C intra-library calls would be more important. There seems
! 171: no easy way to tell gcc that certain functions can be called non-PIC, and
! 172: unfortunately many gmp functions use the global memory allocation variables,
! 173: so they need the GOT anyway. Object files with no global data references
! 174: and only intra-library calls could go into the library as non-PIC under
! 175: -Bsymbolic. Integrating this into libtool and automake is left as an
! 176: exercise for the reader.
! 177:
! 178:
! 179:
! 180: SIMPLE LOOPS
! 181:
! 182: The overheads in setting up for an unrolled loop can mean that at small
! 183: sizes a simple loop is faster. Making small sizes go fast is important,
! 184: even if it adds a cycle or two to bigger sizes. To this end various
! 185: routines choose between a simple loop and an unrolled loop according to
! 186: operand size. The path to the simple loop, or to special case code for
! 187: small sizes, is always as fast as possible.
! 188:
! 189: Adding a simple loop requires a conditional jump to choose between the
! 190: simple and unrolled code. The size of a branch misprediction penalty
! 191: affects whether a simple loop is worthwhile.
! 192:
! 193: The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
! 194: point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
! 195: UNROLL_THRESHOLD using the unrolled loop. If position independent code adds
! 196: a couple of cycles to an unrolled loop setup, the threshold will vary with
! 197: PIC or non-PIC. Something like the following is typical.
! 198:
! 199: ifdef(`PIC',`
! 200: deflit(UNROLL_THRESHOLD, 10)
! 201: ',`
! 202: deflit(UNROLL_THRESHOLD, 8)
! 203: ')
! 204:
! 205: There's no automated way to determine the threshold. Setting it to a small
! 206: value and then to a big value makes it possible to measure the simple and
! 207: unrolled loops each over a range of sizes, from which the crossover point
! 208: can be determined. Alternately, just adjust the threshold up or down until
! 209: there's no more speedups.
! 210:
! 211:
! 212:
! 213: UNROLLED LOOP CODING
! 214:
! 215: The x86 addressing modes allow a byte displacement of -128 to +127, making
! 216: it possible to access 256 bytes, which is 64 limbs, without adjusting
! 217: pointer registers within the loop. Dword sized displacements can be used
! 218: too, but they increase code size, and unrolling to 64 ought to be enough.
! 219:
! 220: When unrolling to the full 64 limbs/loop, the limb at the top of the loop
! 221: will have a displacement of -128, so pointers have to have a corresponding
! 222: +128 added before entering the loop. When unrolling to 32 limbs/loop
! 223: displacements 0 to 127 can be used with 0 at the top of the loop and no
! 224: adjustment needed to the pointers.
! 225:
! 226: Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
! 227: limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or
! 228: 16 is small, so support for 64 limbs/loop is generally only for comparison.
! 229:
! 230:
! 231:
! 232: COMPUTED JUMPS
! 233:
! 234: When working from least significant limb to most significant limb (most
! 235: routines) the computed jump and pointer calculations in preparation for an
! 236: unrolled loop are as follows.
! 237:
! 238: S = operand size in limbs
! 239: N = number of limbs per loop (UNROLL_COUNT)
! 240: L = log2 of unrolling (UNROLL_LOG2)
! 241: M = mask for unrolling (UNROLL_MASK)
! 242: C = code bytes per limb in the loop
! 243: B = bytes per limb (4 for x86)
! 244:
! 245: computed jump (-S & M) * C + entrypoint
! 246: subtract from pointers (-S & M) * B
! 247: initial loop counter (S-1) >> L
! 248: displacements 0 to B*(N-1)
! 249:
! 250: The loop counter is decremented at the end of each loop, and the looping
! 251: stops when the decrement takes the counter to -1. The displacements are for
! 252: the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
! 253:
! 254: Usually the multiply by "C" can be handled without an imul, using instead an
! 255: leal, or a shift and subtract.
! 256:
! 257: When working from most significant to least significant limb (eg. mpn_lshift
! 258: and mpn_copyd), the calculations change as follows.
! 259:
! 260: add to pointers (-S & M) * B
! 261: displacements 0 to -B*(N-1)
! 262:
! 263:
! 264:
! 265: OLD GAS 1.92.3
! 266:
! 267: This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
! 268: affect gmp code.
! 269:
! 270: Firstly, an expression involving two forward references to labels comes out
! 271: as zero. For example,
! 272:
! 273: addl $bar-foo, %eax
! 274: foo:
! 275: nop
! 276: bar:
! 277:
! 278: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
! 279: When only one forward reference is involved, it works correctly, as for
! 280: example,
! 281:
! 282: foo:
! 283: addl $bar-foo, %eax
! 284: nop
! 285: bar:
! 286:
! 287: Secondly, an expression involving two labels can't be used as the
! 288: displacement for an leal. For example,
! 289:
! 290: foo:
! 291: nop
! 292: bar:
! 293: leal bar-foo(%eax,%ebx,8), %ecx
! 294:
! 295: A slightly cryptic error is given, "Unimplemented segment type 0 in
! 296: parse_operand". When only one label is used it's ok, and the label can be a
! 297: forward reference too, as for example,
! 298:
! 299: leal foo(%eax,%ebx,8), %ecx
! 300: nop
! 301: foo:
! 302:
! 303: These problems only affect PIC computed jump calculations. The workarounds
! 304: are just to do an leal without a displacement and then an addl, and to make
! 305: sure the code is placed so that there's at most one forward reference in the
! 306: addl.
! 307:
! 308:
! 309:
! 310: REFERENCES
! 311:
! 312: "Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999,
! 313: order numbers 243190, 243191 and 243192. Available on-line,
! 314:
! 315: ftp://download.intel.com/design/PentiumII/manuals/243190.htm
! 316: ftp://download.intel.com/design/PentiumII/manuals/243191.htm
! 317: ftp://download.intel.com/design/PentiumII/manuals/243192.htm
! 318:
! 319: "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
! 320: published by McGraw-Hill, 1991, ISBN 0-07-031219-2.
! 321:
! 322: "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
! 323: published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor
! 324: Supplement", AT&T, 1991, ISBN 0-13-877689-X. (These have details of ELF
! 325: shared library PIC coding.)
! 326:
! 327:
! 328:
! 329: ----------------
! 330: Local variables:
! 331: mode: text
! 332: fill-column: 76
! 333: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>