OpenXM_contrib/gmp/mpn/x86/README.family - annotate

Return to README.family CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / x86
Annotation of OpenXM_contrib/gmp/mpn/x86/README.family, Revision 1.1.1.1

1.1       maekawa     1:
                      2:                     X86 CPU FAMILY MPN SUBROUTINES
                      3:
                      4:
                      5: This file has some notes on things common to all the x86 family code.
                      6:
                      7:
                      8:
                      9: ASM FILES
                     10:
                     11: The x86 .asm files are BSD style x86 assembler code, first put through m4
                     12: for macro processing.  The generic mpn/asm-defs.m4 is used, together with
                     13: mpn/x86/x86-defs.m4.  Detailed notes are in those files.
                     14:
                     15: The code is meant for use with GNU "gas" or a system "as".  There's no
                     16: support for assemblers that demand Intel style, and with gas freely
                     17: available and easy to use that shouldn't be a problem.
                     18:
                     19:
                     20:
                     21: STACK FRAME
                     22:
                     23: m4 macros are used to define the parameters passed on the stack, and these
                     24: act like comments on what the stack frame looks like too.  For example,
                     25: mpn_mul_1() has the following.
                     26:
                     27:         defframe(PARAM_MULTIPLIER, 16)
                     28:         defframe(PARAM_SIZE,       12)
                     29:         defframe(PARAM_SRC,         8)
                     30:         defframe(PARAM_DST,         4)
                     31:
                     32: Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others
                     33: similarly.  The return address is at offset 0, but there's not normally any
                     34: need to access that.
                     35:
                     36: FRAME is redefined as necessary through the code so it's the number of bytes
                     37: pushed on the stack, and hence the offsets in the parameter macros stay
                     38: correct.  At the start of a routine FRAME should be zero.
                     39:
                     40:         deflit(`FRAME',0)
                     41:        ...
                     42:        deflit(`FRAME',4)
                     43:        ...
                     44:        deflit(`FRAME',8)
                     45:        ...
                     46:
                     47: Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
                     48: FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
                     49: and can be used instead of explicit definitions if preferred.
                     50: defframe_pushl() is a combination FRAME_pushl() and defframe().
                     51:
                     52: There's generally some slackness in redefining FRAME.  If new values aren't
                     53: going to get used, then the redefinitions are omitted to keep from
                     54: cluttering up the code.  This happens for instance at the end of a routine,
                     55: where there might be just four register pops and then a ret, so FRAME isn't
                     56: getting used.
                     57:
                     58: Local variables and saved registers can be similarly defined, with negative
                     59: offsets representing stack space below the initial stack pointer.  For
                     60: example,
                     61:
                     62:        defframe(SAVE_ESI,   -4)
                     63:        defframe(SAVE_EDI,   -8)
                     64:        defframe(VAR_COUNTER,-12)
                     65:
                     66:        deflit(STACK_SPACE, 12)
                     67:
                     68: Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
                     69: space, and that instruction must be followed by a redefinition of FRAME
                     70: (setting it equal to STACK_SPACE) to reflect the change in %esp.
                     71:
                     72: Definitions for pushed registers are only put in when they're going to be
                     73: used.  If registers are just saved and restored with pushes and pops then
                     74: definitions aren't made.
                     75:
                     76:
                     77:
                     78: ASSEMBLER EXPRESSIONS
                     79:
                     80: Only addition and subtraction seem to be universally available, certainly
                     81: that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
                     82: then m4 eval() should be used.
                     83:
                     84: In particular note that a "/" anywhere in a line starts a comment in Solaris
                     85: "as", and in some configurations of gas too.
                     86:
                     87:        addl    $32/2, %eax           <-- wrong
                     88:
                     89:        addl    $eval(32/2), %eax     <-- right
                     90:
                     91: Binutils gas/config/tc-i386.c has a choice between "/" being a comment
                     92: anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
                     93: the latter, and as of 2.9.5 it's the default for GNU/Linux too.
                     94:
                     95:
                     96:
                     97: ASSEMBLER COMMENTS
                     98:
                     99: Solaris "as" doesn't support "#" commenting, using /* */ instead,
                    100: unfortunately.  For that reason "C" commenting is used (see asm-defs.m4) and
                    101: the intermediate ".s" files have no comments.
                    102:
                    103:
                    104:
                    105: ZERO DISPLACEMENTS
                    106:
                    107: In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
                    108: displacement are wanted, rather than (%ebx) with no displacement.  These are
                    109: either for computed jumps or to get desirable code alignment.  Explicit
                    110: .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
                    111: (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
                    112:
                    113: Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
                    114: 1.92.3 changes it.  In general changing would be the sort of "optimization"
                    115: an assembler might perform, hence explicit ".byte"s are used where
                    116: necessary.
                    117:
                    118:
                    119:
                    120: SHLD/SHRD INSTRUCTIONS
                    121:
                    122: The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
                    123: must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
                    124: Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
                    125: gas), and omits %cl elsewhere.
                    126:
                    127: For GMP an autoconf test is used to determine whether %cl should be used and
                    128: the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass
                    129: through or omit %cl as necessary.  See comments with those macros for usage.
                    130:
                    131:
                    132:
                    133: DIRECTION FLAG
                    134:
                    135: The x86 calling conventions say that the direction flag should be clear at
                    136: function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
                    137:
                    138: Although this has been so since the year dot, it's not absolutely clear
                    139: whether it's universally respected.  Since it's better to be safe than
                    140: sorry, gmp follows glibc and does a "cld" if it depends on the direction
                    141: flag being clear.  This happens only in a few places.
                    142:
                    143:
                    144:
                    145: POSITION INDEPENDENT CODE
                    146:
                    147: Defining the symbol PIC in m4 processing selects position independent code.
                    148: This mainly affects computed jumps, and these are implemented in a
                    149: self-contained fashion (without using the global offset table).  The few
                    150: calls from assembly code to global functions use the normal procedure
                    151: linkage table.
                    152:
                    153: PIC is necessary for ELF shared libraries because they can be mapped into
                    154: different processes at different virtual addresses.  Text relocations in
                    155: shared libraries are allowed, but that presumably means a page with such a
                    156: relocation isn't shared.  The use of the PLT for PIC adds a fixed cost to
                    157: every function call, which is small but might be noticeable when working with
                    158: small operands.
                    159:
                    160: Calls from one library function to another don't need to go through the PLT,
                    161: since of course the call instruction uses a displacement, not an absolute
                    162: address, and the relative locations of object files are known when libgmp.so
                    163: is created.  "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
                    164: this way, so that there's no jump through the PLT, but of course leaving
                    165: setups of the GOT address in %ebx that may be unnecessary.
                    166:
                    167: The %ebx setup could be avoided in assembly if a separate option controlled
                    168: PIC for calls as opposed to computed jumps etc.  But there's only ever
                    169: likely to be a handful of calls out of assembler, and getting the same
                    170: optimization for C intra-library calls would be more important.  There seems
                    171: no easy way to tell gcc that certain functions can be called non-PIC, and
                    172: unfortunately many gmp functions use the global memory allocation variables,
                    173: so they need the GOT anyway.  Object files with no global data references
                    174: and only intra-library calls could go into the library as non-PIC under
                    175: -Bsymbolic.  Integrating this into libtool and automake is left as an
                    176: exercise for the reader.
                    177:
                    178:
                    179:
                    180: SIMPLE LOOPS
                    181:
                    182: The overheads in setting up for an unrolled loop can mean that at small
                    183: sizes a simple loop is faster.  Making small sizes go fast is important,
                    184: even if it adds a cycle or two to bigger sizes.  To this end various
                    185: routines choose between a simple loop and an unrolled loop according to
                    186: operand size.  The path to the simple loop, or to special case code for
                    187: small sizes, is always as fast as possible.
                    188:
                    189: Adding a simple loop requires a conditional jump to choose between the
                    190: simple and unrolled code.  The size of a branch misprediction penalty
                    191: affects whether a simple loop is worthwhile.
                    192:
                    193: The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
                    194: point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
                    195: UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
                    196: a couple of cycles to an unrolled loop setup, the threshold will vary with
                    197: PIC or non-PIC.  Something like the following is typical.
                    198:
                    199:        ifdef(`PIC',`
                    200:        deflit(UNROLL_THRESHOLD, 10)
                    201:        ',`
                    202:        deflit(UNROLL_THRESHOLD, 8)
                    203:        ')
                    204:
                    205: There's no automated way to determine the threshold.  Setting it to a small
                    206: value and then to a big value makes it possible to measure the simple and
                    207: unrolled loops each over a range of sizes, from which the crossover point
                    208: can be determined.  Alternately, just adjust the threshold up or down until
                    209: there's no more speedups.
                    210:
                    211:
                    212:
                    213: UNROLLED LOOP CODING
                    214:
                    215: The x86 addressing modes allow a byte displacement of -128 to +127, making
                    216: it possible to access 256 bytes, which is 64 limbs, without adjusting
                    217: pointer registers within the loop.  Dword sized displacements can be used
                    218: too, but they increase code size, and unrolling to 64 ought to be enough.
                    219:
                    220: When unrolling to the full 64 limbs/loop, the limb at the top of the loop
                    221: will have a displacement of -128, so pointers have to have a corresponding
                    222: +128 added before entering the loop.  When unrolling to 32 limbs/loop
                    223: displacements 0 to 127 can be used with 0 at the top of the loop and no
                    224: adjustment needed to the pointers.
                    225:
                    226: Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
                    227: limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
                    228: 16 is small, so support for 64 limbs/loop is generally only for comparison.
                    229:
                    230:
                    231:
                    232: COMPUTED JUMPS
                    233:
                    234: When working from least significant limb to most significant limb (most
                    235: routines) the computed jump and pointer calculations in preparation for an
                    236: unrolled loop are as follows.
                    237:
                    238:        S = operand size in limbs
                    239:        N = number of limbs per loop (UNROLL_COUNT)
                    240:        L = log2 of unrolling (UNROLL_LOG2)
                    241:        M = mask for unrolling (UNROLL_MASK)
                    242:        C = code bytes per limb in the loop
                    243:        B = bytes per limb (4 for x86)
                    244:
                    245:        computed jump            (-S & M) * C + entrypoint
                    246:        subtract from pointers   (-S & M) * B
                    247:        initial loop counter     (S-1) >> L
                    248:        displacements            0 to B*(N-1)
                    249:
                    250: The loop counter is decremented at the end of each loop, and the looping
                    251: stops when the decrement takes the counter to -1.  The displacements are for
                    252: the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
                    253:
                    254: Usually the multiply by "C" can be handled without an imul, using instead an
                    255: leal, or a shift and subtract.
                    256:
                    257: When working from most significant to least significant limb (eg. mpn_lshift
                    258: and mpn_copyd), the calculations change as follows.
                    259:
                    260:        add to pointers          (-S & M) * B
                    261:        displacements            0 to -B*(N-1)
                    262:
                    263:
                    264:
                    265: OLD GAS 1.92.3
                    266:
                    267: This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
                    268: affect gmp code.
                    269:
                    270: Firstly, an expression involving two forward references to labels comes out
                    271: as zero.  For example,
                    272:
                    273:                addl    $bar-foo, %eax
                    274:        foo:
                    275:                nop
                    276:        bar:
                    277:
                    278: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
                    279: When only one forward reference is involved, it works correctly, as for
                    280: example,
                    281:
                    282:        foo:
                    283:                addl    $bar-foo, %eax
                    284:                nop
                    285:        bar:
                    286:
                    287: Secondly, an expression involving two labels can't be used as the
                    288: displacement for an leal.  For example,
                    289:
                    290:        foo:
                    291:                nop
                    292:        bar:
                    293:                leal    bar-foo(%eax,%ebx,8), %ecx
                    294:
                    295: A slightly cryptic error is given, "Unimplemented segment type 0 in
                    296: parse_operand".  When only one label is used it's ok, and the label can be a
                    297: forward reference too, as for example,
                    298:
                    299:                leal    foo(%eax,%ebx,8), %ecx
                    300:                nop
                    301:        foo:
                    302:
                    303: These problems only affect PIC computed jump calculations.  The workarounds
                    304: are just to do an leal without a displacement and then an addl, and to make
                    305: sure the code is placed so that there's at most one forward reference in the
                    306: addl.
                    307:
                    308:
                    309:
                    310: REFERENCES
                    311:
                    312: "Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999,
                    313: order numbers 243190, 243191 and 243192.  Available on-line,
                    314:
                    315:        ftp://download.intel.com/design/PentiumII/manuals/243190.htm
                    316:        ftp://download.intel.com/design/PentiumII/manuals/243191.htm
                    317:        ftp://download.intel.com/design/PentiumII/manuals/243192.htm
                    318:
                    319: "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
                    320: published by McGraw-Hill, 1991, ISBN 0-07-031219-2.
                    321:
                    322: "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
                    323: published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
                    324: Supplement", AT&T, 1991, ISBN 0-13-877689-X.  (These have details of ELF
                    325: shared library PIC coding.)
                    326:
                    327:
                    328:
                    329: ----------------
                    330: Local variables:
                    331: mode: text
                    332: fill-column: 76
                    333: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>