[BACK]Return to README.family CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86

Annotation of OpenXM_contrib/gmp/mpn/x86/README.family, Revision 1.1

1.1     ! maekawa     1:
        !             2:                     X86 CPU FAMILY MPN SUBROUTINES
        !             3:
        !             4:
        !             5: This file has some notes on things common to all the x86 family code.
        !             6:
        !             7:
        !             8:
        !             9: ASM FILES
        !            10:
        !            11: The x86 .asm files are BSD style x86 assembler code, first put through m4
        !            12: for macro processing.  The generic mpn/asm-defs.m4 is used, together with
        !            13: mpn/x86/x86-defs.m4.  Detailed notes are in those files.
        !            14:
        !            15: The code is meant for use with GNU "gas" or a system "as".  There's no
        !            16: support for assemblers that demand Intel style, and with gas freely
        !            17: available and easy to use that shouldn't be a problem.
        !            18:
        !            19:
        !            20:
        !            21: STACK FRAME
        !            22:
        !            23: m4 macros are used to define the parameters passed on the stack, and these
        !            24: act like comments on what the stack frame looks like too.  For example,
        !            25: mpn_mul_1() has the following.
        !            26:
        !            27:         defframe(PARAM_MULTIPLIER, 16)
        !            28:         defframe(PARAM_SIZE,       12)
        !            29:         defframe(PARAM_SRC,         8)
        !            30:         defframe(PARAM_DST,         4)
        !            31:
        !            32: Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others
        !            33: similarly.  The return address is at offset 0, but there's not normally any
        !            34: need to access that.
        !            35:
        !            36: FRAME is redefined as necessary through the code so it's the number of bytes
        !            37: pushed on the stack, and hence the offsets in the parameter macros stay
        !            38: correct.  At the start of a routine FRAME should be zero.
        !            39:
        !            40:         deflit(`FRAME',0)
        !            41:        ...
        !            42:        deflit(`FRAME',4)
        !            43:        ...
        !            44:        deflit(`FRAME',8)
        !            45:        ...
        !            46:
        !            47: Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
        !            48: FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
        !            49: and can be used instead of explicit definitions if preferred.
        !            50: defframe_pushl() is a combination FRAME_pushl() and defframe().
        !            51:
        !            52: There's generally some slackness in redefining FRAME.  If new values aren't
        !            53: going to get used, then the redefinitions are omitted to keep from
        !            54: cluttering up the code.  This happens for instance at the end of a routine,
        !            55: where there might be just four register pops and then a ret, so FRAME isn't
        !            56: getting used.
        !            57:
        !            58: Local variables and saved registers can be similarly defined, with negative
        !            59: offsets representing stack space below the initial stack pointer.  For
        !            60: example,
        !            61:
        !            62:        defframe(SAVE_ESI,   -4)
        !            63:        defframe(SAVE_EDI,   -8)
        !            64:        defframe(VAR_COUNTER,-12)
        !            65:
        !            66:        deflit(STACK_SPACE, 12)
        !            67:
        !            68: Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
        !            69: space, and that instruction must be followed by a redefinition of FRAME
        !            70: (setting it equal to STACK_SPACE) to reflect the change in %esp.
        !            71:
        !            72: Definitions for pushed registers are only put in when they're going to be
        !            73: used.  If registers are just saved and restored with pushes and pops then
        !            74: definitions aren't made.
        !            75:
        !            76:
        !            77:
        !            78: ASSEMBLER EXPRESSIONS
        !            79:
        !            80: Only addition and subtraction seem to be universally available, certainly
        !            81: that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
        !            82: then m4 eval() should be used.
        !            83:
        !            84: In particular note that a "/" anywhere in a line starts a comment in Solaris
        !            85: "as", and in some configurations of gas too.
        !            86:
        !            87:        addl    $32/2, %eax           <-- wrong
        !            88:
        !            89:        addl    $eval(32/2), %eax     <-- right
        !            90:
        !            91: Binutils gas/config/tc-i386.c has a choice between "/" being a comment
        !            92: anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
        !            93: the latter, and as of 2.9.5 it's the default for GNU/Linux too.
        !            94:
        !            95:
        !            96:
        !            97: ASSEMBLER COMMENTS
        !            98:
        !            99: Solaris "as" doesn't support "#" commenting, using /* */ instead,
        !           100: unfortunately.  For that reason "C" commenting is used (see asm-defs.m4) and
        !           101: the intermediate ".s" files have no comments.
        !           102:
        !           103:
        !           104:
        !           105: ZERO DISPLACEMENTS
        !           106:
        !           107: In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
        !           108: displacement are wanted, rather than (%ebx) with no displacement.  These are
        !           109: either for computed jumps or to get desirable code alignment.  Explicit
        !           110: .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
        !           111: (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
        !           112:
        !           113: Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
        !           114: 1.92.3 changes it.  In general changing would be the sort of "optimization"
        !           115: an assembler might perform, hence explicit ".byte"s are used where
        !           116: necessary.
        !           117:
        !           118:
        !           119:
        !           120: SHLD/SHRD INSTRUCTIONS
        !           121:
        !           122: The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
        !           123: must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
        !           124: Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
        !           125: gas), and omits %cl elsewhere.
        !           126:
        !           127: For GMP an autoconf test is used to determine whether %cl should be used and
        !           128: the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass
        !           129: through or omit %cl as necessary.  See comments with those macros for usage.
        !           130:
        !           131:
        !           132:
        !           133: DIRECTION FLAG
        !           134:
        !           135: The x86 calling conventions say that the direction flag should be clear at
        !           136: function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
        !           137:
        !           138: Although this has been so since the year dot, it's not absolutely clear
        !           139: whether it's universally respected.  Since it's better to be safe than
        !           140: sorry, gmp follows glibc and does a "cld" if it depends on the direction
        !           141: flag being clear.  This happens only in a few places.
        !           142:
        !           143:
        !           144:
        !           145: POSITION INDEPENDENT CODE
        !           146:
        !           147: Defining the symbol PIC in m4 processing selects position independent code.
        !           148: This mainly affects computed jumps, and these are implemented in a
        !           149: self-contained fashion (without using the global offset table).  The few
        !           150: calls from assembly code to global functions use the normal procedure
        !           151: linkage table.
        !           152:
        !           153: PIC is necessary for ELF shared libraries because they can be mapped into
        !           154: different processes at different virtual addresses.  Text relocations in
        !           155: shared libraries are allowed, but that presumably means a page with such a
        !           156: relocation isn't shared.  The use of the PLT for PIC adds a fixed cost to
        !           157: every function call, which is small but might be noticeable when working with
        !           158: small operands.
        !           159:
        !           160: Calls from one library function to another don't need to go through the PLT,
        !           161: since of course the call instruction uses a displacement, not an absolute
        !           162: address, and the relative locations of object files are known when libgmp.so
        !           163: is created.  "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
        !           164: this way, so that there's no jump through the PLT, but of course leaving
        !           165: setups of the GOT address in %ebx that may be unnecessary.
        !           166:
        !           167: The %ebx setup could be avoided in assembly if a separate option controlled
        !           168: PIC for calls as opposed to computed jumps etc.  But there's only ever
        !           169: likely to be a handful of calls out of assembler, and getting the same
        !           170: optimization for C intra-library calls would be more important.  There seems
        !           171: no easy way to tell gcc that certain functions can be called non-PIC, and
        !           172: unfortunately many gmp functions use the global memory allocation variables,
        !           173: so they need the GOT anyway.  Object files with no global data references
        !           174: and only intra-library calls could go into the library as non-PIC under
        !           175: -Bsymbolic.  Integrating this into libtool and automake is left as an
        !           176: exercise for the reader.
        !           177:
        !           178:
        !           179:
        !           180: SIMPLE LOOPS
        !           181:
        !           182: The overheads in setting up for an unrolled loop can mean that at small
        !           183: sizes a simple loop is faster.  Making small sizes go fast is important,
        !           184: even if it adds a cycle or two to bigger sizes.  To this end various
        !           185: routines choose between a simple loop and an unrolled loop according to
        !           186: operand size.  The path to the simple loop, or to special case code for
        !           187: small sizes, is always as fast as possible.
        !           188:
        !           189: Adding a simple loop requires a conditional jump to choose between the
        !           190: simple and unrolled code.  The size of a branch misprediction penalty
        !           191: affects whether a simple loop is worthwhile.
        !           192:
        !           193: The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
        !           194: point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
        !           195: UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
        !           196: a couple of cycles to an unrolled loop setup, the threshold will vary with
        !           197: PIC or non-PIC.  Something like the following is typical.
        !           198:
        !           199:        ifdef(`PIC',`
        !           200:        deflit(UNROLL_THRESHOLD, 10)
        !           201:        ',`
        !           202:        deflit(UNROLL_THRESHOLD, 8)
        !           203:        ')
        !           204:
        !           205: There's no automated way to determine the threshold.  Setting it to a small
        !           206: value and then to a big value makes it possible to measure the simple and
        !           207: unrolled loops each over a range of sizes, from which the crossover point
        !           208: can be determined.  Alternately, just adjust the threshold up or down until
        !           209: there's no more speedups.
        !           210:
        !           211:
        !           212:
        !           213: UNROLLED LOOP CODING
        !           214:
        !           215: The x86 addressing modes allow a byte displacement of -128 to +127, making
        !           216: it possible to access 256 bytes, which is 64 limbs, without adjusting
        !           217: pointer registers within the loop.  Dword sized displacements can be used
        !           218: too, but they increase code size, and unrolling to 64 ought to be enough.
        !           219:
        !           220: When unrolling to the full 64 limbs/loop, the limb at the top of the loop
        !           221: will have a displacement of -128, so pointers have to have a corresponding
        !           222: +128 added before entering the loop.  When unrolling to 32 limbs/loop
        !           223: displacements 0 to 127 can be used with 0 at the top of the loop and no
        !           224: adjustment needed to the pointers.
        !           225:
        !           226: Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
        !           227: limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
        !           228: 16 is small, so support for 64 limbs/loop is generally only for comparison.
        !           229:
        !           230:
        !           231:
        !           232: COMPUTED JUMPS
        !           233:
        !           234: When working from least significant limb to most significant limb (most
        !           235: routines) the computed jump and pointer calculations in preparation for an
        !           236: unrolled loop are as follows.
        !           237:
        !           238:        S = operand size in limbs
        !           239:        N = number of limbs per loop (UNROLL_COUNT)
        !           240:        L = log2 of unrolling (UNROLL_LOG2)
        !           241:        M = mask for unrolling (UNROLL_MASK)
        !           242:        C = code bytes per limb in the loop
        !           243:        B = bytes per limb (4 for x86)
        !           244:
        !           245:        computed jump            (-S & M) * C + entrypoint
        !           246:        subtract from pointers   (-S & M) * B
        !           247:        initial loop counter     (S-1) >> L
        !           248:        displacements            0 to B*(N-1)
        !           249:
        !           250: The loop counter is decremented at the end of each loop, and the looping
        !           251: stops when the decrement takes the counter to -1.  The displacements are for
        !           252: the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
        !           253:
        !           254: Usually the multiply by "C" can be handled without an imul, using instead an
        !           255: leal, or a shift and subtract.
        !           256:
        !           257: When working from most significant to least significant limb (eg. mpn_lshift
        !           258: and mpn_copyd), the calculations change as follows.
        !           259:
        !           260:        add to pointers          (-S & M) * B
        !           261:        displacements            0 to -B*(N-1)
        !           262:
        !           263:
        !           264:
        !           265: OLD GAS 1.92.3
        !           266:
        !           267: This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
        !           268: affect gmp code.
        !           269:
        !           270: Firstly, an expression involving two forward references to labels comes out
        !           271: as zero.  For example,
        !           272:
        !           273:                addl    $bar-foo, %eax
        !           274:        foo:
        !           275:                nop
        !           276:        bar:
        !           277:
        !           278: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
        !           279: When only one forward reference is involved, it works correctly, as for
        !           280: example,
        !           281:
        !           282:        foo:
        !           283:                addl    $bar-foo, %eax
        !           284:                nop
        !           285:        bar:
        !           286:
        !           287: Secondly, an expression involving two labels can't be used as the
        !           288: displacement for an leal.  For example,
        !           289:
        !           290:        foo:
        !           291:                nop
        !           292:        bar:
        !           293:                leal    bar-foo(%eax,%ebx,8), %ecx
        !           294:
        !           295: A slightly cryptic error is given, "Unimplemented segment type 0 in
        !           296: parse_operand".  When only one label is used it's ok, and the label can be a
        !           297: forward reference too, as for example,
        !           298:
        !           299:                leal    foo(%eax,%ebx,8), %ecx
        !           300:                nop
        !           301:        foo:
        !           302:
        !           303: These problems only affect PIC computed jump calculations.  The workarounds
        !           304: are just to do an leal without a displacement and then an addl, and to make
        !           305: sure the code is placed so that there's at most one forward reference in the
        !           306: addl.
        !           307:
        !           308:
        !           309:
        !           310: REFERENCES
        !           311:
        !           312: "Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999,
        !           313: order numbers 243190, 243191 and 243192.  Available on-line,
        !           314:
        !           315:        ftp://download.intel.com/design/PentiumII/manuals/243190.htm
        !           316:        ftp://download.intel.com/design/PentiumII/manuals/243191.htm
        !           317:        ftp://download.intel.com/design/PentiumII/manuals/243192.htm
        !           318:
        !           319: "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
        !           320: published by McGraw-Hill, 1991, ISBN 0-07-031219-2.
        !           321:
        !           322: "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
        !           323: published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
        !           324: Supplement", AT&T, 1991, ISBN 0-13-877689-X.  (These have details of ELF
        !           325: shared library PIC coding.)
        !           326:
        !           327:
        !           328:
        !           329: ----------------
        !           330: Local variables:
        !           331: mode: text
        !           332: fill-column: 76
        !           333: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>