[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86

Annotation of OpenXM_contrib/gmp/mpn/x86/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 1999, 2000, 2001 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
        !            22:
1.1       maekawa    23:
                     24:                       X86 MPN SUBROUTINES
                     25:
                     26:
                     27: This directory contains mpn functions for various 80x86 chips.
                     28:
                     29:
                     30: CODE ORGANIZATION
                     31:
1.1.1.2 ! ohara      32:        x86               i386, generic
        !            33:        x86/i486          i486
        !            34:        x86/pentium       Intel Pentium (P5, P54)
        !            35:        x86/pentium/mmx   Intel Pentium with MMX (P55)
        !            36:        x86/p6            Intel Pentium Pro
        !            37:        x86/p6/mmx        Intel Pentium II, III
        !            38:        x86/p6/p3mmx      Intel Pentium III
        !            39:        x86/k6            \ AMD K6
        !            40:        x86/k6/mmx        /
        !            41:        x86/k6/k62mmx     AMD K6-2
        !            42:        x86/k7            \ AMD Athlon
        !            43:        x86/k7/mmx        /
        !            44:        x86/pentium4      \
        !            45:        x86/pentium4/mmx  | Intel Pentium 4
        !            46:        x86/pentium4/sse2 /
1.1       maekawa    47:
                     48:
1.1.1.2 ! ohara      49: The top-level x86 directory contains blended style code, meant to be
        !            50: reasonable on all x86s.
1.1       maekawa    51:
                     52:
                     53:
                     54: STATUS
                     55:
1.1.1.2 ! ohara      56: The code is well-optimized for AMD and Intel chips, but there's nothing
        !            57: specific for Cyrix chips, nor for actual 80386 and 80486 chips.
        !            58:
        !            59:
        !            60:
        !            61: ASM FILES
        !            62:
        !            63: The x86 .asm files are BSD style assembler code, first put through m4 for
        !            64: macro processing.  The generic mpn/asm-defs.m4 is used, together with
        !            65: mpn/x86/x86-defs.m4.  See comments in those files.
        !            66:
        !            67: The code is meant for use with GNU "gas" or a system "as".  There's no
        !            68: support for assemblers that demand Intel style code.
        !            69:
        !            70:
        !            71:
        !            72: STACK FRAME
        !            73:
        !            74: m4 macros are used to define the parameters passed on the stack, and these
        !            75: act like comments on what the stack frame looks like too.  For example,
        !            76: mpn_mul_1() has the following.
        !            77:
        !            78:         defframe(PARAM_MULTIPLIER, 16)
        !            79:         defframe(PARAM_SIZE,       12)
        !            80:         defframe(PARAM_SRC,         8)
        !            81:         defframe(PARAM_DST,         4)
        !            82:
        !            83: PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
        !            84: return address is at offset 0, but there's not normally any need to access
        !            85: that.
        !            86:
        !            87: FRAME is redefined as necessary through the code so it's the number of bytes
        !            88: pushed on the stack, and hence the offsets in the parameter macros stay
        !            89: correct.  At the start of a routine FRAME should be zero.
        !            90:
        !            91:         deflit(`FRAME',0)
        !            92:        ...
        !            93:        deflit(`FRAME',4)
        !            94:        ...
        !            95:        deflit(`FRAME',8)
        !            96:        ...
        !            97:
        !            98: Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
        !            99: FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
        !           100: and can be used instead of explicit definitions if preferred.
        !           101: defframe_pushl() is a combination FRAME_pushl() and defframe().
        !           102:
        !           103: There's generally some slackness in redefining FRAME.  If new values aren't
        !           104: going to get used then the redefinitions are omitted to keep from cluttering
        !           105: up the code.  This happens for instance at the end of a routine, where there
        !           106: might be just four pops and then a ret, so FRAME isn't getting used.
        !           107:
        !           108: Local variables and saved registers can be similarly defined, with negative
        !           109: offsets representing stack space below the initial stack pointer.  For
        !           110: example,
        !           111:
        !           112:        defframe(SAVE_ESI,   -4)
        !           113:        defframe(SAVE_EDI,   -8)
        !           114:        defframe(VAR_COUNTER,-12)
        !           115:
        !           116:        deflit(STACK_SPACE, 12)
        !           117:
        !           118: Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
        !           119: space, and that instruction must be followed by a redefinition of FRAME
        !           120: (setting it equal to STACK_SPACE) to reflect the change in %esp.
        !           121:
        !           122: Definitions for pushed registers are only put in when they're going to be
        !           123: used.  If registers are just saved and restored with pushes and pops then
        !           124: definitions aren't made.
        !           125:
        !           126:
        !           127:
        !           128: ASSEMBLER EXPRESSIONS
        !           129:
        !           130: Only addition and subtraction seem to be universally available, certainly
        !           131: that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
        !           132: then m4 eval() should be used.
        !           133:
        !           134: In particular note that a "/" anywhere in a line starts a comment in Solaris
        !           135: "as", and in some configurations of gas too.
        !           136:
        !           137:        addl    $32/2, %eax           <-- wrong
        !           138:
        !           139:        addl    $eval(32/2), %eax     <-- right
        !           140:
        !           141: Binutils gas/config/tc-i386.c has a choice between "/" being a comment
        !           142: anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
        !           143: the latter, and from 2.9.5 it's the default for GNU/Linux too.
        !           144:
        !           145:
        !           146:
        !           147: ASSEMBLER COMMENTS
        !           148:
        !           149: Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
        !           150: reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
        !           151: files have no comments.
        !           152:
        !           153: Any comments before include(`../config.m4') must use m4 "dnl", since it's
        !           154: only after the include that "C" is available.  By convention "dnl" is also
        !           155: used for comments about m4 macros.
        !           156:
        !           157:
        !           158:
        !           159: TEMPORARY LABELS
        !           160:
        !           161: Temporary numbered labels like "1:" used as "1f" or "1b" are available in
        !           162: "gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
        !           163: used instead, possibly with a counter to make them unique, see jadcl0() for
        !           164: instance.  A separate counter for each macro makes it possible to nest them,
        !           165: for instance movl_text_address() can be used within an ASSERT().
        !           166:
        !           167: "1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
        !           168: unique number looks like a good alternative, but is that actually a
        !           169: documented feature?  In any case this problem doesn't currently arise.
        !           170:
        !           171:
        !           172:
        !           173: ZERO DISPLACEMENTS
        !           174:
        !           175: In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
        !           176: displacement are wanted, rather than (%ebx) with no displacement.  These are
        !           177: either for computed jumps or to get desirable code alignment.  Explicit
        !           178: .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
        !           179: (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
        !           180:
        !           181: Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
        !           182: 1.92.3 changes it.  In general changing would be the sort of "optimization"
        !           183: an assembler might perform, hence explicit ".byte"s are used where
        !           184: necessary.
        !           185:
        !           186:
        !           187:
        !           188: SHLD/SHRD INSTRUCTIONS
        !           189:
        !           190: The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
        !           191: must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
        !           192: Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
        !           193: gas), and omits %cl elsewhere.
        !           194:
        !           195: For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
        !           196: %cl should be used, and the macros shldl, shrdl, shldw and shrdw in
        !           197: mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
        !           198: with those macros for usage.
        !           199:
        !           200:
        !           201:
        !           202: IMUL INSTRUCTION
        !           203:
        !           204: GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
        !           205: that the following two forms produce identical object code
        !           206:
        !           207:        imul    $12, %eax
        !           208:        imul    $12, %eax, %eax
        !           209:
        !           210: but that the former isn't accepted by some assemblers, in particular the SCO
        !           211: OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.
        !           212:
        !           213: (This applies only to immediate operands, the three operand form is only
        !           214: valid with an immediate.)
        !           215:
        !           216:
        !           217:
        !           218: DIRECTION FLAG
        !           219:
        !           220: The x86 calling conventions say that the direction flag should be clear at
        !           221: function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
        !           222: Although this has been so since the year dot, it's not absolutely clear
        !           223: whether it's universally respected.  Since it's better to be safe than
        !           224: sorry, GMP follows glibc and does a "cld" if it depends on the direction
        !           225: flag being clear.  This happens only in a few places.
        !           226:
        !           227:
        !           228:
        !           229: POSITION INDEPENDENT CODE
        !           230:
        !           231: Defining the symbol PIC in m4 processing selects SVR4 / ELF style position
        !           232: independent code.  This is necessary for shared libraries because they can
        !           233: be mapped into different processes at different virtual addresses.  Actually
        !           234: relocations are allowed, but presumably pages with relocations aren't
        !           235: shared, defeating the purpose of a shared library.
        !           236:
        !           237: The use of the PLT adds a fixed cost to every function call, and the GOT
        !           238: adds a cost to any function accessing global variables.  These are small but
        !           239: might be noticeable when working with small operands.
        !           240:
        !           241: Calls from one library function to another don't need to go through the PLT,
        !           242: since of course the call instruction uses a displacement, not an absolute
        !           243: address, and the relative locations of object files are known when libgmp.so
        !           244: is created.  "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
        !           245: this way, so that there's no jump through the PLT, but of course leaving
        !           246: setups of the GOT address in %ebx that may be unnecessary.
        !           247:
        !           248: The %ebx setup could be avoided in assembly if a separate option controlled
        !           249: PIC for calls as opposed to computed jumps etc.  But there's only ever
        !           250: likely to be a handful of calls out of assembler, and getting the same
        !           251: optimization for C intra-library calls would be more important.  There seems
        !           252: no easy way to tell gcc that certain functions can be called non-PIC, and
        !           253: unfortunately many GMP functions use the global memory allocation variables,
        !           254: so they need the GOT anyway.  Object files with no global data references
        !           255: and only intra-library calls could go into the library as non-PIC under
        !           256: -Bsymbolic.  Integrating this into libtool and automake is left as an
        !           257: exercise for the reader.
        !           258:
        !           259:
        !           260:
        !           261: GLOBAL OFFSET TABLE CODING
        !           262:
        !           263: It's believed the magic _GLOBAL_OFFSET_TABLE_ used by code establishing the
        !           264: address of the GOT should be written without a GSYM_PREFIX, ie. that it's
        !           265: the same "_GLOBAL_OFFSET_TABLE_" on an underscore or non-underscore system.
        !           266: Certainly this is true for instance of NetBSD 1.4 which is an underscore
        !           267: system but requires "_GLOBAL_OFFSET_TABLE_".
        !           268:
        !           269: Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
        !           270: asked to assemble the following,
        !           271:
        !           272:         L1:
        !           273:             addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
        !           274:
        !           275: It seems that using the label in the same instruction it refers to is the
        !           276: problem, since a nop in between works.  But the simplest workaround is to
        !           277: follow gcc and omit the +[.-L1] since it does nothing,
        !           278:
        !           279:             addl  $_GLOBAL_OFFSET_TABLE_, %ebx
        !           280:
        !           281: Current gas 2.10 generates incorrect object code when %eax is used in such a
        !           282: construction (with or without +[.-L1]),
        !           283:
        !           284:             addl  $_GLOBAL_OFFSET_TABLE_, %eax
        !           285:
        !           286: The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
        !           287: the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
        !           288: other register, since then it's a two byte opcode+mod/rm.  GCC for example
        !           289: always uses %ebx (which is needed for calls through the PLT).
        !           290:
        !           291: A similar problem occurs in an leal (again with or without a +[.-L1]),
        !           292:
        !           293:             leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx
        !           294:
        !           295: This time the R_386_GOTPC gets a displacement of 0 rather than the 2
        !           296: appropriate for the opcode and mod/rm, making this form unusable.
        !           297:
        !           298:
        !           299:
        !           300: SIMPLE LOOPS
        !           301:
        !           302: The overheads in setting up for an unrolled loop can mean that at small
        !           303: sizes a simple loop is faster.  Making small sizes go fast is important,
        !           304: even if it adds a cycle or two to bigger sizes.  To this end various
        !           305: routines choose between a simple loop and an unrolled loop according to
        !           306: operand size.  The path to the simple loop, or to special case code for
        !           307: small sizes, is always as fast as possible.
        !           308:
        !           309: Adding a simple loop requires a conditional jump to choose between the
        !           310: simple and unrolled code.  The size of a branch misprediction penalty
        !           311: affects whether a simple loop is worthwhile.
        !           312:
        !           313: The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
        !           314: point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
        !           315: UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
        !           316: a couple of cycles to an unrolled loop setup, the threshold will vary with
        !           317: PIC or non-PIC.  Something like the following is typical.
        !           318:
        !           319:        deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
        !           320:
        !           321: There's no automated way to determine the threshold.  Setting it to a small
        !           322: value and then to a big value makes it possible to measure the simple and
        !           323: unrolled loops each over a range of sizes, from which the crossover point
        !           324: can be determined.  Alternately, just adjust the threshold up or down until
        !           325: there's no more speedups.
        !           326:
        !           327:
        !           328:
        !           329: UNROLLED LOOP CODING
        !           330:
        !           331: The x86 addressing modes allow a byte displacement of -128 to +127, making
        !           332: it possible to access 256 bytes, which is 64 limbs, without adjusting
        !           333: pointer registers within the loop.  Dword sized displacements can be used
        !           334: too, but they increase code size, and unrolling to 64 ought to be enough.
        !           335:
        !           336: When unrolling to the full 64 limbs/loop, the limb at the top of the loop
        !           337: will have a displacement of -128, so pointers have to have a corresponding
        !           338: +128 added before entering the loop.  When unrolling to 32 limbs/loop
        !           339: displacements 0 to 127 can be used with 0 at the top of the loop and no
        !           340: adjustment needed to the pointers.
        !           341:
        !           342: Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
        !           343: limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
        !           344: 16 is small, so support for 64 limbs/loop is generally only for comparison.
        !           345:
        !           346:
        !           347:
        !           348: COMPUTED JUMPS
        !           349:
        !           350: When working from least significant limb to most significant limb (most
        !           351: routines) the computed jump and pointer calculations in preparation for an
        !           352: unrolled loop are as follows.
        !           353:
        !           354:        S = operand size in limbs
        !           355:        N = number of limbs per loop (UNROLL_COUNT)
        !           356:        L = log2 of unrolling (UNROLL_LOG2)
        !           357:        M = mask for unrolling (UNROLL_MASK)
        !           358:        C = code bytes per limb in the loop
        !           359:        B = bytes per limb (4 for x86)
        !           360:
        !           361:        computed jump            (-S & M) * C + entrypoint
        !           362:        subtract from pointers   (-S & M) * B
        !           363:        initial loop counter     (S-1) >> L
        !           364:        displacements            0 to B*(N-1)
        !           365:
        !           366: The loop counter is decremented at the end of each loop, and the looping
        !           367: stops when the decrement takes the counter to -1.  The displacements are for
        !           368: the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
        !           369:
        !           370: Usually the multiply by "C" can be handled without an imul, using instead an
        !           371: leal, or a shift and subtract.
        !           372:
        !           373: When working from most significant to least significant limb (eg. mpn_lshift
        !           374: and mpn_copyd), the calculations change as follows.
        !           375:
        !           376:        add to pointers          (-S & M) * B
        !           377:        displacements            0 to -B*(N-1)
        !           378:
        !           379:
        !           380:
        !           381: OLD GAS 1.92.3
        !           382:
        !           383: This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
        !           384: affect GMP code.
        !           385:
        !           386: Firstly, an expression involving two forward references to labels comes out
        !           387: as zero.  For example,
        !           388:
        !           389:                addl    $bar-foo, %eax
        !           390:        foo:
        !           391:                nop
        !           392:        bar:
        !           393:
        !           394: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
        !           395: When only one forward reference is involved, it works correctly, as for
        !           396: example,
        !           397:
        !           398:        foo:
        !           399:                addl    $bar-foo, %eax
        !           400:                nop
        !           401:        bar:
        !           402:
        !           403: Secondly, an expression involving two labels can't be used as the
        !           404: displacement for an leal.  For example,
        !           405:
        !           406:        foo:
        !           407:                nop
        !           408:        bar:
        !           409:                leal    bar-foo(%eax,%ebx,8), %ecx
        !           410:
        !           411: A slightly cryptic error is given, "Unimplemented segment type 0 in
        !           412: parse_operand".  When only one label is used it's ok, and the label can be a
        !           413: forward reference too, as for example,
        !           414:
        !           415:                leal    foo(%eax,%ebx,8), %ecx
        !           416:                nop
        !           417:        foo:
        !           418:
        !           419: These problems only affect PIC computed jump calculations.  The workarounds
        !           420: are just to do an leal without a displacement and then an addl, and to make
        !           421: sure the code is placed so that there's at most one forward reference in the
        !           422: addl.
        !           423:
        !           424:
        !           425:
        !           426: REFERENCES
        !           427:
        !           428: "Intel Architecture Software Developer's Manual", volumes 1 to 3, 2001,
        !           429: order numbers 245470, 245471 and 245472.  Available on-line,
        !           430:
        !           431:        http://developer.intel.com/design/pentium4/manuals/245470.htm
        !           432:        http://developer.intel.com/design/pentium4/manuals/245471.htm
        !           433:        http://developer.intel.com/design/pentium4/manuals/245472.htm
        !           434:
        !           435: "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
        !           436: published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
        !           437: Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
        !           438: conventions and ELF shared library PIC coding.  Versions of both available
        !           439: on-line,
        !           440:
        !           441:        http://www.sco.com/developer/devspecs
1.1       maekawa   442:
1.1.1.2 ! ohara     443: "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
        !           444: published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
        !           445: ABI supplement.)
1.1       maekawa   446:
                    447:
                    448:
1.1.1.2 ! ohara     449: ----------------
        !           450: Local variables:
        !           451: mode: text
        !           452: fill-column: 76
        !           453: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>