[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86

Diff for /OpenXM_contrib/gmp/mpn/x86/Attic/README between version 1.1.1.1 and 1.1.1.2

version 1.1.1.1, 2000/09/09 14:12:42 version 1.1.1.2, 2003/08/25 16:06:27
Line 1 
Line 1 
   Copyright 1999, 2000, 2001 Free Software Foundation, Inc.
   
   This file is part of the GNU MP Library.
   
   The GNU MP Library is free software; you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as published by
   the Free Software Foundation; either version 2.1 of the License, or (at your
   option) any later version.
   
   The GNU MP Library is distributed in the hope that it will be useful, but
   WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
   License for more details.
   
   You should have received a copy of the GNU Lesser General Public License
   along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
   the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
   02111-1307, USA.
   
   
   
   
   
                       X86 MPN SUBROUTINES                        X86 MPN SUBROUTINES
   
   
Line 7  This directory contains mpn functions for various 80x8
Line 29  This directory contains mpn functions for various 80x8
   
 CODE ORGANIZATION  CODE ORGANIZATION
   
         x86              i386, i486, generic          x86               i386, generic
         x86/pentium      Intel Pentium (P5, P54)          x86/i486          i486
         x86/pentium/mmx  Intel Pentium with MMX (P55)          x86/pentium       Intel Pentium (P5, P54)
         x86/p6           Intel Pentium Pro          x86/pentium/mmx   Intel Pentium with MMX (P55)
         x86/p6/mmx       Intel Pentium II, III          x86/p6            Intel Pentium Pro
         x86/p6/p3mmx     Intel Pentium III          x86/p6/mmx        Intel Pentium II, III
         x86/k6           AMD K6, K6-2, K6-3          x86/p6/p3mmx      Intel Pentium III
         x86/k6/mmx          x86/k6            \ AMD K6
         x86/k6/k62mmx    AMD K6-2          x86/k6/mmx        /
         x86/k7           AMD Athlon          x86/k6/k62mmx     AMD K6-2
         x86/k7/mmx          x86/k7            \ AMD Athlon
           x86/k7/mmx        /
           x86/pentium4      \
           x86/pentium4/mmx  | Intel Pentium 4
           x86/pentium4/sse2 /
   
   
 The x86 directory is also the main support for P6 at the moment, and  The top-level x86 directory contains blended style code, meant to be
 is something of a blended style, meant to be reasonable on all x86s.  reasonable on all x86s.
   
   
   
 STATUS  STATUS
   
 The code is well-optimized for AMD and Intel chips, but not so well  The code is well-optimized for AMD and Intel chips, but there's nothing
 optimized for Cyrix chips.  specific for Cyrix chips, nor for actual 80386 and 80486 chips.
   
   
   
 RELEVANT OPTIMIZATION ISSUES  ASM FILES
   
 For implementations with slow double shift instructions (SHLD and  The x86 .asm files are BSD style assembler code, first put through m4 for
 SHRD), it might be better to mimic their operation with SHL+SHR+OR.  macro processing.  The generic mpn/asm-defs.m4 is used, together with
 (M2 is likely to benefit from that, but not Pentium due to its slow  mpn/x86/x86-defs.m4.  See comments in those files.
 plain SHL and SHR.)  
   The code is meant for use with GNU "gas" or a system "as".  There's no
   support for assemblers that demand Intel style code.
   
   
   
   STACK FRAME
   
   m4 macros are used to define the parameters passed on the stack, and these
   act like comments on what the stack frame looks like too.  For example,
   mpn_mul_1() has the following.
   
           defframe(PARAM_MULTIPLIER, 16)
           defframe(PARAM_SIZE,       12)
           defframe(PARAM_SRC,         8)
           defframe(PARAM_DST,         4)
   
   PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
   return address is at offset 0, but there's not normally any need to access
   that.
   
   FRAME is redefined as necessary through the code so it's the number of bytes
   pushed on the stack, and hence the offsets in the parameter macros stay
   correct.  At the start of a routine FRAME should be zero.
   
           deflit(`FRAME',0)
           ...
           deflit(`FRAME',4)
           ...
           deflit(`FRAME',8)
           ...
   
   Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
   FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
   and can be used instead of explicit definitions if preferred.
   defframe_pushl() is a combination FRAME_pushl() and defframe().
   
   There's generally some slackness in redefining FRAME.  If new values aren't
   going to get used then the redefinitions are omitted to keep from cluttering
   up the code.  This happens for instance at the end of a routine, where there
   might be just four pops and then a ret, so FRAME isn't getting used.
   
   Local variables and saved registers can be similarly defined, with negative
   offsets representing stack space below the initial stack pointer.  For
   example,
   
           defframe(SAVE_ESI,   -4)
           defframe(SAVE_EDI,   -8)
           defframe(VAR_COUNTER,-12)
   
           deflit(STACK_SPACE, 12)
   
   Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
   space, and that instruction must be followed by a redefinition of FRAME
   (setting it equal to STACK_SPACE) to reflect the change in %esp.
   
   Definitions for pushed registers are only put in when they're going to be
   used.  If registers are just saved and restored with pushes and pops then
   definitions aren't made.
   
   
   
   ASSEMBLER EXPRESSIONS
   
   Only addition and subtraction seem to be universally available, certainly
   that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
   then m4 eval() should be used.
   
   In particular note that a "/" anywhere in a line starts a comment in Solaris
   "as", and in some configurations of gas too.
   
           addl    $32/2, %eax           <-- wrong
   
           addl    $eval(32/2), %eax     <-- right
   
   Binutils gas/config/tc-i386.c has a choice between "/" being a comment
   anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
   the latter, and from 2.9.5 it's the default for GNU/Linux too.
   
   
   
   ASSEMBLER COMMENTS
   
   Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
   reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
   files have no comments.
   
   Any comments before include(`../config.m4') must use m4 "dnl", since it's
   only after the include that "C" is available.  By convention "dnl" is also
   used for comments about m4 macros.
   
   
   
   TEMPORARY LABELS
   
   Temporary numbered labels like "1:" used as "1f" or "1b" are available in
   "gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
   used instead, possibly with a counter to make them unique, see jadcl0() for
   instance.  A separate counter for each macro makes it possible to nest them,
   for instance movl_text_address() can be used within an ASSERT().
   
   "1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
   unique number looks like a good alternative, but is that actually a
   documented feature?  In any case this problem doesn't currently arise.
   
   
   
   ZERO DISPLACEMENTS
   
   In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
   displacement are wanted, rather than (%ebx) with no displacement.  These are
   either for computed jumps or to get desirable code alignment.  Explicit
   .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
   (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
   
   Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
   1.92.3 changes it.  In general changing would be the sort of "optimization"
   an assembler might perform, hence explicit ".byte"s are used where
   necessary.
   
   
   
   SHLD/SHRD INSTRUCTIONS
   
   The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
   must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
   Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
   gas), and omits %cl elsewhere.
   
   For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
   %cl should be used, and the macros shldl, shrdl, shldw and shrdw in
   mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
   with those macros for usage.
   
   
   
   IMUL INSTRUCTION
   
   GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
   that the following two forms produce identical object code
   
           imul    $12, %eax
           imul    $12, %eax, %eax
   
   but that the former isn't accepted by some assemblers, in particular the SCO
   OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.
   
   (This applies only to immediate operands, the three operand form is only
   valid with an immediate.)
   
   
   
   DIRECTION FLAG
   
   The x86 calling conventions say that the direction flag should be clear at
   function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
   Although this has been so since the year dot, it's not absolutely clear
   whether it's universally respected.  Since it's better to be safe than
   sorry, GMP follows glibc and does a "cld" if it depends on the direction
   flag being clear.  This happens only in a few places.
   
   
   
   POSITION INDEPENDENT CODE
   
   Defining the symbol PIC in m4 processing selects SVR4 / ELF style position
   independent code.  This is necessary for shared libraries because they can
   be mapped into different processes at different virtual addresses.  Actually
   relocations are allowed, but presumably pages with relocations aren't
   shared, defeating the purpose of a shared library.
   
   The use of the PLT adds a fixed cost to every function call, and the GOT
   adds a cost to any function accessing global variables.  These are small but
   might be noticeable when working with small operands.
   
   Calls from one library function to another don't need to go through the PLT,
   since of course the call instruction uses a displacement, not an absolute
   address, and the relative locations of object files are known when libgmp.so
   is created.  "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
   this way, so that there's no jump through the PLT, but of course leaving
   setups of the GOT address in %ebx that may be unnecessary.
   
   The %ebx setup could be avoided in assembly if a separate option controlled
   PIC for calls as opposed to computed jumps etc.  But there's only ever
   likely to be a handful of calls out of assembler, and getting the same
   optimization for C intra-library calls would be more important.  There seems
   no easy way to tell gcc that certain functions can be called non-PIC, and
   unfortunately many GMP functions use the global memory allocation variables,
   so they need the GOT anyway.  Object files with no global data references
   and only intra-library calls could go into the library as non-PIC under
   -Bsymbolic.  Integrating this into libtool and automake is left as an
   exercise for the reader.
   
   
   
   GLOBAL OFFSET TABLE CODING
   
   It's believed the magic _GLOBAL_OFFSET_TABLE_ used by code establishing the
   address of the GOT should be written without a GSYM_PREFIX, ie. that it's
   the same "_GLOBAL_OFFSET_TABLE_" on an underscore or non-underscore system.
   Certainly this is true for instance of NetBSD 1.4 which is an underscore
   system but requires "_GLOBAL_OFFSET_TABLE_".
   
   Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
   asked to assemble the following,
   
           L1:
               addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
   
   It seems that using the label in the same instruction it refers to is the
   problem, since a nop in between works.  But the simplest workaround is to
   follow gcc and omit the +[.-L1] since it does nothing,
   
               addl  $_GLOBAL_OFFSET_TABLE_, %ebx
   
   Current gas 2.10 generates incorrect object code when %eax is used in such a
   construction (with or without +[.-L1]),
   
               addl  $_GLOBAL_OFFSET_TABLE_, %eax
   
   The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
   the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
   other register, since then it's a two byte opcode+mod/rm.  GCC for example
   always uses %ebx (which is needed for calls through the PLT).
   
   A similar problem occurs in an leal (again with or without a +[.-L1]),
   
               leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx
   
   This time the R_386_GOTPC gets a displacement of 0 rather than the 2
   appropriate for the opcode and mod/rm, making this form unusable.
   
   
   
   SIMPLE LOOPS
   
   The overheads in setting up for an unrolled loop can mean that at small
   sizes a simple loop is faster.  Making small sizes go fast is important,
   even if it adds a cycle or two to bigger sizes.  To this end various
   routines choose between a simple loop and an unrolled loop according to
   operand size.  The path to the simple loop, or to special case code for
   small sizes, is always as fast as possible.
   
   Adding a simple loop requires a conditional jump to choose between the
   simple and unrolled code.  The size of a branch misprediction penalty
   affects whether a simple loop is worthwhile.
   
   The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
   point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
   UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
   a couple of cycles to an unrolled loop setup, the threshold will vary with
   PIC or non-PIC.  Something like the following is typical.
   
           deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
   
   There's no automated way to determine the threshold.  Setting it to a small
   value and then to a big value makes it possible to measure the simple and
   unrolled loops each over a range of sizes, from which the crossover point
   can be determined.  Alternately, just adjust the threshold up or down until
   there's no more speedups.
   
   
   
   UNROLLED LOOP CODING
   
   The x86 addressing modes allow a byte displacement of -128 to +127, making
   it possible to access 256 bytes, which is 64 limbs, without adjusting
   pointer registers within the loop.  Dword sized displacements can be used
   too, but they increase code size, and unrolling to 64 ought to be enough.
   
   When unrolling to the full 64 limbs/loop, the limb at the top of the loop
   will have a displacement of -128, so pointers have to have a corresponding
   +128 added before entering the loop.  When unrolling to 32 limbs/loop
   displacements 0 to 127 can be used with 0 at the top of the loop and no
   adjustment needed to the pointers.
   
   Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
   limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
   16 is small, so support for 64 limbs/loop is generally only for comparison.
   
   
   
   COMPUTED JUMPS
   
   When working from least significant limb to most significant limb (most
   routines) the computed jump and pointer calculations in preparation for an
   unrolled loop are as follows.
   
           S = operand size in limbs
           N = number of limbs per loop (UNROLL_COUNT)
           L = log2 of unrolling (UNROLL_LOG2)
           M = mask for unrolling (UNROLL_MASK)
           C = code bytes per limb in the loop
           B = bytes per limb (4 for x86)
   
           computed jump            (-S & M) * C + entrypoint
           subtract from pointers   (-S & M) * B
           initial loop counter     (S-1) >> L
           displacements            0 to B*(N-1)
   
   The loop counter is decremented at the end of each loop, and the looping
   stops when the decrement takes the counter to -1.  The displacements are for
   the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
   
   Usually the multiply by "C" can be handled without an imul, using instead an
   leal, or a shift and subtract.
   
   When working from most significant to least significant limb (eg. mpn_lshift
   and mpn_copyd), the calculations change as follows.
   
           add to pointers          (-S & M) * B
           displacements            0 to -B*(N-1)
   
   
   
   OLD GAS 1.92.3
   
   This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
   affect GMP code.
   
   Firstly, an expression involving two forward references to labels comes out
   as zero.  For example,
   
                   addl    $bar-foo, %eax
           foo:
                   nop
           bar:
   
   This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
   When only one forward reference is involved, it works correctly, as for
   example,
   
           foo:
                   addl    $bar-foo, %eax
                   nop
           bar:
   
   Secondly, an expression involving two labels can't be used as the
   displacement for an leal.  For example,
   
           foo:
                   nop
           bar:
                   leal    bar-foo(%eax,%ebx,8), %ecx
   
   A slightly cryptic error is given, "Unimplemented segment type 0 in
   parse_operand".  When only one label is used it's ok, and the label can be a
   forward reference too, as for example,
   
                   leal    foo(%eax,%ebx,8), %ecx
                   nop
           foo:
   
   These problems only affect PIC computed jump calculations.  The workarounds
   are just to do an leal without a displacement and then an addl, and to make
   sure the code is placed so that there's at most one forward reference in the
   addl.
   
   
   
   REFERENCES
   
   "Intel Architecture Software Developer's Manual", volumes 1 to 3, 2001,
   order numbers 245470, 245471 and 245472.  Available on-line,
   
           http://developer.intel.com/design/pentium4/manuals/245470.htm
           http://developer.intel.com/design/pentium4/manuals/245471.htm
           http://developer.intel.com/design/pentium4/manuals/245472.htm
   
   "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
   published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
   Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
   conventions and ELF shared library PIC coding.  Versions of both available
   on-line,
   
           http://www.sco.com/developer/devspecs
   
   "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
   published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
   ABI supplement.)
   
   
   
   ----------------
   Local variables:
   mode: text
   fill-column: 76
   End:

Legend:
Removed from v.1.1.1.1  
changed lines
  Added in v.1.1.1.2

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>