OpenXM_contrib/gmp/mpn/x86/README - diff

Return to README CVS log

Up to [local] / OpenXM_contrib / gmp / mpn / x86

Diff for /OpenXM_contrib/gmp/mpn/x86/Attic/README between version 1.1.1.1 and 1.1.1.2

-version 1.1.1.1, 2000/09/09 14:12:42
+version 1.1.1.2, 2003/08/25 16:06:27
 Line 1
 Line 1
 Line 1
+ Copyright 1999, 2000, 2001 Free Software Foundation, Inc.
+ This file is part of the GNU MP Library.
+ The GNU MP Library is free software; you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as published by
+ the Free Software Foundation; either version 2.1 of the License, or (at your
+ option) any later version.
+ The GNU MP Library is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+ License for more details.
+ You should have received a copy of the GNU Lesser General Public License
+ along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+ the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+-1307, USA.
                        X86 MPN SUBROUTINES
-Line 7  This directory contains mpn functions for various 80x8
+Line 29  This directory contains mpn functions for various 80x8
 Line 7  This directory contains mpn functions for various 80x8
 Line 29  This directory contains mpn functions for various 80x8
  CODE ORGANIZATION
-         x86              i386, i486, generic
+         x86               i386, generic
-         x86/pentium      Intel Pentium (P5, P54)
+         x86/i486          i486
-         x86/pentium/mmx  Intel Pentium with MMX (P55)
+         x86/pentium       Intel Pentium (P5, P54)
-         x86/p6           Intel Pentium Pro
+         x86/pentium/mmx   Intel Pentium with MMX (P55)
-         x86/p6/mmx       Intel Pentium II, III
+         x86/p6            Intel Pentium Pro
-         x86/p6/p3mmx     Intel Pentium III
+         x86/p6/mmx        Intel Pentium II, III
-         x86/k6           AMD K6, K6-2, K6-3
+         x86/p6/p3mmx      Intel Pentium III
-         x86/k6/mmx
+         x86/k6            \ AMD K6
-         x86/k6/k62mmx    AMD K6-2
+         x86/k6/mmx        /
-         x86/k7           AMD Athlon
+         x86/k6/k62mmx     AMD K6-2
-         x86/k7/mmx
+         x86/k7            \ AMD Athlon
+         x86/k7/mmx        /
+         x86/pentium4      \
+         x86/pentium4/mmx  | Intel Pentium 4
+         x86/pentium4/sse2 /
- The x86 directory is also the main support for P6 at the moment, and
+ The top-level x86 directory contains blended style code, meant to be
- is something of a blended style, meant to be reasonable on all x86s.
+ reasonable on all x86s.
  STATUS
- The code is well-optimized for AMD and Intel chips, but not so well
+ The code is well-optimized for AMD and Intel chips, but there's nothing
- optimized for Cyrix chips.
+ specific for Cyrix chips, nor for actual 80386 and 80486 chips.
- RELEVANT OPTIMIZATION ISSUES
+ ASM FILES
- For implementations with slow double shift instructions (SHLD and
+ The x86 .asm files are BSD style assembler code, first put through m4 for
- SHRD), it might be better to mimic their operation with SHL+SHR+OR.
+ macro processing.  The generic mpn/asm-defs.m4 is used, together with
- (M2 is likely to benefit from that, but not Pentium due to its slow
+ mpn/x86/x86-defs.m4.  See comments in those files.
- plain SHL and SHR.)
+ The code is meant for use with GNU "gas" or a system "as".  There's no
+ support for assemblers that demand Intel style code.
+ STACK FRAME
+ m4 macros are used to define the parameters passed on the stack, and these
+ act like comments on what the stack frame looks like too.  For example,
+ mpn_mul_1() has the following.
+         defframe(PARAM_MULTIPLIER, 16)
+         defframe(PARAM_SIZE,       12)
+         defframe(PARAM_SRC,         8)
+         defframe(PARAM_DST,         4)
+ PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
+ return address is at offset 0, but there's not normally any need to access
+ that.
+ FRAME is redefined as necessary through the code so it's the number of bytes
+ pushed on the stack, and hence the offsets in the parameter macros stay
+ correct.  At the start of a routine FRAME should be zero.
+         deflit(`FRAME',0)
+         ...
+         deflit(`FRAME',4)
+         ...
+         deflit(`FRAME',8)
+         ...
+ Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
+ FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
+ and can be used instead of explicit definitions if preferred.
+ defframe_pushl() is a combination FRAME_pushl() and defframe().
+ There's generally some slackness in redefining FRAME.  If new values aren't
+ going to get used then the redefinitions are omitted to keep from cluttering
+ up the code.  This happens for instance at the end of a routine, where there
+ might be just four pops and then a ret, so FRAME isn't getting used.
+ Local variables and saved registers can be similarly defined, with negative
+ offsets representing stack space below the initial stack pointer.  For
+ example,
+         defframe(SAVE_ESI,   -4)
+         defframe(SAVE_EDI,   -8)
+         defframe(VAR_COUNTER,-12)
+         deflit(STACK_SPACE, 12)
+ Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
+ space, and that instruction must be followed by a redefinition of FRAME
+ (setting it equal to STACK_SPACE) to reflect the change in %esp.
+ Definitions for pushed registers are only put in when they're going to be
+ used.  If registers are just saved and restored with pushes and pops then
+ definitions aren't made.
+ ASSEMBLER EXPRESSIONS
+ Only addition and subtraction seem to be universally available, certainly
+ that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
+ then m4 eval() should be used.
+ In particular note that a "/" anywhere in a line starts a comment in Solaris
+ "as", and in some configurations of gas too.
+         addl    $32/2, %eax           <-- wrong
+         addl    $eval(32/2), %eax     <-- right
+ Binutils gas/config/tc-i386.c has a choice between "/" being a comment
+ anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
+ the latter, and from 2.9.5 it's the default for GNU/Linux too.
+ ASSEMBLER COMMENTS
+ Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
+ reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
+ files have no comments.
+ Any comments before include(`../config.m4') must use m4 "dnl", since it's
+ only after the include that "C" is available.  By convention "dnl" is also
+ used for comments about m4 macros.
+ TEMPORARY LABELS
+ Temporary numbered labels like "1:" used as "1f" or "1b" are available in
+ "gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
+ used instead, possibly with a counter to make them unique, see jadcl0() for
+ instance.  A separate counter for each macro makes it possible to nest them,
+ for instance movl_text_address() can be used within an ASSERT().
+ "1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
+ unique number looks like a good alternative, but is that actually a
+ documented feature?  In any case this problem doesn't currently arise.
+ ZERO DISPLACEMENTS
+ In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
+ displacement are wanted, rather than (%ebx) with no displacement.  These are
+ either for computed jumps or to get desirable code alignment.  Explicit
+ .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
+ (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
+ Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
+.92.3 changes it.  In general changing would be the sort of "optimization"
+ an assembler might perform, hence explicit ".byte"s are used where
+ necessary.
+ SHLD/SHRD INSTRUCTIONS
+ The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
+ must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
+ Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
+ gas), and omits %cl elsewhere.
+ For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
+ %cl should be used, and the macros shldl, shrdl, shldw and shrdw in
+ mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
+ with those macros for usage.
+ IMUL INSTRUCTION
+ GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
+ that the following two forms produce identical object code
+         imul    $12, %eax
+         imul    $12, %eax, %eax
+ but that the former isn't accepted by some assemblers, in particular the SCO
+ OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.
+ (This applies only to immediate operands, the three operand form is only
+ valid with an immediate.)
+ DIRECTION FLAG
+ The x86 calling conventions say that the direction flag should be clear at
+ function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
+ Although this has been so since the year dot, it's not absolutely clear
+ whether it's universally respected.  Since it's better to be safe than
+ sorry, GMP follows glibc and does a "cld" if it depends on the direction
+ flag being clear.  This happens only in a few places.
+ POSITION INDEPENDENT CODE
+ Defining the symbol PIC in m4 processing selects SVR4 / ELF style position
+ independent code.  This is necessary for shared libraries because they can
+ be mapped into different processes at different virtual addresses.  Actually
+ relocations are allowed, but presumably pages with relocations aren't
+ shared, defeating the purpose of a shared library.
+ The use of the PLT adds a fixed cost to every function call, and the GOT
+ adds a cost to any function accessing global variables.  These are small but
+ might be noticeable when working with small operands.
+ Calls from one library function to another don't need to go through the PLT,
+ since of course the call instruction uses a displacement, not an absolute
+ address, and the relative locations of object files are known when libgmp.so
+ is created.  "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
+ this way, so that there's no jump through the PLT, but of course leaving
+ setups of the GOT address in %ebx that may be unnecessary.
+ The %ebx setup could be avoided in assembly if a separate option controlled
+ PIC for calls as opposed to computed jumps etc.  But there's only ever
+ likely to be a handful of calls out of assembler, and getting the same
+ optimization for C intra-library calls would be more important.  There seems
+ no easy way to tell gcc that certain functions can be called non-PIC, and
+ unfortunately many GMP functions use the global memory allocation variables,
+ so they need the GOT anyway.  Object files with no global data references
+ and only intra-library calls could go into the library as non-PIC under
+ -Bsymbolic.  Integrating this into libtool and automake is left as an
+ exercise for the reader.
+ GLOBAL OFFSET TABLE CODING
+ It's believed the magic _GLOBAL_OFFSET_TABLE_ used by code establishing the
+ address of the GOT should be written without a GSYM_PREFIX, ie. that it's
+ the same "_GLOBAL_OFFSET_TABLE_" on an underscore or non-underscore system.
+ Certainly this is true for instance of NetBSD 1.4 which is an underscore
+ system but requires "_GLOBAL_OFFSET_TABLE_".
+ Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
+ asked to assemble the following,
+         L1:
+             addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
+ It seems that using the label in the same instruction it refers to is the
+ problem, since a nop in between works.  But the simplest workaround is to
+ follow gcc and omit the +[.-L1] since it does nothing,
+             addl  $_GLOBAL_OFFSET_TABLE_, %ebx
+ Current gas 2.10 generates incorrect object code when %eax is used in such a
+ construction (with or without +[.-L1]),
+             addl  $_GLOBAL_OFFSET_TABLE_, %eax
+ The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
+ the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
+ other register, since then it's a two byte opcode+mod/rm.  GCC for example
+ always uses %ebx (which is needed for calls through the PLT).
+ A similar problem occurs in an leal (again with or without a +[.-L1]),
+             leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx
+ This time the R_386_GOTPC gets a displacement of 0 rather than the 2
+ appropriate for the opcode and mod/rm, making this form unusable.
+ SIMPLE LOOPS
+ The overheads in setting up for an unrolled loop can mean that at small
+ sizes a simple loop is faster.  Making small sizes go fast is important,
+ even if it adds a cycle or two to bigger sizes.  To this end various
+ routines choose between a simple loop and an unrolled loop according to
+ operand size.  The path to the simple loop, or to special case code for
+ small sizes, is always as fast as possible.
+ Adding a simple loop requires a conditional jump to choose between the
+ simple and unrolled code.  The size of a branch misprediction penalty
+ affects whether a simple loop is worthwhile.
+ The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
+ point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
+ UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
+ a couple of cycles to an unrolled loop setup, the threshold will vary with
+ PIC or non-PIC.  Something like the following is typical.
+         deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
+ There's no automated way to determine the threshold.  Setting it to a small
+ value and then to a big value makes it possible to measure the simple and
+ unrolled loops each over a range of sizes, from which the crossover point
+ can be determined.  Alternately, just adjust the threshold up or down until
+ there's no more speedups.
+ UNROLLED LOOP CODING
+ The x86 addressing modes allow a byte displacement of -128 to +127, making
+ it possible to access 256 bytes, which is 64 limbs, without adjusting
+ pointer registers within the loop.  Dword sized displacements can be used
+ too, but they increase code size, and unrolling to 64 ought to be enough.
+ When unrolling to the full 64 limbs/loop, the limb at the top of the loop
+ will have a displacement of -128, so pointers have to have a corresponding
+ +128 added before entering the loop.  When unrolling to 32 limbs/loop
+ displacements 0 to 127 can be used with 0 at the top of the loop and no
+ adjustment needed to the pointers.
+ Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
+ limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
+is small, so support for 64 limbs/loop is generally only for comparison.
+ COMPUTED JUMPS
+ When working from least significant limb to most significant limb (most
+ routines) the computed jump and pointer calculations in preparation for an
+ unrolled loop are as follows.
+         S = operand size in limbs
+         N = number of limbs per loop (UNROLL_COUNT)
+         L = log2 of unrolling (UNROLL_LOG2)
+         M = mask for unrolling (UNROLL_MASK)
+         C = code bytes per limb in the loop
+         B = bytes per limb (4 for x86)
+         computed jump            (-S & M) * C + entrypoint
+         subtract from pointers   (-S & M) * B
+         initial loop counter     (S-1) >> L
+         displacements            0 to B*(N-1)
+ The loop counter is decremented at the end of each loop, and the looping
+ stops when the decrement takes the counter to -1.  The displacements are for
+ the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
+ Usually the multiply by "C" can be handled without an imul, using instead an
+ leal, or a shift and subtract.
+ When working from most significant to least significant limb (eg. mpn_lshift
+ and mpn_copyd), the calculations change as follows.
+         add to pointers          (-S & M) * B
+         displacements            0 to -B*(N-1)
+ OLD GAS 1.92.3
+ This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
+ affect GMP code.
+ Firstly, an expression involving two forward references to labels comes out
+ as zero.  For example,
+                 addl    $bar-foo, %eax
+         foo:
+                 nop
+         bar:
+ This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
+ When only one forward reference is involved, it works correctly, as for
+ example,
+         foo:
+                 addl    $bar-foo, %eax
+                 nop
+         bar:
+ Secondly, an expression involving two labels can't be used as the
+ displacement for an leal.  For example,
+         foo:
+                 nop
+         bar:
+                 leal    bar-foo(%eax,%ebx,8), %ecx
+ A slightly cryptic error is given, "Unimplemented segment type 0 in
+ parse_operand".  When only one label is used it's ok, and the label can be a
+ forward reference too, as for example,
+                 leal    foo(%eax,%ebx,8), %ecx
+                 nop
+         foo:
+ These problems only affect PIC computed jump calculations.  The workarounds
+ are just to do an leal without a displacement and then an addl, and to make
+ sure the code is placed so that there's at most one forward reference in the
+ addl.
+ REFERENCES
+ "Intel Architecture Software Developer's Manual", volumes 1 to 3, 2001,
+ order numbers 245470, 245471 and 245472.  Available on-line,
+         http://developer.intel.com/design/pentium4/manuals/245470.htm
+         http://developer.intel.com/design/pentium4/manuals/245471.htm
+         http://developer.intel.com/design/pentium4/manuals/245472.htm
+ "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
+ published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
+ Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
+ conventions and ELF shared library PIC coding.  Versions of both available
+ on-line,
+         http://www.sco.com/developer/devspecs
+ "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
+ published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
+ ABI supplement.)
+ ----------------
+ Local variables:
+ mode: text
+ fill-column: 76
+ End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>