version 1.1.1.1, 2000/09/09 14:12:42 |
version 1.1.1.2, 2003/08/25 16:06:27 |
|
|
|
Copyright 1999, 2000, 2001 Free Software Foundation, Inc. |
|
|
|
This file is part of the GNU MP Library. |
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify |
|
it under the terms of the GNU Lesser General Public License as published by |
|
the Free Software Foundation; either version 2.1 of the License, or (at your |
|
option) any later version. |
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
|
License for more details. |
|
|
|
You should have received a copy of the GNU Lesser General Public License |
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
|
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA |
|
02111-1307, USA. |
|
|
|
|
|
|
|
|
|
|
X86 MPN SUBROUTINES |
X86 MPN SUBROUTINES |
|
|
|
|
Line 7 This directory contains mpn functions for various 80x8 |
|
Line 29 This directory contains mpn functions for various 80x8 |
|
|
|
CODE ORGANIZATION |
CODE ORGANIZATION |
|
|
x86 i386, i486, generic |
x86 i386, generic |
x86/pentium Intel Pentium (P5, P54) |
x86/i486 i486 |
x86/pentium/mmx Intel Pentium with MMX (P55) |
x86/pentium Intel Pentium (P5, P54) |
x86/p6 Intel Pentium Pro |
x86/pentium/mmx Intel Pentium with MMX (P55) |
x86/p6/mmx Intel Pentium II, III |
x86/p6 Intel Pentium Pro |
x86/p6/p3mmx Intel Pentium III |
x86/p6/mmx Intel Pentium II, III |
x86/k6 AMD K6, K6-2, K6-3 |
x86/p6/p3mmx Intel Pentium III |
x86/k6/mmx |
x86/k6 \ AMD K6 |
x86/k6/k62mmx AMD K6-2 |
x86/k6/mmx / |
x86/k7 AMD Athlon |
x86/k6/k62mmx AMD K6-2 |
x86/k7/mmx |
x86/k7 \ AMD Athlon |
|
x86/k7/mmx / |
|
x86/pentium4 \ |
|
x86/pentium4/mmx | Intel Pentium 4 |
|
x86/pentium4/sse2 / |
|
|
|
|
The x86 directory is also the main support for P6 at the moment, and |
The top-level x86 directory contains blended style code, meant to be |
is something of a blended style, meant to be reasonable on all x86s. |
reasonable on all x86s. |
|
|
|
|
|
|
STATUS |
STATUS |
|
|
The code is well-optimized for AMD and Intel chips, but not so well |
The code is well-optimized for AMD and Intel chips, but there's nothing |
optimized for Cyrix chips. |
specific for Cyrix chips, nor for actual 80386 and 80486 chips. |
|
|
|
|
|
|
RELEVANT OPTIMIZATION ISSUES |
ASM FILES |
|
|
For implementations with slow double shift instructions (SHLD and |
The x86 .asm files are BSD style assembler code, first put through m4 for |
SHRD), it might be better to mimic their operation with SHL+SHR+OR. |
macro processing. The generic mpn/asm-defs.m4 is used, together with |
(M2 is likely to benefit from that, but not Pentium due to its slow |
mpn/x86/x86-defs.m4. See comments in those files. |
plain SHL and SHR.) |
|
|
The code is meant for use with GNU "gas" or a system "as". There's no |
|
support for assemblers that demand Intel style code. |
|
|
|
|
|
|
|
STACK FRAME |
|
|
|
m4 macros are used to define the parameters passed on the stack, and these |
|
act like comments on what the stack frame looks like too. For example, |
|
mpn_mul_1() has the following. |
|
|
|
defframe(PARAM_MULTIPLIER, 16) |
|
defframe(PARAM_SIZE, 12) |
|
defframe(PARAM_SRC, 8) |
|
defframe(PARAM_DST, 4) |
|
|
|
PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The |
|
return address is at offset 0, but there's not normally any need to access |
|
that. |
|
|
|
FRAME is redefined as necessary through the code so it's the number of bytes |
|
pushed on the stack, and hence the offsets in the parameter macros stay |
|
correct. At the start of a routine FRAME should be zero. |
|
|
|
deflit(`FRAME',0) |
|
... |
|
deflit(`FRAME',4) |
|
... |
|
deflit(`FRAME',8) |
|
... |
|
|
|
Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and |
|
FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions, |
|
and can be used instead of explicit definitions if preferred. |
|
defframe_pushl() is a combination FRAME_pushl() and defframe(). |
|
|
|
There's generally some slackness in redefining FRAME. If new values aren't |
|
going to get used then the redefinitions are omitted to keep from cluttering |
|
up the code. This happens for instance at the end of a routine, where there |
|
might be just four pops and then a ret, so FRAME isn't getting used. |
|
|
|
Local variables and saved registers can be similarly defined, with negative |
|
offsets representing stack space below the initial stack pointer. For |
|
example, |
|
|
|
defframe(SAVE_ESI, -4) |
|
defframe(SAVE_EDI, -8) |
|
defframe(VAR_COUNTER,-12) |
|
|
|
deflit(STACK_SPACE, 12) |
|
|
|
Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the |
|
space, and that instruction must be followed by a redefinition of FRAME |
|
(setting it equal to STACK_SPACE) to reflect the change in %esp. |
|
|
|
Definitions for pushed registers are only put in when they're going to be |
|
used. If registers are just saved and restored with pushes and pops then |
|
definitions aren't made. |
|
|
|
|
|
|
|
ASSEMBLER EXPRESSIONS |
|
|
|
Only addition and subtraction seem to be universally available, certainly |
|
that's all the Solaris 8 "as" seems to accept. If expressions are wanted |
|
then m4 eval() should be used. |
|
|
|
In particular note that a "/" anywhere in a line starts a comment in Solaris |
|
"as", and in some configurations of gas too. |
|
|
|
addl $32/2, %eax <-- wrong |
|
|
|
addl $eval(32/2), %eax <-- right |
|
|
|
Binutils gas/config/tc-i386.c has a choice between "/" being a comment |
|
anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select |
|
the latter, and from 2.9.5 it's the default for GNU/Linux too. |
|
|
|
|
|
|
|
ASSEMBLER COMMENTS |
|
|
|
Solaris "as" doesn't support "#" commenting, using /* */ instead. For that |
|
reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s" |
|
files have no comments. |
|
|
|
Any comments before include(`../config.m4') must use m4 "dnl", since it's |
|
only after the include that "C" is available. By convention "dnl" is also |
|
used for comments about m4 macros. |
|
|
|
|
|
|
|
TEMPORARY LABELS |
|
|
|
Temporary numbered labels like "1:" used as "1f" or "1b" are available in |
|
"gas" and Solaris "as", but not in SCO "as". Normal L() labels should be |
|
used instead, possibly with a counter to make them unique, see jadcl0() for |
|
instance. A separate counter for each macro makes it possible to nest them, |
|
for instance movl_text_address() can be used within an ASSERT(). |
|
|
|
"1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a |
|
unique number looks like a good alternative, but is that actually a |
|
documented feature? In any case this problem doesn't currently arise. |
|
|
|
|
|
|
|
ZERO DISPLACEMENTS |
|
|
|
In a couple of places addressing modes like 0(%ebx) with a byte-sized zero |
|
displacement are wanted, rather than (%ebx) with no displacement. These are |
|
either for computed jumps or to get desirable code alignment. Explicit |
|
.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into |
|
(%ebx). The Zdisp() macro in x86-defs.m4 is used for this. |
|
|
|
Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas |
|
1.92.3 changes it. In general changing would be the sort of "optimization" |
|
an assembler might perform, hence explicit ".byte"s are used where |
|
necessary. |
|
|
|
|
|
|
|
SHLD/SHRD INSTRUCTIONS |
|
|
|
The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx" |
|
must be written "shldl %eax,%ebx" for some assemblers. gas takes either, |
|
Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is |
|
gas), and omits %cl elsewhere. |
|
|
|
For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether |
|
%cl should be used, and the macros shldl, shrdl, shldw and shrdw in |
|
mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments |
|
with those macros for usage. |
|
|
|
|
|
|
|
IMUL INSTRUCTION |
|
|
|
GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes |
|
that the following two forms produce identical object code |
|
|
|
imul $12, %eax |
|
imul $12, %eax, %eax |
|
|
|
but that the former isn't accepted by some assemblers, in particular the SCO |
|
OSR5 COFF assembler. GMP follows GCC and uses only the latter form. |
|
|
|
(This applies only to immediate operands, the three operand form is only |
|
valid with an immediate.) |
|
|
|
|
|
|
|
DIRECTION FLAG |
|
|
|
The x86 calling conventions say that the direction flag should be clear at |
|
function entry and exit. (See iBCS2 and SVR4 ABI books, references below.) |
|
Although this has been so since the year dot, it's not absolutely clear |
|
whether it's universally respected. Since it's better to be safe than |
|
sorry, GMP follows glibc and does a "cld" if it depends on the direction |
|
flag being clear. This happens only in a few places. |
|
|
|
|
|
|
|
POSITION INDEPENDENT CODE |
|
|
|
Defining the symbol PIC in m4 processing selects SVR4 / ELF style position |
|
independent code. This is necessary for shared libraries because they can |
|
be mapped into different processes at different virtual addresses. Actually |
|
relocations are allowed, but presumably pages with relocations aren't |
|
shared, defeating the purpose of a shared library. |
|
|
|
The use of the PLT adds a fixed cost to every function call, and the GOT |
|
adds a cost to any function accessing global variables. These are small but |
|
might be noticeable when working with small operands. |
|
|
|
Calls from one library function to another don't need to go through the PLT, |
|
since of course the call instruction uses a displacement, not an absolute |
|
address, and the relative locations of object files are known when libgmp.so |
|
is created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls |
|
this way, so that there's no jump through the PLT, but of course leaving |
|
setups of the GOT address in %ebx that may be unnecessary. |
|
|
|
The %ebx setup could be avoided in assembly if a separate option controlled |
|
PIC for calls as opposed to computed jumps etc. But there's only ever |
|
likely to be a handful of calls out of assembler, and getting the same |
|
optimization for C intra-library calls would be more important. There seems |
|
no easy way to tell gcc that certain functions can be called non-PIC, and |
|
unfortunately many GMP functions use the global memory allocation variables, |
|
so they need the GOT anyway. Object files with no global data references |
|
and only intra-library calls could go into the library as non-PIC under |
|
-Bsymbolic. Integrating this into libtool and automake is left as an |
|
exercise for the reader. |
|
|
|
|
|
|
|
GLOBAL OFFSET TABLE CODING |
|
|
|
It's believed the magic _GLOBAL_OFFSET_TABLE_ used by code establishing the |
|
address of the GOT should be written without a GSYM_PREFIX, ie. that it's |
|
the same "_GLOBAL_OFFSET_TABLE_" on an underscore or non-underscore system. |
|
Certainly this is true for instance of NetBSD 1.4 which is an underscore |
|
system but requires "_GLOBAL_OFFSET_TABLE_". |
|
|
|
Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when |
|
asked to assemble the following, |
|
|
|
L1: |
|
addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx |
|
|
|
It seems that using the label in the same instruction it refers to is the |
|
problem, since a nop in between works. But the simplest workaround is to |
|
follow gcc and omit the +[.-L1] since it does nothing, |
|
|
|
addl $_GLOBAL_OFFSET_TABLE_, %ebx |
|
|
|
Current gas 2.10 generates incorrect object code when %eax is used in such a |
|
construction (with or without +[.-L1]), |
|
|
|
addl $_GLOBAL_OFFSET_TABLE_, %eax |
|
|
|
The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for |
|
the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any |
|
other register, since then it's a two byte opcode+mod/rm. GCC for example |
|
always uses %ebx (which is needed for calls through the PLT). |
|
|
|
A similar problem occurs in an leal (again with or without a +[.-L1]), |
|
|
|
leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx |
|
|
|
This time the R_386_GOTPC gets a displacement of 0 rather than the 2 |
|
appropriate for the opcode and mod/rm, making this form unusable. |
|
|
|
|
|
|
|
SIMPLE LOOPS |
|
|
|
The overheads in setting up for an unrolled loop can mean that at small |
|
sizes a simple loop is faster. Making small sizes go fast is important, |
|
even if it adds a cycle or two to bigger sizes. To this end various |
|
routines choose between a simple loop and an unrolled loop according to |
|
operand size. The path to the simple loop, or to special case code for |
|
small sizes, is always as fast as possible. |
|
|
|
Adding a simple loop requires a conditional jump to choose between the |
|
simple and unrolled code. The size of a branch misprediction penalty |
|
affects whether a simple loop is worthwhile. |
|
|
|
The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover |
|
point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >= |
|
UNROLL_THRESHOLD using the unrolled loop. If position independent code adds |
|
a couple of cycles to an unrolled loop setup, the threshold will vary with |
|
PIC or non-PIC. Something like the following is typical. |
|
|
|
deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8)) |
|
|
|
There's no automated way to determine the threshold. Setting it to a small |
|
value and then to a big value makes it possible to measure the simple and |
|
unrolled loops each over a range of sizes, from which the crossover point |
|
can be determined. Alternately, just adjust the threshold up or down until |
|
there's no more speedups. |
|
|
|
|
|
|
|
UNROLLED LOOP CODING |
|
|
|
The x86 addressing modes allow a byte displacement of -128 to +127, making |
|
it possible to access 256 bytes, which is 64 limbs, without adjusting |
|
pointer registers within the loop. Dword sized displacements can be used |
|
too, but they increase code size, and unrolling to 64 ought to be enough. |
|
|
|
When unrolling to the full 64 limbs/loop, the limb at the top of the loop |
|
will have a displacement of -128, so pointers have to have a corresponding |
|
+128 added before entering the loop. When unrolling to 32 limbs/loop |
|
displacements 0 to 127 can be used with 0 at the top of the loop and no |
|
adjustment needed to the pointers. |
|
|
|
Where 64 limbs/loop is supported, the +128 adjustment is done only when 64 |
|
limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or |
|
16 is small, so support for 64 limbs/loop is generally only for comparison. |
|
|
|
|
|
|
|
COMPUTED JUMPS |
|
|
|
When working from least significant limb to most significant limb (most |
|
routines) the computed jump and pointer calculations in preparation for an |
|
unrolled loop are as follows. |
|
|
|
S = operand size in limbs |
|
N = number of limbs per loop (UNROLL_COUNT) |
|
L = log2 of unrolling (UNROLL_LOG2) |
|
M = mask for unrolling (UNROLL_MASK) |
|
C = code bytes per limb in the loop |
|
B = bytes per limb (4 for x86) |
|
|
|
computed jump (-S & M) * C + entrypoint |
|
subtract from pointers (-S & M) * B |
|
initial loop counter (S-1) >> L |
|
displacements 0 to B*(N-1) |
|
|
|
The loop counter is decremented at the end of each loop, and the looping |
|
stops when the decrement takes the counter to -1. The displacements are for |
|
the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax". |
|
|
|
Usually the multiply by "C" can be handled without an imul, using instead an |
|
leal, or a shift and subtract. |
|
|
|
When working from most significant to least significant limb (eg. mpn_lshift |
|
and mpn_copyd), the calculations change as follows. |
|
|
|
add to pointers (-S & M) * B |
|
displacements 0 to -B*(N-1) |
|
|
|
|
|
|
|
OLD GAS 1.92.3 |
|
|
|
This version comes with FreeBSD 2.2.8 and has a couple of gremlins that |
|
affect GMP code. |
|
|
|
Firstly, an expression involving two forward references to labels comes out |
|
as zero. For example, |
|
|
|
addl $bar-foo, %eax |
|
foo: |
|
nop |
|
bar: |
|
|
|
This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax". |
|
When only one forward reference is involved, it works correctly, as for |
|
example, |
|
|
|
foo: |
|
addl $bar-foo, %eax |
|
nop |
|
bar: |
|
|
|
Secondly, an expression involving two labels can't be used as the |
|
displacement for an leal. For example, |
|
|
|
foo: |
|
nop |
|
bar: |
|
leal bar-foo(%eax,%ebx,8), %ecx |
|
|
|
A slightly cryptic error is given, "Unimplemented segment type 0 in |
|
parse_operand". When only one label is used it's ok, and the label can be a |
|
forward reference too, as for example, |
|
|
|
leal foo(%eax,%ebx,8), %ecx |
|
nop |
|
foo: |
|
|
|
These problems only affect PIC computed jump calculations. The workarounds |
|
are just to do an leal without a displacement and then an addl, and to make |
|
sure the code is placed so that there's at most one forward reference in the |
|
addl. |
|
|
|
|
|
|
|
REFERENCES |
|
|
|
"Intel Architecture Software Developer's Manual", volumes 1 to 3, 2001, |
|
order numbers 245470, 245471 and 245472. Available on-line, |
|
|
|
http://developer.intel.com/design/pentium4/manuals/245470.htm |
|
http://developer.intel.com/design/pentium4/manuals/245471.htm |
|
http://developer.intel.com/design/pentium4/manuals/245472.htm |
|
|
|
"System V Application Binary Interface", Unix System Laboratories Inc, 1992, |
|
published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor |
|
Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling |
|
conventions and ELF shared library PIC coding. Versions of both available |
|
on-line, |
|
|
|
http://www.sco.com/developer/devspecs |
|
|
|
"Intel386 Family Binary Compatibility Specification 2", Intel Corporation, |
|
published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386 |
|
ABI supplement.) |
|
|
|
|
|
|
|
---------------- |
|
Local variables: |
|
mode: text |
|
fill-column: 76 |
|
End: |