OpenXM_contrib/gmp/mpn/x86/k6/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / x86 / k6
Annotation of OpenXM_contrib/gmp/mpn/x86/k6/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 2000, 2001 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
1.1       maekawa    22:
                     23:                        AMD K6 MPN SUBROUTINES
                     24:
                     25:
                     26:
                     27: This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
                     28: K6-3.
                     29:
1.1.1.2 ! ohara      30: The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
        !            31: has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
        !            32: the separate directories are just so that ./configure can omit them if the
        !            33: assembler doesn't support MMX.
1.1       maekawa    34:
                     35:
                     36:
                     37:
                     38: STATUS
                     39:
                     40: Times for the loops, with all code and data in L1 cache, are as follows.
                     41:
                     42:                                  cycles/limb
                     43:
                     44:        mpn_add_n/sub_n            3.25 normal, 2.75 in-place
                     45:
                     46:        mpn_mul_1                  6.25
                     47:        mpn_add/submul_1           7.65-8.4  (varying with data values)
                     48:
                     49:        mpn_mul_basecase           9.25 cycles/crossproduct (approx)
                     50:        mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
                     51:                                    or 9.2 cycles/triangleproduct (approx)
                     52:
1.1.1.2 ! ohara      53:        mpn_l/rshift               3.0
        !            54:
1.1       maekawa    55:        mpn_divrem_1              20.0
                     56:        mpn_mod_1                 20.0
                     57:        mpn_divexact_by3          11.0
                     58:
1.1.1.2 ! ohara      59:        mpn_copyi                  1.0
        !            60:        mpn_copyd                  1.0
1.1       maekawa    61:
                     62:
                     63: K6-2 and K6-3 have dual-issue MMX and get the following improvements.
                     64:
                     65:        mpn_l/rshift               1.75
                     66:
                     67:
                     68: Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
                     69: instruction, code seems to run slower, and with just "mov" loads it doesn't
                     70: seem faster.  Results so far are inconsistent.  The K6 does a hardware
                     71: prefetch of the second cache line in a sector, so the penalty for not
                     72: prefetching in software is reduced.
                     73:
                     74:
                     75:
                     76:
                     77: NOTES
                     78:
                     79: All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
                     80:
                     81: Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
1.1.1.2 ! ohara      82: execute them in both X and Y (and in both together).
1.1       maekawa    83:
                     84: Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
                     85: chapter 6 table 12).
                     86:
                     87: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
                     88: Store queue is 7 entries of 64 bits each.
                     89:
                     90: Floating point multiplications can be done in parallel with integer
                     91: multiplications, but there doesn't seem to be any way to make use of this.
                     92:
                     93:
                     94:
                     95: OPTIMIZATIONS
                     96:
                     97: Unrolled loops are used to reduce looping overhead.  The unrolling is
                     98: configurable up to 32 limbs/loop for most routines, up to 64 for some.
                     99:
                    100: Sometimes computed jumps into the unrolling are used to handle sizes not a
                    101: multiple of the unrolling.  An attractive feature of this is that times
                    102: smoothly increase with operand size, but an indirect jump is about 6 cycles
                    103: and the setups about another 6, so it depends on how much the unrolled code
                    104: is faster than a simple loop as to whether a computed jump ought to be used.
                    105:
                    106: Position independent code is implemented using a call to get eip for
                    107: computed jumps and a ret is always done, rather than an addl $4,%esp or a
                    108: popl, so the CPU return address branch prediction stack stays synchronised
                    109: with the actual stack in memory.  Such a call however still costs 4 to 7
                    110: cycles.
                    111:
                    112: Branch prediction, in absence of any history, will guess forward jumps are
                    113: not taken and backward jumps are taken.  Where possible it's arranged that
                    114: the less likely or less important case is under a taken forward jump.
                    115:
                    116:
                    117:
                    118: MMX
                    119:
                    120: Putting emms or femms as late as possible in a routine seems to be fastest.
                    121: Perhaps an emms or femms stalls until all outstanding MMX instructions have
                    122: completed, so putting it later gives them a chance to complete on their own,
                    123: in parallel with other operations (like register popping).
                    124:
                    125: The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
                    126: at the start of a routine, in case it's been preceded by x87 floating point
                    127: operations.  This isn't done because in gmp programs it's expected that x87
                    128: floating point won't be much used and that chances are an mpn routine won't
                    129: have been preceded by any x87 code.
                    130:
                    131:
                    132:
                    133: CODING
                    134:
                    135: Instructions in general code are shown paired if they can decode and execute
                    136: together, meaning two short decode instructions with the second not
                    137: depending on the first, only the first using the shifter, no more than one
                    138: load, and no more than one store.
                    139:
                    140: K6 does some out of order execution so the pairings aren't essential, they
                    141: just show what slots might be available.  When decoding is the limiting
                    142: factor things can be scheduled that might not execute until later.
                    143:
                    144:
                    145:
                    146: NOTES
                    147:
                    148: Code alignment
                    149:
                    150: - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
                    151:   short decode is inhibited.  The cross.pl script detects this.
                    152:
                    153: - loops and branch targets should be aligned to 16 bytes, or ensure at least
                    154:   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
                    155:   cache in the BTB.
                    156:
                    157: Addressing modes
                    158:
                    159: - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
                    160:   problem, and can be used as an equivalent, or easier is just to use a
                    161:   different register, like %ebx.
                    162:
                    163: - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
                    164:   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
                    165:
                    166:   If more than 3 bytes are needed to determine instruction length then
                    167:   decoding degrades from direct to long, or from long to vector.  This
                    168:   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
                    169:   with mod=00 the sib determines whether there's a displacement.
                    170:
1.1.1.2 ! ohara     171:   This affects all MMX and 3DNow instructions, and others with an 0F prefix,
1.1       maekawa   172:   like movzbl.  The modes affected are anything with an index and no
                    173:   displacement, or an index but no base, and this includes (%esp) which is
                    174:   really (,%esp,1).
                    175:
                    176:   The cross.pl script detects problem cases.  The workaround is to always
                    177:   use a displacement, and to do this with Zdisp if it's zero so the
                    178:   assembler doesn't discard it.
                    179:
                    180:   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
                    181:   13-14 and 36-37.
                    182:
                    183: Calls
                    184:
                    185: - indirect jumps and calls are not branch predicted, they measure about 6
                    186:   cycles.
                    187:
                    188: Various
                    189:
                    190: - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
                    191: - bsf       12-27 cycles
                    192: - emms      5 cycles
                    193: - femms     3 cycles
                    194: - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
                    195: - divl      20 cycles back-to-back
1.1.1.2 ! ohara     196: - imull     2 decode, 3 execute
1.1       maekawa   197: - mull      2 decode, 3 execute (optimization manual decoding sample)
                    198: - prefetch  2 cycles
                    199: - rcll/rcrl implicit by one bit: 2 cycles
                    200:             immediate or %cl count: 11 + 2 per bit for dword
                    201:                                     13 + 4 per bit for byte
                    202: - setCC            2 cycles
                    203: - xchgl        %eax,reg  1.5 cycles, back-to-back (strange)
                    204:         reg,reg   2 cycles, back-to-back
                    205:
                    206:
                    207:
                    208:
                    209: REFERENCES
                    210:
                    211: "AMD-K6 Processor Code Optimization Application Note", AMD publication
                    212: number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
                    213: K6-3.  Available on-line,
                    214:
                    215:        http://www.amd.com/K6/k6docs/pdf/21924.pdf
                    216:
                    217: "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
                    218: publication number 21828, revision A amendment 0, August 1997.  This is an
                    219: older edition of the above document, describing plain K6.  Available
                    220: on-line,
                    221:
                    222:        http://www.amd.com/K6/k6docs/pdf/21828.pdf
                    223:
                    224: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
                    225: This describes the femms and prefetch instructions, but nothing else from
                    226: 3DNow has been used.  Available on-line,
                    227:
                    228:        http://www.amd.com/K6/k6docs/pdf/21928.pdf
                    229:
                    230: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
                    231: August 1999.  This has some notes on general K6 optimizations as well as
                    232: 3DNow.  Available on-line,
                    233:
                    234:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
                    235:
                    236:
                    237:
                    238: ----------------
                    239: Local variables:
                    240: mode: text
                    241: fill-column: 76
                    242: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>