[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86 / k7

Annotation of OpenXM_contrib/gmp/mpn/x86/k7/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 2000, 2001 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
1.1       maekawa    22:
                     23:                       AMD K7 MPN SUBROUTINES
                     24:
                     25:
                     26: This directory contains code optimized for the AMD Athlon CPU.
                     27:
                     28: The mmx subdirectory has routines using MMX instructions.  All Athlons have
                     29: MMX, the separate directory is just so that configure can omit it if the
                     30: assembler doesn't support MMX.
                     31:
                     32:
                     33:
                     34: STATUS
                     35:
                     36: Times for the loops, with all code and data in L1 cache.
                     37:
                     38:                                cycles/limb
                     39:        mpn_add/sub_n             1.6
                     40:
                     41:        mpn_copyi                 0.75 or 1.0   \ varying with data alignment
                     42:        mpn_copyd                 0.75 or 1.0   /
                     43:
                     44:        mpn_divrem_1             17.0 integer part, 15.0 fractional part
                     45:        mpn_mod_1                17.0
                     46:        mpn_divexact_by3          8.0
                     47:
                     48:        mpn_l/rshift              1.2
                     49:
                     50:        mpn_mul_1                 3.4
                     51:        mpn_addmul/submul_1       3.9
                     52:
                     53:        mpn_mul_basecase          4.42 cycles/crossproduct (approx)
1.1.1.2 ! ohara      54:         mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
        !            55:                                  or 4.55 cycles/triangleproduct (approx)
1.1       maekawa    56:
                     57: Prefetching of sources hasn't yet been tried.
                     58:
                     59:
                     60:
                     61: NOTES
                     62:
                     63: cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
                     64:
                     65: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
                     66:
                     67: Floating point multiplications can be done in parallel with integer
                     68: multiplications, but there doesn't seem to be any way to make use of this.
                     69:
                     70: Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
                     71: the speed of the multiplication routines.  The documentation shows mul
                     72: executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
                     73: to get near 3 cycles code has to be arranged so that nothing else is issued
                     74: to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
                     75: apparently equivalent code takes 5.
                     76:
                     77:
                     78:
                     79: OPTIMIZATIONS
                     80:
                     81: Unrolled loops are used to reduce looping overhead.  The unrolling is
                     82: configurable up to 32 limbs/loop for most routines and up to 64 for some.
                     83: The K7 has 64k L1 code cache so quite big unrolling is allowable.
                     84:
                     85: Computed jumps into the unrolling are used to handle sizes not a multiple of
                     86: the unrolling.  An attractive feature of this is that times increase
                     87: smoothly with operand size, but it may be that some routines should just
                     88: have simple loops to finish up, especially when PIC adds between 2 and 16
                     89: cycles to get %eip.
                     90:
                     91: Position independent code is implemented using a call to get %eip for the
                     92: computed jumps and a ret is always done, rather than an addl $4,%esp or a
                     93: popl, so the CPU return address branch prediction stack stays synchronised
                     94: with the actual stack in memory.
                     95:
                     96: Branch prediction, in absence of any history, will guess forward jumps are
                     97: not taken and backward jumps are taken.  Where possible it's arranged that
                     98: the less likely or less important case is under a taken forward jump.
                     99:
                    100:
                    101:
                    102: CODING
                    103:
                    104: Instructions in general code have been shown grouped if they can execute
                    105: together, which means up to three direct-path instructions which have no
                    106: successive dependencies.  K7 always decodes three and has out-of-order
                    107: execution, but the groupings show what slots might be available and what
                    108: dependency chains exist.
                    109:
                    110: When there's vector-path instructions an effort is made to get triplets of
                    111: direct-path instructions in between them, even if there's dependencies,
                    112: since this maximizes decoding throughput and might save a cycle or two if
                    113: decoding is the limiting factor.
                    114:
                    115:
                    116:
                    117: INSTRUCTIONS
                    118:
                    119: adcl       direct
                    120: divl       39 cycles back-to-back
                    121: lodsl,etc  vector
                    122: loop       1 cycle vector (decl/jnz opens up one decode slot)
                    123: movd reg   vector
                    124: movd mem   direct
                    125: mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
                    126: popl      vector (use movl for more than one pop)
                    127: pushl     direct, will pair with a load
                    128: shrdl %cl  vector, 3 cycles, seems to be 3 decode too
                    129: xorl r,r   false read dependency recognised
                    130:
                    131:
                    132:
                    133: REFERENCES
                    134:
                    135: "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
                    136: 22007, revision E, November 1999.  Available on-line,
                    137:
                    138:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
                    139:
                    140: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
                    141: This describes the femms and prefetch instructions.  Available on-line,
                    142:
                    143:        http://www.amd.com/K6/k6docs/pdf/21928.pdf
                    144:
                    145: "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
                    146: publication number 22466, revision B, August 1999.  This describes
                    147: instructions added in the Athlon processor, such as pswapd and the extra
                    148: prefetch forms.  Available on-line,
                    149:
                    150:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf
                    151:
                    152: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
                    153: August 1999.  This has some notes on general Athlon optimizations as well as
                    154: 3DNow.  Available on-line,
                    155:
                    156:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
                    157:
                    158:
                    159:
                    160:
                    161: ----------------
                    162: Local variables:
                    163: mode: text
                    164: fill-column: 76
                    165: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>