OpenXM_contrib/gmp/mpn/x86/k7/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / x86 / k7
Annotation of OpenXM_contrib/gmp/mpn/x86/k7/README, Revision 1.1

1.1     ! maekawa     1:
        !             2:                       AMD K7 MPN SUBROUTINES
        !             3:
        !             4:
        !             5: This directory contains code optimized for the AMD Athlon CPU.
        !             6:
        !             7: The mmx subdirectory has routines using MMX instructions.  All Athlons have
        !             8: MMX, the separate directory is just so that configure can omit it if the
        !             9: assembler doesn't support MMX.
        !            10:
        !            11:
        !            12:
        !            13: STATUS
        !            14:
        !            15: Times for the loops, with all code and data in L1 cache.
        !            16:
        !            17:                                cycles/limb
        !            18:        mpn_add/sub_n             1.6
        !            19:
        !            20:        mpn_copyi                 0.75 or 1.0   \ varying with data alignment
        !            21:        mpn_copyd                 0.75 or 1.0   /
        !            22:
        !            23:        mpn_divrem_1             17.0 integer part, 15.0 fractional part
        !            24:        mpn_mod_1                17.0
        !            25:        mpn_divexact_by3          8.0
        !            26:
        !            27:        mpn_l/rshift              1.2
        !            28:
        !            29:        mpn_mul_1                 3.4
        !            30:        mpn_addmul/submul_1       3.9
        !            31:
        !            32:        mpn_mul_basecase          4.42 cycles/crossproduct (approx)
        !            33:
        !            34:        mpn_popcount               5.0
        !            35:        mpn_hamdist                6.0
        !            36:
        !            37: Prefetching of sources hasn't yet been tried.
        !            38:
        !            39:
        !            40:
        !            41: NOTES
        !            42:
        !            43: cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
        !            44:
        !            45: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
        !            46:
        !            47: Floating point multiplications can be done in parallel with integer
        !            48: multiplications, but there doesn't seem to be any way to make use of this.
        !            49:
        !            50: Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
        !            51: the speed of the multiplication routines.  The documentation shows mul
        !            52: executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
        !            53: to get near 3 cycles code has to be arranged so that nothing else is issued
        !            54: to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
        !            55: apparently equivalent code takes 5.
        !            56:
        !            57:
        !            58:
        !            59: OPTIMIZATIONS
        !            60:
        !            61: Unrolled loops are used to reduce looping overhead.  The unrolling is
        !            62: configurable up to 32 limbs/loop for most routines and up to 64 for some.
        !            63: The K7 has 64k L1 code cache so quite big unrolling is allowable.
        !            64:
        !            65: Computed jumps into the unrolling are used to handle sizes not a multiple of
        !            66: the unrolling.  An attractive feature of this is that times increase
        !            67: smoothly with operand size, but it may be that some routines should just
        !            68: have simple loops to finish up, especially when PIC adds between 2 and 16
        !            69: cycles to get %eip.
        !            70:
        !            71: Position independent code is implemented using a call to get %eip for the
        !            72: computed jumps and a ret is always done, rather than an addl $4,%esp or a
        !            73: popl, so the CPU return address branch prediction stack stays synchronised
        !            74: with the actual stack in memory.
        !            75:
        !            76: Branch prediction, in absence of any history, will guess forward jumps are
        !            77: not taken and backward jumps are taken.  Where possible it's arranged that
        !            78: the less likely or less important case is under a taken forward jump.
        !            79:
        !            80:
        !            81:
        !            82: CODING
        !            83:
        !            84: Instructions in general code have been shown grouped if they can execute
        !            85: together, which means up to three direct-path instructions which have no
        !            86: successive dependencies.  K7 always decodes three and has out-of-order
        !            87: execution, but the groupings show what slots might be available and what
        !            88: dependency chains exist.
        !            89:
        !            90: When there's vector-path instructions an effort is made to get triplets of
        !            91: direct-path instructions in between them, even if there's dependencies,
        !            92: since this maximizes decoding throughput and might save a cycle or two if
        !            93: decoding is the limiting factor.
        !            94:
        !            95:
        !            96:
        !            97: INSTRUCTIONS
        !            98:
        !            99: adcl       direct
        !           100: divl       39 cycles back-to-back
        !           101: lodsl,etc  vector
        !           102: loop       1 cycle vector (decl/jnz opens up one decode slot)
        !           103: movd reg   vector
        !           104: movd mem   direct
        !           105: mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
        !           106: popl      vector (use movl for more than one pop)
        !           107: pushl     direct, will pair with a load
        !           108: shrdl %cl  vector, 3 cycles, seems to be 3 decode too
        !           109: xorl r,r   false read dependency recognised
        !           110:
        !           111:
        !           112:
        !           113: REFERENCES
        !           114:
        !           115: "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
        !           116: 22007, revision E, November 1999.  Available on-line,
        !           117:
        !           118:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
        !           119:
        !           120: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
        !           121: This describes the femms and prefetch instructions.  Available on-line,
        !           122:
        !           123:        http://www.amd.com/K6/k6docs/pdf/21928.pdf
        !           124:
        !           125: "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
        !           126: publication number 22466, revision B, August 1999.  This describes
        !           127: instructions added in the Athlon processor, such as pswapd and the extra
        !           128: prefetch forms.  Available on-line,
        !           129:
        !           130:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf
        !           131:
        !           132: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
        !           133: August 1999.  This has some notes on general Athlon optimizations as well as
        !           134: 3DNow.  Available on-line,
        !           135:
        !           136:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
        !           137:
        !           138:
        !           139:
        !           140:
        !           141: ----------------
        !           142: Local variables:
        !           143: mode: text
        !           144: fill-column: 76
        !           145: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>