Annotation of OpenXM_contrib/gmp/mpn/x86/k7/README, Revision 1.1
1.1 ! maekawa 1:
! 2: AMD K7 MPN SUBROUTINES
! 3:
! 4:
! 5: This directory contains code optimized for the AMD Athlon CPU.
! 6:
! 7: The mmx subdirectory has routines using MMX instructions. All Athlons have
! 8: MMX, the separate directory is just so that configure can omit it if the
! 9: assembler doesn't support MMX.
! 10:
! 11:
! 12:
! 13: STATUS
! 14:
! 15: Times for the loops, with all code and data in L1 cache.
! 16:
! 17: cycles/limb
! 18: mpn_add/sub_n 1.6
! 19:
! 20: mpn_copyi 0.75 or 1.0 \ varying with data alignment
! 21: mpn_copyd 0.75 or 1.0 /
! 22:
! 23: mpn_divrem_1 17.0 integer part, 15.0 fractional part
! 24: mpn_mod_1 17.0
! 25: mpn_divexact_by3 8.0
! 26:
! 27: mpn_l/rshift 1.2
! 28:
! 29: mpn_mul_1 3.4
! 30: mpn_addmul/submul_1 3.9
! 31:
! 32: mpn_mul_basecase 4.42 cycles/crossproduct (approx)
! 33:
! 34: mpn_popcount 5.0
! 35: mpn_hamdist 6.0
! 36:
! 37: Prefetching of sources hasn't yet been tried.
! 38:
! 39:
! 40:
! 41: NOTES
! 42:
! 43: cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
! 44:
! 45: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
! 46:
! 47: Floating point multiplications can be done in parallel with integer
! 48: multiplications, but there doesn't seem to be any way to make use of this.
! 49:
! 50: Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on
! 51: the speed of the multiplication routines. The documentation shows mul
! 52: executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
! 53: to get near 3 cycles code has to be arranged so that nothing else is issued
! 54: to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other
! 55: apparently equivalent code takes 5.
! 56:
! 57:
! 58:
! 59: OPTIMIZATIONS
! 60:
! 61: Unrolled loops are used to reduce looping overhead. The unrolling is
! 62: configurable up to 32 limbs/loop for most routines and up to 64 for some.
! 63: The K7 has 64k L1 code cache so quite big unrolling is allowable.
! 64:
! 65: Computed jumps into the unrolling are used to handle sizes not a multiple of
! 66: the unrolling. An attractive feature of this is that times increase
! 67: smoothly with operand size, but it may be that some routines should just
! 68: have simple loops to finish up, especially when PIC adds between 2 and 16
! 69: cycles to get %eip.
! 70:
! 71: Position independent code is implemented using a call to get %eip for the
! 72: computed jumps and a ret is always done, rather than an addl $4,%esp or a
! 73: popl, so the CPU return address branch prediction stack stays synchronised
! 74: with the actual stack in memory.
! 75:
! 76: Branch prediction, in absence of any history, will guess forward jumps are
! 77: not taken and backward jumps are taken. Where possible it's arranged that
! 78: the less likely or less important case is under a taken forward jump.
! 79:
! 80:
! 81:
! 82: CODING
! 83:
! 84: Instructions in general code have been shown grouped if they can execute
! 85: together, which means up to three direct-path instructions which have no
! 86: successive dependencies. K7 always decodes three and has out-of-order
! 87: execution, but the groupings show what slots might be available and what
! 88: dependency chains exist.
! 89:
! 90: When there's vector-path instructions an effort is made to get triplets of
! 91: direct-path instructions in between them, even if there's dependencies,
! 92: since this maximizes decoding throughput and might save a cycle or two if
! 93: decoding is the limiting factor.
! 94:
! 95:
! 96:
! 97: INSTRUCTIONS
! 98:
! 99: adcl direct
! 100: divl 39 cycles back-to-back
! 101: lodsl,etc vector
! 102: loop 1 cycle vector (decl/jnz opens up one decode slot)
! 103: movd reg vector
! 104: movd mem direct
! 105: mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
! 106: popl vector (use movl for more than one pop)
! 107: pushl direct, will pair with a load
! 108: shrdl %cl vector, 3 cycles, seems to be 3 decode too
! 109: xorl r,r false read dependency recognised
! 110:
! 111:
! 112:
! 113: REFERENCES
! 114:
! 115: "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
! 116: 22007, revision E, November 1999. Available on-line,
! 117:
! 118: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
! 119:
! 120: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
! 121: This describes the femms and prefetch instructions. Available on-line,
! 122:
! 123: http://www.amd.com/K6/k6docs/pdf/21928.pdf
! 124:
! 125: "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
! 126: publication number 22466, revision B, August 1999. This describes
! 127: instructions added in the Athlon processor, such as pswapd and the extra
! 128: prefetch forms. Available on-line,
! 129:
! 130: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf
! 131:
! 132: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
! 133: August 1999. This has some notes on general Athlon optimizations as well as
! 134: 3DNow. Available on-line,
! 135:
! 136: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
! 137:
! 138:
! 139:
! 140:
! 141: ----------------
! 142: Local variables:
! 143: mode: text
! 144: fill-column: 76
! 145: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>