Annotation of OpenXM_contrib/gmp/mpn/x86/k6/README, Revision 1.1
1.1 ! maekawa 1:
! 2: AMD K6 MPN SUBROUTINES
! 3:
! 4:
! 5:
! 6: This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
! 7: K6-3.
! 8:
! 9: The mmx and k62mmx subdirectories have routines using MMX instructions. All
! 10: K6s have MMX, the separate directories are just so that ./configure can omit
! 11: them if the assembler doesn't support MMX.
! 12:
! 13:
! 14:
! 15:
! 16: STATUS
! 17:
! 18: Times for the loops, with all code and data in L1 cache, are as follows.
! 19:
! 20: cycles/limb
! 21:
! 22: mpn_add_n/sub_n 3.25 normal, 2.75 in-place
! 23:
! 24: mpn_mul_1 6.25
! 25: mpn_add/submul_1 7.65-8.4 (varying with data values)
! 26:
! 27: mpn_mul_basecase 9.25 cycles/crossproduct (approx)
! 28: mpn_sqr_basecase 4.7 cycles/crossproduct (approx)
! 29: or 9.2 cycles/triangleproduct (approx)
! 30:
! 31: mpn_divrem_1 20.0
! 32: mpn_mod_1 20.0
! 33: mpn_divexact_by3 11.0
! 34:
! 35: mpn_l/rshift 3.0
! 36:
! 37: mpn_copyi/copyd 1.0
! 38:
! 39: mpn_com_n 1.5-1.85 \
! 40: mpn_and/andn/ior/xor_n 1.5-1.75 | varying with
! 41: mpn_iorn/xnor_n 2.0-2.25 | data alignment
! 42: mpn_nand/nior_n 2.0-2.25 /
! 43:
! 44: mpn_popcount 12.5
! 45: mpn_hamdist 13.0
! 46:
! 47:
! 48: K6-2 and K6-3 have dual-issue MMX and get the following improvements.
! 49:
! 50: mpn_l/rshift 1.75
! 51:
! 52: mpn_copyi/copyd 0.56 or 1.0 \
! 53: |
! 54: mpn_com_n 1.0-1.2 | varying with
! 55: mpn_and/andn/ior/xor_n 1.2-1.5 | data alignment
! 56: mpn_iorn/xnor_n 1.5-2.0 |
! 57: mpn_nand/nior_n 1.75-2.0 /
! 58:
! 59: mpn_popcount 9.0
! 60: mpn_hamdist 11.5
! 61:
! 62:
! 63: Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch"
! 64: instruction, code seems to run slower, and with just "mov" loads it doesn't
! 65: seem faster. Results so far are inconsistent. The K6 does a hardware
! 66: prefetch of the second cache line in a sector, so the penalty for not
! 67: prefetching in software is reduced.
! 68:
! 69:
! 70:
! 71:
! 72: NOTES
! 73:
! 74: All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
! 75:
! 76: Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
! 77: execute them in both X and Y (and together).
! 78:
! 79: Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
! 80: chapter 6 table 12).
! 81:
! 82: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
! 83: Store queue is 7 entries of 64 bits each.
! 84:
! 85: Floating point multiplications can be done in parallel with integer
! 86: multiplications, but there doesn't seem to be any way to make use of this.
! 87:
! 88:
! 89:
! 90: OPTIMIZATIONS
! 91:
! 92: Unrolled loops are used to reduce looping overhead. The unrolling is
! 93: configurable up to 32 limbs/loop for most routines, up to 64 for some.
! 94:
! 95: Sometimes computed jumps into the unrolling are used to handle sizes not a
! 96: multiple of the unrolling. An attractive feature of this is that times
! 97: smoothly increase with operand size, but an indirect jump is about 6 cycles
! 98: and the setups about another 6, so it depends on how much the unrolled code
! 99: is faster than a simple loop as to whether a computed jump ought to be used.
! 100:
! 101: Position independent code is implemented using a call to get eip for
! 102: computed jumps and a ret is always done, rather than an addl $4,%esp or a
! 103: popl, so the CPU return address branch prediction stack stays synchronised
! 104: with the actual stack in memory. Such a call however still costs 4 to 7
! 105: cycles.
! 106:
! 107: Branch prediction, in absence of any history, will guess forward jumps are
! 108: not taken and backward jumps are taken. Where possible it's arranged that
! 109: the less likely or less important case is under a taken forward jump.
! 110:
! 111:
! 112:
! 113: MMX
! 114:
! 115: Putting emms or femms as late as possible in a routine seems to be fastest.
! 116: Perhaps an emms or femms stalls until all outstanding MMX instructions have
! 117: completed, so putting it later gives them a chance to complete on their own,
! 118: in parallel with other operations (like register popping).
! 119:
! 120: The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
! 121: at the start of a routine, in case it's been preceded by x87 floating point
! 122: operations. This isn't done because in gmp programs it's expected that x87
! 123: floating point won't be much used and that chances are an mpn routine won't
! 124: have been preceded by any x87 code.
! 125:
! 126:
! 127:
! 128: CODING
! 129:
! 130: Instructions in general code are shown paired if they can decode and execute
! 131: together, meaning two short decode instructions with the second not
! 132: depending on the first, only the first using the shifter, no more than one
! 133: load, and no more than one store.
! 134:
! 135: K6 does some out of order execution so the pairings aren't essential, they
! 136: just show what slots might be available. When decoding is the limiting
! 137: factor things can be scheduled that might not execute until later.
! 138:
! 139:
! 140:
! 141: NOTES
! 142:
! 143: Code alignment
! 144:
! 145: - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
! 146: short decode is inhibited. The cross.pl script detects this.
! 147:
! 148: - loops and branch targets should be aligned to 16 bytes, or ensure at least
! 149: 2 instructions before a 32 byte boundary. This makes use of the 16 byte
! 150: cache in the BTB.
! 151:
! 152: Addressing modes
! 153:
! 154: - (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this
! 155: problem, and can be used as an equivalent, or easier is just to use a
! 156: different register, like %ebx.
! 157:
! 158: - K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3
! 159: have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
! 160:
! 161: If more than 3 bytes are needed to determine instruction length then
! 162: decoding degrades from direct to long, or from long to vector. This
! 163: happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
! 164: with mod=00 the sib determines whether there's a displacement.
! 165:
! 166: This affects all MMX and 3DNow instructions, and others with an 0F prefix
! 167: like movzbl. The modes affected are anything with an index and no
! 168: displacement, or an index but no base, and this includes (%esp) which is
! 169: really (,%esp,1).
! 170:
! 171: The cross.pl script detects problem cases. The workaround is to always
! 172: use a displacement, and to do this with Zdisp if it's zero so the
! 173: assembler doesn't discard it.
! 174:
! 175: See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
! 176: 13-14 and 36-37.
! 177:
! 178: Calls
! 179:
! 180: - indirect jumps and calls are not branch predicted, they measure about 6
! 181: cycles.
! 182:
! 183: Various
! 184:
! 185: - adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe
! 186: - bsf 12-27 cycles
! 187: - emms 5 cycles
! 188: - femms 3 cycles
! 189: - jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken)
! 190: - divl 20 cycles back-to-back
! 191: - imull 2 decode, 2 execute
! 192: - mull 2 decode, 3 execute (optimization manual decoding sample)
! 193: - prefetch 2 cycles
! 194: - rcll/rcrl implicit by one bit: 2 cycles
! 195: immediate or %cl count: 11 + 2 per bit for dword
! 196: 13 + 4 per bit for byte
! 197: - setCC 2 cycles
! 198: - xchgl %eax,reg 1.5 cycles, back-to-back (strange)
! 199: reg,reg 2 cycles, back-to-back
! 200:
! 201:
! 202:
! 203:
! 204: REFERENCES
! 205:
! 206: "AMD-K6 Processor Code Optimization Application Note", AMD publication
! 207: number 21924, revision D amendment 0, January 2000. This describes K6-2 and
! 208: K6-3. Available on-line,
! 209:
! 210: http://www.amd.com/K6/k6docs/pdf/21924.pdf
! 211:
! 212: "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
! 213: publication number 21828, revision A amendment 0, August 1997. This is an
! 214: older edition of the above document, describing plain K6. Available
! 215: on-line,
! 216:
! 217: http://www.amd.com/K6/k6docs/pdf/21828.pdf
! 218:
! 219: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
! 220: This describes the femms and prefetch instructions, but nothing else from
! 221: 3DNow has been used. Available on-line,
! 222:
! 223: http://www.amd.com/K6/k6docs/pdf/21928.pdf
! 224:
! 225: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
! 226: August 1999. This has some notes on general K6 optimizations as well as
! 227: 3DNow. Available on-line,
! 228:
! 229: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
! 230:
! 231:
! 232:
! 233: ----------------
! 234: Local variables:
! 235: mode: text
! 236: fill-column: 76
! 237: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>