[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86 / k6

Annotation of OpenXM_contrib/gmp/mpn/x86/k6/README, Revision 1.1.1.1

1.1       maekawa     1:
                      2:                        AMD K6 MPN SUBROUTINES
                      3:
                      4:
                      5:
                      6: This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
                      7: K6-3.
                      8:
                      9: The mmx and k62mmx subdirectories have routines using MMX instructions.  All
                     10: K6s have MMX, the separate directories are just so that ./configure can omit
                     11: them if the assembler doesn't support MMX.
                     12:
                     13:
                     14:
                     15:
                     16: STATUS
                     17:
                     18: Times for the loops, with all code and data in L1 cache, are as follows.
                     19:
                     20:                                  cycles/limb
                     21:
                     22:        mpn_add_n/sub_n            3.25 normal, 2.75 in-place
                     23:
                     24:        mpn_mul_1                  6.25
                     25:        mpn_add/submul_1           7.65-8.4  (varying with data values)
                     26:
                     27:        mpn_mul_basecase           9.25 cycles/crossproduct (approx)
                     28:        mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
                     29:                                    or 9.2 cycles/triangleproduct (approx)
                     30:
                     31:        mpn_divrem_1              20.0
                     32:        mpn_mod_1                 20.0
                     33:        mpn_divexact_by3          11.0
                     34:
                     35:        mpn_l/rshift               3.0
                     36:
                     37:        mpn_copyi/copyd            1.0
                     38:
                     39:        mpn_com_n                  1.5-1.85  \
                     40:        mpn_and/andn/ior/xor_n     1.5-1.75  | varying with
                     41:        mpn_iorn/xnor_n            2.0-2.25  | data alignment
                     42:        mpn_nand/nior_n            2.0-2.25  /
                     43:
                     44:        mpn_popcount              12.5
                     45:        mpn_hamdist               13.0
                     46:
                     47:
                     48: K6-2 and K6-3 have dual-issue MMX and get the following improvements.
                     49:
                     50:        mpn_l/rshift               1.75
                     51:
                     52:        mpn_copyi/copyd            0.56 or 1.0  \
                     53:                                                 |
                     54:        mpn_com_n                  1.0-1.2      | varying with
                     55:        mpn_and/andn/ior/xor_n     1.2-1.5      | data alignment
                     56:        mpn_iorn/xnor_n            1.5-2.0      |
                     57:        mpn_nand/nior_n            1.75-2.0     /
                     58:
                     59:        mpn_popcount               9.0
                     60:        mpn_hamdist               11.5
                     61:
                     62:
                     63: Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
                     64: instruction, code seems to run slower, and with just "mov" loads it doesn't
                     65: seem faster.  Results so far are inconsistent.  The K6 does a hardware
                     66: prefetch of the second cache line in a sector, so the penalty for not
                     67: prefetching in software is reduced.
                     68:
                     69:
                     70:
                     71:
                     72: NOTES
                     73:
                     74: All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
                     75:
                     76: Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
                     77: execute them in both X and Y (and together).
                     78:
                     79: Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
                     80: chapter 6 table 12).
                     81:
                     82: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
                     83: Store queue is 7 entries of 64 bits each.
                     84:
                     85: Floating point multiplications can be done in parallel with integer
                     86: multiplications, but there doesn't seem to be any way to make use of this.
                     87:
                     88:
                     89:
                     90: OPTIMIZATIONS
                     91:
                     92: Unrolled loops are used to reduce looping overhead.  The unrolling is
                     93: configurable up to 32 limbs/loop for most routines, up to 64 for some.
                     94:
                     95: Sometimes computed jumps into the unrolling are used to handle sizes not a
                     96: multiple of the unrolling.  An attractive feature of this is that times
                     97: smoothly increase with operand size, but an indirect jump is about 6 cycles
                     98: and the setups about another 6, so it depends on how much the unrolled code
                     99: is faster than a simple loop as to whether a computed jump ought to be used.
                    100:
                    101: Position independent code is implemented using a call to get eip for
                    102: computed jumps and a ret is always done, rather than an addl $4,%esp or a
                    103: popl, so the CPU return address branch prediction stack stays synchronised
                    104: with the actual stack in memory.  Such a call however still costs 4 to 7
                    105: cycles.
                    106:
                    107: Branch prediction, in absence of any history, will guess forward jumps are
                    108: not taken and backward jumps are taken.  Where possible it's arranged that
                    109: the less likely or less important case is under a taken forward jump.
                    110:
                    111:
                    112:
                    113: MMX
                    114:
                    115: Putting emms or femms as late as possible in a routine seems to be fastest.
                    116: Perhaps an emms or femms stalls until all outstanding MMX instructions have
                    117: completed, so putting it later gives them a chance to complete on their own,
                    118: in parallel with other operations (like register popping).
                    119:
                    120: The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
                    121: at the start of a routine, in case it's been preceded by x87 floating point
                    122: operations.  This isn't done because in gmp programs it's expected that x87
                    123: floating point won't be much used and that chances are an mpn routine won't
                    124: have been preceded by any x87 code.
                    125:
                    126:
                    127:
                    128: CODING
                    129:
                    130: Instructions in general code are shown paired if they can decode and execute
                    131: together, meaning two short decode instructions with the second not
                    132: depending on the first, only the first using the shifter, no more than one
                    133: load, and no more than one store.
                    134:
                    135: K6 does some out of order execution so the pairings aren't essential, they
                    136: just show what slots might be available.  When decoding is the limiting
                    137: factor things can be scheduled that might not execute until later.
                    138:
                    139:
                    140:
                    141: NOTES
                    142:
                    143: Code alignment
                    144:
                    145: - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
                    146:   short decode is inhibited.  The cross.pl script detects this.
                    147:
                    148: - loops and branch targets should be aligned to 16 bytes, or ensure at least
                    149:   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
                    150:   cache in the BTB.
                    151:
                    152: Addressing modes
                    153:
                    154: - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
                    155:   problem, and can be used as an equivalent, or easier is just to use a
                    156:   different register, like %ebx.
                    157:
                    158: - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
                    159:   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
                    160:
                    161:   If more than 3 bytes are needed to determine instruction length then
                    162:   decoding degrades from direct to long, or from long to vector.  This
                    163:   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
                    164:   with mod=00 the sib determines whether there's a displacement.
                    165:
                    166:   This affects all MMX and 3DNow instructions, and others with an 0F prefix
                    167:   like movzbl.  The modes affected are anything with an index and no
                    168:   displacement, or an index but no base, and this includes (%esp) which is
                    169:   really (,%esp,1).
                    170:
                    171:   The cross.pl script detects problem cases.  The workaround is to always
                    172:   use a displacement, and to do this with Zdisp if it's zero so the
                    173:   assembler doesn't discard it.
                    174:
                    175:   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
                    176:   13-14 and 36-37.
                    177:
                    178: Calls
                    179:
                    180: - indirect jumps and calls are not branch predicted, they measure about 6
                    181:   cycles.
                    182:
                    183: Various
                    184:
                    185: - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
                    186: - bsf       12-27 cycles
                    187: - emms      5 cycles
                    188: - femms     3 cycles
                    189: - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
                    190: - divl      20 cycles back-to-back
                    191: - imull     2 decode, 2 execute
                    192: - mull      2 decode, 3 execute (optimization manual decoding sample)
                    193: - prefetch  2 cycles
                    194: - rcll/rcrl implicit by one bit: 2 cycles
                    195:             immediate or %cl count: 11 + 2 per bit for dword
                    196:                                     13 + 4 per bit for byte
                    197: - setCC            2 cycles
                    198: - xchgl        %eax,reg  1.5 cycles, back-to-back (strange)
                    199:         reg,reg   2 cycles, back-to-back
                    200:
                    201:
                    202:
                    203:
                    204: REFERENCES
                    205:
                    206: "AMD-K6 Processor Code Optimization Application Note", AMD publication
                    207: number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
                    208: K6-3.  Available on-line,
                    209:
                    210:        http://www.amd.com/K6/k6docs/pdf/21924.pdf
                    211:
                    212: "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
                    213: publication number 21828, revision A amendment 0, August 1997.  This is an
                    214: older edition of the above document, describing plain K6.  Available
                    215: on-line,
                    216:
                    217:        http://www.amd.com/K6/k6docs/pdf/21828.pdf
                    218:
                    219: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
                    220: This describes the femms and prefetch instructions, but nothing else from
                    221: 3DNow has been used.  Available on-line,
                    222:
                    223:        http://www.amd.com/K6/k6docs/pdf/21928.pdf
                    224:
                    225: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
                    226: August 1999.  This has some notes on general K6 optimizations as well as
                    227: 3DNow.  Available on-line,
                    228:
                    229:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
                    230:
                    231:
                    232:
                    233: ----------------
                    234: Local variables:
                    235: mode: text
                    236: fill-column: 76
                    237: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>