OpenXM_contrib/gmp/mpn/x86/k6/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / x86 / k6
Annotation of OpenXM_contrib/gmp/mpn/x86/k6/README, Revision 1.1

1.1     ! maekawa     1:
        !             2:                        AMD K6 MPN SUBROUTINES
        !             3:
        !             4:
        !             5:
        !             6: This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
        !             7: K6-3.
        !             8:
        !             9: The mmx and k62mmx subdirectories have routines using MMX instructions.  All
        !            10: K6s have MMX, the separate directories are just so that ./configure can omit
        !            11: them if the assembler doesn't support MMX.
        !            12:
        !            13:
        !            14:
        !            15:
        !            16: STATUS
        !            17:
        !            18: Times for the loops, with all code and data in L1 cache, are as follows.
        !            19:
        !            20:                                  cycles/limb
        !            21:
        !            22:        mpn_add_n/sub_n            3.25 normal, 2.75 in-place
        !            23:
        !            24:        mpn_mul_1                  6.25
        !            25:        mpn_add/submul_1           7.65-8.4  (varying with data values)
        !            26:
        !            27:        mpn_mul_basecase           9.25 cycles/crossproduct (approx)
        !            28:        mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
        !            29:                                    or 9.2 cycles/triangleproduct (approx)
        !            30:
        !            31:        mpn_divrem_1              20.0
        !            32:        mpn_mod_1                 20.0
        !            33:        mpn_divexact_by3          11.0
        !            34:
        !            35:        mpn_l/rshift               3.0
        !            36:
        !            37:        mpn_copyi/copyd            1.0
        !            38:
        !            39:        mpn_com_n                  1.5-1.85  \
        !            40:        mpn_and/andn/ior/xor_n     1.5-1.75  | varying with
        !            41:        mpn_iorn/xnor_n            2.0-2.25  | data alignment
        !            42:        mpn_nand/nior_n            2.0-2.25  /
        !            43:
        !            44:        mpn_popcount              12.5
        !            45:        mpn_hamdist               13.0
        !            46:
        !            47:
        !            48: K6-2 and K6-3 have dual-issue MMX and get the following improvements.
        !            49:
        !            50:        mpn_l/rshift               1.75
        !            51:
        !            52:        mpn_copyi/copyd            0.56 or 1.0  \
        !            53:                                                 |
        !            54:        mpn_com_n                  1.0-1.2      | varying with
        !            55:        mpn_and/andn/ior/xor_n     1.2-1.5      | data alignment
        !            56:        mpn_iorn/xnor_n            1.5-2.0      |
        !            57:        mpn_nand/nior_n            1.75-2.0     /
        !            58:
        !            59:        mpn_popcount               9.0
        !            60:        mpn_hamdist               11.5
        !            61:
        !            62:
        !            63: Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
        !            64: instruction, code seems to run slower, and with just "mov" loads it doesn't
        !            65: seem faster.  Results so far are inconsistent.  The K6 does a hardware
        !            66: prefetch of the second cache line in a sector, so the penalty for not
        !            67: prefetching in software is reduced.
        !            68:
        !            69:
        !            70:
        !            71:
        !            72: NOTES
        !            73:
        !            74: All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
        !            75:
        !            76: Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
        !            77: execute them in both X and Y (and together).
        !            78:
        !            79: Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
        !            80: chapter 6 table 12).
        !            81:
        !            82: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
        !            83: Store queue is 7 entries of 64 bits each.
        !            84:
        !            85: Floating point multiplications can be done in parallel with integer
        !            86: multiplications, but there doesn't seem to be any way to make use of this.
        !            87:
        !            88:
        !            89:
        !            90: OPTIMIZATIONS
        !            91:
        !            92: Unrolled loops are used to reduce looping overhead.  The unrolling is
        !            93: configurable up to 32 limbs/loop for most routines, up to 64 for some.
        !            94:
        !            95: Sometimes computed jumps into the unrolling are used to handle sizes not a
        !            96: multiple of the unrolling.  An attractive feature of this is that times
        !            97: smoothly increase with operand size, but an indirect jump is about 6 cycles
        !            98: and the setups about another 6, so it depends on how much the unrolled code
        !            99: is faster than a simple loop as to whether a computed jump ought to be used.
        !           100:
        !           101: Position independent code is implemented using a call to get eip for
        !           102: computed jumps and a ret is always done, rather than an addl $4,%esp or a
        !           103: popl, so the CPU return address branch prediction stack stays synchronised
        !           104: with the actual stack in memory.  Such a call however still costs 4 to 7
        !           105: cycles.
        !           106:
        !           107: Branch prediction, in absence of any history, will guess forward jumps are
        !           108: not taken and backward jumps are taken.  Where possible it's arranged that
        !           109: the less likely or less important case is under a taken forward jump.
        !           110:
        !           111:
        !           112:
        !           113: MMX
        !           114:
        !           115: Putting emms or femms as late as possible in a routine seems to be fastest.
        !           116: Perhaps an emms or femms stalls until all outstanding MMX instructions have
        !           117: completed, so putting it later gives them a chance to complete on their own,
        !           118: in parallel with other operations (like register popping).
        !           119:
        !           120: The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
        !           121: at the start of a routine, in case it's been preceded by x87 floating point
        !           122: operations.  This isn't done because in gmp programs it's expected that x87
        !           123: floating point won't be much used and that chances are an mpn routine won't
        !           124: have been preceded by any x87 code.
        !           125:
        !           126:
        !           127:
        !           128: CODING
        !           129:
        !           130: Instructions in general code are shown paired if they can decode and execute
        !           131: together, meaning two short decode instructions with the second not
        !           132: depending on the first, only the first using the shifter, no more than one
        !           133: load, and no more than one store.
        !           134:
        !           135: K6 does some out of order execution so the pairings aren't essential, they
        !           136: just show what slots might be available.  When decoding is the limiting
        !           137: factor things can be scheduled that might not execute until later.
        !           138:
        !           139:
        !           140:
        !           141: NOTES
        !           142:
        !           143: Code alignment
        !           144:
        !           145: - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
        !           146:   short decode is inhibited.  The cross.pl script detects this.
        !           147:
        !           148: - loops and branch targets should be aligned to 16 bytes, or ensure at least
        !           149:   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
        !           150:   cache in the BTB.
        !           151:
        !           152: Addressing modes
        !           153:
        !           154: - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
        !           155:   problem, and can be used as an equivalent, or easier is just to use a
        !           156:   different register, like %ebx.
        !           157:
        !           158: - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
        !           159:   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
        !           160:
        !           161:   If more than 3 bytes are needed to determine instruction length then
        !           162:   decoding degrades from direct to long, or from long to vector.  This
        !           163:   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
        !           164:   with mod=00 the sib determines whether there's a displacement.
        !           165:
        !           166:   This affects all MMX and 3DNow instructions, and others with an 0F prefix
        !           167:   like movzbl.  The modes affected are anything with an index and no
        !           168:   displacement, or an index but no base, and this includes (%esp) which is
        !           169:   really (,%esp,1).
        !           170:
        !           171:   The cross.pl script detects problem cases.  The workaround is to always
        !           172:   use a displacement, and to do this with Zdisp if it's zero so the
        !           173:   assembler doesn't discard it.
        !           174:
        !           175:   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
        !           176:   13-14 and 36-37.
        !           177:
        !           178: Calls
        !           179:
        !           180: - indirect jumps and calls are not branch predicted, they measure about 6
        !           181:   cycles.
        !           182:
        !           183: Various
        !           184:
        !           185: - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
        !           186: - bsf       12-27 cycles
        !           187: - emms      5 cycles
        !           188: - femms     3 cycles
        !           189: - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
        !           190: - divl      20 cycles back-to-back
        !           191: - imull     2 decode, 2 execute
        !           192: - mull      2 decode, 3 execute (optimization manual decoding sample)
        !           193: - prefetch  2 cycles
        !           194: - rcll/rcrl implicit by one bit: 2 cycles
        !           195:             immediate or %cl count: 11 + 2 per bit for dword
        !           196:                                     13 + 4 per bit for byte
        !           197: - setCC            2 cycles
        !           198: - xchgl        %eax,reg  1.5 cycles, back-to-back (strange)
        !           199:         reg,reg   2 cycles, back-to-back
        !           200:
        !           201:
        !           202:
        !           203:
        !           204: REFERENCES
        !           205:
        !           206: "AMD-K6 Processor Code Optimization Application Note", AMD publication
        !           207: number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
        !           208: K6-3.  Available on-line,
        !           209:
        !           210:        http://www.amd.com/K6/k6docs/pdf/21924.pdf
        !           211:
        !           212: "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
        !           213: publication number 21828, revision A amendment 0, August 1997.  This is an
        !           214: older edition of the above document, describing plain K6.  Available
        !           215: on-line,
        !           216:
        !           217:        http://www.amd.com/K6/k6docs/pdf/21828.pdf
        !           218:
        !           219: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
        !           220: This describes the femms and prefetch instructions, but nothing else from
        !           221: 3DNow has been used.  Available on-line,
        !           222:
        !           223:        http://www.amd.com/K6/k6docs/pdf/21928.pdf
        !           224:
        !           225: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
        !           226: August 1999.  This has some notes on general K6 optimizations as well as
        !           227: 3DNow.  Available on-line,
        !           228:
        !           229:        http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
        !           230:
        !           231:
        !           232:
        !           233: ----------------
        !           234: Local variables:
        !           235: mode: text
        !           236: fill-column: 76
        !           237: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>