Annotation of OpenXM_contrib/gmp/mpn/x86/k7/README, Revision 1.1.1.1
1.1 maekawa 1:
2: AMD K7 MPN SUBROUTINES
3:
4:
5: This directory contains code optimized for the AMD Athlon CPU.
6:
7: The mmx subdirectory has routines using MMX instructions. All Athlons have
8: MMX, the separate directory is just so that configure can omit it if the
9: assembler doesn't support MMX.
10:
11:
12:
13: STATUS
14:
15: Times for the loops, with all code and data in L1 cache.
16:
17: cycles/limb
18: mpn_add/sub_n 1.6
19:
20: mpn_copyi 0.75 or 1.0 \ varying with data alignment
21: mpn_copyd 0.75 or 1.0 /
22:
23: mpn_divrem_1 17.0 integer part, 15.0 fractional part
24: mpn_mod_1 17.0
25: mpn_divexact_by3 8.0
26:
27: mpn_l/rshift 1.2
28:
29: mpn_mul_1 3.4
30: mpn_addmul/submul_1 3.9
31:
32: mpn_mul_basecase 4.42 cycles/crossproduct (approx)
33:
34: mpn_popcount 5.0
35: mpn_hamdist 6.0
36:
37: Prefetching of sources hasn't yet been tried.
38:
39:
40:
41: NOTES
42:
43: cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
44:
45: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
46:
47: Floating point multiplications can be done in parallel with integer
48: multiplications, but there doesn't seem to be any way to make use of this.
49:
50: Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on
51: the speed of the multiplication routines. The documentation shows mul
52: executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
53: to get near 3 cycles code has to be arranged so that nothing else is issued
54: to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other
55: apparently equivalent code takes 5.
56:
57:
58:
59: OPTIMIZATIONS
60:
61: Unrolled loops are used to reduce looping overhead. The unrolling is
62: configurable up to 32 limbs/loop for most routines and up to 64 for some.
63: The K7 has 64k L1 code cache so quite big unrolling is allowable.
64:
65: Computed jumps into the unrolling are used to handle sizes not a multiple of
66: the unrolling. An attractive feature of this is that times increase
67: smoothly with operand size, but it may be that some routines should just
68: have simple loops to finish up, especially when PIC adds between 2 and 16
69: cycles to get %eip.
70:
71: Position independent code is implemented using a call to get %eip for the
72: computed jumps and a ret is always done, rather than an addl $4,%esp or a
73: popl, so the CPU return address branch prediction stack stays synchronised
74: with the actual stack in memory.
75:
76: Branch prediction, in absence of any history, will guess forward jumps are
77: not taken and backward jumps are taken. Where possible it's arranged that
78: the less likely or less important case is under a taken forward jump.
79:
80:
81:
82: CODING
83:
84: Instructions in general code have been shown grouped if they can execute
85: together, which means up to three direct-path instructions which have no
86: successive dependencies. K7 always decodes three and has out-of-order
87: execution, but the groupings show what slots might be available and what
88: dependency chains exist.
89:
90: When there's vector-path instructions an effort is made to get triplets of
91: direct-path instructions in between them, even if there's dependencies,
92: since this maximizes decoding throughput and might save a cycle or two if
93: decoding is the limiting factor.
94:
95:
96:
97: INSTRUCTIONS
98:
99: adcl direct
100: divl 39 cycles back-to-back
101: lodsl,etc vector
102: loop 1 cycle vector (decl/jnz opens up one decode slot)
103: movd reg vector
104: movd mem direct
105: mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
106: popl vector (use movl for more than one pop)
107: pushl direct, will pair with a load
108: shrdl %cl vector, 3 cycles, seems to be 3 decode too
109: xorl r,r false read dependency recognised
110:
111:
112:
113: REFERENCES
114:
115: "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
116: 22007, revision E, November 1999. Available on-line,
117:
118: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
119:
120: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
121: This describes the femms and prefetch instructions. Available on-line,
122:
123: http://www.amd.com/K6/k6docs/pdf/21928.pdf
124:
125: "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
126: publication number 22466, revision B, August 1999. This describes
127: instructions added in the Athlon processor, such as pswapd and the extra
128: prefetch forms. Available on-line,
129:
130: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf
131:
132: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
133: August 1999. This has some notes on general Athlon optimizations as well as
134: 3DNow. Available on-line,
135:
136: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
137:
138:
139:
140:
141: ----------------
142: Local variables:
143: mode: text
144: fill-column: 76
145: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>