Annotation of OpenXM_contrib/gmp/mpn/x86/k7/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 2000, 2001 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
1.1 maekawa 22:
23: AMD K7 MPN SUBROUTINES
24:
25:
26: This directory contains code optimized for the AMD Athlon CPU.
27:
28: The mmx subdirectory has routines using MMX instructions. All Athlons have
29: MMX, the separate directory is just so that configure can omit it if the
30: assembler doesn't support MMX.
31:
32:
33:
34: STATUS
35:
36: Times for the loops, with all code and data in L1 cache.
37:
38: cycles/limb
39: mpn_add/sub_n 1.6
40:
41: mpn_copyi 0.75 or 1.0 \ varying with data alignment
42: mpn_copyd 0.75 or 1.0 /
43:
44: mpn_divrem_1 17.0 integer part, 15.0 fractional part
45: mpn_mod_1 17.0
46: mpn_divexact_by3 8.0
47:
48: mpn_l/rshift 1.2
49:
50: mpn_mul_1 3.4
51: mpn_addmul/submul_1 3.9
52:
53: mpn_mul_basecase 4.42 cycles/crossproduct (approx)
1.1.1.2 ! ohara 54: mpn_sqr_basecase 2.3 cycles/crossproduct (approx)
! 55: or 4.55 cycles/triangleproduct (approx)
1.1 maekawa 56:
57: Prefetching of sources hasn't yet been tried.
58:
59:
60:
61: NOTES
62:
63: cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
64:
65: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
66:
67: Floating point multiplications can be done in parallel with integer
68: multiplications, but there doesn't seem to be any way to make use of this.
69:
70: Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on
71: the speed of the multiplication routines. The documentation shows mul
72: executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
73: to get near 3 cycles code has to be arranged so that nothing else is issued
74: to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other
75: apparently equivalent code takes 5.
76:
77:
78:
79: OPTIMIZATIONS
80:
81: Unrolled loops are used to reduce looping overhead. The unrolling is
82: configurable up to 32 limbs/loop for most routines and up to 64 for some.
83: The K7 has 64k L1 code cache so quite big unrolling is allowable.
84:
85: Computed jumps into the unrolling are used to handle sizes not a multiple of
86: the unrolling. An attractive feature of this is that times increase
87: smoothly with operand size, but it may be that some routines should just
88: have simple loops to finish up, especially when PIC adds between 2 and 16
89: cycles to get %eip.
90:
91: Position independent code is implemented using a call to get %eip for the
92: computed jumps and a ret is always done, rather than an addl $4,%esp or a
93: popl, so the CPU return address branch prediction stack stays synchronised
94: with the actual stack in memory.
95:
96: Branch prediction, in absence of any history, will guess forward jumps are
97: not taken and backward jumps are taken. Where possible it's arranged that
98: the less likely or less important case is under a taken forward jump.
99:
100:
101:
102: CODING
103:
104: Instructions in general code have been shown grouped if they can execute
105: together, which means up to three direct-path instructions which have no
106: successive dependencies. K7 always decodes three and has out-of-order
107: execution, but the groupings show what slots might be available and what
108: dependency chains exist.
109:
110: When there's vector-path instructions an effort is made to get triplets of
111: direct-path instructions in between them, even if there's dependencies,
112: since this maximizes decoding throughput and might save a cycle or two if
113: decoding is the limiting factor.
114:
115:
116:
117: INSTRUCTIONS
118:
119: adcl direct
120: divl 39 cycles back-to-back
121: lodsl,etc vector
122: loop 1 cycle vector (decl/jnz opens up one decode slot)
123: movd reg vector
124: movd mem direct
125: mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
126: popl vector (use movl for more than one pop)
127: pushl direct, will pair with a load
128: shrdl %cl vector, 3 cycles, seems to be 3 decode too
129: xorl r,r false read dependency recognised
130:
131:
132:
133: REFERENCES
134:
135: "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
136: 22007, revision E, November 1999. Available on-line,
137:
138: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
139:
140: "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
141: This describes the femms and prefetch instructions. Available on-line,
142:
143: http://www.amd.com/K6/k6docs/pdf/21928.pdf
144:
145: "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
146: publication number 22466, revision B, August 1999. This describes
147: instructions added in the Athlon processor, such as pswapd and the extra
148: prefetch forms. Available on-line,
149:
150: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf
151:
152: "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
153: August 1999. This has some notes on general Athlon optimizations as well as
154: 3DNow. Available on-line,
155:
156: http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
157:
158:
159:
160:
161: ----------------
162: Local variables:
163: mode: text
164: fill-column: 76
165: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>