Annotation of OpenXM_contrib/gmp/mpn/ia64/README, Revision 1.1.1.1
1.1 ohara 1: Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
2:
3: This file is part of the GNU MP Library.
4:
5: The GNU MP Library is free software; you can redistribute it and/or modify
6: it under the terms of the GNU Lesser General Public License as published by
7: the Free Software Foundation; either version 2.1 of the License, or (at your
8: option) any later version.
9:
10: The GNU MP Library is distributed in the hope that it will be useful, but
11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
13: License for more details.
14:
15: You should have received a copy of the GNU Lesser General Public License
16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
18: 02111-1307, USA.
19:
20:
21:
22:
23:
24:
25: The IA-64 ISA keeps instructions three and three in 128 bit bundles.
26: Programmers/compilers need to put explicit breaks `;;' when there are
27: WAW or RAW dependencies. Such breaks can typically just be at the end
28: of a bundle, with some exceptions.
29:
30: The Itanium and Itanium 2 implementations can under ideal conditions
31: execute two bundles per cycle. The Itanium 2 allows 4 of these
32: instructions to do integer operations, while the Itanium 2 allows all
33: 6 to be integer operations.
34:
35: Taken cloop branches seem to insert a bubble into the pipeline most of
36: the time.
37:
38: Loads to the fp registers bypass the L1 cache and thus get extremely
39: long latencies, 9 cycles on the Itanium and 7 cycles on the Itanium 2.
40:
41: The software pipeline stuff using br.ctop instruction causes delays,
42: since many issue slots are taken up by instructions with zero
43: predicates, and since about many extra instructions are needed to set
44: things up. These features are designed for code density, not maximum
45: speed.
46:
47: Misc pipeline limitations (Itanium):
48: * The getf.sig instruction can only execute in M0.
49: * At most four integer instructions/cycle.
50: * Nops take up resources like any plain instructions.
51:
52: ================================================================
53: mpn_add_n, mpn_sub_n:
54:
55: The current code runs at 3 cycles/limb. Unrolling could clearly bring
56: down the time to 2 cycles/limb.
57:
58: ================================================================
59: mpn_addmul_1:
60:
61: The current code runs at 3.7 cycles/limb, but that somewhat odd timing
62: is reached only for huge operands. It uses the mod-scheduled software
63: pipelining feature. The reason for the poor speed for small operands
64: is that mod-scheduled loops have a very long start-up overhead.
65:
66: For optimal speed, we need to load two 64-bit limbs with the ldfp8
67: instruction, and stay away from mod-scheduled loops. Since rp and up
68: might be mutually aligned in two ways, we will need two loop variants,
69: with the same basic structure:
70:
71: { .mfi getf.sig
72: xma.l
73: (p6) cmp.leu p6, p7 =
74: } { .mfi stf8
75: xma.hu
76: (p7) cmp.ltu p6, p7 =
77: ;;
78: } { .mib getf.sig
79: (p6) add 1
80: nop.b
81: } { .mib ldfp8 = [up], 16
82: (p7) add
83: nop.b
84: ;;
85: { .mfi getf.sig
86: xma.l
87: (p6) cmp.leu p6, p7 =
88: } { .mfi stf8
89: xma.hu
90: (p7) cmp.ltu p6, p7 =
91: ;;
92: } { .mib getf.sig
93: (p6) add 1
94: nop.b
95: } { .mib ldfp8 = [rp], 16
96: (p7) add
97: br.cloop
98: ;;
99: }
100:
101: 2 limbs/20 instructions
102: 20 insn/max 6 insn/cycle: 3.3 cycles/2limb
103: 8 memops/max 2 memops/cycle: 4.0 cycles/2limb
104: 8 intops/max 2 intops/cycle: 4.0 cycles/2limb
105: 4 fpops/max 2 fpops/cycle: 2.0 cycles/2limb
106:
107: ================================================================
108: mpn_submul_1:
109:
110: The current code just calls mpn_mul_1 and mpn_sub_n and thus needs
111: about 7 cycles/limb.
112:
113: We could implement this much like mpn_addmul_1, if we first complement
114: v. When v is complemented, the low product limb becomes complement of
115: true product. This should allow us to use the accumulation of xma.
116: Here is how it works:
117:
118:
119: #define umul_ppmma(ph, pl, m0, m1, a) \
120: do { \
121: UDItype __m0 = (m0), __m1 = (m1), __a = (a); \
122: __asm__ ("xma.hu %0 = %1, %2, %3" \
123: : "=f" (ph) \
124: : "f" (m0), "f" (m1), "f" (__a)); \
125: (pl) = __m0 * __m1 + __a; \
126: } while (0)
127:
128: mp_limb_t
129: mpn_submul_1 (mp_ptr rp, mp_srcptr up, mp_size_t n, mp_limb_t vl)
130: {
131: mp_limb_t cl, cy;
132: mp_size_t i;
133: mp_limb_t phi, plo;
134: mp_limb_t x;
135: mp_limb_t ul, vln;
136:
137: vln = -vl;
138:
139: cl = 0;
140: for (i = n; i != 0; i--)
141: {
142: ul = *up++; /* will need this in both fregs and gregs */
143: x = *rp;
144: umul_ppmma (phi, plo, ul, vln, x);
145:
146: cy = plo < cl;
147: plo -= cl;
148:
149: cl = ul - phi;
150: cl += cy;
151:
152: *rp++ = plo;
153: }
154:
155: return cl;
156: }
157:
158: ================================================================
159: mpn_mul_1:
160:
161: The current code runs at 3.7 cycles/limb. The code is very similar to
162: the mpn_addmul_1 code. See comments above.
163:
164: Faster code wouldn't be too hard to write. This is one possible
165: pattern:
166:
167: { .mfi getf.sig
168: xma.l
169: (p6) cmp.leu p6, p7 =
170: } { .mfi stf8
171: xma.hu
172: (p7) cmp.ltu p6, p7 =
173: ;;
174: } { .mib getf.sig
175: (p6) add 1
176: nop.b
177: } { .mib ldf8 = [up], 8
178: (p7) add
179: br.cloop
180: ;;
181: }
182:
183: 1 limb/12 instructions
184: 12 insn/max 6 insn/cycle: 2.0 cycles/limb
185: 4 memops/max 2 memops/cycle: 2.0 cycles/limb
186: 4 intops/max 2 intops/cycle: 2.0 cycles/limb
187: 2 fpops/max 2 fpops/cycle: 1.0 cycles/limb
188:
189: ================================================================
190: mpn_mul_8
191:
192: The add+cmp+add we use on the other codes is optimal for shortening
193: recurrencies (2 cycles) but the sequence takes up 4 execution slots. When
194: recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
195: better.
196:
197: /* First load the 8 values from v */
198: ldfp8 v0, v1 = [r35], 16;;
199: ldfp8 v2, v3 = [r35], 16;;
200: ldfp8 v4, v5 = [r35], 16;;
201: ldfp8 v6, v7 = [r35], 16;;
202:
203: /* In the inner loop, get a new U limb and store a result limb. */
204: mov lc = un
205: Loop: ldf8 u0 = [r33], 8
206: xma.l lp0 = v0, u0, hp0
207: xma.hu hp0 = v0, u0, hp0
208: xma.l lp1 = v1, u0, hp1
209: xma.hu hp1 = v1, u0, hp1
210: xma.l lp2 = v2, u0, hp2
211: xma.hu hp2 = v2, u0, hp2
212: xma.l lp3 = v3, u0, hp3
213: xma.hu hp3 = v3, u0, hp3
214: xma.l lp4 = v4, u0, hp4
215: xma.hu hp4 = v4, u0, hp4
216: xma.l lp5 = v5, u0, hp5
217: xma.hu hp5 = v5, u0, hp5
218: xma.l lp6 = v6, u0, hp6
219: xma.hu hp6 = v6, u0, hp6
220: xma.l lp7 = v7, u0, hp7
221: xma.hu hp7 = v7, u0, hp7
222: getf.sig l0 = lp0
223: getf.sig l1 = lp1
224: getf.sig l2 = lp2
225: getf.sig l3 = lp3
226: getf.sig l4 = lp4
227: getf.sig l5 = lp5
228: getf.sig l6 = lp6
229: getf.sig l7 = lp7
230: add+cmp+add l0, l0, h0
231: add+cmp+add l1, l1, h1
232: add+cmp+add l2, l2, h2
233: add+cmp+add l3, l3, h3
234: add+cmp+add l4, l4, h4
235: add+cmp+add l5, l5, h5
236: add+cmp+add l6, l6, h6
237: add+cmp+add l7, l7, h7
238: st8 [r32] = xx, 8
239: br.cloop Loop
240:
241: 50 insn at max 6 insn/cycle: 8.33 cycles/limb8
242: 10 memops at max 2 memops/cycle: 5 cycles/limb8
243: 16 fpops at max 2 fpops/cycle: 8 cycles/limb8
244: 24 intops at max 4 intops/cycle: 6 cycles/limb8
245: 10+24 memops+intops at max 4/cycle 8.5 cycles/limb8
246: 1.0625 cycles/limb
247:
248: ================================================================
249: mpn_lshift, mpn_rshift
250:
251: The current code runs at 2 cycles/limb, but has a too deep software
252: pipeline. The code suffers badly from the 4 cycle latency of the
253: variable shift instructions.
254:
255: Using 63 separate loops, we could use the double-word SHRP
256: instruction. That instruction has a plain single-cycle latency. We
257: need 63 loops since this instruction only accept immediate count.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>