Annotation of OpenXM_contrib/gmp/mpn/cray/README, Revision 1.1.1.2
1.1.1.2 ! ohara 1: Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
1.1 maekawa 2:
1.1.1.2 ! ohara 3: This file is part of the GNU MP Library.
1.1 maekawa 4:
1.1.1.2 ! ohara 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
1.1 maekawa 9:
1.1.1.2 ! ohara 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
! 22:
! 23:
! 24:
! 25: The code in this directory works for Cray vector systems such as C90,
! 26: J90, T90 (both the CFP variant and the IEEE variant) and SV1. (For
! 27: the T3E and T3D systems, see the `alpha' subdirectory at the same
! 28: level as the directory containing this file.)
! 29:
! 30: The cfp subdirectory is for systems utilizing the traditional Cray
! 31: floating-point format, and the ieee subdirectory is for the newer
! 32: systems that use the IEEE floating-point format.
! 33:
! 34: There are several issues that reduces speed on Cray systems. For
! 35: systems with cfp floating point, the main obstacle is the forming of
! 36: 128-bit products. For IEEE systems, adding, and in particular
! 37: computing carry is the main issue. There are no vectorizing
! 38: unsigned-less-than instructions, and the sequence that implement that
! 39: opetration is very long.
! 40:
! 41: Shifting is the only operation that is simple to make fast. All Cray
! 42: systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
! 43: should be really useful.
! 44:
! 45: For best speed for cfp systems, we need a mul_basecase, since that
! 46: reduces the need for carry propagation to a minimum. Depending on the
! 47: size (vn) of the smaller of the two operands (V), we should split U and V
! 48: in different chunk sizes:
! 49:
! 50: U split in 2 32-bit parts
! 51: V split according to the table:
! 52: parts 4 5 6 7 8
! 53: bits/part 16 13 11 10 8
! 54: max allowed vn 1 8 32 64 256
! 55: number of multiplies 8 10 12 14 16
! 56: peak cycles/limb 4 5 6 7 8
! 57:
! 58: U split in 3 22-bit parts
! 59: V split according to the table:
! 60: parts 3 4 5
! 61: bits/part 22 16 13
! 62: max allowed vn 16 1024 8192
! 63: number of multiplies 9 12 15
! 64: peak cycles/limb 4.5 6 7.5
! 65:
! 66: U split in 4 16-bit parts
! 67: V split according to the table:
! 68: parts 4
! 69: bits/part 16
! 70: max allowed vn 65536
! 71: number of multiplies 16
! 72: peak cycles/limb 8
! 73:
! 74: (A T90 CPU can accumulate two products per cycle.)
! 75:
! 76: IDEA:
! 77: * Rewrite mpn_add_n:
! 78: short cy[n + 1];
! 79: #pragma _CRI ivdep
! 80: for (i = 0; i < n; i++)
! 81: { s = up[i] + vp[i];
! 82: rp[i] = s;
! 83: cy[i + 1] = s < up[i]; }
! 84: more_carries = 0;
! 85: #pragma _CRI ivdep
! 86: for (i = 1; i < n; i++)
! 87: { s = rp[i] + cy[i];
! 88: rp[i] = s;
! 89: more_carries += s < cy[i]; }
! 90: cys = 0;
! 91: if (more_carries)
! 92: {
! 93: cys = rp[1] < cy[1];
! 94: for (i = 2; i < n; i++)
! 95: { rp[i] += cys;
! 96: cys = rp[i] < cys; }
! 97: }
! 98: return cys + cy[n];
! 99:
! 100: * Write mpn_add3_n for adding three operands. First add operands 1
! 101: and 2, and generate cy[]. Then add operand 3 to the partial result,
! 102: and accumulate carry into cy[]. Finally propagate carry just like
! 103: in the new mpn_add_n.
! 104:
! 105: IDEA:
! 106:
! 107: Store fewer bits, perhaps 62, per limb. That brings mpn_add_n time
! 108: down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb. By
! 109: storing even fewer bits per limb, perhaps 56, it would be possible to
! 110: write a mul_mul_basecase that would run at effectively 1 cycle/limb.
! 111: (Use VM here to better handle the romb-shaped multiply area, perhaps
! 112: rouding operand sizes up to the next power of 2.)
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>