OpenXM_contrib/gmp/mpn/cray/README - diff

Return to README CVS log

Up to [local] / OpenXM_contrib / gmp / mpn / cray

Diff for /OpenXM_contrib/gmp/mpn/cray/Attic/README between version 1.1.1.1 and 1.1.1.2

-version 1.1.1.1, 2000/09/09 14:12:22
+version 1.1.1.2, 2003/08/25 16:06:18
 Line 1
 Line 1
 Line 1
- The (poorly optimized) code in this directory was originally written for a
+ Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
- j90 system, but finished on a c90.  It should work on all Cray vector
- computers.  For the T3E and T3D systems, the `alpha' subdirectory at the
- same level as the directory containing this file, is much better.
- * `+' seems to be faster than `|' when combining carries.
+ This file is part of the GNU MP Library.
- * It is possible that the best multiply performance would be achived by
+ The GNU MP Library is free software; you can redistribute it and/or modify
-   storing only 24 bits per element, and using lazy carry propagation.  Before
+ it under the terms of the GNU Lesser General Public License as published by
-   calling i24mult, full carry propagation would be needed.
+ the Free Software Foundation; either version 2.1 of the License, or (at your
+ option) any later version.
- * Supply tasking versions of the C loops.
+ The GNU MP Library is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+ License for more details.
+ You should have received a copy of the GNU Lesser General Public License
+ along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+ the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+-1307, USA.
+ The code in this directory works for Cray vector systems such as C90,
+ J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
+ the T3E and T3D systems, see the `alpha' subdirectory at the same
+ level as the directory containing this file.)
+ The cfp subdirectory is for systems utilizing the traditional Cray
+ floating-point format, and the ieee subdirectory is for the newer
+ systems that use the IEEE floating-point format.
+ There are several issues that reduces speed on Cray systems.  For
+ systems with cfp floating point, the main obstacle is the forming of
+-bit products.  For IEEE systems, adding, and in particular
+ computing carry is the main issue.  There are no vectorizing
+ unsigned-less-than instructions, and the sequence that implement that
+ opetration is very long.
+ Shifting is the only operation that is simple to make fast.  All Cray
+ systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
+ should be really useful.
+ For best speed for cfp systems, we need a mul_basecase, since that
+ reduces the need for carry propagation to a minimum.  Depending on the
+ size (vn) of the smaller of the two operands (V), we should split U and V
+ in different chunk sizes:
+ U split in 2 32-bit parts
+ V split according to the table:
+ parts                   4       5       6       7       8
+ bits/part               16      13      11      10      8
+ max allowed vn          1       8       32      64      256
+ number of multiplies    8       10      12      14      16
+ peak cycles/limb        4       5       6       7       8
+ U split in 3 22-bit parts
+ V split according to the table:
+ parts                   3       4       5
+ bits/part               22      16      13
+ max allowed vn          16      1024    8192
+ number of multiplies    9       12      15
+ peak cycles/limb        4.5     6       7.5
+ U split in 4 16-bit parts
+ V split according to the table:
+ parts                   4
+ bits/part               16
+ max allowed vn          65536
+ number of multiplies    16
+ peak cycles/limb        8
+ (A T90 CPU can accumulate two products per cycle.)
+ IDEA:
+ * Rewrite mpn_add_n:
+     short cy[n + 1];
+     #pragma _CRI ivdep
+       for (i = 0; i < n; i++)
+         { s = up[i] + vp[i];
+           rp[i] = s;
+           cy[i + 1] = s < up[i]; }
+       more_carries = 0;
+     #pragma _CRI ivdep
+       for (i = 1; i < n; i++)
+         { s = rp[i] + cy[i];
+           rp[i] = s;
+           more_carries += s < cy[i]; }
+       cys = 0;
+       if (more_carries)
+         {
+           cys = rp[1] < cy[1];
+           for (i = 2; i < n; i++)
+             { rp[i] += cys;
+               cys = rp[i] < cys; }
+         }
+       return cys + cy[n];
+ * Write mpn_add3_n for adding three operands.  First add operands 1
+   and 2, and generate cy[].  Then add operand 3 to the partial result,
+   and accumulate carry into cy[].  Finally propagate carry just like
+   in the new mpn_add_n.
+ IDEA:
+ Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
+ down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
+ storing even fewer bits per limb, perhaps 56, it would be possible to
+ write a mul_mul_basecase that would run at effectively 1 cycle/limb.
+ (Use VM here to better handle the romb-shaped multiply area, perhaps
+ rouding operand sizes up to the next power of 2.)

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>