=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/cray/Attic/README,v retrieving revision 1.1 retrieving revision 1.1.1.2 diff -u -p -r1.1 -r1.1.1.2 --- OpenXM_contrib/gmp/mpn/cray/Attic/README 2000/09/09 14:12:22 1.1 +++ OpenXM_contrib/gmp/mpn/cray/Attic/README 2003/08/25 16:06:18 1.1.1.2 @@ -1,14 +1,112 @@ -The (poorly optimized) code in this directory was originally written for a -j90 system, but finished on a c90. It should work on all Cray vector -computers. For the T3E and T3D systems, the `alpha' subdirectory at the -same level as the directory containing this file, is much better. +Copyright 2000, 2001, 2002 Free Software Foundation, Inc. -* `+' seems to be faster than `|' when combining carries. +This file is part of the GNU MP Library. -* It is possible that the best multiply performance would be achived by - storing only 24 bits per element, and using lazy carry propagation. Before - calling i24mult, full carry propagation would be needed. +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. -* Supply tasking versions of the C loops. +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library; see the file COPYING.LIB. If not, write to +the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA +02111-1307, USA. + + + + + +The code in this directory works for Cray vector systems such as C90, +J90, T90 (both the CFP variant and the IEEE variant) and SV1. (For +the T3E and T3D systems, see the `alpha' subdirectory at the same +level as the directory containing this file.) + +The cfp subdirectory is for systems utilizing the traditional Cray +floating-point format, and the ieee subdirectory is for the newer +systems that use the IEEE floating-point format. + +There are several issues that reduces speed on Cray systems. For +systems with cfp floating point, the main obstacle is the forming of +128-bit products. For IEEE systems, adding, and in particular +computing carry is the main issue. There are no vectorizing +unsigned-less-than instructions, and the sequence that implement that +opetration is very long. + +Shifting is the only operation that is simple to make fast. All Cray +systems have a bitblt instructions (Vi Vj,VjAk) that +should be really useful. + +For best speed for cfp systems, we need a mul_basecase, since that +reduces the need for carry propagation to a minimum. Depending on the +size (vn) of the smaller of the two operands (V), we should split U and V +in different chunk sizes: + +U split in 2 32-bit parts +V split according to the table: +parts 4 5 6 7 8 +bits/part 16 13 11 10 8 +max allowed vn 1 8 32 64 256 +number of multiplies 8 10 12 14 16 +peak cycles/limb 4 5 6 7 8 + +U split in 3 22-bit parts +V split according to the table: +parts 3 4 5 +bits/part 22 16 13 +max allowed vn 16 1024 8192 +number of multiplies 9 12 15 +peak cycles/limb 4.5 6 7.5 + +U split in 4 16-bit parts +V split according to the table: +parts 4 +bits/part 16 +max allowed vn 65536 +number of multiplies 16 +peak cycles/limb 8 + +(A T90 CPU can accumulate two products per cycle.) + +IDEA: +* Rewrite mpn_add_n: + short cy[n + 1]; + #pragma _CRI ivdep + for (i = 0; i < n; i++) + { s = up[i] + vp[i]; + rp[i] = s; + cy[i + 1] = s < up[i]; } + more_carries = 0; + #pragma _CRI ivdep + for (i = 1; i < n; i++) + { s = rp[i] + cy[i]; + rp[i] = s; + more_carries += s < cy[i]; } + cys = 0; + if (more_carries) + { + cys = rp[1] < cy[1]; + for (i = 2; i < n; i++) + { rp[i] += cys; + cys = rp[i] < cys; } + } + return cys + cy[n]; + +* Write mpn_add3_n for adding three operands. First add operands 1 + and 2, and generate cy[]. Then add operand 3 to the partial result, + and accumulate carry into cy[]. Finally propagate carry just like + in the new mpn_add_n. + +IDEA: + +Store fewer bits, perhaps 62, per limb. That brings mpn_add_n time +down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb. By +storing even fewer bits per limb, perhaps 56, it would be possible to +write a mul_mul_basecase that would run at effectively 1 cycle/limb. +(Use VM here to better handle the romb-shaped multiply area, perhaps +rouding operand sizes up to the next power of 2.)