OpenXM_contrib/gmp/mpn/cray/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / cray
Annotation of OpenXM_contrib/gmp/mpn/cray/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
1.1       maekawa     2:
1.1.1.2 ! ohara       3: This file is part of the GNU MP Library.
1.1       maekawa     4:
1.1.1.2 ! ohara       5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
1.1       maekawa     9:
1.1.1.2 ! ohara      10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
        !            22:
        !            23:
        !            24:
        !            25: The code in this directory works for Cray vector systems such as C90,
        !            26: J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
        !            27: the T3E and T3D systems, see the `alpha' subdirectory at the same
        !            28: level as the directory containing this file.)
        !            29:
        !            30: The cfp subdirectory is for systems utilizing the traditional Cray
        !            31: floating-point format, and the ieee subdirectory is for the newer
        !            32: systems that use the IEEE floating-point format.
        !            33:
        !            34: There are several issues that reduces speed on Cray systems.  For
        !            35: systems with cfp floating point, the main obstacle is the forming of
        !            36: 128-bit products.  For IEEE systems, adding, and in particular
        !            37: computing carry is the main issue.  There are no vectorizing
        !            38: unsigned-less-than instructions, and the sequence that implement that
        !            39: opetration is very long.
        !            40:
        !            41: Shifting is the only operation that is simple to make fast.  All Cray
        !            42: systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
        !            43: should be really useful.
        !            44:
        !            45: For best speed for cfp systems, we need a mul_basecase, since that
        !            46: reduces the need for carry propagation to a minimum.  Depending on the
        !            47: size (vn) of the smaller of the two operands (V), we should split U and V
        !            48: in different chunk sizes:
        !            49:
        !            50: U split in 2 32-bit parts
        !            51: V split according to the table:
        !            52: parts                  4       5       6       7       8
        !            53: bits/part              16      13      11      10      8
        !            54: max allowed vn         1       8       32      64      256
        !            55: number of multiplies   8       10      12      14      16
        !            56: peak cycles/limb       4       5       6       7       8
        !            57:
        !            58: U split in 3 22-bit parts
        !            59: V split according to the table:
        !            60: parts                  3       4       5
        !            61: bits/part              22      16      13
        !            62: max allowed vn         16      1024    8192
        !            63: number of multiplies   9       12      15
        !            64: peak cycles/limb       4.5     6       7.5
        !            65:
        !            66: U split in 4 16-bit parts
        !            67: V split according to the table:
        !            68: parts                  4
        !            69: bits/part              16
        !            70: max allowed vn         65536
        !            71: number of multiplies   16
        !            72: peak cycles/limb       8
        !            73:
        !            74: (A T90 CPU can accumulate two products per cycle.)
        !            75:
        !            76: IDEA:
        !            77: * Rewrite mpn_add_n:
        !            78:     short cy[n + 1];
        !            79:     #pragma _CRI ivdep
        !            80:       for (i = 0; i < n; i++)
        !            81:        { s = up[i] + vp[i];
        !            82:          rp[i] = s;
        !            83:          cy[i + 1] = s < up[i]; }
        !            84:       more_carries = 0;
        !            85:     #pragma _CRI ivdep
        !            86:       for (i = 1; i < n; i++)
        !            87:        { s = rp[i] + cy[i];
        !            88:          rp[i] = s;
        !            89:          more_carries += s < cy[i]; }
        !            90:       cys = 0;
        !            91:       if (more_carries)
        !            92:        {
        !            93:          cys = rp[1] < cy[1];
        !            94:          for (i = 2; i < n; i++)
        !            95:            { rp[i] += cys;
        !            96:              cys = rp[i] < cys; }
        !            97:        }
        !            98:       return cys + cy[n];
        !            99:
        !           100: * Write mpn_add3_n for adding three operands.  First add operands 1
        !           101:   and 2, and generate cy[].  Then add operand 3 to the partial result,
        !           102:   and accumulate carry into cy[].  Finally propagate carry just like
        !           103:   in the new mpn_add_n.
        !           104:
        !           105: IDEA:
        !           106:
        !           107: Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
        !           108: down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
        !           109: storing even fewer bits per limb, perhaps 56, it would be possible to
        !           110: write a mul_mul_basecase that would run at effectively 1 cycle/limb.
        !           111: (Use VM here to better handle the romb-shaped multiply area, perhaps
        !           112: rouding operand sizes up to the next power of 2.)
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>