OpenXM_contrib/gmp/mpn/ia64/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / ia64
Annotation of OpenXM_contrib/gmp/mpn/ia64/README, Revision 1.1.1.1

1.1       ohara       1: Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
                      2:
                      3: This file is part of the GNU MP Library.
                      4:
                      5: The GNU MP Library is free software; you can redistribute it and/or modify
                      6: it under the terms of the GNU Lesser General Public License as published by
                      7: the Free Software Foundation; either version 2.1 of the License, or (at your
                      8: option) any later version.
                      9:
                     10: The GNU MP Library is distributed in the hope that it will be useful, but
                     11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
                     12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
                     13: License for more details.
                     14:
                     15: You should have received a copy of the GNU Lesser General Public License
                     16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
                     17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
                     18: 02111-1307, USA.
                     19:
                     20:
                     21:
                     22:
                     23:
                     24:
                     25: The IA-64 ISA keeps instructions three and three in 128 bit bundles.
                     26: Programmers/compilers need to put explicit breaks `;;' when there are
                     27: WAW or RAW dependencies.  Such breaks can typically just be at the end
                     28: of a bundle, with some exceptions.
                     29:
                     30: The Itanium and Itanium 2 implementations can under ideal conditions
                     31: execute two bundles per cycle.  The Itanium 2 allows 4 of these
                     32: instructions to do integer operations, while the Itanium 2 allows all
                     33: 6 to be integer operations.
                     34:
                     35: Taken cloop branches seem to insert a bubble into the pipeline most of
                     36: the time.
                     37:
                     38: Loads to the fp registers bypass the L1 cache and thus get extremely
                     39: long latencies, 9 cycles on the Itanium and 7 cycles on the Itanium 2.
                     40:
                     41: The software pipeline stuff using br.ctop instruction causes delays,
                     42: since many issue slots are taken up by instructions with zero
                     43: predicates, and since about many extra instructions are needed to set
                     44: things up.  These features are designed for code density, not maximum
                     45: speed.
                     46:
                     47: Misc pipeline limitations (Itanium):
                     48: * The getf.sig instruction can only execute in M0.
                     49: * At most four integer instructions/cycle.
                     50: * Nops take up resources like any plain instructions.
                     51:
                     52: ================================================================
                     53: mpn_add_n, mpn_sub_n:
                     54:
                     55: The current code runs at 3 cycles/limb.  Unrolling could clearly bring
                     56: down the time to 2 cycles/limb.
                     57:
                     58: ================================================================
                     59: mpn_addmul_1:
                     60:
                     61: The current code runs at 3.7 cycles/limb, but that somewhat odd timing
                     62: is reached only for huge operands.  It uses the mod-scheduled software
                     63: pipelining feature.  The reason for the poor speed for small operands
                     64: is that mod-scheduled loops have a very long start-up overhead.
                     65:
                     66: For optimal speed, we need to load two 64-bit limbs with the ldfp8
                     67: instruction, and stay away from mod-scheduled loops.  Since rp and up
                     68: might be mutually aligned in two ways, we will need two loop variants,
                     69: with the same basic structure:
                     70:
                     71:   { .mfi       getf.sig
                     72:                xma.l
                     73:           (p6) cmp.leu         p6, p7 =
                     74: } { .mfi       stf8
                     75:                xma.hu
                     76:           (p7) cmp.ltu         p6, p7 =
                     77:                ;;
                     78: } { .mib       getf.sig
                     79:           (p6) add 1
                     80:                nop.b
                     81: } { .mib       ldfp8            = [up], 16
                     82:           (p7) add
                     83:                nop.b
                     84:                ;;
                     85:   { .mfi       getf.sig
                     86:                xma.l
                     87:           (p6) cmp.leu         p6, p7 =
                     88: } { .mfi       stf8
                     89:                xma.hu
                     90:           (p7) cmp.ltu         p6, p7 =
                     91:                ;;
                     92: } { .mib       getf.sig
                     93:           (p6) add 1
                     94:                nop.b
                     95: } { .mib       ldfp8            = [rp], 16
                     96:           (p7) add
                     97:                br.cloop
                     98:                ;;
                     99: }
                    100:
                    101: 2 limbs/20 instructions
                    102:           20 insn/max 6 insn/cycle:            3.3 cycles/2limb
                    103:           8 memops/max 2 memops/cycle:         4.0 cycles/2limb
                    104:           8 intops/max 2 intops/cycle:         4.0 cycles/2limb
                    105:           4 fpops/max 2 fpops/cycle:           2.0 cycles/2limb
                    106:
                    107: ================================================================
                    108: mpn_submul_1:
                    109:
                    110: The current code just calls mpn_mul_1 and mpn_sub_n and thus needs
                    111: about 7 cycles/limb.
                    112:
                    113: We could implement this much like mpn_addmul_1, if we first complement
                    114: v.  When v is complemented, the low product limb becomes complement of
                    115: true product.  This should allow us to use the accumulation of xma.
                    116: Here is how it works:
                    117:
                    118:
                    119:   #define umul_ppmma(ph, pl, m0, m1, a) \
                    120:     do {                                                               \
                    121:       UDItype __m0 = (m0), __m1 = (m1), __a = (a);                     \
                    122:       __asm__ ("xma.hu %0 = %1, %2, %3"                                        \
                    123:               : "=f" (ph)                                              \
                    124:               : "f" (m0), "f" (m1), "f" (__a));                        \
                    125:       (pl) = __m0 * __m1 + __a;                                                \
                    126:     } while (0)
                    127:
                    128:   mp_limb_t
                    129:   mpn_submul_1 (mp_ptr rp, mp_srcptr up, mp_size_t n, mp_limb_t vl)
                    130:   {
                    131:     mp_limb_t cl, cy;
                    132:     mp_size_t i;
                    133:     mp_limb_t phi, plo;
                    134:     mp_limb_t x;
                    135:     mp_limb_t ul, vln;
                    136:
                    137:     vln = -vl;
                    138:
                    139:     cl = 0;
                    140:     for (i = n; i != 0; i--)
                    141:       {
                    142:        ul = *up++;             /* will need this in both fregs and gregs */
                    143:        x = *rp;
                    144:        umul_ppmma (phi, plo, ul, vln, x);
                    145:
                    146:        cy = plo < cl;
                    147:        plo -= cl;
                    148:
                    149:        cl = ul - phi;
                    150:        cl += cy;
                    151:
                    152:        *rp++ = plo;
                    153:       }
                    154:
                    155:     return cl;
                    156:   }
                    157:
                    158: ================================================================
                    159: mpn_mul_1:
                    160:
                    161: The current code runs at 3.7 cycles/limb.  The code is very similar to
                    162: the mpn_addmul_1 code.  See comments above.
                    163:
                    164: Faster code wouldn't be too hard to write.  This is one possible
                    165: pattern:
                    166:
                    167:   { .mfi       getf.sig
                    168:                xma.l
                    169:           (p6) cmp.leu         p6, p7 =
                    170: } { .mfi       stf8
                    171:                xma.hu
                    172:           (p7) cmp.ltu         p6, p7 =
                    173:                ;;
                    174: } { .mib       getf.sig
                    175:           (p6) add 1
                    176:                nop.b
                    177: } { .mib       ldf8             = [up], 8
                    178:           (p7) add
                    179:                br.cloop
                    180:                ;;
                    181: }
                    182:
                    183: 1 limb/12 instructions
                    184:           12 insn/max 6 insn/cycle:            2.0 cycles/limb
                    185:           4 memops/max 2 memops/cycle:         2.0 cycles/limb
                    186:           4 intops/max 2 intops/cycle:         2.0 cycles/limb
                    187:           2 fpops/max 2 fpops/cycle:           1.0 cycles/limb
                    188:
                    189: ================================================================
                    190: mpn_mul_8
                    191:
                    192: The add+cmp+add we use on the other codes is optimal for shortening
                    193: recurrencies (2 cycles) but the sequence takes up 4 execution slots.  When
                    194: recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
                    195: better.
                    196:
                    197: /* First load the 8 values from v */
                    198:        ldfp8           v0, v1 = [r35], 16;;
                    199:        ldfp8           v2, v3 = [r35], 16;;
                    200:        ldfp8           v4, v5 = [r35], 16;;
                    201:        ldfp8           v6, v7 = [r35], 16;;
                    202:
                    203: /* In the inner loop, get a new U limb and store a result limb. */
                    204:        mov             lc = un
                    205: Loop:  ldf8            u0 = [r33], 8
                    206:        xma.l           lp0 = v0, u0, hp0
                    207:        xma.hu          hp0 = v0, u0, hp0
                    208:        xma.l           lp1 = v1, u0, hp1
                    209:        xma.hu          hp1 = v1, u0, hp1
                    210:        xma.l           lp2 = v2, u0, hp2
                    211:        xma.hu          hp2 = v2, u0, hp2
                    212:        xma.l           lp3 = v3, u0, hp3
                    213:        xma.hu          hp3 = v3, u0, hp3
                    214:        xma.l           lp4 = v4, u0, hp4
                    215:        xma.hu          hp4 = v4, u0, hp4
                    216:        xma.l           lp5 = v5, u0, hp5
                    217:        xma.hu          hp5 = v5, u0, hp5
                    218:        xma.l           lp6 = v6, u0, hp6
                    219:        xma.hu          hp6 = v6, u0, hp6
                    220:        xma.l           lp7 = v7, u0, hp7
                    221:        xma.hu          hp7 = v7, u0, hp7
                    222:        getf.sig        l0 = lp0
                    223:        getf.sig        l1 = lp1
                    224:        getf.sig        l2 = lp2
                    225:        getf.sig        l3 = lp3
                    226:        getf.sig        l4 = lp4
                    227:        getf.sig        l5 = lp5
                    228:        getf.sig        l6 = lp6
                    229:        getf.sig        l7 = lp7
                    230:        add+cmp+add     l0, l0, h0
                    231:        add+cmp+add     l1, l1, h1
                    232:        add+cmp+add     l2, l2, h2
                    233:        add+cmp+add     l3, l3, h3
                    234:        add+cmp+add     l4, l4, h4
                    235:        add+cmp+add     l5, l5, h5
                    236:        add+cmp+add     l6, l6, h6
                    237:        add+cmp+add     l7, l7, h7
                    238:        st8             [r32] = xx, 8
                    239:        br.cloop Loop
                    240:
                    241:        50 insn at max 6 insn/cycle:            8.33 cycles/limb8
                    242:        10 memops at max 2 memops/cycle:        5 cycles/limb8
                    243:        16 fpops at max 2 fpops/cycle:          8 cycles/limb8
                    244:        24 intops at max 4 intops/cycle:        6 cycles/limb8
                    245:        10+24 memops+intops at max 4/cycle      8.5 cycles/limb8
                    246:                                                1.0625 cycles/limb
                    247:
                    248: ================================================================
                    249: mpn_lshift, mpn_rshift
                    250:
                    251: The current code runs at 2 cycles/limb, but has a too deep software
                    252: pipeline.  The code suffers badly from the 4 cycle latency of the
                    253: variable shift instructions.
                    254:
                    255: Using 63 separate loops, we could use the double-word SHRP
                    256: instruction.  That instruction has a plain single-cycle latency.  We
                    257: need 63 loops since this instruction only accept immediate count.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>