OpenXM_contrib/gmp/mpn/x86/pentium/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / x86 / pentium
Annotation of OpenXM_contrib/gmp/mpn/x86/pentium/README, Revision 1.1.1.3

1.1.1.3 ! ohara       1: Copyright 1996, 1999, 2000, 2001 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
        !            22:
1.1.1.2   maekawa    23:
                     24:                    INTEL PENTIUM P5 MPN SUBROUTINES
                     25:
                     26:
                     27: This directory contains mpn functions optimized for Intel Pentium (P5,P54)
1.1.1.3 ! ohara      28: processors.  The mmx subdirectory has additional code for Pentium with MMX
        !            29: (P55).
1.1.1.2   maekawa    30:
                     31:
                     32: STATUS
                     33:
                     34:                                 cycles/limb
                     35:
                     36:        mpn_add_n/sub_n            2.375
                     37:
1.1.1.3 ! ohara      38:        mpn_mul_1                 12.0
1.1.1.2   maekawa    39:        mpn_add/submul_1          14.0
                     40:
                     41:        mpn_mul_basecase          14.2 cycles/crossproduct (approx)
                     42:
                     43:        mpn_sqr_basecase           8 cycles/crossproduct (approx)
                     44:                                    or 15.5 cycles/triangleproduct (approx)
                     45:
1.1.1.3 ! ohara      46:        mpn_l/rshift               5.375 normal (6.0 on P54)
        !            47:                                   1.875 special shift by 1 bit
        !            48:
        !            49:        mpn_divrem_1              44.0
        !            50:        mpn_mod_1                 28.0
        !            51:        mpn_divexact_by3          15.0
        !            52:
        !            53:        mpn_copyi/copyd            1.0
        !            54:
1.1.1.2   maekawa    55: Pentium MMX gets the following improvements
                     56:
                     57:        mpn_l/rshift               1.75
                     58:
                     59:
1.1.1.3 ! ohara      60: 1. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
        !            61: overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb.
        !            62:
        !            63: 1. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
        !            64: should.  Intel documentation says a mul instruction is 10 cycles, but it
        !            65: measures 9 and the routines using it run as 9.
        !            66:
        !            67:
        !            68:
        !            69: P55 MMX AND X87
        !            70:
        !            71: The cost of switching between MMX and x87 floating point on P55 is about 100
        !            72: cycles (fld1/por/emms for instance).  In order to avoid that the two aren't
        !            73: mixed and currently that means using MMX and not x87.
        !            74:
        !            75: MMX offers a big speedup for lshift and rshift, and a nice speedup for
        !            76: 16-bit multipliers in mul_1.  If fast code using x87 is found then perhaps
        !            77: the preference for MMX will be reversed.
        !            78:
        !            79:
        !            80:
        !            81:
        !            82: P54 SHLDL
        !            83:
        !            84: mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
1.1.1.2   maekawa    85: documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
                     86: or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.
                     87:
1.1.1.3 ! ohara      88: It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
        !            89: but not two.  For example, back to back repetitions of the following
1.1.1.2   maekawa    90:
1.1.1.3 ! ohara      91:        shldl(  %cl, %eax, %ebx)
        !            92:        xorl    %edx, %edx
        !            93:        xorl    %esi, %esi
        !            94:
        !            95: run at 5 cycles, as expected, but repetitions of the following run at 7
        !            96: cycles, whereas 6 would be expected (and is achieved on P55),
        !            97:
        !            98:        shldl(  %cl, %eax, %ebx)
        !            99:        xorl    %edx, %edx
        !           100:        xorl    %esi, %esi
        !           101:        xorl    %edi, %edi
        !           102:        xorl    %ebp, %ebp
        !           103:
        !           104: Three xorls run at 7 cycles too, so it doesn't seem to be pairing inhibited
        !           105: only in the second following cycle.
        !           106:
        !           107: Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
        !           108: pattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has been
        !           109: made on something like that, but it's not yet complete.
        !           110:
        !           111:
        !           112:
        !           113:
        !           114: OTHER NOTES
        !           115:
        !           116: Prefetching Destinations
        !           117:
        !           118:     Pentium doesn't allocate cache lines on writes, unlike most other modern
        !           119:     processors.  Since the functions in the mpn class do array writes, we
        !           120:     have to handle allocating the destination cache lines by reading a word
        !           121:     from it in the loops, to achieve the best performance.
        !           122:
        !           123: Prefetching Sources
        !           124:
        !           125:     Prefetching of sources is pointless since there's no out-of-order loads.
        !           126:     Any load instruction blocks until the line is brought to L1, so it may
        !           127:     as well be the load that wants the data which blocks.
        !           128:
        !           129: Data Cache Bank Clashes
        !           130:
        !           131:     Pairing of memory operations requires that the two issued operations
        !           132:     refer to different cache banks (ie. different addresses modulo 32
        !           133:     bytes).  The simplest way to ensure this is to read/write two words from
        !           134:     the same object.  If we make operations on different objects, they might
        !           135:     or might not be to the same cache bank.
        !           136:
        !           137: PIC %eip Fetching
1.1.1.2   maekawa   138:
1.1.1.3 ! ohara     139:     A simple call $+5 and popl can be used to get %eip, there's no need to
        !           140:     balance calls and returns since P5 doesn't have any return stack branch
        !           141:     prediction.
1.1.1.2   maekawa   142:
1.1.1.3 ! ohara     143: Float Multiplies
1.1       maekawa   144:
1.1.1.3 ! ohara     145:     fmul is pairable and can be issued every 2 cycles (with a 4 cycle
        !           146:     latency for data ready to use).  This is a lot better than integer mull
        !           147:     or imull at 9 cycles non-pairing.  Unfortunately the advantage is
        !           148:     quickly eaten away by needing to throw data through memory back to the
        !           149:     integer registers to adjust for fild and fist being signed, and to do
        !           150:     things like propagating carry bits.
1.1       maekawa   151:
                    152:
                    153:
                    154:
                    155:
1.1.1.2   maekawa   156: REFERENCES
1.1       maekawa   157:
1.1.1.2   maekawa   158: "Intel Architecture Optimization Manual", 1997, order number 242816.  This
                    159: is mostly about P5, the parts about P6 aren't relevant.  Available on-line:
                    160:
                    161:         http://download.intel.com/design/PentiumII/manuals/242816.htm
                    162:
                    163:
                    164:
                    165: ----------------
                    166: Local variables:
                    167: mode: text
                    168: fill-column: 76
                    169: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>