Annotation of OpenXM_contrib/gmp/mpn/x86/pentium/README, Revision 1.1.1.3
1.1.1.3 ! ohara 1: Copyright 1996, 1999, 2000, 2001 Free Software Foundation, Inc.
! 2:
! 3: This file is part of the GNU MP Library.
! 4:
! 5: The GNU MP Library is free software; you can redistribute it and/or modify
! 6: it under the terms of the GNU Lesser General Public License as published by
! 7: the Free Software Foundation; either version 2.1 of the License, or (at your
! 8: option) any later version.
! 9:
! 10: The GNU MP Library is distributed in the hope that it will be useful, but
! 11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
! 12: or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
! 13: License for more details.
! 14:
! 15: You should have received a copy of the GNU Lesser General Public License
! 16: along with the GNU MP Library; see the file COPYING.LIB. If not, write to
! 17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
! 18: 02111-1307, USA.
! 19:
! 20:
! 21:
! 22:
1.1.1.2 maekawa 23:
24: INTEL PENTIUM P5 MPN SUBROUTINES
25:
26:
27: This directory contains mpn functions optimized for Intel Pentium (P5,P54)
1.1.1.3 ! ohara 28: processors. The mmx subdirectory has additional code for Pentium with MMX
! 29: (P55).
1.1.1.2 maekawa 30:
31:
32: STATUS
33:
34: cycles/limb
35:
36: mpn_add_n/sub_n 2.375
37:
1.1.1.3 ! ohara 38: mpn_mul_1 12.0
1.1.1.2 maekawa 39: mpn_add/submul_1 14.0
40:
41: mpn_mul_basecase 14.2 cycles/crossproduct (approx)
42:
43: mpn_sqr_basecase 8 cycles/crossproduct (approx)
44: or 15.5 cycles/triangleproduct (approx)
45:
1.1.1.3 ! ohara 46: mpn_l/rshift 5.375 normal (6.0 on P54)
! 47: 1.875 special shift by 1 bit
! 48:
! 49: mpn_divrem_1 44.0
! 50: mpn_mod_1 28.0
! 51: mpn_divexact_by3 15.0
! 52:
! 53: mpn_copyi/copyd 1.0
! 54:
1.1.1.2 maekawa 55: Pentium MMX gets the following improvements
56:
57: mpn_l/rshift 1.75
58:
59:
1.1.1.3 ! ohara 60: 1. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop
! 61: overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb.
! 62:
! 63: 1. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
! 64: should. Intel documentation says a mul instruction is 10 cycles, but it
! 65: measures 9 and the routines using it run as 9.
! 66:
! 67:
! 68:
! 69: P55 MMX AND X87
! 70:
! 71: The cost of switching between MMX and x87 floating point on P55 is about 100
! 72: cycles (fld1/por/emms for instance). In order to avoid that the two aren't
! 73: mixed and currently that means using MMX and not x87.
! 74:
! 75: MMX offers a big speedup for lshift and rshift, and a nice speedup for
! 76: 16-bit multipliers in mul_1. If fast code using x87 is found then perhaps
! 77: the preference for MMX will be reversed.
! 78:
! 79:
! 80:
! 81:
! 82: P54 SHLDL
! 83:
! 84: mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
1.1.1.2 maekawa 85: documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
86: or 5 cycles/limb asymptotically. The P55 runs them at the expected speed.
87:
1.1.1.3 ! ohara 88: It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
! 89: but not two. For example, back to back repetitions of the following
1.1.1.2 maekawa 90:
1.1.1.3 ! ohara 91: shldl( %cl, %eax, %ebx)
! 92: xorl %edx, %edx
! 93: xorl %esi, %esi
! 94:
! 95: run at 5 cycles, as expected, but repetitions of the following run at 7
! 96: cycles, whereas 6 would be expected (and is achieved on P55),
! 97:
! 98: shldl( %cl, %eax, %ebx)
! 99: xorl %edx, %edx
! 100: xorl %esi, %esi
! 101: xorl %edi, %edi
! 102: xorl %ebp, %ebp
! 103:
! 104: Three xorls run at 7 cycles too, so it doesn't seem to be pairing inhibited
! 105: only in the second following cycle.
! 106:
! 107: Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
! 108: pattern of shift, 2 loads, shift, 2 stores, shift, etc. A start has been
! 109: made on something like that, but it's not yet complete.
! 110:
! 111:
! 112:
! 113:
! 114: OTHER NOTES
! 115:
! 116: Prefetching Destinations
! 117:
! 118: Pentium doesn't allocate cache lines on writes, unlike most other modern
! 119: processors. Since the functions in the mpn class do array writes, we
! 120: have to handle allocating the destination cache lines by reading a word
! 121: from it in the loops, to achieve the best performance.
! 122:
! 123: Prefetching Sources
! 124:
! 125: Prefetching of sources is pointless since there's no out-of-order loads.
! 126: Any load instruction blocks until the line is brought to L1, so it may
! 127: as well be the load that wants the data which blocks.
! 128:
! 129: Data Cache Bank Clashes
! 130:
! 131: Pairing of memory operations requires that the two issued operations
! 132: refer to different cache banks (ie. different addresses modulo 32
! 133: bytes). The simplest way to ensure this is to read/write two words from
! 134: the same object. If we make operations on different objects, they might
! 135: or might not be to the same cache bank.
! 136:
! 137: PIC %eip Fetching
1.1.1.2 maekawa 138:
1.1.1.3 ! ohara 139: A simple call $+5 and popl can be used to get %eip, there's no need to
! 140: balance calls and returns since P5 doesn't have any return stack branch
! 141: prediction.
1.1.1.2 maekawa 142:
1.1.1.3 ! ohara 143: Float Multiplies
1.1 maekawa 144:
1.1.1.3 ! ohara 145: fmul is pairable and can be issued every 2 cycles (with a 4 cycle
! 146: latency for data ready to use). This is a lot better than integer mull
! 147: or imull at 9 cycles non-pairing. Unfortunately the advantage is
! 148: quickly eaten away by needing to throw data through memory back to the
! 149: integer registers to adjust for fild and fist being signed, and to do
! 150: things like propagating carry bits.
1.1 maekawa 151:
152:
153:
154:
155:
1.1.1.2 maekawa 156: REFERENCES
1.1 maekawa 157:
1.1.1.2 maekawa 158: "Intel Architecture Optimization Manual", 1997, order number 242816. This
159: is mostly about P5, the parts about P6 aren't relevant. Available on-line:
160:
161: http://download.intel.com/design/PentiumII/manuals/242816.htm
162:
163:
164:
165: ----------------
166: Local variables:
167: mode: text
168: fill-column: 76
169: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>