=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/x86/pentium/Attic/README,v retrieving revision 1.1.1.2 retrieving revision 1.1.1.3 diff -u -p -r1.1.1.2 -r1.1.1.3 --- OpenXM_contrib/gmp/mpn/x86/pentium/Attic/README 2000/09/09 14:12:44 1.1.1.2 +++ OpenXM_contrib/gmp/mpn/x86/pentium/Attic/README 2003/08/25 16:06:29 1.1.1.3 @@ -1,9 +1,32 @@ +Copyright 1996, 1999, 2000, 2001 Free Software Foundation, Inc. +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. + +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library; see the file COPYING.LIB. If not, write to +the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA +02111-1307, USA. + + + + + INTEL PENTIUM P5 MPN SUBROUTINES This directory contains mpn functions optimized for Intel Pentium (P5,P54) -processors. The mmx subdirectory has code for Pentium with MMX (P55). +processors. The mmx subdirectory has additional code for Pentium with MMX +(P55). STATUS @@ -12,16 +35,7 @@ STATUS mpn_add_n/sub_n 2.375 - mpn_copyi/copyd 1.0 - - mpn_divrem_1 44.0 - mpn_mod_1 44.0 - mpn_divexact_by3 15.0 - - mpn_l/rshift 5.375 normal (6.0 on P54) - 1.875 special shift by 1 bit - - mpn_mul_1 13.0 + mpn_mul_1 12.0 mpn_add/submul_1 14.0 mpn_mul_basecase 14.2 cycles/crossproduct (approx) @@ -29,35 +43,113 @@ STATUS mpn_sqr_basecase 8 cycles/crossproduct (approx) or 15.5 cycles/triangleproduct (approx) + mpn_l/rshift 5.375 normal (6.0 on P54) + 1.875 special shift by 1 bit + + mpn_divrem_1 44.0 + mpn_mod_1 28.0 + mpn_divexact_by3 15.0 + + mpn_copyi/copyd 1.0 + Pentium MMX gets the following improvements mpn_l/rshift 1.75 -1. mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the +1. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop +overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. + +1. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they +should. Intel documentation says a mul instruction is 10 cycles, but it +measures 9 and the routines using it run as 9. + + + +P55 MMX AND X87 + +The cost of switching between MMX and x87 floating point on P55 is about 100 +cycles (fld1/por/emms for instance). In order to avoid that the two aren't +mixed and currently that means using MMX and not x87. + +MMX offers a big speedup for lshift and rshift, and a nice speedup for +16-bit multipliers in mul_1. If fast code using x87 is found then perhaps +the preference for MMX will be reversed. + + + + +P54 SHLDL + +mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the documentation indicates that they should take only 43/8 = 5.375 cycles/limb, or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. -2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop -overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. +It seems that on P54 a shldl or shrdl allows pairing in one following cycle, +but not two. For example, back to back repetitions of the following -3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they -should. Intel documentation says a mul instruction is 10 cycles, but it -measures 9 and the routines using it run with it as 9. + shldl( %cl, %eax, %ebx) + xorl %edx, %edx + xorl %esi, %esi +run at 5 cycles, as expected, but repetitions of the following run at 7 +cycles, whereas 6 would be expected (and is achieved on P55), + shldl( %cl, %eax, %ebx) + xorl %edx, %edx + xorl %esi, %esi + xorl %edi, %edi + xorl %ebp, %ebp -RELEVANT OPTIMIZATION ISSUES +Three xorls run at 7 cycles too, so it doesn't seem to be pairing inhibited +only in the second following cycle. -1. Pentium doesn't allocate cache lines on writes, unlike most other modern -processors. Since the functions in the mpn class do array writes, we have to -handle allocating the destination cache lines by reading a word from it in the -loops, to achieve the best performance. +Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a +pattern of shift, 2 loads, shift, 2 stores, shift, etc. A start has been +made on something like that, but it's not yet complete. -2. Pairing of memory operations requires that the two issued operations refer -to different cache banks. The simplest way to insure this is to read/write -two words from the same object. If we make operations on different objects, -they might or might not be to the same cache bank. + + + +OTHER NOTES + +Prefetching Destinations + + Pentium doesn't allocate cache lines on writes, unlike most other modern + processors. Since the functions in the mpn class do array writes, we + have to handle allocating the destination cache lines by reading a word + from it in the loops, to achieve the best performance. + +Prefetching Sources + + Prefetching of sources is pointless since there's no out-of-order loads. + Any load instruction blocks until the line is brought to L1, so it may + as well be the load that wants the data which blocks. + +Data Cache Bank Clashes + + Pairing of memory operations requires that the two issued operations + refer to different cache banks (ie. different addresses modulo 32 + bytes). The simplest way to ensure this is to read/write two words from + the same object. If we make operations on different objects, they might + or might not be to the same cache bank. + +PIC %eip Fetching + + A simple call $+5 and popl can be used to get %eip, there's no need to + balance calls and returns since P5 doesn't have any return stack branch + prediction. + +Float Multiplies + + fmul is pairable and can be issued every 2 cycles (with a 4 cycle + latency for data ready to use). This is a lot better than integer mull + or imull at 9 cycles non-pairing. Unfortunately the advantage is + quickly eaten away by needing to throw data through memory back to the + integer registers to adjust for fild and fist being signed, and to do + things like propagating carry bits. + +