[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / x86 / p6

Annotation of OpenXM_contrib/gmp/mpn/x86/p6/README, Revision 1.1.1.2

1.1.1.2 ! ohara       1: Copyright 2000, 2001 Free Software Foundation, Inc.
        !             2:
        !             3: This file is part of the GNU MP Library.
        !             4:
        !             5: The GNU MP Library is free software; you can redistribute it and/or modify
        !             6: it under the terms of the GNU Lesser General Public License as published by
        !             7: the Free Software Foundation; either version 2.1 of the License, or (at your
        !             8: option) any later version.
        !             9:
        !            10: The GNU MP Library is distributed in the hope that it will be useful, but
        !            11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
        !            12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
        !            13: License for more details.
        !            14:
        !            15: You should have received a copy of the GNU Lesser General Public License
        !            16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
        !            17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
        !            18: 02111-1307, USA.
        !            19:
        !            20:
        !            21:
        !            22:
1.1       maekawa    23:
                     24:                       INTEL P6 MPN SUBROUTINES
                     25:
                     26:
                     27:
                     28: This directory contains code optimized for Intel P6 class CPUs, meaning
                     29: PentiumPro, Pentium II and Pentium III.  The mmx and p3mmx subdirectories
                     30: have routines using MMX instructions.
                     31:
                     32:
                     33:
                     34: STATUS
                     35:
                     36: Times for the loops, with all code and data in L1 cache, are as follows.
                     37: Some of these might be able to be improved.
                     38:
                     39:                                cycles/limb
                     40:
                     41:        mpn_add_n/sub_n           3.7
                     42:
                     43:        mpn_copyi                 0.75
1.1.1.2 ! ohara      44:        mpn_copyd                 1.75 (or 0.75 if no overlap)
1.1       maekawa    45:
                     46:        mpn_divrem_1             39.0
1.1.1.2 ! ohara      47:        mpn_mod_1                21.5
1.1       maekawa    48:        mpn_divexact_by3          8.5
                     49:
                     50:        mpn_mul_1                 5.5
                     51:        mpn_addmul/submul_1       6.35
                     52:
                     53:        mpn_l/rshift              2.5
                     54:
                     55:        mpn_mul_basecase          8.2 cycles/crossproduct (approx)
                     56:        mpn_sqr_basecase          4.0 cycles/crossproduct (approx)
                     57:                                  or 7.75 cycles/triangleproduct (approx)
                     58:
                     59: Pentium II and III have MMX and get the following improvements.
                     60:
                     61:        mpn_divrem_1             25.0 integer part, 17.5 fractional part
                     62:
                     63:        mpn_l/rshift              1.75
                     64:
                     65:
                     66:
                     67:
                     68: NOTES
                     69:
                     70: Write-allocate L1 data cache means prefetching of destinations is unnecessary.
                     71:
                     72: Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
                     73: to 26 cycles depending how far speculative execution has gone.  The 9 cycle
                     74: minimum penalty comes from the issue pipeline being 9 stages.
                     75:
                     76: A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
                     77: 5, 6 or 7 limb operations are all the same.  The 0.75 cycles/limb would be 3
                     78: cycles per 16 byte block.
                     79:
                     80:
                     81:
                     82:
                     83: CODING
                     84:
                     85: Instructions in general code have been shown grouped if they can execute
                     86: together, which means up to three instructions with no successive
                     87: dependencies, and with only the first being a multiple micro-op.
                     88:
                     89: P6 has out-of-order execution, so the groupings are really only showing
                     90: dependent paths where some shuffling might allow some latencies to be
                     91: hidden.
                     92:
                     93:
                     94:
                     95:
                     96: REFERENCES
                     97:
                     98: "Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
                     99: 02/99, order number 245127 (order number 730795-001 is in the document too).
                    100: Available on-line:
                    101:
                    102:        http://download.intel.com/design/PentiumII/manuals/245127.htm
                    103:
                    104: "Intel Architecture Optimization Manual", 1997, order number 242816.  This
                    105: is an older document mostly about P5 and not as good as the above.
                    106: Available on-line:
                    107:
                    108:        http://download.intel.com/design/PentiumII/manuals/242816.htm
                    109:
                    110:
                    111:
                    112: ----------------
                    113: Local variables:
                    114: mode: text
                    115: fill-column: 76
                    116: End:

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>