OpenXM_contrib/gmp/mpn/x86/pentium4/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / x86 / pentium4
Annotation of OpenXM_contrib/gmp/mpn/x86/pentium4/README, Revision 1.1.1.1

1.1       ohara       1: Copyright 2001 Free Software Foundation, Inc.
                      2:
                      3: This file is part of the GNU MP Library.
                      4:
                      5: The GNU MP Library is free software; you can redistribute it and/or modify
                      6: it under the terms of the GNU Lesser General Public License as published by
                      7: the Free Software Foundation; either version 2.1 of the License, or (at your
                      8: option) any later version.
                      9:
                     10: The GNU MP Library is distributed in the hope that it will be useful, but
                     11: WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
                     12: or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
                     13: License for more details.
                     14:
                     15: You should have received a copy of the GNU Lesser General Public License
                     16: along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
                     17: the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
                     18: 02111-1307, USA.
                     19:
                     20:
                     21:
                     22:
                     23:                    INTEL PENTIUM-4 MPN SUBROUTINES
                     24:
                     25:
                     26: This directory contains mpn functions optimized for Intel Pentium-4.
                     27:
                     28: The mmx subdirectory has routines using MMX instructions, the sse2
                     29: subdirectory has routines using SSE2 instructions.  All P4s have these, the
                     30: separate directories are just so configure can omit that code if the
                     31: assembler doesn't support it.
                     32:
                     33:
                     34: STATUS
                     35:
                     36:                                 cycles/limb
                     37:
                     38:        mpn_add_n/sub_n            4 normal, 6 in-place
                     39:
                     40:        mpn_mul_1                  4 normal, 6 in-place
                     41:        mpn_addmul_1               6
                     42:        mpn_submul_1               7
                     43:
                     44:        mpn_mul_basecase           6 cycles/crossproduct (approx)
                     45:
                     46:        mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
                     47:                                    or 7.0 cycles/triangleproduct (approx)
                     48:
                     49:        mpn_l/rshift               1.75
                     50:
                     51:
                     52:
                     53: The shifts ought to be able to go at 1.5 c/l, but not much effort has been
                     54: applied to them yet.
                     55:
                     56: In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
                     57: calls, suffer from pipeline anomalies associated with write combining and
                     58: movd reads and writes to the same or nearby locations.  The movq
                     59: instructions do not trigger the same hardware problems.  Unfortunately,
                     60: using movq and splitting/combining seems to require too many extra
                     61: instructions to help.  Perhaps future chip steppings will be better.
                     62:
                     63:
                     64:
                     65: NOTES
                     66:
                     67: The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
                     68: Many traditional x86 instructions run very slowly, requiring use of
                     69: alterative instructions for acceptable performance.
                     70:
                     71: adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
                     72: within a 64-bit mmx register seems better, though the combination
                     73: paddq/psrlq when propagating a carry is still a 4 cycle latency.
                     74:
                     75: incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
                     76: the carry flag is not separately renamed, so incl and decl depend on all
                     77: previous flags-setting instructions.
                     78:
                     79: shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
                     80: integer instructions (addl, subl, orl, andl, and some more).  shldl and
                     81: shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.
                     82:
                     83: movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
                     84: pxor/por or similar combination at 2 cycles latency can be used instead.
                     85: The movq however executes in the float unit, thereby saving MMX execution
                     86: resources.  With the right juggling, data moves shouldn't be on a dependent
                     87: chain.
                     88:
                     89: L1 is write-through, but the write-combining sounds like it does enough to
                     90: not require explicit destination prefetching.
                     91:
                     92: xmm registers so far haven't found a use, but not much effort has been
                     93: expended.  A configure test for whether the operating system knows
                     94: fxsave/fxrestor will be needed if they're used.
                     95:
                     96:
                     97:
                     98: REFERENCES
                     99:
                    100: Intel Pentium-4 processor manuals,
                    101:
                    102:        http://developer.intel.com/design/pentium4/manuals
                    103:
                    104: "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
                    105: order number 248966.  Available on-line:
                    106:
                    107:        http://developer.intel.com/design/pentium4/manuals/248966.htm
                    108:
                    109:
                    110:
                    111: ----------------
                    112: Local variables:
                    113: mode: text
                    114: fill-column: 76
                    115: End:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>