OpenXM_contrib/gmp/mpn/pa64/README - diff

Return to README CVS log

Up to [local] / OpenXM_contrib / gmp / mpn / pa64

Diff for /OpenXM_contrib/gmp/mpn/pa64/Attic/README between version 1.1.1.1 and 1.1.1.2

-version 1.1.1.1, 2000/09/09 14:12:37
+version 1.1.1.2, 2003/08/25 16:06:23
 Line 1
 Line 1
 Line 1
+ Copyright 1999, 2001, 2002 Free Software Foundation, Inc.
+ This file is part of the GNU MP Library.
+ The GNU MP Library is free software; you can redistribute it and/or modify
+ it under the terms of the GNU Lesser General Public License as published by
+ the Free Software Foundation; either version 2.1 of the License, or (at your
+ option) any later version.
+ The GNU MP Library is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+ License for more details.
+ You should have received a copy of the GNU Lesser General Public License
+ along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+ the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+-1307, USA.
  This directory contains mpn functions for 64-bit PA-RISC 2.0.
- RELEVANT OPTIMIZATION ISSUES
+ PIPELINE SUMMARY
- The PA8000 has a multi-issue pipeline with large buffers for instructions
+ The PA8x00 processors have an orthogonal 4-way out-of-order pipeline.  Each
- awaiting pending results.  Therefore, no latency scheduling is necessary
+ cycle two ALU operations and two MEM operations can issue, but just one of the
- (and might actually be harmful).
+ MEM operations may be a store.  The two ALU operations can be almost any
+ combination of non-memory operations.  Unlike every other processor, integer
+ and fp operations are completely equal here; they both count as just ALU
+ operations.
- Two 64-bit loads can be completed per cycle.  One 64-bit store can be
+ Unfortunately, some operations cause hickups in the pipeline.  Combining
- completed per cycle.  A store cannot complete in the same cycle as a load.
+ carry-consuming operations like ADD,DC with operations that does not set carry
+ like ADD,L cause long delays.  Skip operations also seem to cause hickups.  If
+ several ADD,DC are issued consecutively, or if plain carry-generating ADD feed
+ ADD,DC, stalling does not occur.  We can effectively issue two ADD,DC
+ operations/cycle.
- STATUS
+ Latency scheduling is not as important as making sure to have a mix of ALU and
+ MEM operations, but for full pipeline utilization, it is still a good idea to
+ do some amount of latency scheduling.
- * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
+ Like for all other processors, RAW memory scheduling is critically important.
-   the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
+ Since integer multiplication takes place in the floating-point unit, the GMP
-   for add/subtract.
+ code needs to handle this problem frequently.
- * The multiplication functions run at 11 cycles/limb.  The cache bandwidth
+ STATUS
-   allows 7.5 cycles/limb.  Perhaps it would be possible, using unrolling or
-   better scheduling, to get closer to the cache bandwidth limit.
- * xaddmul_1.S contains a quicker method for forming the 128 bit product.  It
+ * mpn_lshift and mpn_rshift run at 1.5 cycles/limb on PA8000 and at 1.0
-   uses some fewer operations, and keep the carry flag live across the loop
+   cycles/limb on PA8500.  With latency scheduling, the numbers could be
-   boundary.  But it seems hard to make it run more than 1/4 cycle faster
+   improved to 1.0 cycles/limb for all PA8x00 chips.
-   than the old code.  Perhaps we really ought to unroll this loop be 2x?
-x should suffice since register latency schedling is never needed,
-   but the unrolling would hide the store-load latency.  Here is a sketch:
-. A multiply and store 64-bit products
+ * mpn_add_n and mpn_sub_n run at 2.0 cycles/limb on PA8000 and at about 1.9
-. B sum 64-bit products 128-bit product
+   cycles/limb on PA8500.  With latency scheduling, this could be improved to
-. B load  64-bit products to integer registers
+.5 cycles/limb.
-. B multiply and store 64-bit products
-. A sum 64-bit products 128-bit product
-. A load  64-bit products to integer registers
-. goto 1
-   In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
+ * The mpn_addmul_1 run at 6.25 cycles/limb.  The current code uses ADD,DC for
-   for better instruction mix.
+   adjacent limbs, and relies heavily on reordering.
+ * Both mpn_mul_1 and mpn_submul_1 run at around 11 cycles/limb.  There is
+   obviously room for improving these along the lines of mpn_addmul_1.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>