===================================================================
RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/pa64/Attic/README,v
retrieving revision 1.1.1.1
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.1 -r1.1.1.2
--- OpenXM_contrib/gmp/mpn/pa64/Attic/README	2000/09/09 14:12:37	1.1.1.1
+++ OpenXM_contrib/gmp/mpn/pa64/Attic/README	2003/08/25 16:06:23	1.1.1.2
@@ -1,38 +1,63 @@
+Copyright 1999, 2001, 2002 Free Software Foundation, Inc.
+
+This file is part of the GNU MP Library.
+
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published by
+the Free Software Foundation; either version 2.1 of the License, or (at your
+option) any later version.
+
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+License for more details.
+
+You should have received a copy of the GNU Lesser General Public License
+along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+02111-1307, USA.
+
+
+
+
 This directory contains mpn functions for 64-bit PA-RISC 2.0.
 
-RELEVANT OPTIMIZATION ISSUES
+PIPELINE SUMMARY
 
-The PA8000 has a multi-issue pipeline with large buffers for instructions
-awaiting pending results.  Therefore, no latency scheduling is necessary
-(and might actually be harmful).
+The PA8x00 processors have an orthogonal 4-way out-of-order pipeline.  Each
+cycle two ALU operations and two MEM operations can issue, but just one of the
+MEM operations may be a store.  The two ALU operations can be almost any
+combination of non-memory operations.  Unlike every other processor, integer
+and fp operations are completely equal here; they both count as just ALU
+operations.
 
-Two 64-bit loads can be completed per cycle.  One 64-bit store can be
-completed per cycle.  A store cannot complete in the same cycle as a load.
+Unfortunately, some operations cause hickups in the pipeline.  Combining
+carry-consuming operations like ADD,DC with operations that does not set carry
+like ADD,L cause long delays.  Skip operations also seem to cause hickups.  If
+several ADD,DC are issued consecutively, or if plain carry-generating ADD feed
+ADD,DC, stalling does not occur.  We can effectively issue two ADD,DC
+operations/cycle.
 
-STATUS
+Latency scheduling is not as important as making sure to have a mix of ALU and
+MEM operations, but for full pipeline utilization, it is still a good idea to
+do some amount of latency scheduling.
 
-* mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
-  the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
-  for add/subtract.
+Like for all other processors, RAW memory scheduling is critically important.
+Since integer multiplication takes place in the floating-point unit, the GMP
+code needs to handle this problem frequently.
 
-* The multiplication functions run at 11 cycles/limb.  The cache bandwidth
-  allows 7.5 cycles/limb.  Perhaps it would be possible, using unrolling or
-  better scheduling, to get closer to the cache bandwidth limit.
+STATUS
 
-* xaddmul_1.S contains a quicker method for forming the 128 bit product.  It
-  uses some fewer operations, and keep the carry flag live across the loop
-  boundary.  But it seems hard to make it run more than 1/4 cycle faster
-  than the old code.  Perhaps we really ought to unroll this loop be 2x?
-  2x should suffice since register latency schedling is never needed,
-  but the unrolling would hide the store-load latency.  Here is a sketch:
+* mpn_lshift and mpn_rshift run at 1.5 cycles/limb on PA8000 and at 1.0
+  cycles/limb on PA8500.  With latency scheduling, the numbers could be
+  improved to 1.0 cycles/limb for all PA8x00 chips.
 
-	1. A multiply and store 64-bit products
-	2. B sum 64-bit products 128-bit product
-	3. B load  64-bit products to integer registers
-	4. B multiply and store 64-bit products
-	5. A sum 64-bit products 128-bit product
-	6. A load  64-bit products to integer registers
-	7. goto 1
+* mpn_add_n and mpn_sub_n run at 2.0 cycles/limb on PA8000 and at about 1.9
+  cycles/limb on PA8500.  With latency scheduling, this could be improved to
+  1.5 cycles/limb.
 
-  In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
-  for better instruction mix.
+* The mpn_addmul_1 run at 6.25 cycles/limb.  The current code uses ADD,DC for
+  adjacent limbs, and relies heavily on reordering.
+
+* Both mpn_mul_1 and mpn_submul_1 run at around 11 cycles/limb.  There is
+  obviously room for improving these along the lines of mpn_addmul_1.