=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/pa64/Attic/README,v retrieving revision 1.1 retrieving revision 1.1.1.2 diff -u -p -r1.1 -r1.1.1.2 --- OpenXM_contrib/gmp/mpn/pa64/Attic/README 2000/09/09 14:12:37 1.1 +++ OpenXM_contrib/gmp/mpn/pa64/Attic/README 2003/08/25 16:06:23 1.1.1.2 @@ -1,38 +1,63 @@ +Copyright 1999, 2001, 2002 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. + +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library; see the file COPYING.LIB. If not, write to +the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA +02111-1307, USA. + + + + This directory contains mpn functions for 64-bit PA-RISC 2.0. -RELEVANT OPTIMIZATION ISSUES +PIPELINE SUMMARY -The PA8000 has a multi-issue pipeline with large buffers for instructions -awaiting pending results. Therefore, no latency scheduling is necessary -(and might actually be harmful). +The PA8x00 processors have an orthogonal 4-way out-of-order pipeline. Each +cycle two ALU operations and two MEM operations can issue, but just one of the +MEM operations may be a store. The two ALU operations can be almost any +combination of non-memory operations. Unlike every other processor, integer +and fp operations are completely equal here; they both count as just ALU +operations. -Two 64-bit loads can be completed per cycle. One 64-bit store can be -completed per cycle. A store cannot complete in the same cycle as a load. +Unfortunately, some operations cause hickups in the pipeline. Combining +carry-consuming operations like ADD,DC with operations that does not set carry +like ADD,L cause long delays. Skip operations also seem to cause hickups. If +several ADD,DC are issued consecutively, or if plain carry-generating ADD feed +ADD,DC, stalling does not occur. We can effectively issue two ADD,DC +operations/cycle. -STATUS +Latency scheduling is not as important as making sure to have a mix of ALU and +MEM operations, but for full pipeline utilization, it is still a good idea to +do some amount of latency scheduling. -* mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at - the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb - for add/subtract. +Like for all other processors, RAW memory scheduling is critically important. +Since integer multiplication takes place in the floating-point unit, the GMP +code needs to handle this problem frequently. -* The multiplication functions run at 11 cycles/limb. The cache bandwidth - allows 7.5 cycles/limb. Perhaps it would be possible, using unrolling or - better scheduling, to get closer to the cache bandwidth limit. +STATUS -* xaddmul_1.S contains a quicker method for forming the 128 bit product. It - uses some fewer operations, and keep the carry flag live across the loop - boundary. But it seems hard to make it run more than 1/4 cycle faster - than the old code. Perhaps we really ought to unroll this loop be 2x? - 2x should suffice since register latency schedling is never needed, - but the unrolling would hide the store-load latency. Here is a sketch: +* mpn_lshift and mpn_rshift run at 1.5 cycles/limb on PA8000 and at 1.0 + cycles/limb on PA8500. With latency scheduling, the numbers could be + improved to 1.0 cycles/limb for all PA8x00 chips. - 1. A multiply and store 64-bit products - 2. B sum 64-bit products 128-bit product - 3. B load 64-bit products to integer registers - 4. B multiply and store 64-bit products - 5. A sum 64-bit products 128-bit product - 6. A load 64-bit products to integer registers - 7. goto 1 +* mpn_add_n and mpn_sub_n run at 2.0 cycles/limb on PA8000 and at about 1.9 + cycles/limb on PA8500. With latency scheduling, this could be improved to + 1.5 cycles/limb. - In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved - for better instruction mix. +* The mpn_addmul_1 run at 6.25 cycles/limb. The current code uses ADD,DC for + adjacent limbs, and relies heavily on reordering. + +* Both mpn_mul_1 and mpn_submul_1 run at around 11 cycles/limb. There is + obviously room for improving these along the lines of mpn_addmul_1.