[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / mpn / pa64

Diff for /OpenXM_contrib/gmp/mpn/pa64/Attic/README between version 1.1.1.1 and 1.1.1.2

version 1.1.1.1, 2000/09/09 14:12:37 version 1.1.1.2, 2003/08/25 16:06:23
Line 1 
Line 1 
   Copyright 1999, 2001, 2002 Free Software Foundation, Inc.
   
   This file is part of the GNU MP Library.
   
   The GNU MP Library is free software; you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as published by
   the Free Software Foundation; either version 2.1 of the License, or (at your
   option) any later version.
   
   The GNU MP Library is distributed in the hope that it will be useful, but
   WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
   License for more details.
   
   You should have received a copy of the GNU Lesser General Public License
   along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
   the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
   02111-1307, USA.
   
   
   
   
 This directory contains mpn functions for 64-bit PA-RISC 2.0.  This directory contains mpn functions for 64-bit PA-RISC 2.0.
   
 RELEVANT OPTIMIZATION ISSUES  PIPELINE SUMMARY
   
 The PA8000 has a multi-issue pipeline with large buffers for instructions  The PA8x00 processors have an orthogonal 4-way out-of-order pipeline.  Each
 awaiting pending results.  Therefore, no latency scheduling is necessary  cycle two ALU operations and two MEM operations can issue, but just one of the
 (and might actually be harmful).  MEM operations may be a store.  The two ALU operations can be almost any
   combination of non-memory operations.  Unlike every other processor, integer
   and fp operations are completely equal here; they both count as just ALU
   operations.
   
 Two 64-bit loads can be completed per cycle.  One 64-bit store can be  Unfortunately, some operations cause hickups in the pipeline.  Combining
 completed per cycle.  A store cannot complete in the same cycle as a load.  carry-consuming operations like ADD,DC with operations that does not set carry
   like ADD,L cause long delays.  Skip operations also seem to cause hickups.  If
   several ADD,DC are issued consecutively, or if plain carry-generating ADD feed
   ADD,DC, stalling does not occur.  We can effectively issue two ADD,DC
   operations/cycle.
   
 STATUS  Latency scheduling is not as important as making sure to have a mix of ALU and
   MEM operations, but for full pipeline utilization, it is still a good idea to
   do some amount of latency scheduling.
   
 * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at  Like for all other processors, RAW memory scheduling is critically important.
   the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb  Since integer multiplication takes place in the floating-point unit, the GMP
   for add/subtract.  code needs to handle this problem frequently.
   
 * The multiplication functions run at 11 cycles/limb.  The cache bandwidth  STATUS
   allows 7.5 cycles/limb.  Perhaps it would be possible, using unrolling or  
   better scheduling, to get closer to the cache bandwidth limit.  
   
 * xaddmul_1.S contains a quicker method for forming the 128 bit product.  It  * mpn_lshift and mpn_rshift run at 1.5 cycles/limb on PA8000 and at 1.0
   uses some fewer operations, and keep the carry flag live across the loop    cycles/limb on PA8500.  With latency scheduling, the numbers could be
   boundary.  But it seems hard to make it run more than 1/4 cycle faster    improved to 1.0 cycles/limb for all PA8x00 chips.
   than the old code.  Perhaps we really ought to unroll this loop be 2x?  
   2x should suffice since register latency schedling is never needed,  
   but the unrolling would hide the store-load latency.  Here is a sketch:  
   
         1. A multiply and store 64-bit products  * mpn_add_n and mpn_sub_n run at 2.0 cycles/limb on PA8000 and at about 1.9
         2. B sum 64-bit products 128-bit product    cycles/limb on PA8500.  With latency scheduling, this could be improved to
         3. B load  64-bit products to integer registers    1.5 cycles/limb.
         4. B multiply and store 64-bit products  
         5. A sum 64-bit products 128-bit product  
         6. A load  64-bit products to integer registers  
         7. goto 1  
   
   In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved  * The mpn_addmul_1 run at 6.25 cycles/limb.  The current code uses ADD,DC for
   for better instruction mix.    adjacent limbs, and relies heavily on reordering.
   
   * Both mpn_mul_1 and mpn_submul_1 run at around 11 cycles/limb.  There is
     obviously room for improving these along the lines of mpn_addmul_1.

Legend:
Removed from v.1.1.1.1  
changed lines
  Added in v.1.1.1.2

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>