===================================================================
RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/powerpc64/Attic/README,v
retrieving revision 1.1.1.1
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.1 -r1.1.1.2
--- OpenXM_contrib/gmp/mpn/powerpc64/Attic/README	2000/09/09 14:12:38	1.1.1.1
+++ OpenXM_contrib/gmp/mpn/powerpc64/Attic/README	2003/08/25 16:06:24	1.1.1.2
@@ -1,16 +1,43 @@
+Copyright 1999, 2000, 2001 Free Software Foundation, Inc.
+
+This file is part of the GNU MP Library.
+
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published by
+the Free Software Foundation; either version 2.1 of the License, or (at your
+option) any later version.
+
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+License for more details.
+
+You should have received a copy of the GNU Lesser General Public License
+along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
+the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+02111-1307, USA.
+
+
+
+
+
 PPC630 (aka Power3) pipeline information:
 
 Decoding is 4-way and issue is 8-way with some out-of-order capability.
+Branches are handled separately, and are not part of the 4-way issue limit.
+
+Functional units:
 LS1  - ld/st unit 1
 LS2  - ld/st unit 2
-FXU1 - integer unit 1, handles any simple integer instructions
-FXU2 - integer unit 2, handles any simple integer instructions
+FXU1 - integer unit 1, handles any simple integer instruction
+FXU2 - integer unit 2, handles any simple integer instruction
 FXU3 - integer unit 3, handles integer multiply and divide
 FPU1 - floating-point unit 1
 FPU2 - floating-point unit 2
 
 Memory:		  Any two memory operations can issue, but memory subsystem
-		  can sustain just one store per cycle.
+		  can sustain just one store per cycle.  No need for data
+		  prefetch; the hardware has very sophisticated prefetch logic.
 Simple integer:	  2 operations (such as add, rl*)
 Integer multiply: 1 operation every 9th cycle worst case; exact timing depends
 		  on 2nd operand most significant bit position (10 bits per
@@ -34,3 +61,33 @@ mul:	      18 cycles (average) unless floating-point o
 	      but that would only help for multiplies of perhaps 10 and more
 	      limbs.
 addmul/submul:Same situation as for mul.
+
+
+IDEAS
+
+*mul_1: Handling one limb using mulld/mulhdu and two limbs using
+floating-point operations should give a performance of about 20 cycles
+for 3 limbs, or 7 cycles/limb.
+
+We should probably split the single-limb operand in 32-bit chunks, and
+the multi-limb operand in 16-bit chunks, allowing us to accumulate
+well in fp registers.
+
+Problem is to get 32-bit or 16-bit words to the fp registers.  Only
+64-bit fp memops copies bits without fiddling with them.  We might
+therefore need to load to integer registers with zero extension, store
+as 64 bits into temp space, and then load to fp regs.  Alternatively,
+load directly to fp space and add well-chosen constants to get
+cancelation.  (Other part after given by subsequent subtraction.)
+
+Possible code mix for load-via-intregs variant:
+
+lwz,std,lfd
+fmadd,fmadd,fmul,fmul
+fctidz,stfd,ld,fctidz,stfd,ld
+add,adde
+lwz,std,lfd
+fmadd,fmadd,fmul,fmul
+fctidz,stfd,ld,fctidz,stfd,ld
+add,adde
+srd,sld,add,adde,add,adde