=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/powerpc64/Attic/README,v retrieving revision 1.1 retrieving revision 1.1.1.2 diff -u -p -r1.1 -r1.1.1.2 --- OpenXM_contrib/gmp/mpn/powerpc64/Attic/README 2000/09/09 14:12:38 1.1 +++ OpenXM_contrib/gmp/mpn/powerpc64/Attic/README 2003/08/25 16:06:24 1.1.1.2 @@ -1,16 +1,43 @@ +Copyright 1999, 2000, 2001 Free Software Foundation, Inc. + +This file is part of the GNU MP Library. + +The GNU MP Library is free software; you can redistribute it and/or modify +it under the terms of the GNU Lesser General Public License as published by +the Free Software Foundation; either version 2.1 of the License, or (at your +option) any later version. + +The GNU MP Library is distributed in the hope that it will be useful, but +WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +License for more details. + +You should have received a copy of the GNU Lesser General Public License +along with the GNU MP Library; see the file COPYING.LIB. If not, write to +the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA +02111-1307, USA. + + + + + PPC630 (aka Power3) pipeline information: Decoding is 4-way and issue is 8-way with some out-of-order capability. +Branches are handled separately, and are not part of the 4-way issue limit. + +Functional units: LS1 - ld/st unit 1 LS2 - ld/st unit 2 -FXU1 - integer unit 1, handles any simple integer instructions -FXU2 - integer unit 2, handles any simple integer instructions +FXU1 - integer unit 1, handles any simple integer instruction +FXU2 - integer unit 2, handles any simple integer instruction FXU3 - integer unit 3, handles integer multiply and divide FPU1 - floating-point unit 1 FPU2 - floating-point unit 2 Memory: Any two memory operations can issue, but memory subsystem - can sustain just one store per cycle. + can sustain just one store per cycle. No need for data + prefetch; the hardware has very sophisticated prefetch logic. Simple integer: 2 operations (such as add, rl*) Integer multiply: 1 operation every 9th cycle worst case; exact timing depends on 2nd operand most significant bit position (10 bits per @@ -34,3 +61,33 @@ mul: 18 cycles (average) unless floating-point o but that would only help for multiplies of perhaps 10 and more limbs. addmul/submul:Same situation as for mul. + + +IDEAS + +*mul_1: Handling one limb using mulld/mulhdu and two limbs using +floating-point operations should give a performance of about 20 cycles +for 3 limbs, or 7 cycles/limb. + +We should probably split the single-limb operand in 32-bit chunks, and +the multi-limb operand in 16-bit chunks, allowing us to accumulate +well in fp registers. + +Problem is to get 32-bit or 16-bit words to the fp registers. Only +64-bit fp memops copies bits without fiddling with them. We might +therefore need to load to integer registers with zero extension, store +as 64 bits into temp space, and then load to fp regs. Alternatively, +load directly to fp space and add well-chosen constants to get +cancelation. (Other part after given by subsequent subtraction.) + +Possible code mix for load-via-intregs variant: + +lwz,std,lfd +fmadd,fmadd,fmul,fmul +fctidz,stfd,ld,fctidz,stfd,ld +add,adde +lwz,std,lfd +fmadd,fmadd,fmul,fmul +fctidz,stfd,ld,fctidz,stfd,ld +add,adde +srd,sld,add,adde,add,adde