version 1.1.1.1, 2000/09/09 14:12:38 |
version 1.1.1.2, 2003/08/25 16:06:24 |
|
|
|
Copyright 1999, 2000, 2001 Free Software Foundation, Inc. |
|
|
|
This file is part of the GNU MP Library. |
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify |
|
it under the terms of the GNU Lesser General Public License as published by |
|
the Free Software Foundation; either version 2.1 of the License, or (at your |
|
option) any later version. |
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
|
License for more details. |
|
|
|
You should have received a copy of the GNU Lesser General Public License |
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
|
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA |
|
02111-1307, USA. |
|
|
|
|
|
|
|
|
|
|
PPC630 (aka Power3) pipeline information: |
PPC630 (aka Power3) pipeline information: |
|
|
Decoding is 4-way and issue is 8-way with some out-of-order capability. |
Decoding is 4-way and issue is 8-way with some out-of-order capability. |
|
Branches are handled separately, and are not part of the 4-way issue limit. |
|
|
|
Functional units: |
LS1 - ld/st unit 1 |
LS1 - ld/st unit 1 |
LS2 - ld/st unit 2 |
LS2 - ld/st unit 2 |
FXU1 - integer unit 1, handles any simple integer instructions |
FXU1 - integer unit 1, handles any simple integer instruction |
FXU2 - integer unit 2, handles any simple integer instructions |
FXU2 - integer unit 2, handles any simple integer instruction |
FXU3 - integer unit 3, handles integer multiply and divide |
FXU3 - integer unit 3, handles integer multiply and divide |
FPU1 - floating-point unit 1 |
FPU1 - floating-point unit 1 |
FPU2 - floating-point unit 2 |
FPU2 - floating-point unit 2 |
|
|
Memory: Any two memory operations can issue, but memory subsystem |
Memory: Any two memory operations can issue, but memory subsystem |
can sustain just one store per cycle. |
can sustain just one store per cycle. No need for data |
|
prefetch; the hardware has very sophisticated prefetch logic. |
Simple integer: 2 operations (such as add, rl*) |
Simple integer: 2 operations (such as add, rl*) |
Integer multiply: 1 operation every 9th cycle worst case; exact timing depends |
Integer multiply: 1 operation every 9th cycle worst case; exact timing depends |
on 2nd operand most significant bit position (10 bits per |
on 2nd operand most significant bit position (10 bits per |
Line 34 mul: 18 cycles (average) unless floating-point o |
|
Line 61 mul: 18 cycles (average) unless floating-point o |
|
but that would only help for multiplies of perhaps 10 and more |
but that would only help for multiplies of perhaps 10 and more |
limbs. |
limbs. |
addmul/submul:Same situation as for mul. |
addmul/submul:Same situation as for mul. |
|
|
|
|
|
IDEAS |
|
|
|
*mul_1: Handling one limb using mulld/mulhdu and two limbs using |
|
floating-point operations should give a performance of about 20 cycles |
|
for 3 limbs, or 7 cycles/limb. |
|
|
|
We should probably split the single-limb operand in 32-bit chunks, and |
|
the multi-limb operand in 16-bit chunks, allowing us to accumulate |
|
well in fp registers. |
|
|
|
Problem is to get 32-bit or 16-bit words to the fp registers. Only |
|
64-bit fp memops copies bits without fiddling with them. We might |
|
therefore need to load to integer registers with zero extension, store |
|
as 64 bits into temp space, and then load to fp regs. Alternatively, |
|
load directly to fp space and add well-chosen constants to get |
|
cancelation. (Other part after given by subsequent subtraction.) |
|
|
|
Possible code mix for load-via-intregs variant: |
|
|
|
lwz,std,lfd |
|
fmadd,fmadd,fmul,fmul |
|
fctidz,stfd,ld,fctidz,stfd,ld |
|
add,adde |
|
lwz,std,lfd |
|
fmadd,fmadd,fmul,fmul |
|
fctidz,stfd,ld,fctidz,stfd,ld |
|
add,adde |
|
srd,sld,add,adde,add,adde |