version 1.1.1.1, 2000/09/09 14:12:37 |
version 1.1.1.2, 2003/08/25 16:06:23 |
|
|
|
Copyright 1999, 2001, 2002 Free Software Foundation, Inc. |
|
|
|
This file is part of the GNU MP Library. |
|
|
|
The GNU MP Library is free software; you can redistribute it and/or modify |
|
it under the terms of the GNU Lesser General Public License as published by |
|
the Free Software Foundation; either version 2.1 of the License, or (at your |
|
option) any later version. |
|
|
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
|
License for more details. |
|
|
|
You should have received a copy of the GNU Lesser General Public License |
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
|
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA |
|
02111-1307, USA. |
|
|
|
|
|
|
|
|
This directory contains mpn functions for 64-bit PA-RISC 2.0. |
This directory contains mpn functions for 64-bit PA-RISC 2.0. |
|
|
RELEVANT OPTIMIZATION ISSUES |
PIPELINE SUMMARY |
|
|
The PA8000 has a multi-issue pipeline with large buffers for instructions |
The PA8x00 processors have an orthogonal 4-way out-of-order pipeline. Each |
awaiting pending results. Therefore, no latency scheduling is necessary |
cycle two ALU operations and two MEM operations can issue, but just one of the |
(and might actually be harmful). |
MEM operations may be a store. The two ALU operations can be almost any |
|
combination of non-memory operations. Unlike every other processor, integer |
|
and fp operations are completely equal here; they both count as just ALU |
|
operations. |
|
|
Two 64-bit loads can be completed per cycle. One 64-bit store can be |
Unfortunately, some operations cause hickups in the pipeline. Combining |
completed per cycle. A store cannot complete in the same cycle as a load. |
carry-consuming operations like ADD,DC with operations that does not set carry |
|
like ADD,L cause long delays. Skip operations also seem to cause hickups. If |
|
several ADD,DC are issued consecutively, or if plain carry-generating ADD feed |
|
ADD,DC, stalling does not occur. We can effectively issue two ADD,DC |
|
operations/cycle. |
|
|
STATUS |
Latency scheduling is not as important as making sure to have a mix of ALU and |
|
MEM operations, but for full pipeline utilization, it is still a good idea to |
|
do some amount of latency scheduling. |
|
|
* mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at |
Like for all other processors, RAW memory scheduling is critically important. |
the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb |
Since integer multiplication takes place in the floating-point unit, the GMP |
for add/subtract. |
code needs to handle this problem frequently. |
|
|
* The multiplication functions run at 11 cycles/limb. The cache bandwidth |
STATUS |
allows 7.5 cycles/limb. Perhaps it would be possible, using unrolling or |
|
better scheduling, to get closer to the cache bandwidth limit. |
|
|
|
* xaddmul_1.S contains a quicker method for forming the 128 bit product. It |
* mpn_lshift and mpn_rshift run at 1.5 cycles/limb on PA8000 and at 1.0 |
uses some fewer operations, and keep the carry flag live across the loop |
cycles/limb on PA8500. With latency scheduling, the numbers could be |
boundary. But it seems hard to make it run more than 1/4 cycle faster |
improved to 1.0 cycles/limb for all PA8x00 chips. |
than the old code. Perhaps we really ought to unroll this loop be 2x? |
|
2x should suffice since register latency schedling is never needed, |
|
but the unrolling would hide the store-load latency. Here is a sketch: |
|
|
|
1. A multiply and store 64-bit products |
* mpn_add_n and mpn_sub_n run at 2.0 cycles/limb on PA8000 and at about 1.9 |
2. B sum 64-bit products 128-bit product |
cycles/limb on PA8500. With latency scheduling, this could be improved to |
3. B load 64-bit products to integer registers |
1.5 cycles/limb. |
4. B multiply and store 64-bit products |
|
5. A sum 64-bit products 128-bit product |
|
6. A load 64-bit products to integer registers |
|
7. goto 1 |
|
|
|
In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved |
* The mpn_addmul_1 run at 6.25 cycles/limb. The current code uses ADD,DC for |
for better instruction mix. |
adjacent limbs, and relies heavily on reordering. |
|
|
|
* Both mpn_mul_1 and mpn_submul_1 run at around 11 cycles/limb. There is |
|
obviously room for improving these along the lines of mpn_addmul_1. |