OpenXM_contrib/gmp/mpn/alpha/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / alpha
Annotation of OpenXM_contrib/gmp/mpn/alpha/README, Revision 1.1

1.1     ! maekawa     1: This directory contains mpn functions optimized for DEC Alpha processors.
        !             2:
        !             3: RELEVANT OPTIMIZATION ISSUES
        !             4:
        !             5: EV4
        !             6:
        !             7: 1. This chip has very limited store bandwidth.  The on-chip L1 cache is
        !             8: write-through, and a cache line is transfered from the store buffer to the
        !             9: off-chip L2 in as much 15 cycles on most systems.  This delay hurts
        !            10: mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
        !            11:
        !            12: 2. Pairing is possible between memory instructions and integer arithmetic
        !            13: instructions.
        !            14:
        !            15: 3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of
        !            16: these cycles are pipelined.  Thus, multiply instructions can be issued at a
        !            17: rate of one each 21nd cycle.
        !            18:
        !            19: EV5
        !            20:
        !            21: 1. The memory bandwidth of this chip seems excellent, both for loads and
        !            22: stores.  Even when the working set is larger than the on-chip L1 and L2
        !            23: caches, the perfromance remain almost unaffected.
        !            24:
        !            25: 2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th
        !            26: cycle.  umulh has a measured latency of 15 cycles and an issue rate of 1
        !            27: each 10th cycle.  But the exact timing is somewhat confusing.
        !            28:
        !            29: 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
        !            30:    are memory operations.  This will take at least
        !            31:        ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles
        !            32:    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
        !            33:    cache cycles, which should be completely hidden in the 20 issue cycles.
        !            34:    The computation is inherently serial, with these dependencies:
        !            35:      addq
        !            36:      /   \
        !            37:    addq  cmpult
        !            38:      |     |
        !            39:    cmpult  |
        !            40:        \  /
        !            41:         or
        !            42:    I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute
        !            43:    minimum.  We could replace the `or' with a cmoveq/cmovne, which would save
        !            44:    a cycle on EV5, but that might waste a cycle on EV4.  Also, cmov takes 2
        !            45:    cycles.
        !            46:      addq
        !            47:      /   \
        !            48:    addq  cmpult
        !            49:      |      \
        !            50:    cmpult -> cmovne
        !            51:
        !            52: STATUS
        !            53:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>