OpenXM_contrib/gmp/mpn/alpha/README - annotate

Return to README CVS log
Up to [local] / OpenXM_contrib / gmp / mpn / alpha
Annotation of OpenXM_contrib/gmp/mpn/alpha/README, Revision 1.1.1.2

1.1       maekawa     1: This directory contains mpn functions optimized for DEC Alpha processors.
                      2:
1.1.1.2 ! maekawa     3: ALPHA ASSEMBLY RULES AND REGULATIONS
        !             4:
        !             5: The `.prologue N' pseudo op marks the end of instruction that needs
        !             6: special handling by unwinding.  It also says whether $27 is really
        !             7: needed for computing the gp.  The `.mask M' pseudo op says which
        !             8: registers are saved on the stack, and at what offset in the frame.
        !             9:
        !            10: Cray code is very very different...
        !            11:
        !            12:
1.1       maekawa    13: RELEVANT OPTIMIZATION ISSUES
                     14:
                     15: EV4
                     16:
                     17: 1. This chip has very limited store bandwidth.  The on-chip L1 cache is
1.1.1.2 ! maekawa    18:    write-through, and a cache line is transfered from the store buffer to
        !            19:    the off-chip L2 in as much 15 cycles on most systems.  This delay hurts
        !            20:    mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
1.1       maekawa    21:
                     22: 2. Pairing is possible between memory instructions and integer arithmetic
1.1.1.2 ! maekawa    23:    instructions.
1.1       maekawa    24:
1.1.1.2 ! maekawa    25: 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
        !            26:    these cycles are pipelined.  Thus, multiply instructions can be issued at
        !            27:    a rate of one each 21st cycle.
1.1       maekawa    28:
                     29: EV5
                     30:
                     31: 1. The memory bandwidth of this chip seems excellent, both for loads and
1.1.1.2 ! maekawa    32:    stores.  Even when the working set is larger than the on-chip L1 and L2
        !            33:    caches, the performance remain almost unaffected.
1.1       maekawa    34:
1.1.1.2 ! maekawa    35: 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
        !            36:    umulh has a measured latency of 14 cycles and an issue rate of 1 each
        !            37:    10th cycle.  But the exact timing is somewhat confusing.
1.1       maekawa    38:
                     39: 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
                     40:    are memory operations.  This will take at least
1.1.1.2 ! maekawa    41:        ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
1.1       maekawa    42:    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
1.1.1.2 ! maekawa    43:    cache cycles, which should be completely hidden in the 19 issue cycles.
1.1       maekawa    44:    The computation is inherently serial, with these dependencies:
1.1.1.2 ! maekawa    45:
        !            46:               ldq  ldq
        !            47:                 \  /\
        !            48:          (or)   addq |
        !            49:           |\   /   \ |
        !            50:           | addq  cmpult
        !            51:            \  |     |
        !            52:             cmpult  |
        !            53:                 \  /
        !            54:                  or
        !            55:
        !            56:    I.e., 3 operations are needed between carry-in and carry-out, making 12
        !            57:    cycles the absolute minimum for the 4 limbs.  We could replace the `or'
        !            58:    with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
        !            59:    but that might waste a cycle on EV4.  The total depth remain unaffected,
        !            60:    since cmov has a latency of 2 cycles.
        !            61:
1.1       maekawa    62:      addq
                     63:      /   \
                     64:    addq  cmpult
                     65:      |      \
                     66:    cmpult -> cmovne
                     67:
1.1.1.2 ! maekawa    68: Montgomery has a slightly different way of computing carry that requires one
        !            69: less instruction, but has depth 4 (instead of the current 3).  Since the
        !            70: code is currently instruction issue bound, Montgomery's idea should save us
        !            71: 1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
        !            72: cycles/limb.  Unfortunately, this method will not be good for the EV6.
        !            73:
        !            74: EV6
        !            75:
        !            76: Here we have a really parallel pipeline, capable of issuing up to 4 integer
        !            77: instructions per cycle.  One integer multiply instruction can issue each
        !            78: cycle.  To get optimal speed, we need to pretend we are vectorizing the code,
        !            79: i.e., minimize the iterative dependencies.
        !            80:
        !            81: There are two dependencies to watch out for.  1) Address arithmetic
        !            82: dependencies, and 2) carry propagation dependencies.
        !            83:
        !            84: We can avoid serializing due to address arithmetic by unrolling the loop, so
        !            85: that addresses don't depend heavily on an index variable.  Avoiding
        !            86: serializing because of carry propagation is trickier; the ultimate performance
        !            87: of the code will be determined of the number of latency cycles it takes from
        !            88: accepting carry-in to a vector point until we can generate carry-out.
        !            89:
        !            90: Most integer instructions can execute in either the L0, U0, L1, or U1
        !            91: pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
        !            92:
        !            93: CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
        !            94: the execute efficiently.  But CMOV split the mapping process (see pg 2-26 in
        !            95: cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
        !            96: instruction of an aligned 4 instruction block (?).
        !            97:
        !            98: Perhaps the most important issue is the latency between the L0/U0 and L1/U1
        !            99: clusters; a result obtained on either cluster has an extra cycle of latency
        !           100: for consumers in the opposite cluster.  Because of the dynamic nature of the
        !           101: implementation, it is hard to predict where an instruction will execute.
        !           102:
        !           103: The shift loops need (per limb):
        !           104:     1 load (Lx pipes)
        !           105:     1 store (Lx pipes)
        !           106:     2 shift (Ux pipes)
        !           107:     1 iaddlog (Lx pipes, Ux pipes)
        !           108: Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
        !           109:
        !           110: For mpn_add_n, we currently have
        !           111:     2 load (Lx pipes)
        !           112:     1 store (Lx pipes)
        !           113:     5 iaddlog (Lx pipes, Ux pipes)
        !           114:
        !           115: Again, we have a perfect balance and will be limited by carry propagation
        !           116: delays, currently three cycles.  The superoptimizer indicates that ther
        !           117: might be sequences that--using a final cmov--have a carry propagation delay
        !           118: of just two.  Montgomery's subtraction sequence could perhaps be used, by
        !           119: complementing some operands.  All in all, we should get down to 2 cycles
        !           120: without much problems.
        !           121:
        !           122: For mpn_mul_1, we could do, just like for mpn_add_n:
        !           123:     not                newlo,notnewlo
        !           124:     addq       cylimb,newlo,newlo  ||    cmpult        cylimb,notnewlo,cyout
        !           125:     addq       cyout,newhi,cylimb
        !           126: and get 2-cycle carry propagation.  The instructions needed will be
        !           127:     1 ld (Lx pipes)
        !           128:     1 st (Lx pipes)
        !           129:     2 mul (U1 pipe)
        !           130:     4 iaddlog (Lx pipes, Ux pipes)
        !           131: issue1: addq not mul ld
        !           132: issue2: cmpult addq mul st
        !           133: Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
        !           134:
        !           135: Last, we have mpn_addmul_1.  Almost certainly, we will get down to 3
        !           136: cycles/limb, which would be absolutely awesome.
        !           137:
        !           138: Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
        !           139:
        !           140:    i
        !           141:    s
        !           142:    s  i
        !           143:    u  n
        !           144:    e  s
        !           145:    d  t
        !           146:       r
        !           147:    i  u
        !           148: l  n  c
        !           149: i  s  t
        !           150: v  t  i
        !           151: e  r  o
        !           152:    u  n
        !           153: v  c
        !           154: a  t  t
        !           155: l  i  y
        !           156: u  o  p
        !           157: e  n  e
        !           158: s  s  s
        !           159:         issue
        !           160:          in
        !           161:         cycle
        !           162:          -1     ldq
        !           163:                /    \
        !           164:           0   |      \
        !           165:               |       \
        !           166:           1   |        |
        !           167:               |        |
        !           168:           2   |        |                   ldq
        !           169:               |        |                  /    \
        !           170:           3   |       mulq               |      \
        !           171:               |           \              |       \
        !           172:           4  umulh         \             |        |
        !           173:                |            |            |        |
        !           174:           5    |            |            |        |                   ldq
        !           175:                |            |            |        |                  /    \
        !           176:     4calm 6    |            |   ldq      |       mulq               |      \
        !           177:                |            |  /         |           \              |       \
        !           178:     4casm 7    |            | /         umulh         \             |        |
        !           179: 6              |            ||            |            |            |        |
        !           180:     3aal  8    |            ||            |            |            |        |                   ldq
        !           181: 7              |            ||            |            |            |        |                  /    \
        !           182:     4calm 9    |            ||            |            |   ldq      |       mulq               |      \
        !           183: 9              |            ||            |            |  /         |           \              |       \
        !           184:     4casm 10   |            ||            |            | /         umulh         \             |        |
        !           185: 9              |            ||            |            ||            |            |            |        |
        !           186:     3aal  11   |           addq           |            ||            |            |            |        |                   ldq
        !           187: 9              |          //   \          |            ||            |            |            |        |                  /    \
        !           188:     4calm 12    \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
        !           189: 13               \    /       //   \      |            ||            |            |  /         |           \              |       \
        !           190:     4casm 13      addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
        !           191: 11                    \  /                |            ||            |            ||            |            |            |        |
        !           192:     3aal  14          addq                |           addq           |            ||            |            |            |        |                   ldq
        !           193: 10                        \               |          //   \          |            ||            |            |            |        |                  /    \
        !           194:     4calm 15                cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
        !           195: 13                                          \    /       //   \      |            ||            |            |  /         |           \              |       \
        !           196:     4casm 16                                 addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
        !           197: 11                                               \  /                |            ||            |            ||            |            |            |        |
        !           198:     3aal  17                                     addq                |           addq           |            ||            |            |            |        |
        !           199: 10                                                   \               |          //   \          |            ||            |            |            |        |
        !           200:     4calm 18                                           cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq
        !           201: 13                                                                     \    /       //   \      |            ||            |            |  /         |           \
        !           202:     4casm 19                                                            addq   cmpult     stq   |            ||            |            | /         umulh         \
        !           203: 11                                                                          \  /                |            ||            |            ||            |            |
        !           204:     3aal  20                                                                addq                |           addq           |            ||            |            |
        !           205: 10                                                                              \               |          //   \          |            ||            |            |
        !           206:     4calm 21                                                                      cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq
        !           207:                                                                                                   \    /       //   \      |            ||            |            |  /
        !           208:           22                                                                                       addq   cmpult     stq   |            ||            |            | /
        !           209:                                                                                                        \  /                |            ||            |            ||
        !           210:           23                                                                                           addq                |           addq           |            ||
        !           211:                                                                                                            \               |          //   \          |            ||
        !           212:           24                                                                                                 cy ---->       \     cmpult    addq<-cy  |            ||
        !           213:                                                                                                                              \    /       //   \      |            ||
        !           214:           25                                                                                                                  addq   cmpult     stq   |            ||
        !           215:                                                                                                                                   \  /                |            ||
        !           216:           26                                                                                                                      addq                |           addq
        !           217:                                                                                                                                       \               |          //   \
        !           218:           27                                                                                                                            cy ---->       \     cmpult    addq<-cy
        !           219:                                                                                                                                                         \    /       //   \
        !           220:           28                                                                                                                                             addq   cmpult     stq
        !           221:                                                                                                                                                              \  /
        !           222: As many as 6 consecutive points will be under execution simultaneously, or if we                                                                             addq
        !           223: schedule loads even further away, maybe 7 or 8.  But the number of live quantities                                                                               \
        !           224: is reasonable, and can easily be satisfied.                                                                                                                        cy ---->
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>