Annotation of OpenXM_contrib/gmp/mpn/alpha/README, Revision 1.1.1.2
1.1 maekawa 1: This directory contains mpn functions optimized for DEC Alpha processors.
2:
1.1.1.2 ! maekawa 3: ALPHA ASSEMBLY RULES AND REGULATIONS
! 4:
! 5: The `.prologue N' pseudo op marks the end of instruction that needs
! 6: special handling by unwinding. It also says whether $27 is really
! 7: needed for computing the gp. The `.mask M' pseudo op says which
! 8: registers are saved on the stack, and at what offset in the frame.
! 9:
! 10: Cray code is very very different...
! 11:
! 12:
1.1 maekawa 13: RELEVANT OPTIMIZATION ISSUES
14:
15: EV4
16:
17: 1. This chip has very limited store bandwidth. The on-chip L1 cache is
1.1.1.2 ! maekawa 18: write-through, and a cache line is transfered from the store buffer to
! 19: the off-chip L2 in as much 15 cycles on most systems. This delay hurts
! 20: mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
1.1 maekawa 21:
22: 2. Pairing is possible between memory instructions and integer arithmetic
1.1.1.2 ! maekawa 23: instructions.
1.1 maekawa 24:
1.1.1.2 ! maekawa 25: 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
! 26: these cycles are pipelined. Thus, multiply instructions can be issued at
! 27: a rate of one each 21st cycle.
1.1 maekawa 28:
29: EV5
30:
31: 1. The memory bandwidth of this chip seems excellent, both for loads and
1.1.1.2 ! maekawa 32: stores. Even when the working set is larger than the on-chip L1 and L2
! 33: caches, the performance remain almost unaffected.
1.1 maekawa 34:
1.1.1.2 ! maekawa 35: 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
! 36: umulh has a measured latency of 14 cycles and an issue rate of 1 each
! 37: 10th cycle. But the exact timing is somewhat confusing.
1.1 maekawa 38:
39: 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
40: are memory operations. This will take at least
1.1.1.2 ! maekawa 41: ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
1.1 maekawa 42: We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
1.1.1.2 ! maekawa 43: cache cycles, which should be completely hidden in the 19 issue cycles.
1.1 maekawa 44: The computation is inherently serial, with these dependencies:
1.1.1.2 ! maekawa 45:
! 46: ldq ldq
! 47: \ /\
! 48: (or) addq |
! 49: |\ / \ |
! 50: | addq cmpult
! 51: \ | |
! 52: cmpult |
! 53: \ /
! 54: or
! 55:
! 56: I.e., 3 operations are needed between carry-in and carry-out, making 12
! 57: cycles the absolute minimum for the 4 limbs. We could replace the `or'
! 58: with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
! 59: but that might waste a cycle on EV4. The total depth remain unaffected,
! 60: since cmov has a latency of 2 cycles.
! 61:
1.1 maekawa 62: addq
63: / \
64: addq cmpult
65: | \
66: cmpult -> cmovne
67:
1.1.1.2 ! maekawa 68: Montgomery has a slightly different way of computing carry that requires one
! 69: less instruction, but has depth 4 (instead of the current 3). Since the
! 70: code is currently instruction issue bound, Montgomery's idea should save us
! 71: 1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
! 72: cycles/limb. Unfortunately, this method will not be good for the EV6.
! 73:
! 74: EV6
! 75:
! 76: Here we have a really parallel pipeline, capable of issuing up to 4 integer
! 77: instructions per cycle. One integer multiply instruction can issue each
! 78: cycle. To get optimal speed, we need to pretend we are vectorizing the code,
! 79: i.e., minimize the iterative dependencies.
! 80:
! 81: There are two dependencies to watch out for. 1) Address arithmetic
! 82: dependencies, and 2) carry propagation dependencies.
! 83:
! 84: We can avoid serializing due to address arithmetic by unrolling the loop, so
! 85: that addresses don't depend heavily on an index variable. Avoiding
! 86: serializing because of carry propagation is trickier; the ultimate performance
! 87: of the code will be determined of the number of latency cycles it takes from
! 88: accepting carry-in to a vector point until we can generate carry-out.
! 89:
! 90: Most integer instructions can execute in either the L0, U0, L1, or U1
! 91: pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
! 92:
! 93: CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
! 94: the execute efficiently. But CMOV split the mapping process (see pg 2-26 in
! 95: cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
! 96: instruction of an aligned 4 instruction block (?).
! 97:
! 98: Perhaps the most important issue is the latency between the L0/U0 and L1/U1
! 99: clusters; a result obtained on either cluster has an extra cycle of latency
! 100: for consumers in the opposite cluster. Because of the dynamic nature of the
! 101: implementation, it is hard to predict where an instruction will execute.
! 102:
! 103: The shift loops need (per limb):
! 104: 1 load (Lx pipes)
! 105: 1 store (Lx pipes)
! 106: 2 shift (Ux pipes)
! 107: 1 iaddlog (Lx pipes, Ux pipes)
! 108: Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
! 109:
! 110: For mpn_add_n, we currently have
! 111: 2 load (Lx pipes)
! 112: 1 store (Lx pipes)
! 113: 5 iaddlog (Lx pipes, Ux pipes)
! 114:
! 115: Again, we have a perfect balance and will be limited by carry propagation
! 116: delays, currently three cycles. The superoptimizer indicates that ther
! 117: might be sequences that--using a final cmov--have a carry propagation delay
! 118: of just two. Montgomery's subtraction sequence could perhaps be used, by
! 119: complementing some operands. All in all, we should get down to 2 cycles
! 120: without much problems.
! 121:
! 122: For mpn_mul_1, we could do, just like for mpn_add_n:
! 123: not newlo,notnewlo
! 124: addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout
! 125: addq cyout,newhi,cylimb
! 126: and get 2-cycle carry propagation. The instructions needed will be
! 127: 1 ld (Lx pipes)
! 128: 1 st (Lx pipes)
! 129: 2 mul (U1 pipe)
! 130: 4 iaddlog (Lx pipes, Ux pipes)
! 131: issue1: addq not mul ld
! 132: issue2: cmpult addq mul st
! 133: Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
! 134:
! 135: Last, we have mpn_addmul_1. Almost certainly, we will get down to 3
! 136: cycles/limb, which would be absolutely awesome.
! 137:
! 138: Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
! 139:
! 140: i
! 141: s
! 142: s i
! 143: u n
! 144: e s
! 145: d t
! 146: r
! 147: i u
! 148: l n c
! 149: i s t
! 150: v t i
! 151: e r o
! 152: u n
! 153: v c
! 154: a t t
! 155: l i y
! 156: u o p
! 157: e n e
! 158: s s s
! 159: issue
! 160: in
! 161: cycle
! 162: -1 ldq
! 163: / \
! 164: 0 | \
! 165: | \
! 166: 1 | |
! 167: | |
! 168: 2 | | ldq
! 169: | | / \
! 170: 3 | mulq | \
! 171: | \ | \
! 172: 4 umulh \ | |
! 173: | | | |
! 174: 5 | | | | ldq
! 175: | | | | / \
! 176: 4calm 6 | | ldq | mulq | \
! 177: | | / | \ | \
! 178: 4casm 7 | | / umulh \ | |
! 179: 6 | || | | | |
! 180: 3aal 8 | || | | | | ldq
! 181: 7 | || | | | | / \
! 182: 4calm 9 | || | | ldq | mulq | \
! 183: 9 | || | | / | \ | \
! 184: 4casm 10 | || | | / umulh \ | |
! 185: 9 | || | || | | | |
! 186: 3aal 11 | addq | || | | | | ldq
! 187: 9 | // \ | || | | | | / \
! 188: 4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \
! 189: 13 \ / // \ | || | | / | \ | \
! 190: 4casm 13 addq cmpult stq | || | | / umulh \ | |
! 191: 11 \ / | || | || | | | |
! 192: 3aal 14 addq | addq | || | | | | ldq
! 193: 10 \ | // \ | || | | | | / \
! 194: 4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \
! 195: 13 \ / // \ | || | | / | \ | \
! 196: 4casm 16 addq cmpult stq | || | | / umulh \ | |
! 197: 11 \ / | || | || | | | |
! 198: 3aal 17 addq | addq | || | | | |
! 199: 10 \ | // \ | || | | | |
! 200: 4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq
! 201: 13 \ / // \ | || | | / | \
! 202: 4casm 19 addq cmpult stq | || | | / umulh \
! 203: 11 \ / | || | || | |
! 204: 3aal 20 addq | addq | || | |
! 205: 10 \ | // \ | || | |
! 206: 4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq
! 207: \ / // \ | || | | /
! 208: 22 addq cmpult stq | || | | /
! 209: \ / | || | ||
! 210: 23 addq | addq | ||
! 211: \ | // \ | ||
! 212: 24 cy ----> \ cmpult addq<-cy | ||
! 213: \ / // \ | ||
! 214: 25 addq cmpult stq | ||
! 215: \ / | ||
! 216: 26 addq | addq
! 217: \ | // \
! 218: 27 cy ----> \ cmpult addq<-cy
! 219: \ / // \
! 220: 28 addq cmpult stq
! 221: \ /
! 222: As many as 6 consecutive points will be under execution simultaneously, or if we addq
! 223: schedule loads even further away, maybe 7 or 8. But the number of live quantities \
! 224: is reasonable, and can easily be satisfied. cy ---->
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>