Annotation of OpenXM_contrib2/asir2000/gc/doc/scale.html, Revision 1.1
1.1 ! noro 1: <HTML>
! 2: <HEAD>
! 3: <TITLE>Garbage collector scalability</TITLE>
! 4: </HEAD>
! 5: <BODY>
! 6: <H1>Garbage collector scalability</h1>
! 7: In its default configuration, the Boehm-Demers-Weiser garbage collector
! 8: is not thread-safe. It can be made thread-safe for a number of environments
! 9: by building the collector with the appropriate
! 10: <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
! 11: flag. This has primarily two effects:
! 12: <OL>
! 13: <LI> It causes the garbage collector to stop all other threads when
! 14: it needs to see a consistent memory state.
! 15: <LI> It causes the collector to acquire a lock around essentially all
! 16: allocation and garbage collection activity.
! 17: </ol>
! 18: Since a single lock is used for all allocation-related activity, only one
! 19: thread can be allocating or collecting at one point. This inherently
! 20: limits performance of multi-threaded applications on multiprocessors.
! 21: <P>
! 22: On most platforms, the allocator/collector lock is implemented as a
! 23: spin lock with exponential back-off. Longer wait times are implemented
! 24: by yielding and/or sleeping. If a collection is in progress, the pure
! 25: spinning stage is skipped. This has the advantage that uncontested and
! 26: thus most uniprocessor lock acquisitions are very cheap. It has the
! 27: disadvantage that the application may sleep for small periods of time
! 28: even when there is work to be done. And threads may be unnecessarily
! 29: woken up for short periods. Nonetheless, this scheme empirically
! 30: outperforms native queue-based mutual exclusion implementations in most
! 31: cases, sometimes drastically so.
! 32: <H2>Options for enhanced scalability</h2>
! 33: Version 6.0 of the collector adds two facilities to enhance collector
! 34: scalability on multiprocessors. As of 6.0alpha1, these are supported
! 35: only under Linux on X86 and IA64 processors, though ports to other
! 36: otherwise supported Pthreads platforms should be straightforward.
! 37: They are intended to be used together.
! 38: <UL>
! 39: <LI>
! 40: Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
! 41: run the mark phase in parallel in multiple threads, and thus on multiple
! 42: processors. The mark phase typically consumes the large majority of the
! 43: collection time. Thus this largely parallelizes the garbage collector
! 44: itself, though not the allocation process. Currently the marking is
! 45: performed by the thread that triggered the collection, together with
! 46: <I>N</i>-1 dedicated
! 47: threads, where <I>N</i> is the number of processors detected by the collector.
! 48: The dedicated threads are created once at initialization time.
! 49: <P>
! 50: A second effect of this flag is to switch to a more concurrent
! 51: implementation of <TT>GC_malloc_many</tt>, so that free lists can be
! 52: built, and memory can be cleared, by more than one thread concurrently.
! 53: <LI>
! 54: Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
! 55: local allocation. It does not, by itself, cause thread local allocation
! 56: to be used. It simply allows the use of the interface in
! 57: <TT>gc_local_alloc.h</tt>.
! 58: <P>
! 59: Memory returned from thread-local allocators is completely interchangeable
! 60: with that returned by the standard allocators. It may be used by other
! 61: threads. The only difference is that, if the thread allocates enough
! 62: memory of a certain kind, it will build a thread-local free list for
! 63: objects of that kind, and allocate from that. This greatly reduces
! 64: locking. The thread-local free lists are refilled using
! 65: <TT>GC_malloc_many</tt>.
! 66: <P>
! 67: An important side effect of this flag is to replace the default
! 68: spin-then-sleep lock to be replace by a spin-then-queue based implementation.
! 69: This <I>reduces performance</i> for the standard allocation functions,
! 70: though it usually improves performance when thread-local allocation is
! 71: used heavily, and thus the number of short-duration lock acquisitions
! 72: is greatly reduced.
! 73: </ul>
! 74: <P>
! 75: The easiest way to switch an application to thread-local allocation is to
! 76: <OL>
! 77: <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
! 78: and then include the <TT>gc.h</tt>
! 79: header in each client source file.
! 80: <LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
! 81: <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
! 82: and/or <TT>GC_GCJ_MALLOC</tt>.
! 83: </ol>
! 84: <H2>The Parallel Marking Algorithm</h2>
! 85: We use an algorithm similar to
! 86: <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
! 87: Endo, Taura, and Yonezawa</a> at the University of Tokyo.
! 88: However, the data structures and implementation are different,
! 89: and represent a smaller change to the original collector source,
! 90: probably at the expense of extreme scalability. Some of
! 91: the refinements they suggest, <I>e.g.</i> splitting large
! 92: objects, were also incorporated into out approach.
! 93: <P>
! 94: The global mark stack is transformed into a global work queue.
! 95: Unlike the usual case, it never shrinks during a mark phase.
! 96: The mark threads remove objects from the queue by copying them to a
! 97: local mark stack and changing the global descriptor to zero, indicating
! 98: that there is no more work to be done for this entry.
! 99: This removal
! 100: is done with no synchronization. Thus it is possible for more than
! 101: one worker to remove the same entry, resulting in some work duplication.
! 102: <P>
! 103: The global work queue grows only if a marker thread decides to
! 104: return some of its local mark stack to the global one. This
! 105: is done if the global queue appears to be running low, or if
! 106: the local stack is in danger of overflowing. It does require
! 107: synchronization, but should be relatively rare.
! 108: <P>
! 109: The sequential marking code is reused to process local mark stacks.
! 110: Hence the amount of additional code required for parallel marking
! 111: is minimal.
! 112: <P>
! 113: It should be possible to use generational collection in the presence of the
! 114: parallel collector, by calling <TT>GC_enable_incremental()</tt>.
! 115: This does not result in fully incremental collection, since parallel mark
! 116: phases cannot currently be interrupted, and doing so may be too
! 117: expensive.
! 118: <P>
! 119: Gcj-style mark descriptors do not currently mix with the combination
! 120: of local allocation and incremental collection. They should work correctly
! 121: with one or the other, but not both.
! 122: <P>
! 123: The number of marker threads is set on startup to the number of
! 124: available processors (or to the value of the <TT>GC_NPROCS</tt>
! 125: environment variable). If only a single processor is detected,
! 126: parallel marking is disabled.
! 127: <P>
! 128: Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
! 129: the collector to immediately yield the processor instead of busy waiting
! 130: first. In the case of a multiprocessor and a client with multiple
! 131: simultaneously runnable threads, this may have disastrous performance
! 132: consequences (e.g. a factor of 10 slowdown).
! 133: <H2>Performance</h2>
! 134: We conducted some simple experiments with a version of
! 135: <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
! 136: run multiple concurrent client threads in the same address space.
! 137: Each client thread does the same work as the original benchmark, but they share
! 138: a heap.
! 139: This benchmark involves very little work outside of memory allocation.
! 140: This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
! 141: under Linux 2.2.12.
! 142: <P>
! 143: Running with a thread-unsafe collector, the benchmark ran in 9
! 144: seconds. With the simple thread-safe collector,
! 145: built with <TT>-DLINUX_THREADS</tt>, the execution time
! 146: increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
! 147: (The times for the <TT>malloc</tt>/i<TT>free</tt> version
! 148: with glibc <TT>malloc</tt>
! 149: are 10.51 (standard library, pthreads not linked),
! 150: 20.90 (one thread, pthreads linked),
! 151: and 24.55 seconds respectively. The benchmark favors a
! 152: garbage collector, since most objects are small.)
! 153: <P>
! 154: The following table gives execution times for the collector built
! 155: with parallel marking and thread-local allocation support
! 156: (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
! 157: the client using either one or two marker threads, and running
! 158: one or two client threads. Note that the client uses thread local
! 159: allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
! 160: switches to a locking strategy that is better tuned to less frequent
! 161: lock acquisition. The standard allocation primitives thus peform
! 162: slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
! 163: avoided in time-critical code.
! 164: <P>
! 165: (The results using <TT>pthread_mutex_lock</tt>
! 166: directly for allocation locking would have been worse still, at
! 167: least for older versions of linuxthreads.
! 168: With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
! 169: lock with pthread_mutex_try_lock(), busy_waiting between attempts.
! 170: After a fixed number of attempts, we use pthread_mutex_lock().)
! 171: <P>
! 172: These measurements do not use incremental collection, nor was prefetching
! 173: enabled in the marker. We used the C version of the benchmark.
! 174: All measurements are in elapsed seconds on an unloaded machine.
! 175: <P>
! 176: <TABLE BORDER ALIGN="CENTER">
! 177: <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
! 178: <TH>2 marker threads (secs.)</th></tr>
! 179: <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
! 180: <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
! 181: </table>
! 182: <PP>
! 183: The execution time for the single threaded case is slightly worse than with
! 184: simple locking. However, even the single-threaded benchmark runs faster than
! 185: even the thread-unsafe version if a second processor is available.
! 186: The execution time for two clients with thread local allocation time is
! 187: only 1.4 times the sequential execution time for a single thread in a
! 188: thread-unsafe environment, even though it involves twice the client work.
! 189: That represents close to a
! 190: factor of 2 improvement over the 2 client case with the old collector.
! 191: The old collector clearly
! 192: still suffered from some contention overhead, in spite of the fact that the
! 193: locking scheme had been fairly well tuned.
! 194: <P>
! 195: Full linear speedup (i.e. the same execution time for 1 client on one
! 196: processor as 2 clients on 2 processors)
! 197: is probably not achievable on this kind of
! 198: hardware even with such a small number of processors,
! 199: since the memory system is
! 200: a major constraint for the garbage collector,
! 201: the processors usually share a single memory bus, and thus
! 202: the aggregate memory bandwidth does not increase in
! 203: proportion to the number of processors.
! 204: <P>
! 205: These results are likely to be very sensitive to both hardware and OS
! 206: issues. Preliminary experiments with an older Pentium Pro machine running
! 207: an older kernel were far less encouraging.
! 208:
! 209: </body>
! 210: </html>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>