[BACK]Return to scale.html CVS log [TXT][DIR] Up to [local] / OpenXM_contrib2 / asir2000 / gc / doc

Annotation of OpenXM_contrib2/asir2000/gc/doc/scale.html, Revision 1.1

1.1     ! noro        1: <HTML>
        !             2: <HEAD>
        !             3: <TITLE>Garbage collector scalability</TITLE>
        !             4: </HEAD>
        !             5: <BODY>
        !             6: <H1>Garbage collector scalability</h1>
        !             7: In its default configuration, the Boehm-Demers-Weiser garbage collector
        !             8: is not thread-safe.  It can be made thread-safe for a number of environments
        !             9: by building the collector with the appropriate
        !            10: <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
        !            11: flag.  This has primarily two effects:
        !            12: <OL>
        !            13: <LI> It causes the garbage collector to stop all other threads when
        !            14: it needs to see a consistent memory state.
        !            15: <LI> It causes the collector to acquire a lock around essentially all
        !            16: allocation and garbage collection activity.
        !            17: </ol>
        !            18: Since a single lock is used for all allocation-related activity, only one
        !            19: thread can be allocating or collecting at one point.  This inherently
        !            20: limits performance of multi-threaded applications on multiprocessors.
        !            21: <P>
        !            22: On most platforms, the allocator/collector lock is implemented as a
        !            23: spin lock with exponential back-off.  Longer wait times are implemented
        !            24: by yielding and/or sleeping.  If a collection is in progress, the pure
        !            25: spinning stage is skipped.  This has the advantage that uncontested and
        !            26: thus most uniprocessor lock acquisitions are very cheap.  It has the
        !            27: disadvantage that the application may sleep for small periods of time
        !            28: even when there is work to be done.  And threads may be unnecessarily
        !            29: woken up for short periods.  Nonetheless, this scheme empirically
        !            30: outperforms native queue-based mutual exclusion implementations in most
        !            31: cases, sometimes drastically so.
        !            32: <H2>Options for enhanced scalability</h2>
        !            33: Version 6.0 of the collector adds two facilities to enhance collector
        !            34: scalability on multiprocessors.  As of 6.0alpha1, these are supported
        !            35: only under Linux on X86 and IA64 processors, though ports to other
        !            36: otherwise supported Pthreads platforms should be straightforward.
        !            37: They are intended to be used together.
        !            38: <UL>
        !            39: <LI>
        !            40: Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
        !            41: run the mark phase in parallel in multiple threads, and thus on multiple
        !            42: processors.  The mark phase typically consumes the large majority of the
        !            43: collection time.  Thus this largely parallelizes the garbage collector
        !            44: itself, though not the allocation process.  Currently the marking is
        !            45: performed by the thread that triggered the collection, together with
        !            46: <I>N</i>-1 dedicated
        !            47: threads, where <I>N</i> is the number of processors detected by the collector.
        !            48: The dedicated threads are created once at initialization time.
        !            49: <P>
        !            50: A second effect of this flag is to switch to a more concurrent
        !            51: implementation of <TT>GC_malloc_many</tt>, so that free lists can be
        !            52: built, and memory can be cleared, by more than one thread concurrently.
        !            53: <LI>
        !            54: Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
        !            55: local allocation.  It does not, by itself, cause thread local allocation
        !            56: to be used.  It simply allows the use of the interface in
        !            57: <TT>gc_local_alloc.h</tt>.
        !            58: <P>
        !            59: Memory returned from thread-local allocators is completely interchangeable
        !            60: with that returned by the standard allocators.  It may be used by other
        !            61: threads.  The only difference is that, if the thread allocates enough
        !            62: memory of a certain kind, it will build a thread-local free list for
        !            63: objects of that kind, and allocate from that.  This greatly reduces
        !            64: locking.  The thread-local free lists are refilled using
        !            65: <TT>GC_malloc_many</tt>.
        !            66: <P>
        !            67: An important side effect of this flag is to replace the default
        !            68: spin-then-sleep lock to be replace by a spin-then-queue based implementation.
        !            69: This <I>reduces performance</i> for the standard allocation functions,
        !            70: though it usually improves performance when thread-local allocation is
        !            71: used heavily, and thus the number of short-duration lock acquisitions
        !            72: is greatly reduced.
        !            73: </ul>
        !            74: <P>
        !            75: The easiest way to switch an application to thread-local allocation is to
        !            76: <OL>
        !            77: <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
        !            78: and then include the <TT>gc.h</tt>
        !            79: header in each client source file.
        !            80: <LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
        !            81: <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
        !            82: and/or <TT>GC_GCJ_MALLOC</tt>.
        !            83: </ol>
        !            84: <H2>The Parallel Marking Algorithm</h2>
        !            85: We use an algorithm similar to
        !            86: <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
        !            87: Endo, Taura, and Yonezawa</a> at the University of Tokyo.
        !            88: However, the data structures and implementation are different,
        !            89: and represent a smaller change to the original collector source,
        !            90: probably at the expense of extreme scalability.  Some of
        !            91: the refinements they suggest, <I>e.g.</i> splitting large
        !            92: objects, were also incorporated into out approach.
        !            93: <P>
        !            94: The global mark stack is transformed into a global work queue.
        !            95: Unlike the usual case, it never shrinks during a mark phase.
        !            96: The mark threads remove objects from the queue by copying them to a
        !            97: local mark stack and changing the global descriptor to zero, indicating
        !            98: that there is no more work to be done for this entry.
        !            99: This removal
        !           100: is done with no synchronization.  Thus it is possible for more than
        !           101: one worker to remove the same entry, resulting in some work duplication.
        !           102: <P>
        !           103: The global work queue grows only if a marker thread decides to
        !           104: return some of its local mark stack to the global one.  This
        !           105: is done if the global queue appears to be running low, or if
        !           106: the local stack is in danger of overflowing.  It does require
        !           107: synchronization, but should be relatively rare.
        !           108: <P>
        !           109: The sequential marking code is reused to process local mark stacks.
        !           110: Hence the amount of additional code required for parallel marking
        !           111: is minimal.
        !           112: <P>
        !           113: It should be possible to use generational collection in the presence of the
        !           114: parallel collector, by calling <TT>GC_enable_incremental()</tt>.
        !           115: This does not result in fully incremental collection, since parallel mark
        !           116: phases cannot currently be interrupted, and doing so may be too
        !           117: expensive.
        !           118: <P>
        !           119: Gcj-style mark descriptors do not currently mix with the combination
        !           120: of local allocation and incremental collection.  They should work correctly
        !           121: with one or the other, but not both.
        !           122: <P>
        !           123: The number of marker threads is set on startup to the number of
        !           124: available processors (or to the value of the <TT>GC_NPROCS</tt>
        !           125: environment variable).  If only a single processor is detected,
        !           126: parallel marking is disabled.
        !           127: <P>
        !           128: Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
        !           129: the collector to immediately yield the processor instead of busy waiting
        !           130: first.  In the case of a multiprocessor and a client with multiple
        !           131: simultaneously runnable threads, this may have disastrous performance
        !           132: consequences (e.g. a factor of 10 slowdown).
        !           133: <H2>Performance</h2>
        !           134: We conducted some simple experiments with a version of
        !           135: <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
        !           136: run multiple concurrent client threads in the same address space.
        !           137: Each client thread does the same work as the original benchmark, but they share
        !           138: a heap.
        !           139: This benchmark involves very little work outside of memory allocation.
        !           140: This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
        !           141: under Linux 2.2.12.
        !           142: <P>
        !           143: Running with a thread-unsafe collector,  the benchmark ran in 9
        !           144: seconds.  With the simple thread-safe collector,
        !           145: built with <TT>-DLINUX_THREADS</tt>, the execution time
        !           146: increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
        !           147: (The times for the <TT>malloc</tt>/i<TT>free</tt> version
        !           148: with glibc <TT>malloc</tt>
        !           149: are 10.51 (standard library, pthreads not linked),
        !           150: 20.90 (one thread, pthreads linked),
        !           151: and 24.55 seconds respectively. The benchmark favors a
        !           152: garbage collector, since most objects are small.)
        !           153: <P>
        !           154: The following table gives execution times for the collector built
        !           155: with parallel marking and thread-local allocation support
        !           156: (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>).  We tested
        !           157: the client using either one or two marker threads, and running
        !           158: one or two client threads.  Note that the client uses thread local
        !           159: allocation exclusively.  With -DTHREAD_LOCAL_ALLOC the collector
        !           160: switches to a locking strategy that is better tuned to less frequent
        !           161: lock acquisition.  The standard allocation primitives thus peform
        !           162: slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
        !           163: avoided in time-critical code.
        !           164: <P>
        !           165: (The results using <TT>pthread_mutex_lock</tt>
        !           166: directly for allocation locking would have been worse still, at
        !           167: least for older versions of linuxthreads.
        !           168: With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
        !           169: lock with pthread_mutex_try_lock(), busy_waiting between attempts.
        !           170: After a fixed number of attempts, we use pthread_mutex_lock().)
        !           171: <P>
        !           172: These measurements do not use incremental collection, nor was prefetching
        !           173: enabled in the marker.  We used the C version of the benchmark.
        !           174: All measurements are in elapsed seconds on an unloaded machine.
        !           175: <P>
        !           176: <TABLE BORDER ALIGN="CENTER">
        !           177: <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
        !           178: <TH>2 marker threads (secs.)</th></tr>
        !           179: <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
        !           180: <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
        !           181: </table>
        !           182: <PP>
        !           183: The execution time for the single threaded case is slightly worse than with
        !           184: simple locking.  However, even the single-threaded benchmark runs faster than
        !           185: even the thread-unsafe version if a second processor is available.
        !           186: The execution time for two clients with thread local allocation time is
        !           187: only 1.4 times the sequential execution time for a single thread in a
        !           188: thread-unsafe environment, even though it involves twice the client work.
        !           189: That represents close to a
        !           190: factor of 2 improvement over the 2 client case with the old collector.
        !           191: The old collector clearly
        !           192: still suffered from some contention overhead, in spite of the fact that the
        !           193: locking scheme had been fairly well tuned.
        !           194: <P>
        !           195: Full linear speedup (i.e. the same execution time for 1 client on one
        !           196: processor as 2 clients on 2 processors)
        !           197: is probably not achievable on this kind of
        !           198: hardware even with such a small number of processors,
        !           199: since the memory system is
        !           200: a major constraint for the garbage collector,
        !           201: the processors usually share a single memory bus, and thus
        !           202: the aggregate memory bandwidth does not increase in
        !           203: proportion to the number of processors.
        !           204: <P>
        !           205: These results are likely to be very sensitive to both hardware and OS
        !           206: issues.  Preliminary experiments with an older Pentium Pro machine running
        !           207: an older kernel were far less encouraging.
        !           208:
        !           209: </body>
        !           210: </html>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>