Memory Management in Oxischeme

I've recently been playing with the Rust programming language, and what better way to learn a language than to implement a second language in the language one wishes to learn?! It almost goes without saying that this second language being implemented should be Scheme. Thus, Oxischeme was born.

Why implement Scheme instead of some other language? Scheme is a dialect of LISP and inherits the simple parenthesized list syntax of its LISP-y origins, called s-expressions. Thus, writing a parser for Scheme syntax is rather easy compared to doing the same for most other languages' syntax. Furthermore, Scheme's semantics are also minimal. It is a small language designed for teaching, and writing a metacircular interpreter (ie a Scheme interpreter written in Scheme itself) takes only a few handfuls of lines of code. Finally, Scheme is a beautiful language: its design is rooted in the elegant λ-calculus.

Scheme is not "close to the metal" and doesn't provide direct access to the machine's hardware. Instead, it provides the illusion of infinite memory and its structures are automatically garbage collected rather than explicitly managed by the programmer. When writing a Scheme implementation in Scheme itself, or any other language with garbage collection, one can piggy-back on top of the host language's garbage collector, and use that to manage the Scheme's structures. This is not the situation I found myself in: Rust is close to the metal, and does not have a runtime with garbage collection built in (although it has some other cool ideas regarding lifetimes, ownership, and when it is safe to deallocate an object). Therefore, I had to implement garbage collection myself.

Faced with this task, I had a decision to make: tracing garbage collection or reference counting?

Reference counting is a technique where we keep track of the number of other things holding references to any given object or resource. When new references are taken, the count is incremented. When a reference is dropped, the count is decremented. When the count reaches zero, the resource is deallocated and it decrements the reference count of any other objects it holds a reference to. Reference counting is great because once an object becomes unreachable, it is reclaimed immediately and doesn't sit around consuming valuable memory space while waiting for the garbage collector to clean it up at some later point in time. Additionally, the reclamation process happens incrementally and program execution doesn't halt while every object in the heap is checked for liveness. On the negative side, reference counting runs into trouble when it encounters cycles. Consider the following situation:

A -> B
^    |
|    v
D <- C

A, B, C, and D form a cycle and all have a reference count of one. Nothing from outside of the cycle holds a reference to any of these objects, so it should be safe to collect them. However, because each reference count will never get decremented to zero, none of these objects will be deallocated. In practice, the programmer must explicitly use (potentially unsafe) weak references, or the runtime must provide a means for detecting and reclaiming cycles. The former defeats the general purpose, don't-worry-about-it style of managed memory environments. The latter is equivalent to implementing a tracing collector in addition to the existing reference counting memory management.

Tracing garbage collectors start from a set of roots and recursively traverse object references to discover the set of live objects in the heap graph. Any object that is not an element of the live set cannot be used again in the future, because the program has no way to refer to that object. Therefore, the object is available for reclaiming. This has the advantage of collecting dead cycles, because if the cycle is not reachable from the roots, then it won't be in the live set. The cyclic references don't matter because those edges are never traversed. The disadvantage is that, without a lot of hard work, when the collector is doing its bookkeeping, the program is halted until the collector is finished analyzing the whole heap. This can result in long, unpredictable GC pauses.

Reference counting is to tracing as yin is to yang. The former operates on dead, unreachable objects while the latter operates on live, reachable things. Fun fact: every high performance GC algorithm (such as generational GC or reference counting with "purple" nodes and trial deletion) uses a mixture of both, whether it appears so on the surface or not. "A Unified Theory of Garbage Collection" by Bacon, et all discusses this in depth.

I opted to implement a tracing garbage collector for Oxischeme. In particular, I implemented one of the simplest GC algorithms: stop-the-world mark-and-sweep. The steps are as follows:

Stop the Scheme program execution.
Mark phase. Trace the live heap starting from the roots and add every reachable object to the marked set.
Sweep phase. Iterate over each object x in the heap:
- If x is an element of the marked set, continue.
- If x is not an element of the marked set, reclaim it.
Resume execution of the Scheme program.

Because the garbage collector needs to trace the complete heap graph, any structure that holds references to a garbage collected type must participate in garbage collection by tracing the GC things it is holding alive. In Oxischeme, this is implemented with the oxischeme::heap::Trace trait, whose implementation requires a trace function that returns an iterable of GcThings:

pub trait Trace {
    /// Return an iterable of all of the GC things referenced by
    /// this structure.
    fn trace(&self) -> IterGcThing;
}

Note that Oxischeme separates tracing (generic heap graph traversal) from marking (adding live nodes in the heap graph to a set). This enables using Trace to implement other graph algorithms on top of the heap graph. Examples include computing dominator trees and retained sizes of objects, or finding the set of retaining paths of an object that you expected to be reclaimed by the collector, but hasn't been.

If we were to introduce a Trio type that contained three cons cells, we would implement tracing like this:

struct Trio {
    first: ConsPtr,
    second: ConsPtr,
    third: ConsPtr,
}

impl Trace for Trio {
    fn trace(&self) -> IterGcThing {
        let refs = vec!(GcThing::from_cons_ptr(self.first),
                        GcThing::from_cons_ptr(self.second),
                        GcThing::from_cons_ptr(self.third));
        refs.into_iter()
    }
}

What causes a garbage collection? As we allocate GC things, GC pressure increases. Once that pressure crosses a threshold — BAM! — a collection is triggered. Oxischeme's pressure application and threshold are very naive at the moment: every N allocations a collection is triggered, regardless of size of the heap or size of individual allocations.

A root is an object in the heap graph that is known to be live and reachable. When marking, we start tracing from the set of roots. For example, in Oxischeme, the global environment is a GC root.

In addition to permanent GC roots, like the global environment, sometimes it is necessary to temporarily root GC things referenced by pointers on the stack. Garbage collection can be triggered by any allocation, and it isn't always clear which Rust functions (or other functions called by those functions, or even other functions called by those functions called from the first function, and so on) might allocate a GC thing, triggering collection. The situation we want to avoid is a Rust function using a temporary variable that references a GC thing, then calling another function which triggers a collection and collects the GC thing that was referred to by the temporary variable. That results in the temporary variable becoming a dangling pointer. If the Rust function accesses it again, that is Undefined Behavior: it might still get the value it was pointing at, or it might be a segfault, or it might be a freshly allocated value being used by something else! Not good!

let a = pointer_to_some_gc_thing;
function_which_can_trigger_gc();
// Oops! A collection was triggered and dereferencing this
// pointer leads to Undefined Behavior!
*a;

There are two possible solutions to this problem. The first is conservative garbage collection, where we walk the native stack and if any value on the stack looks like it might be a pointer and if coerced to a pointer happens to point to a GC thing in the heap, we assume that it is in fact a pointer. Under this assumption, it isn't safe to reclaim the object pointed to, and so we treat that GC thing a root. Note that this strategy is simple and easy to retrofit because it doesn't involve changes in any code other than adding the stack scanning, but it results in false positives. The second solution is precise rooting. With precise rooting, it is the responsibility of the Rust function's author to explicitly root and unroot pointers to GC things used in variables on the stack. The advantage this provides is that there are no false positives: you know exactly which stack values are pointers to GC things. The disadvantage is the requirement of explicitly telling the GC about every pointer to a GC thing you ever reference on the stack.

Almost every modern, high performance tracing collector for managed runtimes uses precise rooting because it is a prerequisite^* of a moving collector: a GC that relocates objects while performing collection. Moving collectors are desirable because they can compact the heap, creating a smaller memory footprint for programs and better cache locality. They can also implement pointer bumping allocation, that is both simpler and faster than maintaining a free list. Finally, they can split the heap into generations. Generational GCs gain performance wins from the empirical observation that most allocations are short lived, and those objects that are most recently allocated are most likely to be garbage, so we can focus the GC's efforts on them to get the most bang for our buck. Precise rooting is a requirement for a moving collector because it has to update all references to point to the new address of each moved GC thing. A conservative collector doesn't know for sure if a given value on the stack is a reference to a GC thing or not, and if the value just so happens not to be a reference to a GC thing (it is a false positive), and the collector "helpfully" updates that value to the moved address, then the collector is introducing migraine-inducing bugs into the program execution.

^* Technically, there do exist some moving and generational collectors that are "mostly copying" and conservatively mark the stack but precisely mark the heap. These collectors only move objects which are not conservatively reachable.

Oxischeme uses precise rooting, but is not a moving GC (yet). Precise rooting is implemented with the oxischeme::heap::Rooted<T> smart pointer RAII type, which roots its referent upon construction and unroots it when the smart pointer goes out of scope and is dropped.

Using precise rooting and Rooted, we can solve the dangling stack pointer problem like this:

{
    // The pointed to GC thing gets rooted when wrapped
    // with `Rooted`.
    let a = Rooted::new(heap, pointer_to_some_gc_thing);
    function_which_can_trigger_gc();
    // Dereferencing `a` is now safe, because the referent is
    // a GC root, and can't be collected!
    *a;
}
// `a` is now out of scope, and its referent is unrooted.

With all of that out of the way, here is the implementation of our mark-and-sweep collector:

impl Heap {
    pub fn collect_garbage(&mut self) {
        self.reset_gc_pressure();

        // First, trace the heap graph and mark everything that
        // is reachable.

        let mut pending_trace = self.get_roots();

        while !pending_trace.is_empty() {
            let mut newly_pending_trace = vec!();

            for thing in pending_trace.drain() {
                if !thing.is_marked() {
                    thing.mark();

                    for referent in thing.trace() {
                        newly_pending_trace.push(referent);
                    }
                }
            }

            pending_trace.append(&mut newly_pending_trace);
        }

        // Second, sweep each `ArenaSet`.

        self.strings.sweep();
        self.activations.sweep();
        self.cons_cells.sweep();
        self.procedures.sweep();
    }
}

Why do we have four calls to sweep, one for each type that Oxischeme implements? To explain this, first we need to understand Oxischeme's allocation strategy.

Oxischeme does not allocate each individual object directly from the operating system. In fact, most Scheme "allocations" do not actually perform any allocation from the operating system (eg, call malloc or Box::new). Oxischeme uses a set of oxischeme::heap::Arenas, each of which have a preallocated object pool with each item in the pool either being used by live GC things, or waiting to be used in a future allocation. We keep track of an Arena's available objects with a "free list" of indices into its pool.

type FreeList = Vec<usize>;

/// An arena from which to allocate `T` objects from.
pub struct Arena<T> {
    pool: Vec<T>,

    /// The set of free indices into `pool` that are available
    /// for allocating an object from.
    free: FreeList,

    /// During a GC, if the nth bit of `marked` is set, that
    /// means that the nth object in `pool` has been marked as
    /// reachable.
    marked: Bitv,
}

When the Scheme program allocates a new object, we remove the first entry from the free list and return a pointer to the object at that entry's index in the object pool. If the every Arena is at capacity (ie, its free list is empty), a new Arena is allocated from the operating system and its object pool is used for the requested Scheme allocation.

impl<T: Default> ArenaSet<T> {
    pub fn allocate(&mut self) -> ArenaPtr<T> {
        for arena in self.arenas.iter_mut() {
            if !arena.is_full() {
                return arena.allocate();
            }
        }

        let mut new_arena = Arena::new(self.capacity);
        let result = new_arena.allocate();
        self.arenas.push(new_arena);
        result
    }
}

impl<T: Default> Arena<T> {
    pub fn allocate(&mut self) -> ArenaPtr<T> {
        match self.free.pop() {
            Some(idx) => {
                let self_ptr : *mut Arena<T> = self;
                ArenaPtr::new(self_ptr, idx)
            },
            None => panic!("Arena is at capacity!"),
        }
    }
}

For simplicity, Oxischeme has separate arenas for separate types of objects. This sidesteps the problem of finding an appropriately sized free block of memory when allocating different sized objects from the same pool, the fragmentation that can occur because of that, and lets us use a plain old vector as the object pool. However, this also means that we need a separate ArenaSet<T> for each T object that a Scheme program can allocate and why oxischeme::heap::Heap::collect_garbage has four calls to sweep().

During the sweep phase of Oxischeme's garbage collector, we return the entries of any dead object back to the free list. If the Arena is empty (ie, the free list is full) then we return the Arena's memory to the operating system. This prevents retaining the peak amount of memory used for the rest of the program execution.

impl<T: Default> Arena<T> {
    pub fn sweep(&mut self) {
        self.free = range(0, self.capacity())
            .filter(|&n| {
                !self.marked.get(n)
                    .expect("marked should have length == self.capacity()")
            })
            .collect();

        // Reset `marked` to all zero.
        self.marked.set_all();
        self.marked.negate();
    }
}

impl<T: Default> ArenaSet<T> {
    pub fn sweep(&mut self) {
        for arena in self.arenas.iter_mut() {
            arena.sweep();
        }

        // Deallocate any arenas that do not contain any
        // reachable objects.
        self.arenas.retain(|a| !a.is_empty());
    }
}

This concludes our tour of Oxischeme's current implementation of memory management, allocation, and garbage collection for Scheme programs. In the future, I plan to make Oxischeme's collector a moving collector, which will pave the way for a compacting and generational GC. I might also experiment with incrementalizing marking for lower latency and shorter GC pauses, or making sweeping lazy. Additionally, I intend to declare to the rust compiler that operations on un-rooted GC pointers unsafe, but I haven't settled on an implementation strategy yet. I would also like to experiment with writing a syntax extension for the Rust compiler so that it can derive Trace implementations, and they don't need to be written by hand.

Thanks to Tom Tromey and Zach Carter for reviewing drafts.

References

Oxischeme, a Scheme implementation in Rust
Tracing Garbage Collection — Wikipedia
A Unified Theory of Garbage Collection by Bacon, et all. Great paper!
Improving Python's Memory Allocator by Evan Jones. This was a nice resource when planning Oxischeme's allocation strategy. Interesting side note: CPython uses reference counting and a cycle detector rather than a tracing collector.