Back to the Futu-rr-e: Deterministic Debugging with rr

November 2nd, 2015

What follows is an embloggified version of an introduction to the rr debugger that I gave at a local Linux Users Group meet up on October 6th, 2015.

Hello, everyone! My name is Nick Fitzgerald and I'm here to talk about a (relatively) new kid on the block: the rr debugger. rr is built by some super folks working at Mozilla, and I work at Mozilla too, but not with them and I haven't contributed to rr (yet!) — I'm just a very happy customer!

At Mozilla, I am a member of the Firefox Developer Tools team. I've done a good amount of thinking about bugs, where they come from, the process of debugging, and the tools we use in the process. I like to think that I am a fairly savvy toolsmith.

These days, I'm doing less of building the devtools directly and more of baking APIs into Gecko (Firefox's browser engine) and SpiderMonkey (Firefox's JavaScript engine) that sit underneath and support the devtools. That means I'm writing a lot of C++.

And rr has quickly become the number one tool I reach for when debugging complicated C++ code. rr only runs on Linux and I don't even use Linux as my day-to-day operating system! But rr provides such a great debugging experience, and gives me such a huge productivity boost, that I will reboot into Fedora just to use rr for all but the most trivial bugs. Take note of that, Linux advocates.

But before we dive into rr and why it is so awesome, I'd like to step back and talk a bit about the anatomy of a bug and the process of debugging.

A bug has a symptom, for example the program crashes or something that was supposed to happen didn't, and it has an origin where things first started going wrong or misbehaving. Between the origin and the symptom is a cause and effect chain. The cause and effect chain begins with the bug's origin and ends with the symptom you've observed.

Initially, the cause and effect chain and the origin are hidden from us. When first investigating a bug, we only know its final symptoms.

But what we really want to know is the root cause or origin of the bug, because that is where we need to apply a fix and resolve the bug. To get there, we have to reason backwards through the cause and effect chain. The process of debugging is the process of reasoning backwards through the cause and effect chain from the bug's symptom to its origin.

It follows that all debugging is "reverse debugging", but not all debuggers support reverse execution. And even among those debuggers that do support reverse execution, going backwards and forwards repeatedly can give you different results due to nondeterminism in the program.

Let's take a look at the ergonomics of walking backwards through the cause and effect chain in a traditional stepping debugger without reverse execution.

With a traditional debugger, we are slowly reasoning our way backwards, while the program executes forwards and we end up in a loop that looks something like this:

  1. Pause or break at the point in the program's execution where things start to go wrong. "The program is crashing at D, let's pause at that point."

  2. Poke around, inspect variables, frames on the stack, etc. and discover the immediate cause of the effect you have observed. "Oh, I see: D is being passed a null pointer because of C!"

  3. But we get to the point where in order to find the next immediate cause, we need to either go backwards in time or restart the program's execution. Since the former is not an option in traditional debuggers, we are forced to restart the program. "But what conditions led to C giving out a null pointer? I'll set this breakpoint and restart the program."

  4. Re-execute the program until you hit the breakpoint. Hopefully there aren't many false positives; conditional breakpoints can help cut down on them.

  5. Loop back to (2) until we've found the origin of the bug.

I really hope we can reliably and consistently reproduce this bug, or else (4) will involve re-running the program many times! If the cause and effect chain is long, that means we are re-running the program many, many, many times! This productivity drain is compounded by false positives and manual, non-automatable steps-to-reproducing the bug. Not a good look.

If we shoehorn the debugging process afforded by a traditional debugger into big-O notation, we get O(d∙r) where d is the length of the cause and effect chain, and r is the inverse reproducibility, meaning that we need to execute the program about r times to reproduce the bug once.

When both d and r are very small, discovering the bug's origin is trivial. We aren't interested in this class of bugs; a traditional debugger will already serve just fine. When d or r grow larger, we quickly spend more and more time debugging and our productivity slows to a crawl. When both d and r are large, bugs can feel (and in practice be) impossible to resolve. It can even be so impractical that it's not worth spending time on and makes more business sense to let the bug live on! It's a sad situation to be in.

These really difficult to diagnose and fix bugs are the type we are interested in. If we can better debug this class of bug, we increase our productivity and can even fix bugs that were impossible or impractical to fix before. This results in better, more reliable software that is delivered faster.

rr provides a fundamentally better debugging process than traditional debuggers and lives up to the promise of making seemingly impossible to fix bugs into simply difficult. It untangles reproducing a bug from investigating the cause effect chain by splitting the debugging process into two phases: recording a program's execution and then replaying and debugging that recording deterministically.

Using rr is kind of like investigating a theft at a corner store. The store has a CCTV system that is recording everything going on inside. If you want to solve the store's security problem, you could hire someone to watch the CCTV feed in real time 24 hours a day and try and spot a thief in action but that is expensive, manual, and time consuming. Alternatively, you could save that time and money spent on watching the live video feed and wait until you've captured a thief on camera. At that point, you can watch that recording at your leisure and rewind, replay, fast forward, and enhance to determine who the thief is, and how to make that theft impossible in the future. rr is the same way: you keep recording until you've captured an occurrence of your bug and then you go back and replay the recording to determine the origin of why that bug happened.

Photo credits: and

With rr, the first phase is recording an execution of your program that reproduces the bug in question. The recording must include everything needed to replay the program execution exactly. The key observation that rr leverages is that most of a program's execution is deterministic. Therefore, if you record all of the nondeterministic parts, then the rest of the program (the deterministic parts) will by definition always yield the same results. This also turns out to have less recording overhead than alternative techniques.

Examples of nondeterminism that rr records:

For system calls, ptrace is used to wrap the call, note its occurrence and return value in the recording, and let the call return.

Concurrency and parallelism is a little trickier. True parallelism with shared memory can't be recorded efficiently with today's hardware. Instead, rr only supports concurrency and all threads are pinned to a single core. rr counts the number of instructions executed between preemptions so it can recreate them exactly in the replay phase. Unfortunately, this means that very parallel programs will run much slower under rr than they otherwise would.

Photo credit:

The second phase is replaying the recording. Mostly, this phase is just running the program again and letting the deterministic majority of it recompute the same things it originally did. That is what lets rr's overhead stay low when replaying.

The nondeterministic parts of the program's execution are emulated from the original execution's recording in a couple different ways. System calls are intercepted and the corresponding value from the recording is returned instead of continuing with the actual system call. To emulate the exact preemptions between context switches between threads, rr implements its own scheduler and uses hardware performance counters to interrupt after the desired number of instructions (that were previously noted in the recording) have been executed. rr's scheduler then handles the interrupt and switches contexts. Replaying signals is handled similarly.

If you want more implementation details, this slide deck is a great deep dive into rr's internals for potential contributors. It is from April, 2014, so a little dated, but my understanding is that rr is just more featureful and hasn't been fundamentally re-architected since then.

By splitting out reproducing a failure from investigating the cause and effect chain, rr's debugging process is O(r+d). This is the time spent reproducing the bug once while we are recording, plus the time spent deducing the cause and effect chain. We don't need to continually reproduce the bug anymore! Only once ever! This is a huge improvement over traditional debuggers, and if we can automate recording the steps to reproduce until the bug occurs with a script, then it is only O(d) of our personal time!

This is treeherder. All of Firefox, Gecko, and SpiderMonkey live inside the big mozilla-central repository. Everyday, there are hundreds of commits (from both paid Mozilla employees and volunteer contributors) landing on mozilla-central, one of its integration repositories (mozilla-inbound for platform changes and fx-team for frontend changes), or the Try testing repository. Every one of these commits gets tested with our continuous integration testing. Treeherder displays these test results. Green is a test suite that passed, orange means that one or more tests in that suite failed.

We have a lot of tests in mozilla-central. Some quick and dirty shell scripting yielded just under 30,000 test files and I'm sure that it wasn't exhaustive. Because we're all human, we write tests with bugs and they intermittently fail. If a test is failing consistently it is usually pretty easy to fix. But if a test only fails one out of a thousand times, it is very hard to debug. Still, it has a real impact: because of the sheer number of commits on and tests within mozilla-central, we still get many intermittent failures on every test run. These failures are a drag on developer time when developers have to check if it is a "real" failure or not, and while often they are poorly written tests that contain race conditions only in the test code, sometimes they will be real bugs that down the line are affecting real users.

rr puts these bugs that are seemingly impossible to debug within reach. Mozilla infrastructure folks are investigating what is required to optionally run these test suites under rr during our integration tests and throw away the recording if an intermittent failure doesn't occur, but if it does then save the recording for developers to debug at their leisure.

Wow! Hopefully you're convinced that rr is worth using. Most of the familiar commands from gdb are still present since it uses gdb as a frontend. You should definitely read the wiki page on rr's usage. Here are a few tips and tricks so you can hit the ground running.

These are the same as the normal stepping commands but go backwards instead of forwards in the execution. Go backwards one instruction, expression, line, or even frame. What's really neat about this is that going backwards and forwards repeatedly is deterministic and you'll always get the same results.

Step backwards through the cause and effect chain directly, rather than piecing it back together with repeated forward execution!

Here is a real example I had a couple months ago where reverse stepping really saved me a ton of time. I was implementing an extension to the structured cloning algorithm to handle my SavedFrame stack objects so that they could be sent between WebWorkers. The function in question returns nullptr on failure, and it does a lot of potentially fallible operations. Twelve to be exact (see the lines highlighted with an arrow in the screencap). I had a test that was failing. This function was called many times and usually it would succeed, but one time it would fail and return nullptr because one of its fallible operations was failing but I had no idea which one or why. With a traditional debugger, my options were to set twelve breakpoints, one on every return nullptr;, or break on every entry of the function (which is called many times and usually succeeds) and then step through line by line. Both options were tedious. With rr, I just set a conditional breakpoint for when my call to the function returned nulltpr and then did a single reverse-step command. I was taken directly to the failing operation, and from there the origin of the bug quickly became apparent.

In the end, it turned out to be embarrassingly simple. I was doing this:

result = fallibleOperation();
if (result)
    return nullptr;

instead of this:

result = fallibleOperation();
if (!result)
    return nullptr;

Note the ! in the latter code snippet.

This is the Holy Grail of debuggers. If you could most directly translate "reasoning backwards through the cause and effect chain" into a debugger command, it would be this. When a value has become corrupted, or an instance's member has an unexpected value, we can set a hardware watchpoint on the value or member and then reverse continue directly to the last point in the program's execution when the given value was modified.

This also works with breakpoints and conditional breakpoints in addition to watchpoints.

A few months ago, I was doing a little refactoring of the way that SpiderMonkey's Debugger API objects interact with the garbage collector. When SpiderMonkey finishes its final shutdown GC, it asserts that every Debugger API object has been reclaimed. Somehow, my changes broke that invariant. I couldn't use the memory tools we have been building to help find retainers because this was the middle of shutdown and those APIs weren't designed with that constraint. Additionally, the core marking loop of the GC is very, very hot and is implemented as a bunch of gotos. Using a traditional debugger and setting a conditional breakpoint for when my "leaking" Debugger API object was processed in the marking loop was useless; it gave no clue as to why it was being marked or who was retaining it and why they were retained in turn. Using rr, I did set a conditional breakpoint for when my Debugger API object was pushed on the "to-mark" stack, but then I reverse stepped to find what object was being marked and queueing my Debugger API object for marking in turn. Using the combination of conditional breakpoints and rr's reverse execution, I was able to walk the retaining path backwards through the GC's marking loop to figure out why my Debugger API object was sticking around and not getting reclaimed. It turned out that my refactoring marked a global's Debugger API objects unconditionally, when they should only be marked if the global was still live.

Thanks for bearing with me! I hope some of my excitement for rr has been transferred to you. I definitely feel like I can debug more challenging problems than I otherwise would be able to.

Go forth and use rr! Happy debugging!

That was everything I presented to the Linux Users Group, but since I gave that presentation I've accumulated one more good rr debugging anecdote:

We (the devtools team) use a series of protobuf messages to serialize and deserialize heap snapshots to and from core dump files. While we are only just building a frontend to this functionality in the devtools, the APIs for saving and reading heap snapshots have existed for a while and have pretty good test coverage. In the process of testing out the new frontend we would seemingly randomly get errors originating from within the protobuf library when reading heap snapshots.

I don't know the protobuf internals, it's a black box to me and on top of that a lot of it is machine generated by the protoc compiler. But, I rebooted into Fedora, fired up rr and reproduced the bug. Starting from where the protobuf message parsing API was returning false to denote a parse failure (no kind of message or error code explaining why message parsing failed) I started stepping backwards. It quickly became apparent that I was repeatedly stepping backwards through frames belonging to the same handful of mutually recursive functions (clue #1) that were unwinding the stack and returning false along the way. So I set a breakpoint on the return false; statements I was repeatedly stepping through and then used gdb's breakpoint commands to help automate my debugging process.

(rr) commands <breakpoint number>
Type commands for when breakpoint <breakpoint number> is hit, one per line.
End with a line saying just "end".

After I added this command to each breakpoint, I kicked the process off with a single reverse-finish which triggered a cascading series of breakpoint hits followed by another reverse-finish from the automated breakpoint commands. In a few seconds, I re-wound a couple hundred mutually recursive stack frames. I ended up on this line within the protobuf library:

if (!input->IncrementRecursionDepth()) return false;

It turns out that the protobuf library protects against malicious, deeply nested messages from blowing the stack by implementing a recursion limit for how deep messages can nest. Thanks! (Although, I still wish it gave some indication of why it was failing…) Our core dump protobuf message format was designed so that we use a lot of individual messages rather than one big nested structure, but there was one case where we were hitting the limit and that led to this bug. Luckily, it was pretty easy to fix. I can only imagine how hard it would have been to debug this without rr!

Proposal for Encoding Source-Level Environment Information Within Source Maps

July 22nd, 2015

A month ago, I wrote about how source maps are an insufficient debugging format for the web. I tried to avoid too much gloom and doom, and focus on the positive aspect. We can extend the source map format with the environment information needed to provide a rich, source-level debugging experience for languages targeting JavaScript. We can do this while maintaining backwards compatibility.

Today, I'm happy to share that I have a draft proposal and a reference implementation for encoding source-level environment information within source maps. It's backwards compatible, compact, and future extensible. It enables JavaScript debuggers to rematerialize source-level scopes and bindings, and locate any given binding's value, even if that binding does not exist in the compiled JavaScript.

I look forward to the future of debugging languages that target JavaScript.

Interested in getting involved? Join the discussion.

Source Maps are an Insufficient Debugging Format for the Web

June 19th, 2015

I want exceptional debugging support for programs compiled to JavaScript (or wasm) on the web. I want first class developer tools support for these programs. Unfortunately, this isn't the case right now because source maps are currently insufficient.

When boiled down to their essence, stepping debuggers enable users to do two things:

  1. set breakpoints and incrementally step through their program's execution, and
  2. inspect the values of variables and function parameters as they step through their program's execution.

A debugging format encodes information about a program that is lost at compile time, enabling source-level debugging of the compiled program. Source maps — the sole debugging format for compilers targeting JavaScript — can reconstruct source-level location information (filename, line, and column) which enables (1). However, that is the only data source maps encode, and the source map format does not attempt to solve (2). The result is that debugging programs compiled to JavaScript is positively maddening.

The good news is that we can overcome this limitation and extend the source map format to support reconstructing source-level scopes, bindings, and values. Before we do that, we should understand how mature debugging formats encode this information and how it is in turn consumed by debuggers. The prevailing debugging data format for *nix-based systems today is DWARF. It supersedes earlier ad-hoc formats, such as stabs. DWARF's designers worked with earlier formats for years, and used that experience as the foundation upon which DWARF was built. What follows is an exploration of how DWARF encodes scope and binding debugging information while keeping an eye out for lessons we can learn and apply when extending source maps.

Compact Binary Format

DWARF uses a compact binary record format to store structured information, conceptually similar to Cap'n Proto or Message Pack serialization formats. These Debugging Information Entries, or DIEs, are made up of a type tag, a set of attribute type and attribute value pairs, and an optional set of child DIEs. With these children, DIEs form a tree structure.

<DIE tag 1>
    <attribute name and type 1>
    <attribute value 1>
    <attribute name and type 2>
    <attribute value 2>
    <attribute name and type 3>
    <attribute value 3>

<DIE tag 1>
    <attribute name and type 1>
    <attribute value 1>
    <attribute name and type 2>
    <attribute value 2>
    <attribute name and type 3>
    <attribute value 3>

<DIE tag 1>
    <attribute name and type 1>
    <attribute value 1>
    <attribute name and type 2>
    <attribute value 2>
    <attribute name and type 3>
    <attribute value 3>


Furthermore, the metadata describing common attribute shapes can be factored out and de-duplicated with abbreviations. By using abbreviations, that previous series of DIEs can be compressed into this:

<abbreviation definition 1>
    <attribute name and type 1>
    <attribute name and type 2>
    <attribute name and type 3>

<DIE abbreviation tag 1>
    <attribute value 1>
    <attribute value 2>
    <attribute value 3>

<DIE abbreviation tag 1>
    <attribute value 1>
    <attribute value 2>
    <attribute value 3>

<DIE abbreviation tag 1>
    <attribute value 1>
    <attribute value 2>
    <attribute value 3>


With abbreviations, none of the attribute name and type metadata is repeated. This helps keep the debugging data compact. Debugging data can be huge in practice. Source maps can already be many megabytes in size, and the format already strives for as much de-duplication of data it can and only encodes limited debugging information as it is. This issue is exacerbated for source maps, because they are sent over the network and downloaded by debuggers. We must go to great lengths to keep debugging formats compact.

Additionally, because the set of attributes a DIE contains is not fixed, the format is future extensible. New attributes can be added to provide new debugging information that wasn't previously encoded. If a debugger does not know how to interpret a given attribute, it can skip and ignore it. This is a huge deal for the web, where we tend to take many baby steps rather than a few giant leaps at a time. Any extension to the source map format must strive to remain future extensible.

Lexical Scopes and Bindings

DWARF encodes data about the program's source-level scopes, bindings, and values in a DIE tree.

Consider this example program, fact.c:

#include "stdio.h"

static const int FACT_N = 6;

int main(int argc, char **argv) {
    int i = FACT_N;
    int result = 1;

    while (i > 0) {
        result *= i;

        int decrement = 1;
        i -= decrement;

    printf("%d\n", result);

    return 0;

Within this source, there are three scopes we are interested in: the top level, the main function, and the block within the while loop. These scopes nest to form a tree (it would be a more interesting tree with a more complicated example program). DWARF encodes this tree directly with DIEs: scopes have their variables as child DIEs, and nested scopes are also child DIEs of their parent scope's DIE. Different kinds of bindings, such as formal parameters vs. local variables vs. constants, are declared with separate DIE type tags.

Below is a trimmed down subset of the information clang -g fact.c provides to debuggers via DWARF. It depicts the outline of the scope tree and the various bindings within those scopes. You can use the dwarfdump command line tool to view the text representation of the binary DIEs.

0x00000026:     TAG_variable [2]
                 AT_name( "FACT_N" )

0x0000003e:     TAG_subprogram [5] *
                 AT_name( "main" )

0x0000005e:         TAG_formal_parameter [6]
                     AT_name( "argc" )

0x0000006c:         TAG_formal_parameter [6]
                     AT_name( "argv" )

0x0000007a:         TAG_variable [7]
                     AT_name( "i" )

0x00000088:         TAG_variable [7]
                     AT_name( "result" )

0x00000096:         TAG_lexical_block [8] *

0x000000a7:             TAG_variable [7]
                         AT_name( "decrement" )

0x000000b5:             NULL

0x000000b6:         NULL

0x000000c8:     NULL

Additionally, the start and end target location bounds of these scopes are embedded as well. These bounds are denoted with the AT_low_pc and AT_high_pc DIE attributes. When the debuggee pauses at a given target location, the debugger finds the innermost scope containing the current target location, and can walk up the DIE tree to find every binding that is in scope.

Here is the DWARF data describing the bounds for the main function, and the while loop's block:

0x0000003e:     TAG_subprogram [5] *
                 AT_name( "main" )
                 AT_low_pc( 0x0000000100000f00 )
                 AT_high_pc( 0x0000000100000f75 )

0x00000096:         TAG_lexical_block [8] *
                     AT_low_pc( 0x0000000100000f31 )
                     AT_high_pc( 0x0000000100000f54 )

Locating a Binding's Value

It is not enough to list the names of bindings that are in scope when paused during the program's execution. A debugger must also provide the current values of those bindings so that the user may inspect them for irregularities.

Often, a binding's value is simply at some constant offset from the frame pointer. Other times it is in a specific register. Other times, the situation is a more complicated combination of these cases: for the first n instructions of this function, the parameter's value is located in this register, for the rest it is at that constant offset from the frame pointer. Even other times, a value is not located in one contiguous location! An optimizing compiler is free to explode a struct's members and store one member in a register, another member at this offset from the stack pointer, and to avoid even instantiating a value for the last member because that member is never used by the program.

To handle all of these cases, DWARF employs "location descriptions". These descriptions are expressed in stack-based operations that form a Turing-complete language.

JavaScript does not have frame pointers or registers. It has local, global, lexical, and dynamic (via this) bindings. It has objects and arrays with properties and indices. Because of that mismatch, it doesn't make sense for the source map format to adopt DWARF's location descriptions. Luckily, JavaScript engines already implement a sandboxed language for accessing and manipulating JavaScript values: JavaScript. We can embed snippets of JavaScript within the debugging data to locate a binding's value.


We can extend the source map format to encode the source-level lexical environment, its scopes and bindings, and a way to locate those binding's values. This enables users compiling their programs to JavaScript to do the other half of their debugging: inspecting variables and values as they step through their programs instead of stepping blindly.

However, we must ensure that source maps remain compact for quick downloads and parse times. The format should also be future extensible so we can add additional features down the line and don't need to reach consensus on everything all at once.

Finally, we should follow DWARF's lead and encode the scopes as a tree and embed within each scope its set of bindings, their symbol names, the scope's location boundaries, types, etc. We can reuse JavaScript as our language for describing the location of a binding's value.

If you are interested in being a part of these discussions, please join in at the Source Map RFC Repository.

Thanks to Tom Tromey and Dave Herman for reading drafts.


A Compact Representation of Captured Stack Frames for SpiderMonkey

April 10th, 2015

Last year, I implemented a new, compact representation for captured JavaScript stacks in SpiderMonkey1. This was done to support the allocation tracking I built into SpiderMonkey's Debugger API2 3, which saves the stack for potentially every Object allocation in the observed globals. That's a lot of stacks, and so it was important to make sure that we incurred as little overhead to memory usage as we could. Even more so, because we'd like to affect memory usage as little as possible while simultaneously instrumenting and measuring it.

The key observation is that while there may be many allocations, there are many fewer allocation sites. Most allocations happen at a few places. Thus, much information is duplicated across stacks. For example, if we run the esprima JavaScript parser on its own JavaScript source, there are approximately 54,700 total Object allocations, but just ~1,400 unique JS stacks at allocation time. There are only ~200 allocation sites if you consider only the youngest stack frame.

Consider the example below:

function a() {

function b() {

function c() { new Object; }
function d() { new Object; }
function e() { new Object; }

Disregarding compiler optimizations removing allocations, arguments objects, as well as engine internal allocations, calling the a function allocates three Objects. With arrows representing the "called by" relationship from a younger frame to its older, parent frame, this is the set of stacks that existed during an Object allocation:

c -> b -> a
d -> b -> a
e -> b -> a

Instead of duplicating all these saved stack frames, we use a technique called hash consing. Hash consing lets us share older tail stack frames between many unique stacks, similar to the way a trie shares string prefixes. With hash consing, SpiderMonkey internally represents those stacks like this:

c -> b -> a
d ---|
e ---'

Each frame is stored in a hash table. When saving new stacks, we use this table to find pre-existing saved frames. If such an object is already extant, it is reused; otherwise a new saved frame is allocated and inserted into the table. During the garbage collector's sweep phase, we remove old entries from this hash table that are no longer used.

I just landed a series of patches to make Error objects use these hash cons'd stacks internally, and lazily stringify it when the stack property is accessed4. Previously, every time an Error object was created, the stack string was eagerly computed. This change can result in significant memory savings for JS programs that allocate many Error objects. In one Real World™ test case5, we dropped memory usage from 1.2GB down to 167MB. Additionally, we use almost half of the memory that V8 uses on this same test case.

Many thanks to Jim Blandy for guidance and reviews. Thanks also to Boris Zbarsky, Bobby Holley, and Jason Orendorff for insight into implementing the security wrappers needed for passing stack frame objects between privileged and un-privileged callers. Thanks to Paolo Amadini for adding support for tracing stacks across asynchronous operations (more about this in the future). Thanks again to Boris Zbarsky for making DOMExceptions use these hash cons'd stacks. Another thanks to Jason Orendorff for reviewing my patches to make the builtin JavaScript Error objects use hash cons'd stacks behind the scenese.

In the future, I would like to make capturing stacks even faster by avoiding re-walking stack frames that we already previously walked capturing another stack6. The cheaper capturing a stack is, the more ubiquitous it can be throughout the the platform, and that puts more context and debugging information into developers' hands.

Memory Management in Oxischeme

February 22nd, 2015

I've recently been playing with the Rust programming language, and what better way to learn a language than to implement a second language in the language one wishes to learn?! It almost goes without saying that this second language being implemented should be Scheme. Thus, Oxischeme was born.

Why implement Scheme instead of some other language? Scheme is a dialect of LISP and inherits the simple parenthesized list syntax of its LISP-y origins, called s-expressions. Thus, writing a parser for Scheme syntax is rather easy compared to doing the same for most other languages' syntax. Furthermore, Scheme's semantics are also minimal. It is a small language designed for teaching, and writing a metacircular interpreter (ie a Scheme interpreter written in Scheme itself) takes only a few handfuls of lines of code. Finally, Scheme is a beautiful language: its design is rooted in the elegant λ-calculus.

Scheme is not "close to the metal" and doesn't provide direct access to the machine's hardware. Instead, it provides the illusion of infinite memory and its structures are automatically garbage collected rather than explicitly managed by the programmer. When writing a Scheme implementation in Scheme itself, or any other language with garbage collection, one can piggy-back on top of the host language's garbage collector, and use that to manage the Scheme's structures. This is not the situation I found myself in: Rust is close to the metal, and does not have a runtime with garbage collection built in (although it has some other cool ideas regarding lifetimes, ownership, and when it is safe to deallocate an object). Therefore, I had to implement garbage collection myself.

Faced with this task, I had a decision to make: tracing garbage collection or reference counting?

Reference counting is a technique where we keep track of the number of other things holding references to any given object or resource. When new references are taken, the count is incremented. When a reference is dropped, the count is decremented. When the count reaches zero, the resource is deallocated and it decrements the reference count of any other objects it holds a reference to. Reference counting is great because once an object becomes unreachable, it is reclaimed immediately and doesn't sit around consuming valuable memory space while waiting for the garbage collector to clean it up at some later point in time. Additionally, the reclamation process happens incrementally and program execution doesn't halt while every object in the heap is checked for liveness. On the negative side, reference counting runs into trouble when it encounters cycles. Consider the following situation:

A -> B
^    |
|    v
D <- C

A, B, C, and D form a cycle and all have a reference count of one. Nothing from outside of the cycle holds a reference to any of these objects, so it should be safe to collect them. However, because each reference count will never get decremented to zero, none of these objects will be deallocated. In practice, the programmer must explicitly use (potentially unsafe) weak references, or the runtime must provide a means for detecting and reclaiming cycles. The former defeats the general purpose, don't-worry-about-it style of managed memory environments. The latter is equivalent to implementing a tracing collector in addition to the existing reference counting memory management.

Tracing garbage collectors start from a set of roots and recursively traverse object references to discover the set of live objects in the heap graph. Any object that is not an element of the live set cannot be used again in the future, because the program has no way to refer to that object. Therefore, the object is available for reclaiming. This has the advantage of collecting dead cycles, because if the cycle is not reachable from the roots, then it won't be in the live set. The cyclic references don't matter because those edges are never traversed. The disadvantage is that, without a lot of hard work, when the collector is doing its bookkeeping, the program is halted until the collector is finished analyzing the whole heap. This can result in long, unpredictable GC pauses.

Reference counting is to tracing as yin is to yang. The former operates on dead, unreachable objects while the latter operates on live, reachable things. Fun fact: every high performance GC algorithm (such as generational GC or reference counting with "purple" nodes and trial deletion) uses a mixture of both, whether it appears so on the surface or not. "A Unified Theory of Garbage Collection" by Bacon, et all discusses this in depth.

I opted to implement a tracing garbage collector for Oxischeme. In particular, I implemented one of the simplest GC algorithms: stop-the-world mark-and-sweep. The steps are as follows:

  1. Stop the Scheme program execution.
  2. Mark phase. Trace the live heap starting from the roots and add every reachable object to the marked set.
  3. Sweep phase. Iterate over each object x in the heap:
    • If x is an element of the marked set, continue.
    • If x is not an element of the marked set, reclaim it.
  4. Resume execution of the Scheme program.

Because the garbage collector needs to trace the complete heap graph, any structure that holds references to a garbage collected type must participate in garbage collection by tracing the GC things it is holding alive. In Oxischeme, this is implemented with the oxischeme::heap::Trace trait, whose implementation requires a trace function that returns an iterable of GcThings:

pub trait Trace {
    /// Return an iterable of all of the GC things referenced by
    /// this structure.
    fn trace(&self) -> IterGcThing;

Note that Oxischeme separates tracing (generic heap graph traversal) from marking (adding live nodes in the heap graph to a set). This enables using Trace to implement other graph algorithms on top of the heap graph. Examples include computing dominator trees and retained sizes of objects, or finding the set of retaining paths of an object that you expected to be reclaimed by the collector, but hasn't been.

If we were to introduce a Trio type that contained three cons cells, we would implement tracing like this:

struct Trio {
    first: ConsPtr,
    second: ConsPtr,
    third: ConsPtr,

impl Trace for Trio {
    fn trace(&self) -> IterGcThing {
        let refs = vec!(GcThing::from_cons_ptr(self.first),

What causes a garbage collection? As we allocate GC things, GC pressure increases. Once that pressure crosses a threshold — BAM! — a collection is triggered. Oxischeme's pressure application and threshold are very naive at the moment: every N allocations a collection is triggered, regardless of size of the heap or size of individual allocations.

A root is an object in the heap graph that is known to be live and reachable. When marking, we start tracing from the set of roots. For example, in Oxischeme, the global environment is a GC root.

In addition to permanent GC roots, like the global environment, sometimes it is necessary to temporarily root GC things referenced by pointers on the stack. Garbage collection can be triggered by any allocation, and it isn't always clear which Rust functions (or other functions called by those functions, or even other functions called by those functions called from the first function, and so on) might allocate a GC thing, triggering collection. The situation we want to avoid is a Rust function using a temporary variable that references a GC thing, then calling another function which triggers a collection and collects the GC thing that was referred to by the temporary variable. That results in the temporary variable becoming a dangling pointer. If the Rust function accesses it again, that is Undefined Behavior: it might still get the value it was pointing at, or it might be a segfault, or it might be a freshly allocated value being used by something else! Not good!

let a = pointer_to_some_gc_thing;
// Oops! A collection was triggered and dereferencing this
// pointer leads to Undefined Behavior!

There are two possible solutions to this problem. The first is conservative garbage collection, where we walk the native stack and if any value on the stack looks like it might be a pointer and if coerced to a pointer happens to point to a GC thing in the heap, we assume that it is in fact a pointer. Under this assumption, it isn't safe to reclaim the object pointed to, and so we treat that GC thing a root. Note that this strategy is simple and easy to retrofit because it doesn't involve changes in any code other than adding the stack scanning, but it results in false positives. The second solution is precise rooting. With precise rooting, it is the responsibility of the Rust function's author to explicitly root and unroot pointers to GC things used in variables on the stack. The advantage this provides is that there are no false positives: you know exactly which stack values are pointers to GC things. The disadvantage is the requirement of explicitly telling the GC about every pointer to a GC thing you ever reference on the stack.

Almost every modern, high performance tracing collector for managed runtimes uses precise rooting because it is a prerequisite* of a moving collector: a GC that relocates objects while performing collection. Moving collectors are desirable because they can compact the heap, creating a smaller memory footprint for programs and better cache locality. They can also implement pointer bumping allocation, that is both simpler and faster than maintaining a free list. Finally, they can split the heap into generations. Generational GCs gain performance wins from the empirical observation that most allocations are short lived, and those objects that are most recently allocated are most likely to be garbage, so we can focus the GC's efforts on them to get the most bang for our buck. Precise rooting is a requirement for a moving collector because it has to update all references to point to the new address of each moved GC thing. A conservative collector doesn't know for sure if a given value on the stack is a reference to a GC thing or not, and if the value just so happens not to be a reference to a GC thing (it is a false positive), and the collector "helpfully" updates that value to the moved address, then the collector is introducing migraine-inducing bugs into the program execution.

* Technically, there do exist some moving and generational collectors that are "mostly copying" and conservatively mark the stack but precisely mark the heap. These collectors only move objects which are not conservatively reachable.

Oxischeme uses precise rooting, but is not a moving GC (yet). Precise rooting is implemented with the oxischeme::heap::Rooted<T> smart pointer RAII type, which roots its referent upon construction and unroots it when the smart pointer goes out of scope and is dropped.

Using precise rooting and Rooted, we can solve the dangling stack pointer problem like this:

    // The pointed to GC thing gets rooted when wrapped
    // with `Rooted`.
    let a = Rooted::new(heap, pointer_to_some_gc_thing);
    // Dereferencing `a` is now safe, because the referent is
    // a GC root, and can't be collected!
// `a` is now out of scope, and its referent is unrooted.

With all of that out of the way, here is the implementation of our mark-and-sweep collector:

impl Heap {
    pub fn collect_garbage(&mut self) {

        // First, trace the heap graph and mark everything that
        // is reachable.

        let mut pending_trace = self.get_roots();

        while !pending_trace.is_empty() {
            let mut newly_pending_trace = vec!();

            for thing in pending_trace.drain() {
                if !thing.is_marked() {

                    for referent in thing.trace() {

            pending_trace.append(&mut newly_pending_trace);

        // Second, sweep each `ArenaSet`.


Why do we have four calls to sweep, one for each type that Oxischeme implements? To explain this, first we need to understand Oxischeme's allocation strategy.

Oxischeme does not allocate each individual object directly from the operating system. In fact, most Scheme "allocations" do not actually perform any allocation from the operating system (eg, call malloc or Box::new). Oxischeme uses a set of oxischeme::heap::Arenas, each of which have a preallocated object pool with each item in the pool either being used by live GC things, or waiting to be used in a future allocation. We keep track of an Arena's available objects with a "free list" of indices into its pool.

type FreeList = Vec<usize>;

/// An arena from which to allocate `T` objects from.
pub struct Arena<T> {
    pool: Vec<T>,

    /// The set of free indices into `pool` that are available
    /// for allocating an object from.
    free: FreeList,

    /// During a GC, if the nth bit of `marked` is set, that
    /// means that the nth object in `pool` has been marked as
    /// reachable.
    marked: Bitv,

When the Scheme program allocates a new object, we remove the first entry from the free list and return a pointer to the object at that entry's index in the object pool. If the every Arena is at capacity (ie, its free list is empty), a new Arena is allocated from the operating system and its object pool is used for the requested Scheme allocation.

impl<T: Default> ArenaSet<T> {
    pub fn allocate(&mut self) -> ArenaPtr<T> {
        for arena in self.arenas.iter_mut() {
            if !arena.is_full() {
                return arena.allocate();

        let mut new_arena = Arena::new(self.capacity);
        let result = new_arena.allocate();

impl<T: Default> Arena<T> {
    pub fn allocate(&mut self) -> ArenaPtr<T> {
        match {
            Some(idx) => {
                let self_ptr : *mut Arena<T> = self;
                ArenaPtr::new(self_ptr, idx)
            None => panic!("Arena is at capacity!"),

For simplicity, Oxischeme has separate arenas for separate types of objects. This sidesteps the problem of finding an appropriately sized free block of memory when allocating different sized objects from the same pool, the fragmentation that can occur because of that, and lets us use a plain old vector as the object pool. However, this also means that we need a separate ArenaSet<T> for each T object that a Scheme program can allocate and why oxischeme::heap::Heap::collect_garbage has four calls to sweep().

During the sweep phase of Oxischeme's garbage collector, we return the entries of any dead object back to the free list. If the Arena is empty (ie, the free list is full) then we return the Arena's memory to the operating system. This prevents retaining the peak amount of memory used for the rest of the program execution.

impl<T: Default> Arena<T> {
    pub fn sweep(&mut self) { = range(0, self.capacity())
            .filter(|&n| {
                    .expect("marked should have length == self.capacity()")

        // Reset `marked` to all zero.

impl<T: Default> ArenaSet<T> {
    pub fn sweep(&mut self) {
        for arena in self.arenas.iter_mut() {

        // Deallocate any arenas that do not contain any
        // reachable objects.
        self.arenas.retain(|a| !a.is_empty());

This concludes our tour of Oxischeme's current implementation of memory management, allocation, and garbage collection for Scheme programs. In the future, I plan to make Oxischeme's collector a moving collector, which will pave the way for a compacting and generational GC. I might also experiment with incrementalizing marking for lower latency and shorter GC pauses, or making sweeping lazy. Additionally, I intend to declare to the rust compiler that operations on un-rooted GC pointers unsafe, but I haven't settled on an implementation strategy yet. I would also like to experiment with writing a syntax extension for the Rust compiler so that it can derive Trace implementations, and they don't need to be written by hand.

Thanks to Tom Tromey and Zach Carter for reviewing drafts.


« Older Entries

Recent Entries

Back to the Futu-rr-e: Deterministic Debugging with rr on November 2nd, 2015

Proposal for Encoding Source-Level Environment Information Within Source Maps on July 22nd, 2015

Source Maps are an Insufficient Debugging Format for the Web on June 19th, 2015

A Compact Representation of Captured Stack Frames for SpiderMonkey on April 10th, 2015

Memory Management in Oxischeme on February 22nd, 2015

Naming `eval` Scripts with the `//# sourceURL` Directive on December 5th, 2014

wu.js 2.0 on August 7th, 2014

Come work with me on Firefox Developer Tools on July 8th, 2014

Debugging Web Performance with Firefox DevTools - Velocity 2014 on June 26th, 2014

Beyond Source Maps on March 12th, 2014

Creative Commons License