A Compact Representation of Captured Stack Frames for SpiderMonkey

April 10th, 2015

Last year, I implemented a new, compact representation for captured JavaScript stacks in SpiderMonkey1. This was done to support the allocation tracking I built into SpiderMonkey's Debugger API2 3, which saves the stack for potentially every Object allocation in the observed globals. That's a lot of stacks, and so it was important to make sure that we incurred as little overhead to memory usage as we could. Even more so, because we'd like to affect memory usage as little as possible while simultaneously instrumenting and measuring it.

The key observation is that while there may be many allocations, there are many fewer allocation sites. Most allocations happen at a few places. Thus, much information is duplicated across stacks. For example, if we run the esprima JavaScript parser on its own JavaScript source, there are approximately 54,700 total Object allocations, but just ~1,400 unique JS stacks at allocation time. There are only ~200 allocation sites if you consider only the youngest stack frame.

Consider the example below:

function a() {
  b();
}

function b() {
  c();
  d();
  e();
}

function c() { new Object; }
function d() { new Object; }
function e() { new Object; }

Disregarding compiler optimizations removing allocations, arguments objects, as well as engine internal allocations, calling the a function allocates three Objects. With arrows representing the "called by" relationship from a younger frame to its older, parent frame, this is the set of stacks that existed during an Object allocation:

c -> b -> a
d -> b -> a
e -> b -> a

Instead of duplicating all these saved stack frames, we use a technique called hash consing. Hash consing lets us share older tail stack frames between many unique stacks, similar to the way a trie shares string prefixes. With hash consing, SpiderMonkey internally represents those stacks like this:

c -> b -> a
     ^
d ---|
     |
e ---'

Each frame is stored in a hash table. When saving new stacks, we use this table to find pre-existing saved frames. If such an object is already extant, it is reused; otherwise a new saved frame is allocated and inserted into the table. During the garbage collector's sweep phase, we remove old entries from this hash table that are no longer used.

I just landed a series of patches to make Error objects use these hash cons'd stacks internally, and lazily stringify it when the stack property is accessed4. Previously, every time an Error object was created, the stack string was eagerly computed. This change can result in significant memory savings for JS programs that allocate many Error objects. In one Real World™ test case5, we dropped memory usage from 1.2GB down to 167MB. Additionally, we use almost half of the memory that V8 uses on this same test case.

Many thanks to Jim Blandy for guidance and reviews. Thanks also to Boris Zbarsky, Bobby Holley, and Jason Orendorff for insight into implementing the security wrappers needed for passing stack frame objects between privileged and un-privileged callers. Thanks to Paolo Amadini for adding support for tracing stacks across asynchronous operations (more about this in the future). Thanks again to Boris Zbarsky for making DOMExceptions use these hash cons'd stacks. Another thanks to Jason Orendorff for reviewing my patches to make the builtin JavaScript Error objects use hash cons'd stacks behind the scenese.

In the future, I would like to make capturing stacks even faster by avoiding re-walking stack frames that we already previously walked capturing another stack6. The cheaper capturing a stack is, the more ubiquitous it can be throughout the the platform, and that puts more context and debugging information into developers' hands.

Memory Management in Oxischeme

February 22nd, 2015

I've recently been playing with the Rust programming language, and what better way to learn a language than to implement a second language in the language one wishes to learn?! It almost goes without saying that this second language being implemented should be Scheme. Thus, Oxischeme was born.

Why implement Scheme instead of some other language? Scheme is a dialect of LISP and inherits the simple parenthesized list syntax of its LISP-y origins, called s-expressions. Thus, writing a parser for Scheme syntax is rather easy compared to doing the same for most other languages' syntax. Furthermore, Scheme's semantics are also minimal. It is a small language designed for teaching, and writing a metacircular interpreter (ie a Scheme interpreter written in Scheme itself) takes only a few handfuls of lines of code. Finally, Scheme is a beautiful language: its design is rooted in the elegant λ-calculus.

Scheme is not "close to the metal" and doesn't provide direct access to the machine's hardware. Instead, it provides the illusion of infinite memory and its structures are automatically garbage collected rather than explicitly managed by the programmer. When writing a Scheme implementation in Scheme itself, or any other language with garbage collection, one can piggy-back on top of the host language's garbage collector, and use that to manage the Scheme's structures. This is not the situation I found myself in: Rust is close to the metal, and does not have a runtime with garbage collection built in (although it has some other cool ideas regarding lifetimes, ownership, and when it is safe to deallocate an object). Therefore, I had to implement garbage collection myself.

Faced with this task, I had a decision to make: tracing garbage collection or reference counting?

Reference counting is a technique where we keep track of the number of other things holding references to any given object or resource. When new references are taken, the count is incremented. When a reference is dropped, the count is decremented. When the count reaches zero, the resource is deallocated and it decrements the reference count of any other objects it holds a reference to. Reference counting is great because once an object becomes unreachable, it is reclaimed immediately and doesn't sit around consuming valuable memory space while waiting for the garbage collector to clean it up at some later point in time. Additionally, the reclamation process happens incrementally and program execution doesn't halt while every object in the heap is checked for liveness. On the negative side, reference counting runs into trouble when it encounters cycles. Consider the following situation:

A -> B
^    |
|    v
D <- C

A, B, C, and D form a cycle and all have a reference count of one. Nothing from outside of the cycle holds a reference to any of these objects, so it should be safe to collect them. However, because each reference count will never get decremented to zero, none of these objects will be deallocated. In practice, the programmer must explicitly use (potentially unsafe) weak references, or the runtime must provide a means for detecting and reclaiming cycles. The former defeats the general purpose, don't-worry-about-it style of managed memory environments. The latter is equivalent to implementing a tracing collector in addition to the existing reference counting memory management.

Tracing garbage collectors start from a set of roots and recursively traverse object references to discover the set of live objects in the heap graph. Any object that is not an element of the live set cannot be used again in the future, because the program has no way to refer to that object. Therefore, the object is available for reclaiming. This has the advantage of collecting dead cycles, because if the cycle is not reachable from the roots, then it won't be in the live set. The cyclic references don't matter because those edges are never traversed. The disadvantage is that, without a lot of hard work, when the collector is doing its bookkeeping, the program is halted until the collector is finished analyzing the whole heap. This can result in long, unpredictable GC pauses.

Reference counting is to tracing as yin is to yang. The former operates on dead, unreachable objects while the latter operates on live, reachable things. Fun fact: every high performance GC algorithm (such as generational GC or reference counting with "purple" nodes and trial deletion) uses a mixture of both, whether it appears so on the surface or not. "A Unified Theory of Garbage Collection" by Bacon, et all discusses this in depth.

I opted to implement a tracing garbage collector for Oxischeme. In particular, I implemented one of the simplest GC algorithms: stop-the-world mark-and-sweep. The steps are as follows:

  1. Stop the Scheme program execution.
  2. Mark phase. Trace the live heap starting from the roots and add every reachable object to the marked set.
  3. Sweep phase. Iterate over each object x in the heap:
    • If x is an element of the marked set, continue.
    • If x is not an element of the marked set, reclaim it.
  4. Resume execution of the Scheme program.

Because the garbage collector needs to trace the complete heap graph, any structure that holds references to a garbage collected type must participate in garbage collection by tracing the GC things it is holding alive. In Oxischeme, this is implemented with the oxischeme::heap::Trace trait, whose implementation requires a trace function that returns an iterable of GcThings:

pub trait Trace {
    /// Return an iterable of all of the GC things referenced by
    /// this structure.
    fn trace(&self) -> IterGcThing;
}

Note that Oxischeme separates tracing (generic heap graph traversal) from marking (adding live nodes in the heap graph to a set). This enables using Trace to implement other graph algorithms on top of the heap graph. Examples include computing dominator trees and retained sizes of objects, or finding the set of retaining paths of an object that you expected to be reclaimed by the collector, but hasn't been.

If we were to introduce a Trio type that contained three cons cells, we would implement tracing like this:

struct Trio {
    first: ConsPtr,
    second: ConsPtr,
    third: ConsPtr,
}

impl Trace for Trio {
    fn trace(&self) -> IterGcThing {
        let refs = vec!(GcThing::from_cons_ptr(self.first),
                        GcThing::from_cons_ptr(self.second),
                        GcThing::from_cons_ptr(self.third));
        refs.into_iter()
    }
}

What causes a garbage collection? As we allocate GC things, GC pressure increases. Once that pressure crosses a threshold — BAM! — a collection is triggered. Oxischeme's pressure application and threshold are very naive at the moment: every N allocations a collection is triggered, regardless of size of the heap or size of individual allocations.

A root is an object in the heap graph that is known to be live and reachable. When marking, we start tracing from the set of roots. For example, in Oxischeme, the global environment is a GC root.

In addition to permanent GC roots, like the global environment, sometimes it is necessary to temporarily root GC things referenced by pointers on the stack. Garbage collection can be triggered by any allocation, and it isn't always clear which Rust functions (or other functions called by those functions, or even other functions called by those functions called from the first function, and so on) might allocate a GC thing, triggering collection. The situation we want to avoid is a Rust function using a temporary variable that references a GC thing, then calling another function which triggers a collection and collects the GC thing that was referred to by the temporary variable. That results in the temporary variable becoming a dangling pointer. If the Rust function accesses it again, that is Undefined Behavior: it might still get the value it was pointing at, or it might be a segfault, or it might be a freshly allocated value being used by something else! Not good!

let a = pointer_to_some_gc_thing;
function_which_can_trigger_gc();
// Oops! A collection was triggered and dereferencing this
// pointer leads to Undefined Behavior!
*a;

There are two possible solutions to this problem. The first is conservative garbage collection, where we walk the native stack and if any value on the stack looks like it might be a pointer and if coerced to a pointer happens to point to a GC thing in the heap, we assume that it is in fact a pointer. Under this assumption, it isn't safe to reclaim the object pointed to, and so we treat that GC thing a root. Note that this strategy is simple and easy to retrofit because it doesn't involve changes in any code other than adding the stack scanning, but it results in false positives. The second solution is precise rooting. With precise rooting, it is the responsibility of the Rust function's author to explicitly root and unroot pointers to GC things used in variables on the stack. The advantage this provides is that there are no false positives: you know exactly which stack values are pointers to GC things. The disadvantage is the requirement of explicitly telling the GC about every pointer to a GC thing you ever reference on the stack.

Almost every modern, high performance tracing collector for managed runtimes uses precise rooting because it is a prerequisite* of a moving collector: a GC that relocates objects while performing collection. Moving collectors are desirable because they can compact the heap, creating a smaller memory footprint for programs and better cache locality. They can also implement pointer bumping allocation, that is both simpler and faster than maintaining a free list. Finally, they can split the heap into generations. Generational GCs gain performance wins from the empirical observation that most allocations are short lived, and those objects that are most recently allocated are most likely to be garbage, so we can focus the GC's efforts on them to get the most bang for our buck. Precise rooting is a requirement for a moving collector because it has to update all references to point to the new address of each moved GC thing. A conservative collector doesn't know for sure if a given value on the stack is a reference to a GC thing or not, and if the value just so happens not to be a reference to a GC thing (it is a false positive), and the collector "helpfully" updates that value to the moved address, then the collector is introducing migraine-inducing bugs into the program execution.

* Technically, there do exist some moving and generational collectors that are "mostly copying" and conservatively mark the stack but precisely mark the heap. These collectors only move objects which are not conservatively reachable.

Oxischeme uses precise rooting, but is not a moving GC (yet). Precise rooting is implemented with the oxischeme::heap::Rooted<T> smart pointer RAII type, which roots its referent upon construction and unroots it when the smart pointer goes out of scope and is dropped.

Using precise rooting and Rooted, we can solve the dangling stack pointer problem like this:

{
    // The pointed to GC thing gets rooted when wrapped
    // with `Rooted`.
    let a = Rooted::new(heap, pointer_to_some_gc_thing);
    function_which_can_trigger_gc();
    // Dereferencing `a` is now safe, because the referent is
    // a GC root, and can't be collected!
    *a;
}
// `a` is now out of scope, and its referent is unrooted.

With all of that out of the way, here is the implementation of our mark-and-sweep collector:

impl Heap {
    pub fn collect_garbage(&mut self) {
        self.reset_gc_pressure();

        // First, trace the heap graph and mark everything that
        // is reachable.

        let mut pending_trace = self.get_roots();

        while !pending_trace.is_empty() {
            let mut newly_pending_trace = vec!();

            for thing in pending_trace.drain() {
                if !thing.is_marked() {
                    thing.mark();

                    for referent in thing.trace() {
                        newly_pending_trace.push(referent);
                    }
                }
            }

            pending_trace.append(&mut newly_pending_trace);
        }

        // Second, sweep each `ArenaSet`.

        self.strings.sweep();
        self.activations.sweep();
        self.cons_cells.sweep();
        self.procedures.sweep();
    }
}

Why do we have four calls to sweep, one for each type that Oxischeme implements? To explain this, first we need to understand Oxischeme's allocation strategy.

Oxischeme does not allocate each individual object directly from the operating system. In fact, most Scheme "allocations" do not actually perform any allocation from the operating system (eg, call malloc or Box::new). Oxischeme uses a set of oxischeme::heap::Arenas, each of which have a preallocated object pool with each item in the pool either being used by live GC things, or waiting to be used in a future allocation. We keep track of an Arena's available objects with a "free list" of indices into its pool.

type FreeList = Vec<usize>;

/// An arena from which to allocate `T` objects from.
pub struct Arena<T> {
    pool: Vec<T>,

    /// The set of free indices into `pool` that are available
    /// for allocating an object from.
    free: FreeList,

    /// During a GC, if the nth bit of `marked` is set, that
    /// means that the nth object in `pool` has been marked as
    /// reachable.
    marked: Bitv,
}

When the Scheme program allocates a new object, we remove the first entry from the free list and return a pointer to the object at that entry's index in the object pool. If the every Arena is at capacity (ie, its free list is empty), a new Arena is allocated from the operating system and its object pool is used for the requested Scheme allocation.

impl<T: Default> ArenaSet<T> {
    pub fn allocate(&mut self) -> ArenaPtr<T> {
        for arena in self.arenas.iter_mut() {
            if !arena.is_full() {
                return arena.allocate();
            }
        }

        let mut new_arena = Arena::new(self.capacity);
        let result = new_arena.allocate();
        self.arenas.push(new_arena);
        result
    }
}

impl<T: Default> Arena<T> {
    pub fn allocate(&mut self) -> ArenaPtr<T> {
        match self.free.pop() {
            Some(idx) => {
                let self_ptr : *mut Arena<T> = self;
                ArenaPtr::new(self_ptr, idx)
            },
            None => panic!("Arena is at capacity!"),
        }
    }
}

For simplicity, Oxischeme has separate arenas for separate types of objects. This sidesteps the problem of finding an appropriately sized free block of memory when allocating different sized objects from the same pool, the fragmentation that can occur because of that, and lets us use a plain old vector as the object pool. However, this also means that we need a separate ArenaSet<T> for each T object that a Scheme program can allocate and why oxischeme::heap::Heap::collect_garbage has four calls to sweep().

During the sweep phase of Oxischeme's garbage collector, we return the entries of any dead object back to the free list. If the Arena is empty (ie, the free list is full) then we return the Arena's memory to the operating system. This prevents retaining the peak amount of memory used for the rest of the program execution.

impl<T: Default> Arena<T> {
    pub fn sweep(&mut self) {
        self.free = range(0, self.capacity())
            .filter(|&n| {
                !self.marked.get(n)
                    .expect("marked should have length == self.capacity()")
            })
            .collect();

        // Reset `marked` to all zero.
        self.marked.set_all();
        self.marked.negate();
    }
}

impl<T: Default> ArenaSet<T> {
    pub fn sweep(&mut self) {
        for arena in self.arenas.iter_mut() {
            arena.sweep();
        }

        // Deallocate any arenas that do not contain any
        // reachable objects.
        self.arenas.retain(|a| !a.is_empty());
    }
}

This concludes our tour of Oxischeme's current implementation of memory management, allocation, and garbage collection for Scheme programs. In the future, I plan to make Oxischeme's collector a moving collector, which will pave the way for a compacting and generational GC. I might also experiment with incrementalizing marking for lower latency and shorter GC pauses, or making sweeping lazy. Additionally, I intend to declare to the rust compiler that operations on un-rooted GC pointers unsafe, but I haven't settled on an implementation strategy yet. I would also like to experiment with writing a syntax extension for the Rust compiler so that it can derive Trace implementations, and they don't need to be written by hand.

Thanks to Tom Tromey and Zach Carter for reviewing drafts.

References

Naming `eval` Scripts with the `//# sourceURL` Directive

December 5th, 2014

In Firefox 36, SpiderMonkey (Firefox's JavaScript engine) now supports the //# sourceURL=my-display-url.js directive. This allows developers to give a name to a script created by eval or new Function, and get better stack traces.

To demonstrate this, let's use a minimal version of John Resig's micro templater. The micro templater compiles template strings into JS source code that it passes to new Function, thus transforming the template string into a function.

function tmpl(str) {
  return new Function("obj",
    "var p=[],print=function(){p.push.apply(p,arguments);};" +

    // Introduce the data as local variables using with(){}
    "with(obj){p.push('" +

    // Convert the template into pure JavaScript
    str
      .replace(/[\r\t\n]/g, " ")
      .split("<%").join("\t")
      .replace(/((^|%>)[^\t]*)'/g, "$1\r")
      .replace(/\t=(.*?)%>/g, "',$1,'")
      .split("\t").join("');")
      .split("%>").join("p.push('")
      .split("\r").join("\\'")

    + "');}return p.join('');");
};

The details of how the template is converted into JavaScript source code isn't of import; what is important is that it dynamically creates new scripts via code evaluated in new Function.

We can define a new templater function:

var hello = tmpl("<h1>Hello, <%=name%></h1>");

And use it like this:

hello({ name: "World!" });
// "<h1>Hello, World!</h1>"

When we get an error, SpiderMonkey will generate a name for the evaled (or in this case, new Functioned) script based on the location where the call to eval (or new Function) occurred. For our concrete example, this is the generated name for the hello templater function's frame:

file:///Users/fitzgen/scratch/foo.js line 2 > Function

And here it is in the context of an error with the whole stack trace:

hello({ name: Object.create(null) });
// TypeError: can't convert Object to string
// Stack trace:
// anonymous@file:///Users/fitzgen/scratch/foo.js line 2 > Function:1:107
// @file:///Users/fitzgen/scratch/foo.js:28:3

Despite being a solid improvement over just "eval frame" or something of that sort, these stack traces can still be difficult to read. If there are many different templater functions, the value of the eval script's introduction location is further diminished. It is difficult to determine which of the many functions created by tmpl contains the thrown error, because they all end up with the same name, because they were all created at the same location.

We can improve this situation with the //# sourceURL directive.

Consider this version of the tmpl function adapted to use the //# sourceURL directive:

function tmpl(name, str) {
  return new Function("obj",
    "var p=[],print=function(){p.push.apply(p,arguments);};" +

    // Introduce the data as local variables using with(){}
    "with(obj){p.push('" +

    // Convert the template into pure JavaScript
    str
      .replace(/[\r\t\n]/g, " ")
      .split("<%").join("\t")
      .replace(/((^|%>)[^\t]*)'/g, "$1\r")
      .replace(/\t=(.*?)%>/g, "',$1,'")
      .split("\t").join("');")
      .split("%>").join("p.push('")
      .split("\r").join("\\'")

    + "');}return p.join('');"
    + "//# sourceURL=" + name);
};

Note that the function takes a new parameter called name and appends //# sourceURL=<name> to the end of the generated JS code passed to new Function.

With this new version of tmpl, we create our templater function like this:

var hello = tmpl("hello.template", "<h1>Hello <%=name%></h1>");

Now SpiderMonkey will use the name given by the //# sourceURL directive, instead of using a name based on the introduction site of the script:

hello({ name: Object.create(null) });
// TypeError: can't convert Object to string
// Stack trace:
// anonymous@hello.template:1:107
// @file:///Users/fitzgen/scratch/foo.js:25:3

Giving the eval script a name makes it easier for us to debug errors originating from it, and we can give a different name to different scripts created at the same location.

The //# sourceURL directive is also particularly useful for dynamic code- and module-loaders, which fetch source text over the network and then eval it.

Additionally, in Firefox 37, evaled sources with a //# sourceURL directive will be debuggable and labeled with the name specified by the directive in the debugger. (Update: //# sourceURL support in the debugger is also available in Firefox 36!)

wu.js 2.0

August 7th, 2014

On May 21st, 2010, I started experimenting with lazy, functional streams in JavaScript with a library I named after the Wu-Tang Clan.

commit 9d3c5b19a088f6e33888c215f44ab59da4ece302
Author: Nick Fitzgerald <fitzgen@gmail.com>
Date:   Fri May 21 22:56:49 2010 -0700

    First commit

Four years later, the feature-complete, partially-implemented, and soon-to-be-finalized ECMAScript 6 supports lazy streams in the form of generators and its iterator protocol. Unfortunately, ES6 iterators are missing the higher order functions you expect: map, filter, reduce, etc.

Today, I'm happy to announce the release of wu.js version 2.0, which has been completely rewritten for ES6.

wu.js aims to provide higher order functions for ES6 iterables. Some of them you already know (filter, some, reduce) and some of them might be new to you (reductions, takeWhile). wu.js works with all ES6 iterables, including Arrays, Maps, Sets, and generators you write yourself. You don't have to wait for ES6 to be fully implemented by every JS engine, wu.js can be compiled to ES5 with the Tracuer compiler.

Here's a couple small examples:

const factorials = wu.count(1).reductions((last, n) => last * n);
// (1, 2, 6, 24, ...)

const isEven = x => x % 2 === 0;
const evens = wu.filter(isEven);
evens(wu.count());
// (0, 2, 4, 6, ...)

Check out the wu.js documentation here.

Come work with me on Firefox Developer Tools

July 8th, 2014

My team at Mozilla, the half of the larger devtools team that works on JavaScript and performance tools, is looking to hire another software engineer.

We have members of the devtools team in our San Francisco, London, Vancouver, Paris, Toronto, and Portland offices, but many also work remotely.

We are responsible for writing the full stack of the tools we create, from the C++ platform APIs exposing SpiderMonkey and Gecko internals, to the JavaScript/CSS/HTML based frontend that you see when you open the Firefox Developer Tools.

Some of the things we're working on:

One of the most important things for me is that every line of code we write at Mozilla is Free and Open Source from day one, and we're dedicated to keeping the web open.

Apply here!

« Older Entries


Recent Entries

A Compact Representation of Captured Stack Frames for SpiderMonkey on April 10th, 2015

Memory Management in Oxischeme on February 22nd, 2015

Naming `eval` Scripts with the `//# sourceURL` Directive on December 5th, 2014

wu.js 2.0 on August 7th, 2014

Come work with me on Firefox Developer Tools on July 8th, 2014

Debugging Web Performance with Firefox DevTools - Velocity 2014 on June 26th, 2014

Beyond Source Maps on March 12th, 2014

Memory Tooling in Firefox Developer Tools in 2014 on March 4th, 2014

Hiding Implementation Details with ECMAScript 6 WeakMaps on January 13th, 2014

Re-evaluate Individual Functions in Firefox Developer Tools' Scratchpad on November 22nd, 2013

Creative Commons License

Fork me on GitHub