A Wee Allocator for WebAssembly

Recently, we introduced WebAssembly (compiled from Rust) into the source-map JavaScript library. We saw parsing and querying source maps get a whole lot faster. But keeping the compiled .wasm code size small was a challenge.

There are a variety of tools available for removing dead code from WebAssembly and compressing .wasm sizes:

wasm-gc constructs the callgraph of all the functions in a .wasm file, determines which functions are never transitively called by an exported function, and removes them.⁰
wasm-snip replaces a WebAssembly function’s body with a single unreachable instruction. This is useful for manually removing functions that will never be called at runtime, but which the compiler and wasm-gc couldn’t statically prove were dead. After snipping a function, other functions that were only called by the snipped function become unreachable, so running wasm-gc again after wasm-snip often yields further code size reductions.
wasm-opt runs binaryen‘s sophisticated optimization passes over a .wasm file, both shrinking its size and improving its runtime performance.

After using these tools, we looked at what was left in our .wasm file, and broke the results down by crate:

Half of our code size was coming from dlmalloc, the allocator that Rust uses by default for the wasm32-unknown-unknown target. Source map parsing requires just a couple of large, long-lived allocations, and then does its heavy lifting without allocating any further. Querying a parsed source map doesn’t require any allocation. So we are paying a lot, in code size, for an allocator that we aren’t really using that much.

Allocator implementations have trade offs, but Rust’s default is the wrong choice for the source map parsing and querying scenario.

Introducing `wee_alloc`

wee_alloc is a work-in-progress memory allocator designed for WebAssembly. It has a tiny code size footprint, compiling down to only a kilobyte of .wasm code.

wee_alloc is designed for scenarios like those described above: where there are a handful of initial, long-lived allocations, after which the code does its heavy lifting without any further allocations. This scenario requires some allocator to exist, but we are more than happy to trade performance for small code size.

In contrast, wee_alloc is not designed for, and would be a poor choice in, scenarios where allocation is a performance bottleneck.

Although WebAssembly is the primary target, wee_alloc also has an mmap-based implementation for unix systems. This enables testing wee_alloc, and using wee_alloc in your own code, without a browser or WebAssembly engine.

How `wee_alloc` Works

Allocating WebAssembly Pages

WebAssembly module instances have a linear memory space¹, and use store and load instructions to access values within it via an index. If an instruction attempts to access the value at an index that is beyond the memory’s bounds, a trap is raised.

There are two instructions for manipulating the linear memory space itself, rather than its contents: current_memory and grow_memory². The current_memory instruction gives the current size of memory, in units of pages. The grow_memory instruction takes an operand n, grows the memory space by n pages, and gives back the old size of memory, units of pages. Alternatively, if growing memory fails, -1 is returned.

WebAssembly does not have any facilities for shrinking memory, at least right now.

To implement allocating n pages of memory to Rust, we need to use LLVM’s intrinsics. Right now, the intrinsic for grow_memory doesn’t expose the return value, but this should be fixed once Rust updates its LLVM. Therefore, our page allocation routine must use current_memory before grow_memory, when it should just use the result of grow_memory instead. It also means we can’t check for failure yet.

extern "C" {
    #[link_name = "llvm.wasm.current.memory.i32"]
    fn current_memory() -> usize;

    #[link_name = "llvm.wasm.grow.memory.i32"]
    fn grow_memory(pages: usize);
}

unsafe fn get_base_pointer() -> *mut u8 {
    (current_memory() * PAGE_SIZE.0) as *mut u8
}

unsafe fn alloc_pages(n: Pages) -> *mut u8 {
    let ptr = get_base_pointer();
    grow_memory(n.0);
    ptr
}

Careful readers will have noticed the Pages type in alloc_pages’s type signature. wee_alloc uses newtypes for units of bytes, words, and pages. Each of these is a thin wrapper around a usize with relevant operator overloads and inter-conversions. This has been very helpful in catching bugs at compile time, like attempts to offset a pointer by two words rather than two bytes, and compiles away to nothing in the emitted .wasm.

Free Lists

But we don’t satisfy individual allocation requests by directly allocating pages. First, the WebAssembly page size is 64KiB, which is much larger than most allocations. Second, because there is no way to return unused pages to the WebAssembly engine, it would be incredibly wasteful if we didn’t reuse pages. Instead, we maintain a free list of blocks of memory we’ve already allocated from the WebAssembly engine.

Free lists have low complexity and are easy to implement. These properties also lend themselves to a small implementation. The basic idea is to maintain an intrusive linked list of memory blocks that are available. Allocation removes a block from the free list. If the block is larger than the requested allocation size, we can split it in two. Or, if there is no suitable block available in the free list, we can fall back to alloc_pages to get a fresh block of memory. Deallocation reinserts the given block back into the free list, so that it can be reused for future allocations. Because a block is only in the free list if it is not allocated and is therefore unused, we can use a word from the data itself to store the free list links, so long as we ensure that the data is always at least a word in size.

Here is a diagram of what the free list looks like in memory, showing a free block, followed by an allocated block, followed by another free block:


--. .--------------------------------------. ,-----------
  | |                                      | |
  V |                                      V |
+----------------\\-----+---------\\-----+---------------
| free ; data... // ... | data... // ... | free ; data...
+----------------\\-----+---------\\-----+---------------

Even after choosing to use free lists, we have more design choices to make. How should we choose which block to use from the free list? The first that can satisfy this allocation, aka first fit? The block that is closest to the requested allocation size, in an effort to cut down on fragmentation (more on this in a minute), aka best fit? Pick up the search where we left off last time, aka next fit? Regardless which of first fit, best fit, and next fit we choose, we are dealing with an O(n) search. Indeed, this is the downside to the trade off we made when choosing free lists for their simplicity of implementation.

A common technique for speeding up free list allocation is to have separate free lists for allocations of different sizes, which is known as segregated fits or having size classes. With this approach, we can guarantee the invariant that every block in the free list for a particular size can satisfy an allocation request of that size. All we need to do is avoid splitting any block into pieces smaller than that size.

Maintaining this invariant gives us amortized O(1) allocation for these sizes with their own free lists. Using first fit within a size’s free list is guaranteed to only look at the first block within the free list, because the invariant tells us that the first block can satisfy this request. It is amortized because we need to fall back to the O(n) allocation to refill this size’s free list from the fallback, main free list whenever it is empty.

If we reuse the same first fit allocation routine for both our size classes’ free lists and our main free list, then we get the benefits of size classes without paying for them in extra code size.

This is the approach that wee_alloc takes.

Fragmentation

Our other main concern is avoiding fragmentation. Fragmentation is the degree of wasted space between allocations in memory. High fragmentation can lead to situations where there exist many free blocks of memory, but none of which can fulfill some allocation request, because each individual free block’s size is too small, even if the sum of their sizes is more than enough for the requested allocation. Therefore, a high degree of fragmentation can effectively break an allocator. It had one job — allocate memory — and it can’t even do that anymore. So wee_alloc really should have some kind of story here; punting 100% on fragmentation is not a practical option.

Once again there are trade offs, and avoiding fragmentation is not a binary choice. On one end of the spectrum, compacting garbage collectors can re-arrange objects in memory and pack them tightly next to each other, effectively leading to zero fragmentation. The cost that you pay is the size and time overhead of a full tracing garbage collector that can enumerate all pointers in the system and patch them to point to moved objects’ new locations. On the opposite end of the spectrum, if we never re-consolidate two blocks of adjacent memory that we previously split from what had originally been a single contiguous block, then we can expect a lot of fragmentation. As we split blocks into smaller blocks for smaller allocations, we get small bits of wasted space between them, and even after we free all these small allocations, we won’t have any large block in the free list for a large allocation. Even if we have multiple adjacent, small blocks in the free list that could be merged together to satisfy the large allocation.

One possibility is keeping the free list sorted by each block’s address, and then deallocating a block would re-insert it into the free list at the sorted location. If either of its neighbors in the free list are immediately adjacent in memory, we could consolidate them. But then deallocation is an O(n) search through the free list, instead of the O(1) push onto its front. We could lower that to O(log n) by representing the free list as a balanced search tree or btree. But the implementation complexity goes up, and I suspect code size will go up with it.

Instead of a free list, we could use bitmaps to track which portions of our heap are free or allocated, and then the consolidation could happen automatically as bits next to each other are reset. But then we need to restrict our allocator to parceling out portions of a single, contiguous region of memory. This implies that only a single, global allocator exists, since if there were multiple, each instance would want to “own” the end of the WebAssembly linear memory, and have the power to grow it to satisfy more and larger allocations. And maybe this is a fair constraint to impose in the context of WebAssembly, where memory is already linear and contiguous. But lifting this constraint, while still using bitmaps, implies a hybrid free list and bitmap implementation. The downside to that is more implementation complexity, and a larger code size foot print.

wee_alloc takes a third approach: trading some space overhead for easy and fast merging. We maintain a sorted, doubly-linked list of all blocks, whether allocated or free. This adds two words of space overhead to every heap allocation. When freeing a block, we check if either of its adjacent blocks are also free, and if so, merge them together with a handful of updates to the next and previous pointers. If neither of the neighbors are free, then we push this block onto the front of the free list. In this way, we keep both O(1) deallocation and our simple free list implementation.

Here is a diagram of what this sorted, doubly-linked list looks like in memory:


 ,---------------------------------.                           ,---------------------
 |                              ,--|---------------------------|--.
 |  X                           |  |                           |  |
 V  |                           V  |                           V  |
+-----------------------\\-----+-----------------------\\-----+----------------------
| prev ; next ; data... // ... | prev ; next ; data... // ... | prev ; next ; data...
+-----------------------\\-----+-----------------------\\-----+----------------------
          |                     ^        |                     ^        |
          |                     |        |                     |        |
          `---------------------'        `---------------------'        `------------

`CellHeader`, `FreeCell`, and `AllocatedCell`

The CellHeader contains the common data members found within both allocated and free memory blocks: the next and previous doubly-linked list pointers.

#[repr(C)]
struct CellHeader {
    next_cell_raw: ptr::NonNull<CellHeader>,
    prev_cell_raw: *mut CellHeader,
}

We use a low bit of the next_cell_raw pointer to distinguish whether the cell is allocated or free, and can consult its value to dynamically cast to an AllocatedCell or FreeCell.

impl CellHeader {
    const IS_ALLOCATED: usize = 0b01;

    fn is_allocated(&self) -> bool {
        self.next_cell_raw.as_ptr() as usize & Self::IS_ALLOCATED != 0
    }

    fn is_free(&self) -> bool {
        !self.is_allocated()
    }

    fn set_allocated(&mut self) {
        let next = self.next_cell_raw.as_ptr() as usize;
        let next = next | Self::IS_ALLOCATED;
        self.next_cell_raw = unsafe {
            ptr::NonNull::new_unchecked(next as *mut CellHeader)
        };
    }

    fn set_free(&mut self) {
        let next = self.next_cell_raw.as_ptr() as usize;
        let next = next & !Self::IS_ALLOCATED;
        self.next_cell_raw = unsafe {
            ptr::NonNull::new_unchecked(next as *mut CellHeader)
        };
    }

    fn as_free_cell_mut(&mut self) -> Option<&mut FreeCell> {
        if self.is_free() {
            Some(unsafe { mem::transmute(self) })
        } else {
            None
        }
    }
}

We use pointer arithmetic to calculate the size of a given cell’s data to avoid another word of space overhead, so the next_cell_raw pointer must always point just after this cell’s data. But, because of that restriction, we can’t use a null pointer as the sentinel for the end of the doubly-linked-list. Therefore, we use the second low bit of the next_cell_raw pointer to distinguish whether the data pointed to by next_cell_raw (after the appropriate masking) is a valid cell, or is garbage memory.

impl CellHeader {
    const NEXT_CELL_IS_INVALID: usize = 0b10;
    const MASK: usize = !0b11;

    fn next_cell_is_invalid(&self) -> bool {
        let next = self.next_cell_raw.as_ptr() as usize;
        next & Self::NEXT_CELL_IS_INVALID != 0
    }

    fn next_cell_unchecked(&self) -> *mut CellHeader {
        let ptr = self.next_cell_raw.as_ptr() as usize;
        let ptr = ptr & Self::MASK;
        let ptr = ptr as *mut CellHeader;
        ptr
    }

    fn next_cell(&self) -> Option<*mut CellHeader> {
        if self.next_cell_is_invalid() {
            None
        } else {
            Some(self.next_cell_unchecked())
        }
    }

    fn size(&self) -> Bytes {
        let data = unsafe { (self as *const CellHeader).offset(1) };
        let data = data as usize;

        let next = self.next_cell_unchecked();
        let next = next as usize;

        Bytes(next - data)
    }
}

An AllocatedCell is a CellHeader followed by data that is allocated.

#[repr(C)]
struct AllocatedCell {
    header: CellHeader,
}

A FreeCell is a CellHeader followed by data that is not in use, and from which we recycle a word for the next link in the free list.

#[repr(C)]
struct FreeCell {
    header: CellHeader,
    next_free_raw: *mut FreeCell,
}

Each of AllocatedCell and FreeCell have methods that make sense only when the cell is allocated or free, respectively, and maintain the invariants required for cells of their state. For example, the method for transforming a FreeCell into an AllocatedCell ensures that the IS_ALLOCATED bit gets set, and the method for transforming an AllocatedCell into a FreeCell unsets that bit.

impl FreeCell {
    unsafe fn into_allocated_cell(&mut self) -> &mut AllocatedCell {
        self.header.set_allocated();
        mem::transmute(self)
    }
}

impl AllocatedCell {
    unsafe fn into_free_cell(&mut self) -> &mut FreeCell {
        self.header.set_free();
        mem::transmute(self)
    }
}

Implementing Allocation

Let’s begin by looking at first fit allocation without any refilling of the free list in the case where there are no available blocks of memory that can satisfy this allocation request. Given the head of a free list, we search for the first block that can fit the requested allocation. Upon finding a suitable block, we determine whether to split the block in two, or use it as is. If we don’t find a suitable block we return an error.

unsafe fn walk_free_list<F, T>(
    head: &mut *mut FreeCell,
    mut callback: F,
) -> Result<T, ()>
where
    F: FnMut(&mut *mut FreeCell, &mut FreeCell) -> Option<T>,
{
    let mut previous_free = head;

    loop {
        let current_free = *previous_free;

        if current_free.is_null() {
            return Err(());
        }

        let mut current_free = &mut *current_free;

        if let Some(result) = callback(previous_free, current_free) {
            return Ok(result);
        }

        previous_free = &mut current_free.next_free_raw;
    }
}

unsafe fn alloc_first_fit(
    size: Words,
    head: &mut *mut FreeCell,
    policy: &AllocPolicy,
) -> Result<*mut u8, ()> {
    walk_free_list(head, |previous, current| {
        // Check whether this cell is large enough to satisfy this allocation.
        if current.header.size() < size.into() {
            return None;
        }

        // The cell is large enough for this allocation -- maybe *too*
        // large. Try splitting it.
        if let Some(allocated) = current.split_alloc(previous, size, policy) {
            return Some(allocated.data());
        }

        // This cell has crazy Goldilocks levels of "just right". Use it as is,
        // without any splitting.
        *previous = current.next_free();
        let allocated = current.into_allocated_cell(policy);
        Some(allocated.data())
    })
}

Splitting a cell in two occurs when a cell has room for both the requested allocation and for another adjacent cell afterwards that is no smaller than some minimum block size. We use the &AllocPolicy trait object to configure this minimum block size, among other things, for different size classes without the code duplication that monomorphization creates. If there is room to split, then we insert the newly split cell into the free list, remove the current cell, and fixup the doubly-linked list of adjacent cells in the headers.

impl FreeCell {
    fn should_split_for(&self, alloc_size: Words, policy: &AllocPolicy) -> bool {
        let self_size = self.header.size();
        let min_cell_size: Bytes = policy.min_cell_size(alloc_size).into();
        let alloc_size: Bytes = alloc_size.into();
        self_size - alloc_size >= min_cell_size + size_of::<CellHeader>()
    }

    unsafe fn split_alloc(
        &mut self,
        previous: &mut *mut FreeCell,
        alloc_size: Words,
        policy: &AllocPolicy,
    ) -> Option<&mut AllocatedCell> {
        if self.should_split_for(alloc_size, policy) {
            let alloc_size: Bytes = alloc_size.into();

            let remainder = {
                let data = (&mut self.header as *mut CellHeader).offset(1) as *mut u8;
                data.offset(alloc_size.0 as isize)
            };

            let remainder = &mut *FreeCell::from_uninitialized(
                remainder,
                self.header.next_cell_raw,
                Some(&mut self.header),
                Some(self.next_free()),
                policy,
            );

            if let Some(next) = self.header.next_cell() {
                (*next).prev_cell_raw = &mut remainder.header;
            }
            self.header.next_cell_raw =
                ptr::NonNull::new_unchecked(&mut remainder.header);

            *previous = remainder;
            Some(self.into_allocated_cell(policy))
        } else {
            None
        }
    }
}

Refilling a free list when there is not a suitable block already in it is easy. For the main free list, we allocate new pages directly from the WebAssembly engine with the alloc_pages function we defined earlier. For a size class’s free list, we allocate a (relatively) large block from the main free list. This logic is encapsulated in the two different AllocPolicy implementations, and the AllocPolicy::new_cell_for_free_list method.

To allocate with a fallback to refill the free list, we do just that: attempt a first fit allocation, if that fails, refill the free list by pushing a new cell onto its front, and then try a first fit allocation once again.

unsafe fn alloc_with_refill(
    size: Words,
    head: &mut *mut FreeCell,
    policy: &AllocPolicy,
) -> Result<*mut u8, ()> {
    if let Ok(result) = alloc_first_fit(size, head, policy) {
        return Ok(result);
    }

    let cell = policy.new_cell_for_free_list(size)?;
    let head = (*cell).insert_into_free_list(head, policy);
    alloc_first_fit(size, head, policy)
}

But where do we get the free list heads from? The WeeAlloc structure holds the head of the main free list, and if size classes are enabled, the size classes’ free list heads.

pub struct WeeAlloc {
    head: imp::Exclusive<*mut FreeCell>,

    #[cfg(feature = "size_classes")]
    size_classes: SizeClasses,
}

struct SizeClasses(
    [imp::Exclusive<*mut FreeCell>; SizeClasses::NUM_SIZE_CLASSES],
);

impl SizeClasses {
    pub const NUM_SIZE_CLASSES: usize = 256;

    pub fn get(&self, size: Words) -> Option<&imp::Exclusive<*mut FreeCell>> {
        self.0.get(size.0 - 1)
    }
}

As you can see, every free list head is wrapped in an imp::Exclusive.

The imp module contains target-specific implementation code and comes in two flavors: imp_wasm32.rs and imp_unix.rs. The alloc_pages function we saw earlier is defined in imp_wasm32.rs. There is another alloc_pages function that uses mmap inside imp_unix.rs. The imp::Exclusive wrapper type guarantees exclusive access to its inner value. For WebAssembly, this is a no-op, since SharedArrayBuffers aren’t shipping and there is no shared-data threading. For unix systems, this protects the inner value in a pthread mutex, and is similar to std::sync::Mutex but provides a FnOnce interface rather than an RAII guard.

If size classes are not enabled, we always use the main free list head and the LargeAllocPolicy. If size classes are enabled, we try to get the appropriate size class’s free list head, and if that works, then we use the SizeClassAllocPolicy with it. If there is no size class for the requested allocation size, then we fall back to the main free list and the LargeAllocPolicy.

impl WeeAlloc {
    #[cfg(feature = "size_classes")]
    unsafe fn with_free_list_and_policy_for_size<F, T>(&self, size: Words, f: F) -> T
    where
        F: for<'a> FnOnce(&'a mut *mut FreeCell, &'a AllocPolicy) -> T,
    {
        if let Some(head) = self.size_classes.get(size) {
            let policy = size_classes::SizeClassAllocPolicy(&self.head);
            let policy = &policy as &AllocPolicy;
            head.with_exclusive_access(|head| f(head, policy))
        } else {
            let policy = &LARGE_ALLOC_POLICY as &AllocPolicy;
            self.head.with_exclusive_access(|head| f(head, policy))
        }
    }

    #[cfg(not(feature = "size_classes"))]
    unsafe fn with_free_list_and_policy_for_size<F, T>(&self, size: Words, f: F) -> T
    where
        F: for<'a> FnOnce(&'a mut *mut FreeCell, &'a AllocPolicy) -> T,
    {
        let policy = &LARGE_ALLOC_POLICY as &AllocPolicy;
        self.head.with_exclusive_access(|head| f(head, policy))
    }
}

Finally, all that is left is to tie everything together to implement the alloc method for the Alloc trait:

unsafe impl<'a> Alloc for &'a WeeAlloc {
    unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr> {
        if layout.align() > mem::size_of::<usize>() {
            return Err(AllocErr::Unsupported {
                details: "wee_alloc cannot align to more than word alignment",
            });
        }

        let size = Bytes(layout.size());
        if size.0 == 0 {
            return Ok(0x1 as *mut u8);
        }

        let size: Words = size.round_up_to();
        self.with_free_list_and_policy_for_size(size, |head, policy| {
            alloc_with_refill(size, head, policy)
                .map_err(|()| AllocErr::Exhausted { request: layout })
        })
    }

    ...
}

Implementing Deallocation

Deallocation either merges the just-freed block with one of its adjacent neighbors, if they are also free, or it pushes the block onto the front of the free list.

If we are reinserting a block into a size class’s free list, however, it doesn’t make sense to merge blocks. Because the these free lists are always servicing allocations of a single size, we would just end up re-splitting the merged block back exactly as it is split now. There is no benefit to splitting and merging and splitting again. Therefore, we have the AllocPolicy inform us whether merging is desirable or not.

First, let’s examine deallocation without the details of merging. We get the appropriate free list and allocation policy, and conjure up a reference to the AllocatedCell that sits just before the data being freed. Then (assuming we didn’t merge into another block that is already in the free list) we push the block onto the front of the free list.

unsafe impl<'a> Alloc for &'a WeeAlloc {
    ...

    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout) {
        let size = Bytes(layout.size());

        if size.0 == 0 || ptr.is_null() {
            return;
        }

        let size: Words = size.round_up_to();

        self.with_free_list_and_policy_for_size(size, |head, policy| {
            let cell = (ptr as *mut AllocatedCell).offset(-1);
            let cell = &mut *cell;

            let free = cell.into_free_cell(policy);

            if policy.should_merge_adjacent_free_cells() {
                ...
            }

            free.insert_into_free_list(head, policy);
        });
    }
}

When merging cells, the adjacent neighbor(s) we are merging into are also free, and therefore are already inside the free list. Because our free list is singly-linked, rather than doubly-linked, we can’t arbitrarily splice in new elements when we have a handle to an element that is already in the free list. This causes some hiccups.

Merging with the previous adjacent cell is still easy: it is already in the free list, and we aren’t changing the location of the CellHeader, so folding this cell into it is all that needs to be done. The free list can be left alone.

if policy.should_merge_adjacent_free_cells() {
    if let Some(prev) = free.header
        .prev_cell()
        .and_then(|p| (*p).as_free_cell_mut())
    {
        prev.header.next_cell_raw = free.header.next_cell_raw;
        if let Some(next) = free.header.next_cell() {
            (*next).prev_cell_raw = &mut prev.header;
        }
        return;
    }

    ...
}

Merging with the next adjacent cell is a little harder. It is already in the free list, but we need to splice it out from the free list, since its header will become invalid after consolidation, and it is this cell’s header that needs to be in the free list. But, because the free list is singly-linked, we don’t have access to the pointer pointing to the soon-to-be-invalid header, and therefore can’t update that pointer to point to the new cell header. So instead we have a delayed consolidation scheme. We insert this cell just after the next adjacent cell in the free list, and set the next adjacent cell’s NEXT_FREE_CELL_CAN_MERGE bit.

impl FreeCell {
    const NEXT_FREE_CELL_CAN_MERGE: usize = 0b01;
    const _RESERVED: usize = 0b10;
    const MASK: usize = !0b11;

    fn next_free_can_merge(&self) -> bool {
        self.next_free_raw as usize & Self::NEXT_FREE_CELL_CAN_MERGE != 0
    }

    fn set_next_free_can_merge(&mut self) {
        let next_free = self.next_free_raw as usize;
        let next_free = next_free | Self::NEXT_FREE_CELL_CAN_MERGE;
        self.next_free_raw = next_free as *mut FreeCell;
    }

    fn next_free(&self) -> *mut FreeCell {
        let next_free = self.next_free_raw as usize & Self::MASK;
        next_free as *mut FreeCell
    }
}

unsafe impl<'a> Alloc for &'a WeeAlloc {
    ...

    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout) {
        ...

        if policy.should_merge_adjacent_free_cells() {
            ...

            if let Some(next) = free.header
                .next_cell()
                .and_then(|n| (*n).as_free_cell_mut())
            {
                free.next_free_raw = next.next_free();
                next.next_free_raw = free;
                next.set_next_free_can_merge();
                return;
            }
        }

        ...
    }
}

Then, the next time that we walk the free list for allocation, the bit will be checked and the consolidation will happen at that time. Which means that the walk_free_list definition we showed earlier was incomplete, since it didn’t include the code for consolidation. Here is its complete definition:

unsafe fn walk_free_list<F, T>(
    head: &mut *mut FreeCell,
    policy: &AllocPolicy,
    mut f: F,
) -> Result<T, ()>
where
    F: FnMut(&mut *mut FreeCell, &mut FreeCell) -> Option<T>,
{
    let mut previous_free = head;

    loop {
        let current_free = *previous_free;

        if current_free.is_null() {
            return Err(());
        }

        let mut current_free = &mut *current_free;

        if policy.should_merge_adjacent_free_cells() {
            // Now check if this cell can merge with the next cell in the free
            // list. We do this after the initial allocation attempt so that we
            // don't merge, only to immediately split the cell again right
            // afterwards.
            while current_free.next_free_can_merge() {
                let prev_adjacent = current_free.header.prev_cell_raw as *mut FreeCell;
                let prev_adjacent = &mut *prev_adjacent;

                prev_adjacent.header.next_cell_raw = current_free.header.next_cell_raw;
                if let Some(next) = current_free.header.next_cell() {
                    (*next).prev_cell_raw = &mut prev_adjacent.header;
                }

                *previous_free = prev_adjacent;
                current_free = prev_adjacent;
            }
        }

        if let Some(result) = f(previous_free, current_free) {
            return Ok(result);
        }

        previous_free = &mut current_free.next_free_raw;
    }
}

On the other hand, if both the previous and next adjacent cells are free, we are faced with a dilemma. We cannot merge all the previous, current, and next cells together because our singly-linked free list doesn’t allow for that kind of arbitrary appending and splicing in O(1) time. Instead, we use a heuristic to choose whether to merge with the previous or next adjacent cell. We could choose to merge with whichever neighbor cell is smaller or larger, but we don’t. Right now, we prefer the previous adjacent cell because we can greedily consolidate with it immediately, whereas the consolidating with the next adjacent cell must be delayed, as explained above.

If we made the minimum allocation size two words, then we would have room for a doubly-linked free list, and could support consolidating previous, current, and next free cell neighbors. We could also remove the delayed consolidation scheme, which would further simplify a bunch of code. But it would mean effectively three words of overhead for single word heap allocations. It isn’t clear to me whether or not that trade off is worth it or not. To make an informed decision, we’d need a corpus of allocations and frees made by typical WebAssembly applications.

Conclusion

wee_alloc is a work-in-progress prototype, but it already meets its goal of being small.

However, it is certainly lacking in other areas. Sergey Pepyakin tried bootstrapping rustc with wee_alloc enabled as the global allocator, and it is a couple orders of magnitude slower than bootstrapping rustc with its default jemalloc. On the one hand, bootstrapping rustc isn’t exactly the scenario wee_alloc was designed for, and it isn’t surprising that this new, unoptimized prototype is slower than the mature, production-grade jemalloc. Furthermore, I doubt that we can compete with jemalloc on speed without regressing code size. But even so, I think wee_alloc is slower than it should be, and I suspect that there are some low-hanging fruit waiting to be plucked.

I would love, love, love some help building wee_alloc — it’s a lot of fun! Are you interested in code golfing the smallest .wasm binaries? Do you like nitty-gritty profiling and digging into performance? Is wee_alloc missing some obviously-better technique that you’re familiar with? Want to help Rust be the number one choice for compiling to WebAssembly? Fork the wee_alloc repository on GitHub and then come join us in #rust-wasm on irc.mozilla.org and introduce yourself!

Thanks to Mike Cooper and David Keeler for reading early drafts and providing valuable feedback.

⁰ This will be unnecessary soon, since LLVM’s linker, lld, is gaining support for WebAssembly, and already supports this functionality. The Rust toolchain will also start using lld and then we won’t need wasm-gc anymore. ↩

¹ Right now there is a single linear memory, but it is expected that more address spaces will come. They would be useful for, for example, referencing garbage-collected JavaScript or DOM objects directly from WebAssembly code. ↩

² The current_memory and grow_memory instructions will likely be renamed to mem.size and mem.grow. ↩

Introducing wee_alloc

How wee_alloc Works