Recently, we introduced WebAssembly (compiled from Rust) into the
We saw parsing and querying source maps get a whole lot faster. But keeping the
.wasm code size small was a challenge.
There are a variety of tools available for removing dead code from WebAssembly
wasm-snipreplaces a WebAssembly function’s body with a single
unreachableinstruction. This is useful for manually removing functions that will never be called at runtime, but which the compiler and
wasm-gccouldn’t statically prove were dead. After snipping a function, other functions that were only called by the snipped function become unreachable, so running
wasm-snipoften yields further code size reductions.
binaryen‘s sophisticated optimization passes over a
.wasmfile, both shrinking its size and improving its runtime performance.
After using these tools, we looked at what was left in our
.wasm file, and
broke the results down by crate:
Half of our code size was coming from
dlmalloc, the allocator that Rust uses
by default for the
wasm32-unknown-unknown target. Source map parsing requires
just a couple of large, long-lived allocations, and then does its heavy lifting
without allocating any further. Querying a parsed source map doesn’t require
any allocation. So we are paying a lot, in code size, for an allocator that we
aren’t really using that much.
Allocator implementations have trade offs, but Rust’s default is the wrong choice for the source map parsing and querying scenario.
wee_alloc is a work-in-progress memory allocator designed for
WebAssembly. It has a tiny code size footprint, compiling down to only a
wee_alloc is designed for scenarios like those described above: where there
are a handful of initial, long-lived allocations, after which the code does its
heavy lifting without any further allocations. This scenario requires some
allocator to exist, but we are more than happy to trade performance for small
wee_alloc is not designed for, and would be a poor choice in,
scenarios where allocation is a performance bottleneck.
Although WebAssembly is the primary target,
wee_alloc also has an
implementation for unix systems. This enables testing
wee_alloc, and using
wee_alloc in your own code, without a browser or WebAssembly engine.
Allocating WebAssembly Pages
WebAssembly module instances have a linear memory space1, and use store and load instructions to access values within it via an index. If an instruction attempts to access the value at an index that is beyond the memory’s bounds, a trap is raised.
There are two instructions for manipulating the linear memory space itself,
rather than its contents:
current_memory instruction gives the
current size of memory, in units of pages. The
grow_memory instruction takes
an operand n, grows the memory space by n pages, and gives back the old
size of memory, units of pages. Alternatively, if growing memory fails,
WebAssembly does not have any facilities for shrinking memory, at least right now.
To implement allocating n pages of memory to Rust, we need to use LLVM’s
intrinsics. Right now, the intrinsic for
grow_memory doesn’t expose the return
value, but this should be fixed once Rust updates its LLVM. Therefore, our page
allocation routine must use
grow_memory, when it
should just use the result of
grow_memory instead. It also means we can’t
check for failure yet.
Careful readers will have noticed the
Pages type in
wee_alloc uses newtypes for units of bytes, words, and pages. Each
of these is a thin wrapper around a
usize with relevant operator overloads and
inter-conversions. This has been very helpful in catching bugs at compile time,
like attempts to offset a pointer by two words rather than two bytes, and
compiles away to nothing in the emitted
But we don’t satisfy individual allocation requests by directly allocating pages. First, the WebAssembly page size is 64KiB, which is much larger than most allocations. Second, because there is no way to return unused pages to the WebAssembly engine, it would be incredibly wasteful if we didn’t reuse pages. Instead, we maintain a free list of blocks of memory we’ve already allocated from the WebAssembly engine.
Free lists have low complexity and are easy to implement. These properties also
lend themselves to a small implementation. The basic idea is to maintain an
intrusive linked list of memory blocks that are available. Allocation removes a
block from the free list. If the block is larger than the requested allocation
size, we can split it in two. Or, if there is no suitable block available in the
free list, we can fall back to
alloc_pages to get a fresh block of
memory. Deallocation reinserts the given block back into the free list, so that
it can be reused for future allocations. Because a block is only in the free
list if it is not allocated and is therefore unused, we can use a word from the
data itself to store the free list links, so long as we ensure that the data is
always at least a word in size.
Here is a diagram of what the free list looks like in memory, showing a free block, followed by an allocated block, followed by another free block:
--. .--------------------------------------. ,-----------
| | | |
V | V |
| free ; data... // ... | data... // ... | free ; data...
Even after choosing to use free lists, we have more design choices to make. How should we choose which block to use from the free list? The first that can satisfy this allocation, aka first fit? The block that is closest to the requested allocation size, in an effort to cut down on fragmentation (more on this in a minute), aka best fit? Pick up the search where we left off last time, aka next fit? Regardless which of first fit, best fit, and next fit we choose, we are dealing with an O(n) search. Indeed, this is the downside to the trade off we made when choosing free lists for their simplicity of implementation.
A common technique for speeding up free list allocation is to have separate free lists for allocations of different sizes, which is known as segregated fits or having size classes. With this approach, we can guarantee the invariant that every block in the free list for a particular size can satisfy an allocation request of that size. All we need to do is avoid splitting any block into pieces smaller than that size.
Maintaining this invariant gives us amortized O(1) allocation for these sizes with their own free lists. Using first fit within a size’s free list is guaranteed to only look at the first block within the free list, because the invariant tells us that the first block can satisfy this request. It is amortized because we need to fall back to the O(n) allocation to refill this size’s free list from the fallback, main free list whenever it is empty.
If we reuse the same first fit allocation routine for both our size classes’ free lists and our main free list, then we get the benefits of size classes without paying for them in extra code size.
This is the approach that
Our other main concern is avoiding fragmentation. Fragmentation is the degree
of wasted space between allocations in memory. High fragmentation can lead to
situations where there exist many free blocks of memory, but none of which can
fulfill some allocation request, because each individual free block’s size is
too small, even if the sum of their sizes is more than enough for the requested
allocation. Therefore, a high degree of fragmentation can effectively break an
allocator. It had one job — allocate memory — and it can’t even do
that anymore. So
wee_alloc really should have some kind of story here;
punting 100% on fragmentation is not a practical option.
Once again there are trade offs, and avoiding fragmentation is not a binary choice. On one end of the spectrum, compacting garbage collectors can re-arrange objects in memory and pack them tightly next to each other, effectively leading to zero fragmentation. The cost that you pay is the size and time overhead of a full tracing garbage collector that can enumerate all pointers in the system and patch them to point to moved objects’ new locations. On the opposite end of the spectrum, if we never re-consolidate two blocks of adjacent memory that we previously split from what had originally been a single contiguous block, then we can expect a lot of fragmentation. As we split blocks into smaller blocks for smaller allocations, we get small bits of wasted space between them, and even after we free all these small allocations, we won’t have any large block in the free list for a large allocation. Even if we have multiple adjacent, small blocks in the free list that could be merged together to satisfy the large allocation.
One possibility is keeping the free list sorted by each block’s address, and then deallocating a block would re-insert it into the free list at the sorted location. If either of its neighbors in the free list are immediately adjacent in memory, we could consolidate them. But then deallocation is an O(n) search through the free list, instead of the O(1) push onto its front. We could lower that to O(log n) by representing the free list as a balanced search tree or btree. But the implementation complexity goes up, and I suspect code size will go up with it.
Instead of a free list, we could use bitmaps to track which portions of our heap are free or allocated, and then the consolidation could happen automatically as bits next to each other are reset. But then we need to restrict our allocator to parceling out portions of a single, contiguous region of memory. This implies that only a single, global allocator exists, since if there were multiple, each instance would want to “own” the end of the WebAssembly linear memory, and have the power to grow it to satisfy more and larger allocations. And maybe this is a fair constraint to impose in the context of WebAssembly, where memory is already linear and contiguous. But lifting this constraint, while still using bitmaps, implies a hybrid free list and bitmap implementation. The downside to that is more implementation complexity, and a larger code size foot print.
wee_alloc takes a third approach: trading some space overhead for easy and
fast merging. We maintain a sorted, doubly-linked list of all blocks, whether
allocated or free. This adds two words of space overhead to every heap
allocation. When freeing a block, we check if either of its adjacent blocks are
also free, and if so, merge them together with a handful of updates to the next
and previous pointers. If neither of the neighbors are free, then we push this
block onto the front of the free list. In this way, we keep both O(1)
deallocation and our simple free list implementation.
Here is a diagram of what this sorted, doubly-linked list looks like in memory:
| X | | | |
V | V | V |
| prev ; next ; data... // ... | prev ; next ; data... // ... | prev ; next ; data...
| ^ | ^ |
| | | | |
`---------------------' `---------------------' `------------
CellHeader contains the common data members found within both allocated
and free memory blocks: the next and previous doubly-linked list pointers.
We use a low bit of the
next_cell_raw pointer to distinguish whether the cell
is allocated or free, and can consult its value to dynamically cast to an
We use pointer arithmetic to calculate the size of a given cell’s data to avoid
another word of space overhead, so the
next_cell_raw pointer must always point
just after this cell’s data. But, because of that restriction, we can’t use a
null pointer as the sentinel for the end of the doubly-linked-list. Therefore,
we use the second low bit of the
next_cell_raw pointer to distinguish whether
the data pointed to by
next_cell_raw (after the appropriate masking) is a
valid cell, or is garbage memory.
AllocatedCell is a
CellHeader followed by data that is allocated.
FreeCell is a
CellHeader followed by data that is not in use, and from
which we recycle a word for the next link in the free list.
FreeCell have methods that make sense only when
the cell is allocated or free, respectively, and maintain the invariants
required for cells of their state. For example, the method for transforming a
FreeCell into an
AllocatedCell ensures that the
IS_ALLOCATED bit gets set,
and the method for transforming an
AllocatedCell into a
FreeCell unsets that
Let’s begin by looking at first fit allocation without any refilling of the free list in the case where there are no available blocks of memory that can satisfy this allocation request. Given the head of a free list, we search for the first block that can fit the requested allocation. Upon finding a suitable block, we determine whether to split the block in two, or use it as is. If we don’t find a suitable block we return an error.
Splitting a cell in two occurs when a cell has room for both the requested
allocation and for another adjacent cell afterwards that is no smaller than some
minimum block size. We use the
&AllocPolicy trait object to configure this
minimum block size, among other things, for different size classes without the
code duplication that monomorphization creates. If there is room to split, then
we insert the newly split cell into the free list, remove the current cell, and
fixup the doubly-linked list of adjacent cells in the headers.
Refilling a free list when there is not a suitable block already in it is
easy. For the main free list, we allocate new pages directly from the
WebAssembly engine with the
alloc_pages function we defined earlier. For a
size class’s free list, we allocate a (relatively) large block from the main
free list. This logic is encapsulated in the two different
implementations, and the
To allocate with a fallback to refill the free list, we do just that: attempt a first fit allocation, if that fails, refill the free list by pushing a new cell onto its front, and then try a first fit allocation once again.
But where do we get the free list heads from? The
WeeAlloc structure holds the
head of the main free list, and if size classes are enabled, the size classes’
free list heads.
As you can see, every free list head is wrapped in an
imp module contains target-specific implementation code and comes in two
alloc_pages function we saw
earlier is defined in
imp_wasm32.rs. There is another
imp::Exclusive wrapper type
guarantees exclusive access to its inner value. For WebAssembly, this is a
SharedArrayBuffers aren’t shipping and there is no shared-data
threading. For unix systems, this protects the inner value in a pthread mutex,
and is similar to
std::sync::Mutex but provides a
FnOnce interface rather
than an RAII guard.
If size classes are not enabled, we always use the main free list head and the
LargeAllocPolicy. If size classes are enabled, we try to get the appropriate
size class’s free list head, and if that works, then we use the
SizeClassAllocPolicy with it. If there is no size class for the requested
allocation size, then we fall back to the main free list and the
Finally, all that is left is to tie everything together to implement the
method for the
Deallocation either merges the just-freed block with one of its adjacent neighbors, if they are also free, or it pushes the block onto the front of the free list.
If we are reinserting a block into a size class’s free list, however, it doesn’t
make sense to merge blocks. Because the these free lists are always servicing
allocations of a single size, we would just end up re-splitting the merged block
back exactly as it is split now. There is no benefit to splitting and merging
and splitting again. Therefore, we have the
AllocPolicy inform us whether
merging is desirable or not.
First, let’s examine deallocation without the details of merging. We get the
appropriate free list and allocation policy, and conjure up a reference to the
AllocatedCell that sits just before the data being freed. Then (assuming we
didn’t merge into another block that is already in the free list) we push the
block onto the front of the free list.
When merging cells, the adjacent neighbor(s) we are merging into are also free, and therefore are already inside the free list. Because our free list is singly-linked, rather than doubly-linked, we can’t arbitrarily splice in new elements when we have a handle to an element that is already in the free list. This causes some hiccups.
Merging with the previous adjacent cell is still easy: it is already in the
free list, and we aren’t changing the location of the
CellHeader, so folding
this cell into it is all that needs to be done. The free list can be left alone.
Merging with the next adjacent cell is a little harder. It is already in the
free list, but we need to splice it out from the free list, since its header
will become invalid after consolidation, and it is this cell’s header that
needs to be in the free list. But, because the free list is singly-linked, we
don’t have access to the pointer pointing to the soon-to-be-invalid header, and
therefore can’t update that pointer to point to the new cell header. So instead
we have a delayed consolidation scheme. We insert this cell just after the next
adjacent cell in the free list, and set the next adjacent cell’s
Then, the next time that we walk the free list for allocation, the bit will be
checked and the consolidation will happen at that time. Which means that the
walk_free_list definition we showed earlier was incomplete, since it didn’t
include the code for consolidation. Here is its complete definition:
On the other hand, if both the previous and next adjacent cells are free, we are faced with a dilemma. We cannot merge all the previous, current, and next cells together because our singly-linked free list doesn’t allow for that kind of arbitrary appending and splicing in O(1) time. Instead, we use a heuristic to choose whether to merge with the previous or next adjacent cell. We could choose to merge with whichever neighbor cell is smaller or larger, but we don’t. Right now, we prefer the previous adjacent cell because we can greedily consolidate with it immediately, whereas the consolidating with the next adjacent cell must be delayed, as explained above.
If we made the minimum allocation size two words, then we would have room for a doubly-linked free list, and could support consolidating previous, current, and next free cell neighbors. We could also remove the delayed consolidation scheme, which would further simplify a bunch of code. But it would mean effectively three words of overhead for single word heap allocations. It isn’t clear to me whether or not that trade off is worth it or not. To make an informed decision, we’d need a corpus of allocations and frees made by typical WebAssembly applications.
wee_alloc is a work-in-progress prototype, but it already meets its goal of
However, it is certainly lacking in other areas. Sergey Pepyakin
wee_alloc enabled as the global
allocator, and it is a couple orders of magnitude slower than
rustc with its default
jemalloc. On the one hand,
rustc isn’t exactly the scenario
wee_alloc was designed for,
and it isn’t surprising that this new, unoptimized prototype is slower than the
jemalloc. Furthermore, I doubt that we can compete
jemalloc on speed without regressing code size. But even so, I think
wee_alloc is slower than it should be, and I suspect that there are some
low-hanging fruit waiting to be plucked.
I would love, love, love some help building
wee_alloc — it’s a lot
of fun! Are you interested in code golfing the smallest
.wasm binaries? Do you
like nitty-gritty profiling and digging into performance? Is
some obviously-better technique that you’re familiar with? Want to help Rust be
the number one choice for compiling to WebAssembly? Fork the
repository on GitHub and then come join us in
irc.mozilla.org and introduce yourself!
0 This will be unnecessary soon, since LLVM’s
lld, is gaining support for WebAssembly, and already supports
this functionality. The Rust toolchain will also start using
lld and then we
wasm-gc anymore. ↩