When writing a bump allocator, always bump downwards. That is, allocate from high addresses, down towards lower addresses by decrementing the bump pointer. Although it is perhaps less natural to think about, it is more efficient than incrementing the bump pointer and allocating from lower addresses up to higher ones.

### What is Bump Allocation?

Bump allocation is a super fast method for allocating objects. We have a chunk of memory, and we maintain a “bump pointer” within that memory. Whenever we allocate an object, we do a quick test that we have enough capacity left in the chunk, and then, assuming we have enough room, we move the bump pointer over by `sizeof(object)` bytes and return the pointer to the space we just reserved for the object within the chunk.

That’s it!

Here is some pseudo-code showing off the algorithm:

The trade off with bump allocation is that we can’t deallocate individual objects in the general case. We can deallocate all of them en mass by resetting the bump pointer back to its initial location. We can deallocate in a LIFO, stack order by moving the bump pointer in reverse. But we can’t deallocate an arbitrary object in the middle of the chunk and reclaim its space for new allocations.

Finally, notice that the chunk of memory we are bump allocating within is always split in two: the side holding allocated objects and the side with free memory. The bump pointer separates the two sides. Furthermore, note that I haven’t defined which side of the bump pointer is free or allocated space, and I’ve carefully avoided saying whether the bump pointer is incremented or decremented.

### Bumping Upwards

First, let’s consider what we shouldn’t do: bump upwards by initializing the bump pointer at the low end of our memory chunk and incrementing the bump pointer on each allocation.

We begin with a `struct` that holds the start and end addresses of our chunk of memory, as well as our current bump pointer:

Constructing our upwards bump allocator requires giving it the `start` and `end` pointers, and it will initialize its bump pointer to the `start` address:

To allocate an object, we will begin by grabbing the current bump pointer, and saving it in a temporary variable: this is going to be the pointer to the newly allocated space. Then we increment the bump pointer by the requested size, and check if it is still less than `end`. If so, then we have capacity for the allocation, and can commit the new bump pointer to `self.ptr` and return the temporary pointing to the freshly allocated space.

But first, there is one thing that the pseudo-code ignored, but which a real implementation cannot: alignment. We need to round up the initial bump pointer to a multiple of the requested alignment before we compute the new bump pointer by adding the requested size.0

Put all that together, and it looks like this:

If we compile this allocation routine to x86-64 with optimizations, we get the following code, which I’ve lightly annotated:

I’m not going to explain each individual instruction. What’s important to appreciate here is that this is just a small handful of fast instructions with only a single branch to handle the not-enough-capacity case. This is what makes bump allocation so fast — great!

But before we get too excited, there is another practicality to consider: to maintain memory safety, we must handle potential integer overflows in the allocation procedure, or else we could have bugs where we return pointers outside the bounds of our memory chunk. No good!

There are two opportunities for overflow we must take care of:

1. If the requested allocation’s size is large enough, `aligned + size` can overflow.

2. If the requested allocation’s alignment is large enough, the ```ptr + align - 1``` sub-expression we use when rounding up to the alignment can overflow.

To handle both these cases, we will use checked addition and return a null pointer if either addition overflows. Here is the new Rust source code:

Now that we’re handling overflows in addition to the alignment requirements, let’s take a look at the x86-64 code that `rustc` and LLVM produce for the function now:

Now there are three conditional branches rather than one. The two new branches are from those two new overflow checks that we added. Less than ideal.

Can bumping downwards do better?

### Bumping Downwards

Now let’s implement a bump allocator where the bump pointer is initialized at the end of the memory chunk and is decremented on allocation, so that it moves downwards towards the start of the memory chunk.

The `struct` is identical to the previous version:

Constructing a `BumpDown` is similar to constructing a `BumpUp` except we initialize the `ptr` to `end` rather than `start`:

When we were allocating by incrementing the bump pointer, the original bump pointer value before it was incremented pointed at the space that was about to be reserved for the allocation. When we are allocating by decrementing the bump pointer, the original bump pointer is pointing at either the end of the memory chunk, or at the last allocation we made. What we want to return is the value of the bump pointer after we decrement it down, at which time it will be pointing at our allocated space.

First we subtract the allocation size from the bump pointer. This subtraction might overflow, so we check for that and return a null pointer if that is the case, just like we did in the previous, upward-bumping function. Then, we round that down to the nearest multiple of `align` to ensure that the allocated space has the object’s alignment. At this point, we check if we are down past the start of our memory chunk, in which case we don’t have the capacity to fulfill this allocation, and we return null. Otherwise, we update the bump pointer to its new value and return the pointer!

And here is the x86-64 code generated for this downward-bumping allocation routine!

Because rounding down doesn’t require an addition or subtraction operation, it doesn’t have an associated overflow check. That means one less conditional branch in the generated code, and downward bumping only has two conditional branches versus the three that upward bumping has.

Additionally, because we don’t need to save the original bump pointer value, this version uses fewer registers than the upward-bumping version. Bump allocation functions are designed to be fast paths that are inlined into callers, which means that downward bumping is creating less register pressure at every call site.

Finally, this downwards-bumping version is implemented with eleven instructions, while the upwards-bumping version requires thirteen instructions. In general, fewer instructions implies a shorter run time.

### Benchmarks

I recently switched the `bumpalo` crate from bumping upwards to bumping downwards. It has a nice, little micro-benchmark suite that is written with the excellent, statistics-driven Criterion.rs benchmarking framework. With Criterion’s built-in support for defining a baseline measurement and comparing an alternate implementation of the code against it, I compared the new, downwards-bumping implementation against the original, upwards-bumping implementation.

The new, downwards-bumping implementation has up to 19% better allocation throughput than the original, upwards-bumping implementation! We’re down to 2.7 nanoseconds per allocation.

The plot below shows the probability of allocating 10,000 small objects taking a certain amount of time. The red curve represents the old, upwards-bumping implementation, while the blue curve shows the new, downwards-bumping implementation. The lines represent the mean time.

You can view the complete, nitty-gritty benchmark results in the pull request.

#### The One Downside: Losing a `realloc` Fast Path

`bumpalo` doesn’t only provide an allocation method, it also provides a `realloc` method to resize an existing allocation. `realloc` is O(n) because in the worst-case scenario it needs to allocate a whole new region of memory and copy the data from the old to the new region. But the old, upwards-bumping implementation had a fast path for growing the last allocation: it would add the delta size to the bump pointer, leaving the allocation in place and avoiding that copy. The new, downwards-bumping implementation also has a fast path for resizing the last allocation , but even if we reuse that space, the start of the allocated region of memory has shifted, and so we can’t avoid the data copy.

The loss of that fast path leads to a 4% slow down in our `realloc` benchmark that formats a string into a bump-allocated buffer, triggering a number of `realloc`s as the string is constructed. We felt that this was worth the trade off for faster allocation.

### Less Work with More Alignment?

It is rare for types to require more than word alignment. We could enforce a minimum alignment on the bump pointer at all times that is greater than or equal to the vast majority of our allocations’ alignment requirements. If our allocation routine is monomorphized for the type of the allocation it’s making, or it is aggressively inlined — and it definitely should be — then we should be able to completely avoid generating any code to align the bump pointer in most cases, including the conditional branch on overflow if we are rounding up for upwards bumping.

The trade off is extra memory overhead from introducing wasted space between small allocations that don’t require that extra alignment.

### Conclusion

If you are writing your own bump allocator, you should bump downwards: initialize the bump pointer to the end of the chunk of memory you are allocating from within, and decrement it on each allocation so that it moves down towards the start of the memory chunk. Downwards bumping requires fewer registers, fewer instructions, and fewer conditional branches. Ultimately, that makes it faster than bumping upwards.

The one exception is if, for some reason, you frequently use `realloc` to grow the last allocation you made, in which case you might get more out of a fast path for growing the last allocation in place without copying any data. And if you do decide to bump upwards, then you should strongly consider enforcing a minimum alignment on the bump pointer to recover some of the performance that you’re otherwise leaving on the table.

Finally, I’d like to thank Jim Blandy, Alex Crichton, Jeena Lee, and Jason Orendorff for reading an early draft of this blog post, for discussing these ideas with me, and for being super friends :)

0 The simple way to round `n` up to a multiple of `align` is

``(n + align - 1) / align * align``

Consider the numerator: `n + align - 1`. This is ensuring that if there is any remainder for `n / align`, then the result of the division sub-expression is one greater than `n / align`, and that otherwise we get exactly the same result as `n / align` due to integer division rounding off the remainder. In other words, we only round up if `n` is not aligned to `align`.

However, we know `align` is a power of two, and therefore ```anything / align``` is equivalent to `anything >> log2(align)` and `anything * align` is equivalent to `anything << log2(align)`. We can therefore rewrite our expression into:

``(n + align - 1) >> log2(align) << log2(align)``

But shifting a value right by some number of bits `b` and then shifting it left by that same number of bits `b` is equivalent to clearing the bottom `b` bits of the number. We can clear the bottom `b` bits of a number by bit-wise and’ing the number with the bit-wise not of `2^b - 1`. Plugging this into our equation and simplifying, we get:

``````  (n + align - 1) >> log2(align) << log2(align)
= (n + align - 1) & !(2^log2(align) - 1)
= (n + align - 1) & !(align - 1)``````

And now we have our final version of rounding up to a multiple of a power of two!

If you find these bit twiddling hacks fun, definitely find yourself a copy of Hacker’s Delight. It’s a wonderful book!