Vyacheslav Egorov, who goes by
mraleph on the Web, wrote a response
to my article “Oxidizing Source Maps with Rust and WebAssembly”
titled “Maybe you don’t need Rust and WASM to speed up your JS”.
The “Oxidizing” article recounts my experience integrating Rust
(compiled to WebAssembly) into the
improvements, the code became hard to read and maintain. With Rust and its
zero-cost abstractions, we found that there was no trade-off between performance
and clean code.
He specializes in Just-In-Time (JIT) compilers, and has contributed to Google’s
the Rust and WebAssembly implementation:
Here is where we started:
$ d8 bench-shell-bindings.js ... [Stats samples: 5, total: 24050 ms, mean: 4810 ms, stddev: 155.91063145276527 ms] $ sm bench-shell-bindings.js ... [Stats samples: 7, total: 22925 ms, mean: 3275 ms, stddev: 269.5999093306804 ms]
and this is where we are finishing
$ d8 bench-shell-bindings.js ... [Stats samples: 22, total: 25158 ms, mean: 1143.5454545454545 ms, stddev: 16.59358125226469 ms] $ sm bench-shell-bindings.js ... [Stats samples: 31, total: 25247 ms, mean: 814.4193548387096 ms, stddev: 5.591064299397745 ms]
This is a factor of 4 improvement!
The optimizations he implemented roughly fall into three buckets:
- Algorithmic improvements.
- Avoiding allocations to reduce garbage collections.
- Staying on the JIT’s happy paths to gain compiler optimizations, and subsequently avoiding falling off performance cliffs.
His article is well-measured and a great read. I particularly like it for three reasons:
- The opportunities for algorithmic improvements that he noticed helped speed
up the Rust and WebAssembly implementation another 3x over what we had
previously seen. Thanks to his suggestions, the current version of the
Before continuing, I suggest that you go back and read the two articles if you haven’t already.
Staying on the JIT’s Happy Paths
mraleph noticed that some sorting comparator functions had an arity of
three, the last parameter being optional, but were only ever actually invoked
with two arguments in hot code paths. In V8, this arity mismatch results in an
extra trampoline function in the emitted machine code, and fixing the arity
Just by fixing the arity mismatch we improved benchmark mean on V8 by 14% from 4774 ms to 4123 ms. If we profile the benchmark again we will discover that
ArgumentsAdaptorTrampolinehas completely disappeared from it. Why was it there in the first place?
It turns out that
undefined. V8 does this by creating a new frame on the stack, copying arguments down and then invoking the target function.
Impressive results for such a tiny change! But that’s also part of the problem:
themselves to readers. Worse yet, what is fine in one engine’s JIT might be a
performance cliff in another engine’s JIT. This is the case with this arity
For what it is worth argument adaptation overhead seems to be highly V8 specific. When I benchmark my change against SpiderMonkey, I don’t see any significant performance improvement from matching the arity.
mraleph noticed that the generic sorting routine was not being
monomorphized, and the comparator functions passed to it were not being inlined.
for monomorphization and inlining, he stringified the functions to get their
source text and then re-evaluated that source text with
with their own type inference records. Therefore, from the JIT’s point of view,
the types passed to the new functions are not intermingled with those passed to
the originals, and the JIT will monomorphize them in the machine code it emits.
With Rust and WebAssembly, we have reliable speed without clever tricks.
WebAssembly is designed to perform well without relying on heuristic-based
optimizations, avoiding the performance cliffs that come if code doesn’t meet
those heuristics. It is expected that the compiler emitting the WebAssembly (in
rustc and LLVM) already has sophisticated optimization
infrastructure, that the engine is receiving WebAssembly code that has already
had optimization passes applied, and that the WebAssembly is close to its final
Rust lets us talk about monomorphization and inlining directly, without
circuitous incantations. We can annotate functions with
#[inline(never)] attributes to explicitly communicate
our inlining suggestions to the compiler. With generic functions, we can choose
between monomorphization via type parameters and dynamic virtual calls via trait
Avoiding Allocation and Reducing Garbage Collection
Object, which must be allocated by the garbage collector. To reduce GC
pressure during parsing,
mraleph modified the parser so that, rather than
Objects, it uses a linear typed array buffer and a pointer into
that buffer as a bump allocator. It writes parsed mappings into the typed array,
doubling the typed array’s size whenever it fills up.
even the abstractions that C provides. We can’t define any
away such allocations, but (once again) that depends on unreliable heuristics,
and JIT engines vary in their effectiveness at removing the allocations.
Since we can’t define any
structs to name the records and their members in the
typed array, we must manually write, in
mraleph’s words, “verbose and error
prone code” to read and write
memory[pointer + static_offset_of_member]:
and butter. Not only can we name a
struct without boxing
it or requiring a GC, we don’t have to sacrifice any of Rust’s abstractions to
do so. In fact, Rust’s ownership, lifetimes, and borrowing let us comfortably
avoid allocations and copies in ways that would cause severe headaches even in C
The final group of optimizations that
mraleph implemented included two
algorithmic improvements to avoid sorting all mappings at once, and only sort
subsequences of mappings at a time.
mraleph noticed that the encoding for mappings means that they are
already sorted by generated line number as we parse them. It follows that, if we
wish to sort mappings by generated location (i.e. generated line and column,
breaking ties with original location), we only need to sort within each
generated line’s subsequence to sort the entire sequence by generated location.
Second, when sorting by original location, we can bucket mappings by their source file, and then we only need to sort one source file’s bucket at a time. Furthermore, we can lazily sort each source file’s bucket.
I implemented both subsequence sorting optimizations for the current Rust and
WebAssembly version of the
source-map library, and then measured the “setting
a breakpoint for the first time” benchmark once more, using the same methods
and setup as in the “Oxidizing” article. Below are the results;
recall that lower values are better.
Speed improved another ~3x over what we had previously seen. The current
version of the
source-map library is now 5.3x faster than the original
mraleph did not publish his complete set of changes, so I could not
Most of the improvements that
mraleph implemented are desirable regardless of
the programming language that is our medium. Excessive allocation rates make any
garbage collector (or
free implementation) a bottleneck.
Monomorphization and inlining are crucial to eking out performance in both Rust
mraleph concludes with a bit of sound engineering advice:
Obviously each developer and each team are free to choose between spending
Mhours rewriting their stuff in a language
I couldn’t agree more! It is important to remember that engineering choices have trade-offs, and it is equally important to be cognizant of the choices in the first place.
We chose to rewrite a portion of our library in Rust and WebAssembly not just for the speed ups, although those are certainly nice, but also for maintainability and to get rid of the kludges added to gain JIT optimizations. We wanted to return to clean code and clearly expressed intent. It has been a great success.
Rewriting the whole library from scratch would not have been tenable either. We avoided that because Rust’s small runtime and freedom from a garbage collector mean that incremental adoption is both possible and practical. We were able to surgically replace just the hottest code paths with Rust and WebAssembly, leaving the rest of the library’s code in place.
Finally, I’d like to re-broadcast
mraleph’s points about profiling:
Profiling and fine grained performance tracking in various shapes and forms is the best way to stay on top of the performance. It allows you to localize hot-spots in your code and also reveals potential issues in the underlying runtime. For this particular reason don’t shy away from using low-level profiling tools like
perf- “friendly” tools might not be telling you the whole story because they hide lower level.
Different performance problems require different approaches to profiling and visualizing collected profiles. Make sure to familiarize yourself with a wide spectrum of available tools.
This is sage advice, and his article is an outstanding example of letting a profiler lead your investigations.