Speed Without Wizardry

Vyacheslav Egorov, who goes by mraleph on the Web, wrote a response to my article “Oxidizing Source Maps with Rust and WebAssembly” titled “Maybe you don’t need Rust and WASM to speed up your JS”.

The “Oxidizing” article recounts my experience integrating Rust (compiled to WebAssembly) into the source-map JavaScript library. Although the JavaScript implementation was originally authored in idiomatic JavaScript style, as we profiled and implemented speed improvements, the code became hard to read and maintain. With Rust and its zero-cost abstractions, we found that there was no trade-off between performance and clean code.

mraleph is an established expert on JavaScript performance. He specializes in Just-In-Time (JIT) compilers, and has contributed to Google’s V8 JavaScript engine. Guided by profiling the JIT’s internals and emitted code, he ultimately brought the JavaScript implementation’s performance on par with the Rust and WebAssembly implementation:

Here is where we started:

$ d8 bench-shell-bindings.js
...
[Stats samples: 5, total: 24050 ms, mean: 4810 ms, stddev: 155.91063145276527 ms]
$ sm bench-shell-bindings.js
...
[Stats samples: 7, total: 22925 ms, mean: 3275 ms, stddev: 269.5999093306804 ms]

and this is where we are finishing

$ d8 bench-shell-bindings.js
...
[Stats samples: 22, total: 25158 ms, mean: 1143.5454545454545 ms, stddev: 16.59358125226469 ms]
$ sm bench-shell-bindings.js
...
[Stats samples: 31, total: 25247 ms, mean: 814.4193548387096 ms, stddev: 5.591064299397745 ms]

This is a factor of 4 improvement!

The optimizations he implemented roughly fall into three buckets:

Algorithmic improvements.
Avoiding allocations to reduce garbage collections.
Staying on the JIT’s happy paths to gain compiler optimizations, and subsequently avoiding falling off performance cliffs.

His article is well-measured and a great read. I particularly like it for three reasons:

It is educational to peek into his profiling process, and he shares JavaScript performance techniques that are just plain fun to read about.
The opportunities for algorithmic improvements that he noticed helped speed up the Rust and WebAssembly implementation another 3x over what we had previously seen. Thanks to his suggestions, the current version of the source-map library is now 5.3x faster than the original JavaScript in Chrome, 10.8x faster in Firefox, and 9.4x faster in Safari.
He perfectly demonstrates one of the points my “Oxidizing” article was making: with Rust and WebAssembly we have reliable performance without the wizard-level shenanigans that are required to get the same performance in JavaScript.

Before continuing, I suggest that you go back and read the two articles if you haven’t already.

Staying on the JIT’s Happy Paths

First, mraleph noticed that some sorting comparator functions had an arity of three, the last parameter being optional, but were only ever actually invoked with two arguments in hot code paths. In V8, this arity mismatch results in an extra trampoline function in the emitted machine code, and fixing the arity improved performance:

Just by fixing the arity mismatch we improved benchmark mean on V8 by 14% from 4774 ms to 4123 ms. If we profile the benchmark again we will discover that ArgumentsAdaptorTrampoline has completely disappeared from it. Why was it there in the first place?

It turns out that ArgumentsAdaptorTrampoline is V8’s mechanism for coping with JavaScript’s variadic calling convention: you can call function that has 3 parameters with 2 arguments - in which case the third parameter will be filled with undefined. V8 does this by creating a new frame on the stack, copying arguments down and then invoking the target function.

Impressive results for such a tiny change! But that’s also part of the problem: the reasons for JavaScript’s good or bad performance don’t readily reveal themselves to readers. Worse yet, what is fine in one engine’s JIT might be a performance cliff in another engine’s JIT. This is the case with this arity mismatch, as mraleph notes:

For what it is worth argument adaptation overhead seems to be highly V8 specific. When I benchmark my change against SpiderMonkey, I don’t see any significant performance improvement from matching the arity.

Next, mraleph noticed that the generic sorting routine was not being monomorphized, and the comparator functions passed to it were not being inlined. To mold the JavaScript code into something that would match the JIT’s heuristics for monomorphization and inlining, he stringified the functions to get their source text and then re-evaluated that source text with new Function(...). This causes the JavaScript engine to create distinct functions with their own type inference records. Therefore, from the JIT’s point of view, the types passed to the new functions are not intermingled with those passed to the originals, and the JIT will monomorphize them in the machine code it emits.

Stringifying a function to get its source text and re-evaluating it is a very clever trick, but it is not idiomatic JavaScript style. This introduces more convolutions to the codebase for the sake of performance, and maintainability suffers. Furthermore, although the code might match this JIT’s heuristics for inlining and monomorphization today, those heuristics may change in the future, and other JITs may use different heuristics altogether.

With Rust and WebAssembly, we have reliable speed without clever tricks.

WebAssembly is designed to perform well without relying on heuristic-based optimizations, avoiding the performance cliffs that come if code doesn’t meet those heuristics. It is expected that the compiler emitting the WebAssembly (in this case rustc and LLVM) already has sophisticated optimization infrastructure, that the engine is receiving WebAssembly code that has already had optimization passes applied, and that the WebAssembly is close to its final form.

Rust lets us talk about monomorphization and inlining directly, without circuitous incantations. We can annotate functions with #[inline], #[inline(always)], and #[inline(never)] attributes to explicitly communicate our inlining suggestions to the compiler. With generic functions, we can choose between monomorphization via type parameters and dynamic virtual calls via trait objects.

Avoiding Allocation and Reducing Garbage Collection

The original JavaScript implementation parsed each mapping into a JavaScript Object, which must be allocated by the garbage collector. To reduce GC pressure during parsing, mraleph modified the parser so that, rather than allocating Objects, it uses a linear typed array buffer and a pointer into that buffer as a bump allocator. It writes parsed mappings into the typed array, doubling the typed array’s size whenever it fills up.

Effectively, this technique is like writing C code in JavaScript, but without even the abstractions that C provides. We can’t define any structs because that implies a GC allocation in JavaScript. In some cases, JITs can optimize away such allocations, but (once again) that depends on unreliable heuristics, and JIT engines vary in their effectiveness at removing the allocations.

Since we can’t define any structs to name the records and their members in the typed array, we must manually write, in mraleph’s words, “verbose and error prone code” to read and write memory[pointer + static_offset_of_member]:

exports.compareByOriginalPositionsNoSource =
function (memory, mappingA, mappingB, onlyCompareOriginal) {
  var cmp = memory[mappingA + 3] - memory[mappingB + 3];  // originalLine
  if (cmp !== 0) {
    return cmp;
  }

  cmp = memory[mappingA + 4] - memory[mappingB + 4];  // originalColumn
  if (cmp !== 0 || onlyCompareOriginal) {
    return cmp;
  }

  cmp = memory[mappingA + 1] - memory[mappingB + 1];  // generatedColumn
  if (cmp !== 0) {
    return cmp;
  }

  cmp = memory[mappingA + 0] - memory[mappingB + 0];  // generatedLine
  if (cmp !== 0) {
    return cmp;
  }

  return memory[mappingA + 5] - memory[mappingB + 5];  // name
};

In contrast to JavaScript, safely avoiding garbage collection is Rust’s bread and butter. Not only can we name a struct without boxing it or requiring a GC, we don’t have to sacrifice any of Rust’s abstractions to do so. In fact, Rust’s ownership, lifetimes, and borrowing let us comfortably avoid allocations and copies in ways that would cause severe headaches even in C or C++.

Algorithmic Improvements

The final group of optimizations that mraleph implemented included two algorithmic improvements to avoid sorting all mappings at once, and only sort subsequences of mappings at a time.

First, mraleph noticed that the encoding for mappings means that they are already sorted by generated line number as we parse them. It follows that, if we wish to sort mappings by generated location (i.e. generated line and column, breaking ties with original location), we only need to sort within each generated line’s subsequence to sort the entire sequence by generated location.

Second, when sorting by original location, we can bucket mappings by their source file, and then we only need to sort one source file’s bucket at a time. Furthermore, we can lazily sort each source file’s bucket.

These algorithmic improvements aren’t providing fundamentally smaller big-O running times. In the worst case, there is only one generated line and only one original source file, which means that the subsequences we sort are actually the full sequence in both cases. And this isn’t a terribly rare worst case either: if you minify a single JavaScript file, you’ll likely create this scenario. But in the large Scala.js source map that is an input to the benchmark, and I suspect many other source maps found in the wild, both of these optimizations pay off.

I implemented both subsequence sorting optimizations for the current Rust and WebAssembly version of the source-map library, and then measured the “setting a breakpoint for the first time” benchmark once more, using the same methods and setup as in the “Oxidizing” article. Below are the results; recall that lower values are better.

Speed improved another ~3x over what we had previously seen. The current version of the source-map library is now 5.3x faster than the original JavaScript in Chrome, 10.8x faster in Firefox, and 9.4x faster in Safari!

(mraleph did not publish his complete set of changes, so I could not directly compare his improved JavaScript implementation’s speeds in this graph.)

Conclusion

Most of the improvements that mraleph implemented are desirable regardless of the programming language that is our medium. Excessive allocation rates make any garbage collector (or malloc and free implementation) a bottleneck. Monomorphization and inlining are crucial to eking out performance in both Rust and JavaScript. Algorithms transcend programming languages.

But a distinction between JavaScript and Rust+WebAssembly emerges when we consider the effort required to attain inlining and monomorphization, or to avoid allocations. Rust lets us explicitly state our desires to the compiler, we can rely on the optimizations occurring, and Rust’s natural idioms guide us towards fast code, so we don’t have to be performance wizards to get fast code. In JavaScript, on the other hand, we must communicate with oblique incantations to match each JavaScript engine’s JIT’s heuristics.

mraleph concludes with a bit of sound engineering advice:

Obviously each developer and each team are free to choose between spending N rigorous hours profiling and reading and thinking about their JavaScript code, or to spend M hours rewriting their stuff in a language X.

I couldn’t agree more! It is important to remember that engineering choices have trade-offs, and it is equally important to be cognizant of the choices in the first place.

We chose to rewrite a portion of our library in Rust and WebAssembly not just for the speed ups, although those are certainly nice, but also for maintainability and to get rid of the kludges added to gain JIT optimizations. We wanted to return to clean code and clearly expressed intent. It has been a great success.

Rewriting the whole library from scratch would not have been tenable either. We avoided that because Rust’s small runtime and freedom from a garbage collector mean that incremental adoption is both possible and practical. We were able to surgically replace just the hottest code paths with Rust and WebAssembly, leaving the rest of the library’s code in place.

Finally, I’d like to re-broadcast mraleph’s points about profiling:

Profiling and fine grained performance tracking in various shapes and forms is the best way to stay on top of the performance. It allows you to localize hot-spots in your code and also reveals potential issues in the underlying runtime. For this particular reason don’t shy away from using low-level profiling tools like perf - “friendly” tools might not be telling you the whole story because they hide lower level.

Different performance problems require different approaches to profiling and visualizing collected profiles. Make sure to familiarize yourself with a wide spectrum of available tools.

This is sage advice, and his article is an outstanding example of letting a profiler lead your investigations.