C++ symbols are mangled? Why?
Linkers only support C identifiers for symbol names. They don’t have any knowledge of C++’s namespaces, monomorphization of a template function, overloaded functions, etc. That means that the C++ compiler needs to generate C identifier compatible symbols for C++ constructs. This process is called “mangling”, the resulting symbol is a “mangled symbol”, and reconstructing the original C++ name is “demangling”.
Let’s take an overloaded function as a concrete example. Say there are two
overloads: one that takes a
char* and another that takes an
overloaded is a valid C identifier that the linker could understand,
we can’t name both versions of the function
overloaded because we need to
differentiate them from each other for the linker.
We can see mangling in action if we run
nm on the object file:
$ nm -j overloaded.o __Z10overloadedPc __Z10overloadedi
There is some prefixed gunk and then the types of parameters got mangled
themselves and appended to the end of the function name:
Pc for a pointer to a
Here is a more complex example:
This time the mangled symbols are a bit less human readable, and its a bit less obvious where some of the mangled bits even came from:
$ nm -j complex.o __Z11instantiatev __ZN3foo3BarIPcE11some_methodEPS2_S3_S3_ __ZN3foo3BarIiE11some_methodEPS1_S2_S2_
Do compilers have to agree on the mangling format?
If they want code compiled from one compiler to inter-operate with code compiled from another compiler (or even different versions of the same compiler), then yes, they need to agree on names.
These days, almost every C++ compiler uses the Itanium C++ ABI’s name mangling rules. The notable exception is MSVC, which uses a completely different format.
We’ve been looking at the Itanium C++ ABI style mangled names, and that’s what
Why demangle C++ symbols in Rust?
A few reasons:
I wanted to demangle some C++ symbols for
addr2lineclone. But I also didn’t want to introduce complexity into the build system for some old C code, nor a dependency on a system library outside of the Rust ecosystem, nor spawn a
Tom Tromey, a GNU hacker and buddy of mine, mentioned that historically, the canonical C++ demangler in
gdb) has had tons of classic C bugs: use-after-free, out-of-bounds array accesses, etc, and that it falls over immediately when faced with a fuzzer. In fact, there were so many of these issues that
gdbwent so far as to install a signal handler to catch
SIGSEGVs during demangling. It “recovered” from the segfaults by
longjmping out of the signal handler and printing a warning message before moving along and pretending that nothing happened. My ears perked up. Those are the exact kinds of things Rust protects us from at compile time! A robust alternative might actually be a boon not just for the Rust community, but everybody who wants to demangle C++ symbols.
Finally, I was looking for something to kill a little bit of free time.
What I didn’t anticipate was how complicated parsing mangled C++ symbols actually is! The second bullet point should have been a hint. I hadn’t looked to deeply into how C++ symbols are mangled, and I expected simple replacement rules. But if that was the case, why would the canonical demangler have had so many bugs?
What makes demangling C++ symbols difficult?
Maybe I’ve oversold how complicated it is a little bit: it isn’t terribly difficult, its just more difficult than I expected it was going to be. That said, here are some things I didn’t expect.
The grammar is pretty huge, and has to allow encoding all of C++’s
expressions, e.g. because of
decltype. And its not just the grammar that’s
huge, the symbols themselves are too. Here is a pretty big mangled C++ symbol
That’s 355 bytes!
And here is its demangled form – even bigger, with a whopping 909 bytes!
decltype(fp.match(fp0.as<mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >())) mozilla::detail::VariantImplementation<unsigned char, 3ul, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >::match<decltype(fp(static_cast<JSObject**>(std::nullptr_t))) js::CrossCompartmentKey::applyToWrapped<(anonymous namespace)::NeedsSweepUnbarrieredFunctor>((anonymous namespace)::NeedsSweepUnbarrieredFunctor)::WrappedMatcher&, mozilla::Variant<JSObject*, JSString*, mozilla::Tuple<js::NativeObject*, JSScript*>, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> > >((anonymous namespace)::NeedsSweepUnbarrieredFunctor&&, mozilla::Variant<JSObject*, JSString*, mozilla::Tuple<js::NativeObject*, JSScript*>, mozilla::Tuple<js::NativeObject*, JSObject*, js::CrossCompartmentKey::DebuggerObjectKind> >&)
The mangled symbol is smaller than the demangled form because the name mangling algorithm also specifies a symbol compression algorithm:
To minimize the length of external names, we use two mechanisms, a substitution encoding to eliminate repetition of name components, and abbreviations for certain common names. Each non-terminal in the grammar above for which
<substitution>appears on the right-hand side is both a source of future substitutions and a candidate for being substituted.
After the first appearance of a type in the mangled symbol, all subsequent
references to it must actually be back references pointing to the first
appearance. The back references are encoded as
S i _ where
i is the i + 1th substitutable AST node in an in-order walk of
the AST (and
S_ is the 0th).
We have to be careful not to mess up the order in which we populate the
substitutions table. If we got it wrong, then all the back references will point
to the wrong component, and the final output of demangling will be
cpp_demangle uses a handwritten recursive descent parser. This makes
it easy to build the substitutions table as we parse the mangled symbol,
avoiding a second back reference resolution pass over the AST after
parsing. Except there’s a catch: the grammar contains left-recursion, which will
cause naive recursive descent parsers to infinitely recurse and blow the
stack. Can we transform the grammar to remove the left-recursion?
Nope. Although we would parse the exact same set of mangled symbols, the
resulting parse tree would be different, invalidating our substitutions table
and back references! Instead, we resign ourselves to peeking ahead a character
and switching on it.
The compression also means that the demangled symbol can be exponentially larger than the mangled symbol in the worst case! Here is an example that makes it plain:
lets_get_exponential symbol is 91 bytes long:
lets_get_exponential symbol is 5902 bytes long; too big to
include here, but you can run that symbol through
c++filt to verify it
The final thing to watch out for with compression and back references is malicious input that creates cycles with the references. If we aren’t checking for and defending against cycles, we could hang indefinitely in the best case, or more likely infinitely recurse and blow the stack.
What’s the implementation status?
cpp_demangle crate can parse every mangled C++ symbol I’ve thrown at it,
and it can demangle them too. However, it doesn’t always match
demangler’s formatting character-for-character. I’m currently in the process of
getting all of
libiberty’s C++ demangling tests passing.
Additionally, I’ve been running American Fuzzy Lop (with afl.rs) on
cpp_demangle overnight. It found a panic involving unhandled integer overflow,
which I fixed. Since then, AFL hasn’t triggered any panics, and its never been
able to find a crash (thanks Rust!) so I think
cpp_demangle is fairly solid
If you aren’t too particular about formatting, then
cpp_demangle is ready for
you right now. If you do care more about formatting, it will be ready for you
- Finish getting all the tests in
libiberty’s test suite passing, so
cpp_demangleis character-for-character identical with
libiberty’s C++ demangler.
- Add a C API, so that non-Rust projects can use
- Add benchmarks and compare performance with
libiberty’s C++ demangler. Make sure
cpp_demangleis faster ;-)
- A couple other things listed in the github issues.
That’s it! See you later, and happy demangling!
Thanks to Tom Tromey for sharing GNU war stories with me, and providing feedback on an early draft of this blog post.