How Fuzzy are Your Fuzzers?

As long as a fuzzer is uncovering a steady stream of bugs, we can have confidence it’s serving its purpose. But a silent fuzzer is harder to interpret: is our program finally free of bugs, or is the fuzzer simply unable to reach the code in which they are hidden?

Code coverage reports can help here: we can manually check which functions and blocks of code the fuzzer has executed. We can see what coverage is missing that we want or expected to be covered, and then figure out ways to help the fuzzer explore that code. We implement those changes, run the fuzzer again, check the coverage reports again, and can verify our changes had the desired effect.

But how can we be sure that the fuzzer will continue exercising these code paths — especially in evolving code bases with many developers collaborating together? Imagine this scenario: we have a generator that creates test cases that are guaranteed to be syntactically correct, but aren’t guaranteed to type check even if they do in practice 99% of the time. Therefore, our try-and-compile-the-input fuzz target intentionally ignores type errors so it can skip to the next probably-well-typed input, hoping that compiling that next input will trigger an internal compiler assertion or find some other bug. However, some change in one of the generator’s dependencies perturbed the generator so that now it only generates ill-typed programs. After this change, the fuzzer will never exercise our compiler’s mid-end optimizations and backend code generation because it always bounces off the type checker. This is a huge reduction in code exercised by the fuzzer and nothing alerted us to this regression!¹

Manually checking coverage reports every week, month, or whenever you happen to remember is tedious. Even worse, if we accidentally introduce a coverage regression, we won’t catch that until the next manual review. What if we unknowingly cut a release during one of these periods? We could ship bugs that we would otherwise have caught — not good!

This isn’t a hypothetical scenario. We’ve been bitten by it in the Wasmtime project, as detailed in the following quote from one of our security advisories:

This bug was discovered when we discovered that Wasmtime’s fuzz target for exercising GC and stack maps, table_ops, was mistakenly not performing any actual work, and hadn’t been for some time now. This meant that while the fuzzer was reporting success it wasn’t actually doing anything substantive. After the fuzz target was fixed to exercise what it was meant to, it quickly found this issue.

Catching bugs early, before you release them, is much preferable to the alternative! Exposing users to bugs isn’t good and writing security advisories and patches isn’t fun.

A more robust solution than periodic coverage reviews is to manually instrument your code with counters or other metrics and then write tests that run your fuzz target N times and assert that your metrics match your expectations within some margin of error. This technique is really low effort and high reward for fuzz targets that are specifically designed to exercise one corner of your system. Even just a single counter or boolean flag can provide lots of value!

For example, I wrote a Wasmtime fuzz target to test our backtrace capturing functionality. The fuzz target is composed of two parts:

A generator that creates pseudo-random Wasm programs consisting of a bunch of functions that arbitrarily call each other, all while dynamically maintaining a shadow stack of function activations that always reflects actual execution.
An oracle that takes these generated test cases, runs them in Wasmtime, and asserts that when we capture an actual backtrace in Wasmtime, it matches the generated program’s shadow stack of activations. Crucially, the oracle also returns the length of the deepest backtrace that it captured.

Now, we need to make sure that this fuzz target is actually exercising what we want it to, and isn’t going off the rails by, for example, returning early from the first function every time and therefore never actually exercising stack capture with many frames on the stack. To do this, I wrote a regular test that generates random buffers of data with an RNG, generates test cases from that random data, runs our oracle on those test cases, and asserts that we capture a stack trace of length ten in a reasonable amount of time. Easy!

// We should quickly capture a stack at least this deep. We
// consider this deep enough to be a "non-trivial" stack.
const TARGET_STACK_DEPTH: usize = 10;

#[test]
fn stacks_smoke_test() {
    // Use a fixed seed so the corpus of generated test cases
    // is deterministic.
    let mut rng = SmallRng::seed_from_u64(0);
    let mut buf = vec![0; 2048];

    for _ in 0..1024 {
        rng.fill_bytes(&mut buf);

        // Generate a new `Stacks` test case from the raw
        // data, using the `arbitrary` crate.
        let u = Unstructured::new(&buf);
        if let Ok(stacks) = Stacks::arbitrary_take_rest(u) {
            // Run the test case through our `check_stacks`
            // oracle.
            let max_stack_depth = check_stacks(stacks);

            // If we reached our target stack depth, then we
            // passed the test!
            if max_stack_depth >= TARGET_STACK_DEPTH {
                return;
            }
        }
    }

    panic!(
        "never generated a `Stacks` test case that reached \
        {TARGET_STACK_DEPTH} deep stack frames",
    );
}

Now we know that we won’t ever accidentally make a change that silently makes it so that we only test capturing stack traces of depth one in this fuzz target. If we tried to make that change, this test would fail, alerting us to the problem.

Of course, this technique isn’t a silver bullet. For more general fuzz targets that are testing basically the whole system, rather than a specific feature, there isn’t a single counter or metric to rely on. Some code paths might take a while to be discovered by the fuzzer, longer than you’d want to wait for in a unit test, even if it should be found eventually. But maybe there are a few counters you can implement as low-hanging fruit and get 80% of the benefits for 20% of the effort?

Finally, that earlier quote from one of our Wasmtime security advisories ends with the following:

Further testing has been added to this fuzz target to ensure that in the future we’ll detect if it’s failing to exercise GC.

We are confident we’ll detect if that fuzz target starts failing to exercise garbage collection because now we count how many garbage collections are triggered in each iteration of the fuzz target, and assert that we trigger at least one garbage collection within a small number of iterations. Simple and easy to implement, but we’ll never have that particular whoops-we-never-triggered-a-GC-in-this-fuzz-target-designed-to-exercise-the-GC egg on our faces again!

Many thanks to Alex Crichton, Chris Fallin, and Jim Blandy for reading drafts of this blog post and providing valuable feedback!

As an aside, a super neat feature for OSS-Fuzz to grow would be automatically filing an issue whenever a fuzz target’s coverage dramatically drops after pulling the latest code from upstream or something like that. ↩