Beyond Source Maps

There's been some recent talk on es-discuss about standardizing source maps in ECMAScript 7. Before that happens, we should take a moment to reflect on what source maps have done well, where they are lacking, and meditate on what a more perfect debug format for compilers targeting JavaScript might be. My response quickly outgrew an email reply, and so I am collecting my thoughts and posting them here.

Before diving in, it makes sense to tell you where I am coming from. I implemented the mozilla/source-map library and the source map support in the Firefox Developer Tools' debugger. I'm a coauthor of the document specifying source maps; although the actual design of the format was done by John Lenz. I mostly documented and polished the parts that weren't about the format itself: like how to find a script's source map. On top of that, I've debugged scripts with source maps compiled from browserify, the r.js AMD module optimizer, CoffeeScript, ClojureScript, Emscripten, UglifyJS, Closure Compiler, and more.

When do source maps work well?

Source maps work pretty well for:

Most kinds of JavaScript to JavaScript translation: minification, module concatenation, and pretty printing.
Light syntactic sugar sprinkled over JavaScript: CoffeeScript, TypeScript, etc.
CSS preprocessors and transpilers. From what I can tell, this is the most prevalent use of source maps, and what they've ended up being best at!

When do source maps fall short?

Whenever you start doing "real compilation" instead of "transpiling" and your source language's semantics don't match JavaScript's semantics, source maps break down:

When the source language's types don't correspond one to one to JavaScript types. For example ClojureScript's Map isn't a JavaScript Object nor a JavaScript Map, and a C++ char * isn't a JavaScript String. The result is that the debugger user is inspecting the compiler's implementation of the data type rather than the source language data type.
When a source language's compiler does not compile source language variables into JavaScript variables, or when the source environment doesn't correspond exactly with the JavaScript environment. This can be becuase:
1. Variables are stored on a "heap" typed array instead of in JavaScript variables (eg, asm.js).
2. Scoping or naming information is lost in compilation (eg, regenerator and ClojureScript).

A stepping debugger isn't useful without the ability to inspect bindings and values in the environment, and the above issues create humungous hurdles for doing that!

Moreover, using the console as a REPL, adding a watch expression, or setting a new conditional breakpoint must be done with JavaScript, not the source language.

SourceMap.next?

Requirements

An ideal SourceMap.next format should support the existing use of source maps:

It should provide source language location debugging information for the JavaScript object code.
It should list the set of source language files that were compiled to create the JavaScript object code.
It should support debugging dynamically compiled sources which don't live at a remote URL, like those created in the "Try CoffeeScript" online editor.
The format should be compact and space-saving so that the network isn't a bottleneck when debugging remote targets.
Likewise, it should be available as an auxillary file to to the JavaScript object code it provides debugging information for so that there is no overhead when the code is not being debugged.

Additionally, any SourceMap.next format should support debugging JavaScript object code that the existing source map format fails to satisfy:

It should provide a way to list bindings that are in scope in the current environment (whether or not those bindings are implemented as JavaScript variables).
It should provide a way for the JavaScript debugger to display values in a meaningful way.
It should optionally provide an eval capability, for use from a REPL, watch expression, or conditional breakpoint.
The format should be future-extensible. That is, when SourceMap.next v2 rolls out, any SourceMap.next v1 consumer should still be able to parse and use instances of SourceMap.next v2 (although without the new features, of course). DWARF did a great job solving this problem with its Debugging Information Entries.

A Kernel of a Solution

The ideas I present here aren't fully baked yet, but I have confidence in the approach.

Because a SourceMap.next must do more than translate file, line, and column locations, it no longer makes sense to build the format on location mappings. Instead it should annotate the abstract syntax tree of the JavaScript object code with debugging information. Instead of adding debugging information to a line and column location in the JavaScript object code which in turn can be approximately translated into a specific JavaScript statement or expression, annotating the AST adds debugging information directly on a specific JavaScript statement or expression. It cuts out the middle man while simultaneously simplifying the architecture. Many existing tools are already using this approach: they create a JavaScript AST, annotate it with source language locations, and use escodegen to generate both the JavaScript object code and a source map!

Annotating the AST enables us to take advantage of the hierarchy present in the tree. If you are searching for a specific annotation and the current AST node does not have it, recurse on the node's parent. This makes it easy to provide very detailed information or a less fine-grained summary. For example, in the following AST, rather than requiring source location annotations for each AssignmentExpression, Identifier, BinaryExpression, and Literal AST nodes, one could simply add the annotation to the AssignmentExpression node. Then, any queries for source location information on any of the other nodes would bubble up to the AssignmentExpression and yield its annotated source location information.

AssignmentExpression
  operator: "="
  left: Identifier
    name: "answer"
  right: BinaryExpression
    operator: "*"
    left: Literal
      value: 6
    right: literal
      value: 7

Here are a few debugging annotations that should cover the requirements listed above:

sourceList: An array of sources that were compiled to create this JavaScript object code. This annotation would only be valid on the root AST node. A source would be either a URL pointing to a source file, or the inlined source text, display name, and content type.
source: The source (as an index into the sourceList) from which this AST node was compiled.
line: The line number in the source file that corresponds to this AST node.
column: The column number in the source file that corresponds to this AST node.
getLocals: A JavaScript function (or URL pointing to a function) to get all local variables introduced by this AST node. You can walk up the tree to the root to create the lexical environment chain.
getDisplayValue: A JavaScript function (or URL pointing to a function) that given a source language value should return a JavaScript value for display and inspection in the debugger. For example, a ClojureScript Map's display value might be a JavaScript Object because JavaScript debuggers already know how to provide a reasonable view of Objects that would work fine for a ClojureScript Map even though it wouldn't be the best way to implement a ClojureScript Map.
eval: A JavaScript function (or URL pointing to a function) to eval arbitrary snippets of the source language. Would accept an optional parameter that dictated which frame the debugger is paused in, if any.

"Solving the SourceMap.next problem once and for all" reduces to defining a compact binary encoding of JavaScript's AST that supports an arbitrary number of arbitrary debugging annotations on each node. The beauty of this approach is that as long as we take care that this format is future-extensible with new types of debugging annotations, we can resolve any future issues without needing to update the binary format. We just add new debugging annotations and keep the old annotations for legacy consumers until they die off. The only time the binary format changes is when a new version of the ECMAScript standard is released and defines new syntactic forms that the AST must incorporate.

The final affordance of having an extensible debugging format is that compilers which target JavaScript can progressively add more debugging information over time. Since specific annotations are optional, the compiler can simply skip those for which it can't provide information. When a later version of the compiler can provide that information, it can start adding those annotations.