Writing a Tree-Sitter Grammar

I recently found a chance to squeeze tree-sitter into a side project of a side project. This post is about getting to this point and writing a grammar for a small language.

background

The top-level project is a tool to operate on data definitions. To accomplish this, I'm trying to define an intermediate representation and translate between that and a target format. Like generating JSON schema definitions from Rust, or converting an XML schema to Python dataclasses.

Despite working through a few "how to build a compiler" books, most of this is new to me. I've never designed an intermediate representation. However, working through "Writing a C Compiler" introduced me to ASDL. I thought maybe this would be a useful tool to help me design and implement an IR. One thing I thought about while working through "Writing a C Compiler" was that it would be really convenient to have a tool that took in an ASDL specification and generated all the corresponding structs and enums for me. Faced with the prospect of using ASDL again, I set aside my original project and set out to write that tool.

Those aforementioned compiler books helped me understand the vague shape of what I needed: convert an ASDL specification to a series of tokens, parse those tokens into an AST, and then convert that AST into Rust code. ASDL is a simple language, so none of the bugs that popped up were too hard to figure out. I started thinking about what it would take to have a better debugging experience. It would be nice to know what part of the input caused a problem. Still, I wasn't enthusiastic about breaking out anyhow and thiserror to make things better for myself. I'd put in work trying to implement parsers before, and it's just not what I want to spend my time doing.

Which brings us to tree-sitter. I don't have much experience with parser generators, but I figured this would be a good time to learn. Tree-sitter is what I hear about most often these days, so I set about writing a grammar to introduce tree-sitter to ASDL.

tree-sitter-asdl

The process of developing this grammar is more or less the story of me reading the documentation for writing parsers. I benefitted from using nom to write my original lexer, since the grammar is defined using a very similar DSL in JavaScript. So once I had my grammar ready to go, it was just a matter of following the rest of the steps to publish my crate, tree-sitter-asdl.

further work

I was really impressed with how straightforward it was to add support for a new language in tree-sitter. That said, I still haven't actually used tree-sitter in my ASDL project. I don't know if it will feel different to work with a concrete syntax tree instead of an abstract syntax tree, or if I'd even notice using tree-sitter with a language as small as ASDL. I also don't know what I'm really able to control with the grammar. I don't know if there are ways of defining my rules that make the output any better or worse to work with. These are the things I'll be thinking about as I work tree-sitter in to my ASDL tool so I can get back to my schema project.