Addenda
In between part one and part two of this write up, the AST undergoes a few transformation; these transformations are not, in the most platonic sense, strictly necessary for a compiler. It is perfectly plausible to jump from surface AST to machine code, it is also, as we say in the biz, very hard. One of the compiler design tips which I think I internalized the most through this project is that compilation is first and foremost a series of string manipulation passes that allow us to reason about a program in increasingly more primitive ways I’ve personally taken to calling this programmatic deterritorialization, a term which I am sure will impress all and any compiler engineer who moonlights as a postmodern philosopher. In pursuit of a more pure core language to compile, we introduce the following transformations: Special forms are primitives which we privilege early so that we can break them down into constituent parts later, desugaring is the process of breaking up some of those special forms, scoping resolves human named variables to unique numeric variables, lowering introduces the core language which massively simplifies the code generation process.
Special Forms
The set of special forms we support is fixed at compile time and lives in an enum:
// ast.hpp
;
The parser maps the corresponding source strings to those enum values with a lookup table:
// parser.hpp
const std::unordered_map<std::string, Keyword> kwords = ;
When the lexer produces an atom token the parser first tries to parse it as a
number, then a boolean literal, then checks the keyword table. If a keyword is
found the symbol carries a Keyword tag instead of a raw string.
After the initial S-expression tree is built a second pass called
resolve_forms walks it and handles each special form:
// parser.cpp
void
define and lambda are validated and left as-is in the tree. let and
letrec are desugared, rewritten into equivalent combinations of lambda and
set!.
The term “desugaring” comes from referring to certain language features as “syntactic sugar”, these are
parts of the language which do not contribute to the overall expressive power of
the language, but make it easier to program in the language
The goal, at least at the language design level, is to get the language down to
minimal possible representation so code generation is as simple and efficient as
possible.
Desugaring
Let
let introduces a set of local bindings that are all evaluated before any
name comes into scope. This is equivalently the behavior of a lambda which is
invoked immediatly:
(let ((x e1) (y e2)) body) -> ((lambda (x y) body) e1 e2)
In the above x and y, which are set to e1 and e2 respectively, are
accessible only to the body of this let scope, it is hopefully clear how this
is the same as passing e1 and e2 to a lambda which has x and y as
parameters. In the above, the outer expressions, e1 and e2 “do not know”
that the program is using them as variables x and y, in that way, they have
no ability to refer to themselves in their own definitions, which is an important
restriction of let which the above desugar gives us for free. let does not
allow for recursion, but clearly encodes the scope and availability of
a variable
The stronger compilation time knowledge that let affords us
also enables certain optimizations which we will not be discussing in this
article, but are worth looking into
Let Recursive
letrec allows us to use recursion, it’s desguaring rules:
(letrec ((f e)) body)
-> (let ((f nil)) (set! f e) body)
-> ((lambda (f) (set! f e) body) nil)
By adding the second step, f is a name that is available to e while it’s
being evaluated, enabling mutual recursion.
letrec therefore reduces to let plus mutation, and let reduces to
lambda. After resolve_forms the AST contains only define, lambda, if,
and set! as special forms.
Implementations
// parser.cpp — let
SExp
// parser.cpp — letrec
SExp
Scoping
This Lisp uses lexical scoping: the meaning of a name is determined by where it
appears in the source text, not where a function happens to be called from.
Concretely, a lambda introduces a new scope that can see everything in the
surrounding scope. A symbol should then be assigned to the definition as local
as possible. We solve this problem by representing our scopes as tables which
relate to each other with tree semantics:
// scoper.hpp
;
The root table is the global scope, pre-populated with the built-in functions
(+, -, cons, car, cdr, …). Every lambda in the source gets its own
child table. parent pointers let us walk back up to resolve names that were
not declared locally.
The Scoper runs in two phases: run builds the tree and assigns each binding
a unique integer ID; resolve replaces every string identifier in the AST with
that integer.
Building the Scope Tree
// scoper.cpp
case :
Resolving Names
// scoper.cpp
if constexpr
search walks up the parent chain until it finds the name or runs out of
scopes:
// scoper.cpp
Binding
After this pass every std::string symbol in the AST has been replaced by a
SymbolID which is guaranteed to be unique, two references to the same binding share the same integer
and the scope tree can be discarded.
Code Lowering and the Core Language
The AST the parser produces is still quite “surface-y”, it carries all of the
syntactic forms and still uses the SExp / List / Symbol vocabulary. The
core language is a separate, typed IR that strips that away and gives each
concept its own concrete type:
// core.hpp
;
;
;
;
;
;
;
;
Because we already desugared let/letrec into lambdas while resolving those special forms, and resolved
all names to integers, the lowering pass is mostly pattern matching. The
Lowerer::lower_expr function dispatches on the shape of the AST node:
// core.cpp
core::Expr
For lambda, the formals list and body expressions are extracted by position —
the structure is guaranteed by the parser so there is no additional validation:
// core.cpp
core::Lambda
define is only permitted at the top level (the lower_top entry point handles
it separately from lower_expr), which means the core Program is a flat list
of top-level definitions and expressions — nothing nested can introduce a new
global binding.
// core.hpp
using Top = std::variant<Define, Expr>;
using Program = std::vector<Top>;
The core language is what the code generator consumes.