The Stack
The exact phase boundary between an interpreted and compiled language is slightly fuzzy; Java, generally thought of as compiled, targets a virtual machine after several layers of IR, Python, generally thought of as interpreted… Targets a virtual machine after several layers of IR? What’s the difference?
I attribute the fact that a good bunch of languages are a little bit of both is that a purely interpreted language, one which takes your mushy-human-programmer code and immediately executes it on a compatible abstract machine, will usually be a very unpleasant programming experience. Turns out, compilation, or at least intermediate representation of source code, is very useful. Compilation gives the language designer the opportunity to do optimizations, check for correctness, and allow the programmer to use more intricate program organization, think classes and modules. However, compiling all the way down to something that talks to your hardware is not always feasible or even desirable. There are, and I am sorry to tell you this, a lot of different kinds of hardware, and not every language designer wants to spend the time and effort on designing the compiler backend for the sake of portability. Less to the point but still an important motivation for VMs in language design, it is relatively common for a language to be “embedded” in another system, notable examples include Lua, eBPF in the linux kernel, and maybe something like GDScript in Godot. If a domain specific language like this is a good idea or not is a question for the culture that I will not answer.
The Abstract Machine
A stack is an abstract data structure following LIFO semantics, that is, a piece of data enters, and does not leave until all entries added after it have been removed from the stack. The Stack is the name I took to calling the virtual machine that I was targeting for the purposes of this part of the project. Before I describe the full architecture I will do a little bit of apologia for the design. When I started the project I wanted most of the engineering to be “from first principles”, as such, this was probably the most naive part of the project, going off of no real specification, and only vaguely referencing sources like clox from crafting interpreters. This is also why this is not just an SECD machine, though I might implement it as a separate target later. When I started the project I had in mind a pure Forth-like stack machine, without heaps or registers and a relatively limited ISA. As it turns out, and I should have foreseen this from the onset frankly, it’s impossible to implement cool things like closures without dynamic data structures, for reasons that I will explain when we get there. Even giving up a completely static memory model only gets us so far before we need to contend with other difficult bits and pieces of a perfect stack model, namely that it becomes really ugly if a function uses variables in any order that is not precisely LIFO. The generator would need to pop out all of the variables above the variable of interest, pull that one to the top, and push them all back in. So when we say a “stack” vm, we generally mean a structure which enforces the stack semantics for code environments rather than for individual variables. With that little preamble out of the way we can finally get started on sketching out the virtual machine that we will be generating code for.
The virtual machine is built of the following components
- A memory primitive, which the rest of the VM interacts with only through
shared pointers, named
Cell, imagine this as a physical memory location program_memwhich stores our program instructions.- A program counter register,
pc, which points to where inprogram_memthe program currently is - A data “stack”, which is a vector of shared pointers to
Cells - A return stack, which is an actual stack of
Cells - A
frame_basewhich points to where in the data stack our local scope starts, and aframe_base_stackwhich tracks where we should restore the base to after returning from a local scope - A
heapwhich stores heap objects, that is, code environments and list primitives global_tblwhich is a map of global variables and “where they live”, this is in itself a form of dynamic memory similar to heap
This is, largely, isomorphic to an SECD machine the original SECD machine did not support global variables, but later extensions did. If you do not know what SECD machine is, it’s not necessary to understanding the rest of the article, but is a very interesting piece of programming history but the general philosophy and code generation strategy are different enough as to be worth treating this machine as a separate entity.
The Memory Model
The Cell and HeapObject types underpin most of the operations of the VM, so
it’s worth looking at them in greater detail
;
;
;
using HeapObject = std::variant<Pair, CodeEnv>;
Cell is an almost self explanatory single memory unit, it contains one signed,
64 bit integer value, a function pair and null
flag
If, and this is not out of the question, I end up designing something in verilog to model this system, I will probably end up cutting into the most significant bits of the integer to represent the flags.
The CodeEnv struct represents the code environment that a lambda runs in, it’s
necessary for this to be a heap structure, that is, to have a dynamic lifetime,
because the lambda could be called past the lifetime of the variables that it
originally lived in. For that reason, closures/closure environments generally
require heap allocation, Rust being a notable exception (as it typically is),
which gets around the problem of a closure using a dead variable by way of its
ownership model.
Pair is just a cons cell, this object allows us to support lists and list
programming in the virtual machine. The Pair struct has to be a heap object
for a similar reason to the CodeEnv object, namely, we can’t actually say at
comp time how large a tree represented by the cons cells taken together will be.
The ISA
With that, we’re ready to look at the ISA that the program uses to interact with the virtual machine. A lot of these are likely familiar, so we’ll be skipping the arithmetic, logic, and transfer opcodes to focus on the control and list instructions in the next section, where we detail how variables are handled, and then the calling convention I am certain that I will eventually, probably after this particular write up, get around to writing more cohesive documentation .
Arithmetic Operations
| Opcode | Operand | Stack In | Stack Out |
|---|---|---|---|
add | — | 2 | 1 |
sub | — | 2 | 1 |
mul | — | 2 | 1 |
div | — | 2 | 1 |
mod | — | 2 | 1 |
inc | — | 1 | 1 |
dec | — | 1 | 1 |
max | — | 2 | 1 |
min | — | 2 | 1 |
Logic Operations
| Opcode | Operand | Stack In | Stack Out |
|---|---|---|---|
lt | — | 2 | 1 |
le | — | 2 | 1 |
eq | — | 2 | 1 |
ge | — | 2 | 1 |
gt | — | 2 | 1 |
Transfer Operations
| Opcode | Operand | Stack In | Stack Out |
|---|---|---|---|
drop | u64 | 1 | 0 |
dup | — | 1 | 2 |
swap | — | 2 | 2 |
nrot | u64 | 0 | 0 |
push | u64 | 0 | 1 |
Control Operations
| Opcode | Operand | Stack In | Stack Out |
|---|---|---|---|
call | u64 | 0 | 0 |
ret | — | 0 | 0 |
jmp | addr | 0 | 0 |
cjmp | addr | 1 | 0 |
wait | — | 0 | 0 |
halt | — | 0 | 0 |
mkclosure | u64 | 0 | 0 |
mkglobal | u64 | 1 | 0 |
loadglobal | u64 | 0 | 0 |
mutglobal | u64 | 0 | 0 |
enter | u64 | 0 | 0 |
get_local | u64 | 0 | 1 |
set_local | u64 | 1 | 0 |
List Operations
| Opcode | Operand | Stack In | Stack Out |
|---|---|---|---|
cons | — | 2 | 1 |
car | — | 1 | 1 |
cdr | — | 1 | 1 |
pushnil | — | 0 | 1 |
isnull | — | 1 | 1 |
Programming the Machine
Variables
First we should understand how variables work. We can take it as a promise from
the language front end that all variables have a globally unique name, which
we’ll call SymbolID. A variable can be either global or local, let’s begin
with the simpler of the two and inspect how we interact with locally scoped
variables:
-
GETLOCAL nloads thenth variable from theframe_baseup to the top of the stack. On an implementation level, this pushes a copy of the pointer atdata_stack[frame_base + n]up to the top of the stack. -
SETLOCAL npop aCellpointer from the top of the stack and then replaces the dereferencedCellfound atdata_stack[frame_base + n]. This propagates to all other shared pointers which pointed at thatCellpreviously.
Local variables are kind of automatic in the way that if it’s on the stack above the current frame index, it’s accessible as a local variable. Global variables are a bit more tricky as we need to indicate that this global variable should be accessible from everywhere, so we need to add a way to create a new global variable.
MKGLOBAL ncreates a new entry in theglobal_tblwith the keyn, pointing to a shared pointer to theCellwhich holds the value we want to be global.LOADGLOBAL npushes to the stack the variable in thenkey of theglobal_tblMUTGLOBAL nchanges the underlying reference of thenkey of theglobal_tblto the top of the stack.
Since we ultimately are the ones who control the code generation, and we’re
pretty ok that SETLOCAL can interact with a global variable unwittingly.
The Calling Convention
In programming language design, the calling convention is the scheme the
contract for how functions receive and interact with their parameters, and how
functions are expected to provide their outputs. This is almost always
considered to be part of the ABI,abstract binary interface, and determines how
subroutines and programs are expected to interact with one another. I will save
you the details of the three or four revisions of this calling convention.
First, we should internalize what we mean by code environments being subject to
stack semantics rather than the variables. When we CALL a closure, which is
all functions in this language, we expect it’s local variables to be the
arguments passed to it by the callee, and the variables that it references in
its body, formals and captured variables respectively. To achieve this, we
should tell the VM where those start, which we do by setting frame_base, and
allowing the closure to interact with it’s local variables only in reference to
that frame base.
This is, by the way, the same for most ISAs, when we talk about
“stack” allocation, we’re really just talking about a contiguous region of memory with
a pointer to the top which grows in one direction, which we use in a LIFO fashion
But when that called closure has concluded, and the result lies on top of the
stack, the callee, which might be a closure in and of it self, could still need to
do some work, so we have to restore the frame base to what it was prior to the
call, which we achieve through the frame_base_stack! Diagramatically, we have
the following procedure. The caller, just before CALL n is invoked:
+-----------------+
| argn |
+-----------------+
| ... |
+-----------------+
| arg0 |
+-----------------+
| c0 | <- Cell containing CodeEnv
+-----------------+
| ... |
+-----------------+
| argX | <- frame_base
+-----------------+
| Bottom |
+-----------------+
CALL consumes c0, pushes the captured variables, saves frame_base to
frame_base_stack, then moves frame_base up to arg0 — entering the
callee:
+-----------------+
| capm |
+-----------------+
| ... |
+-----------------+
| cap0 |
+-----------------+
| argn |
+-----------------+
| ... |
+-----------------+
| arg0 | <- frame_base (new)
+-----------------+
| ... |
+-----------------+
| argX | <- frame_base (saved)
+-----------------+
| Bottom |
+-----------------+
The closure does what work it needs to, accessing the arguments and captures
through GETLOCAL. After NROT n+1 and DROP n clean the frame slots, RET
pops frame_base_stack, restoring the caller’s view with only the result on
top:
+-----------------+
| result |
+-----------------+
| ... | <- frame_base (restored)
+-----------------+
| Bottom |
+-----------------+
Let’s now walk through what a function call looks like at the assembly level. Starting with the declaration of a function at comp time:
[0] JMP [M+3] ; skip over the body
[1] ENTER n ; n = #formals + #captured-outer-locals
<body>
[M] NROT n+1 ; Function clean up
[M+1] DROP n ; discard the n frame slots, intermediate results of the procedure
; result is now on top for the caller
[M+2] RET ; pop return address, restore old frame_base
[M+3] MKCLOSURE [1] ; heap-allocate a CodeEnv:
; .captured_vars = snapshot of data_stack[frame_base..top)
JMP simply sets the program counter to the operand, in the above, we use it to
skip over the function definition so that we do not accidentally execute the
function before it is invoked, skipping to [M+3], the MKCLOSURE command
copies all of the shared pointers on the stack into a code environment, pushes a
handle to the stack, and points to [1] in the program memory. The ENTER
operation moves the frame base down the stack so that the closure can see its
arguments, with the explicit understanding that on call, all of the necessary
arguments will be provided either by the closure environment or the callee, on
the stack. After that the function body is executed, then NROT n+1 moves the
final result under the arguments, then DROP n drops those. Finally RET pops
a destination from the return stack and sets the program counter to it, then
it sets the frame base to be what it was before the ENTER instruction was
executed.
A Simple Example
To see what a function definition and call site look like, consider a program which takes one argument and increments it by one:
[0] JMP 72
[9] ENTER 1
[18] GET_LOCAL 0
[27] PUSH 1
[36] ADD
[45] NROT 2
[54] DROP 1
[63] RET
[72] MKCLOSURE 9
[81] MKGLOBAL 15
[90] LOADGLOBAL 15
[99] PUSH 3
[108] CALL 1
[117] HALT
This corresponds to the following lisp code
Let’s look at the stack execution state.
pc=0 op=JMP stack=[]
pc=72 op=MKCLOSURE stack=[]
pc=81 op=MKGLOBAL stack=[0f]
pc=90 op=LOADGLOBAL stack=[]
pc=99 op=PUSH stack=[0f]
pc=108 op=CALL stack=[0f, 3]
pc=9 op=ENTER stack=[3]
pc=18 op=GETLOCAL stack=[3]
pc=27 op=PUSH stack=[3, 3]
pc=36 op=ADD stack=[3, 3, 1]
pc=45 op=NROT stack=[3, 4]
pc=54 op=DROP stack=[4, 3]
pc=63 op=RET stack=[4]
pc=117 op=HALT stack=[4]
- First we
JMPpast the function body topc = 72, where we make the closure, since this function only takes 1 argument and doesn’t close over anything, it’sCodeEnv.captured_varsis empty. - Then we
MKGLOBAL 15, that is, we take the value on the top of the stack, which right now is the closure handler, and add it to the heap with global name 15. Now we can call this function from anywhere else in the code - Then we add the function argument, in this case 3, and
CALL 1, the operand to theCALLinstruction tells it how far down in the stack the function we’re calling is. So we take that global variable, look inside and find aCodeEnvwithcode_idx = 9 - We follow that and pull
GET_LOCAL 0intothe stack, which is the first argument to the function, then push 1, andADD. Then we clean up the argument that was passed to the function, leaving only the 4 on top, as desired.
Tree Operations
The last thing that we should discuss before closing this chapter of the project
are the list/tree instructions, CONS, CAR and CDR. We need these
primatives primarily because, at compilation time, we can not determine the size
of the list, so we must give the user a way to access the heap and interact
with it. A brief explaination for each follows
CONScreates a new cons cell in the heap and pushes it’s index into the stack . The cons cell is a heap object which has aleftandrightelement, each one can point to another cons cell, or hold a value (or a function for that matter).CARpops the top element and, if it’s a cons cell handler, pushes theleftelement to the data stackCDRdoes the same but gets therightelement instead
Those are the three orthodox primitives required for a lisp, any target has to somehow support them. Two additional, slightly apocryphal instructions are also added to make it a bit easier to generate code which deals with lists:
PUSHNILcreates a unique null value which, typically terminates a listISNULLchecks the top cell in the stack for thenullflag, and pushes a new cell containing that truth value.
It’s worth thinking a little bit about how targets which aren’t taylored
specifically for a lisp would support this: the cons cell is just a sum type
with pointers to two other objects in memory, CONS then just creates a static
structure which holds two pointers, which in themselves point to two others and
so on and so on. However this creates a lot of structure that might or might not
be accessible, which is where more sophisticated garbage collection mechanisms
come in. There’s also the problem that a naive pointer-to-somewhere could be
incredibly slow on cache miss, which is where rather sophisticated caching comes
into play.
Lets work through an example of the creating a list, a simple list becomes
->
Which produces the following bytecode:
PUSHNIL
PUSH 3
CONS
PUSH 2
CONS
PUSH 1
CONS
After the final CONS, the data stack holds a single handle into the heap,
where three Pair objects form the linked structure:
Stack
+---------+
| h2 ───────► Pair @ h2
+---------+ ├─ head ──► Cell { value: 1 }
└─ tail ──► Pair @ h1
├─ head ──► Cell { value: 2 }
└─ tail ──► Pair @ h0
├─ head ──► Cell { value: 3 }
└─ tail ──► Cell { null: true }
This is the key contrast with the calling convention diagrams above: the stack
stays flat, holding only a single pointer, while the structure fans out entirely
in the heap. CAR and CDR follow that pointer one level at a time, and
ISNULL tests whether the cell at the current node is the null sentinel, giving
the language everything needed to recurse over a list without any stack growth
beyond the call frames themselves.
Conclusion
In the next entry to this only slightly protracted write up we’ll focus on the code generator which turns the AST from part 1 into the ISA in this part. Code generation is probably the most visible and obvious role of the compiler, but ironically the one I had to spend the least time on to get to an MVP.