Making a language
What’s in a programming language?
Programming language implementations vary wildly, but the steps are roughly as follows:
- Parse source code text into abstract syntax tree (AST) such as JSON
- (Optional) Static analysis of AST to report warnings and errors
- Depending on whether you want a compiler or an interpreter:
- Compile to executable, bytecode, JS, etc. OR
- Evaluate the AST to run the program
Note: Regexps are not powerful enough to for the job here. Don’t believe me? Check out the Chomsky hierarchy on Wikipedia, or this funny StackOverflow answer about parsing HTML.
Starting out easy
One of the smallest programming languages I could choose to implement is Lambda Calculus, but it’s pretty hard to see how that relates to everyday programming. So I’ll be walking through a tiny Lisp I made called Duckweed. It’s about 300 lines, so it shouldn’t be too bad to get through!
In the interest of space, I will not be diving into a full explanation of how Parsimmon works. You should check out the repo for more information, or just read this post anyway.
What is Duckweed?
A lot of people get scared at the idea of Lisp because they hear it’s a “hacker language” or some other gatekeeping nonsense. The short of it is, Lisp is just a style of programming language with very simple and regular syntax, perfect for a short example like this blog post.
Some Lisp languages have many features, but Duckweed intentionally has very few.
In terms of JavaScript, Duckweed has operations similar to function
, call()
,
var
, +
, *
, -
, if
, and not really much of anything else.
Here’s the hello world program in Duckweed:
(print "hello world")
Now let’s cover a basic factorial program in JavaScript.
In math factorial can be defined as follows:
factorial(0) = 1
factorial(n) = n * factorial(n - 1)
In English, when factorial
is given 0, it returns 1. When factorial is given
any other number (n), it returns n * factorial(n - 1)
. In JavaScript, this
looks like the following:
function factorial(n) {
if (n === 0) {
return 1;
} else {
return n * factorial(n - 1);
}
}
console.log(factorial(4));
Duckweed doesn’t have the concept of a function declaration, just anonymous functions, so the Duckweed program is a little closer to this JavaScript program:
var factorial = function (n) {
if (n === 0) {
return 1;
} else {
return n * factorial(n - 1);
}
};
console.log(factorial(4));
In Duckweed, let
is the word for var
, and it takes an expression at the end
to evaluate. Also lambda
is the word for function
.
(let ((factorial (lambda (n)
(if (= n 0)
1
(* n (factorial (- n 1)))))))
(print (factorial 4)))
So that’s what Duckweed looks like, and you can see it’s capable of doing at least basic math.
First things first
Let’s start off with the top level of Duckweed: main.js
.
const fs = require("fs");
const util = require("util");
const parse = require("./parse").parse;
const evaluate = require("./evaluate").evaluate;
const globals = require("./globals").globals;
const Scope = require("./scope");
const U = require("./util");
First up we have a lot of things to import. Not much to see here.
const args = process.argv.slice(2);
const filename = args[0];
const code = fs.readFileSync(filename, "utf-8");
const opts = { colors: true, depth: null };
function show(x) {
return console.log(util.inspect(x, opts));
}
Next we read the Duckweed code from the file specified on the command line, and define a helper function to pretty-print objects. This will be useful for inspecting the data generated by the parser.
const result = parse(code);
if (result.status) {
const ast = result.value;
// show(ast);
console.log(U.showSexp(ast));
console.log();
const x = evaluate(globals, ast);
console.log(U.showSexp(x));
} else {
console.error("parse error!");
console.error(result);
}
This is where the magic happens. It all really boils down to parse
and
evaluate
, but we’ve got some extra code in here to show parse errors and
inspect the code being evaluated.
Parsing all those parentheses
One of the reasons Lisp is a good choice is that the syntax is very simple to parse compared to most other languages.
const P = require("parsimmon");
const Comment = P.regex(/;[^\n]*/)
.skip(P.alt(P.string("\n"), P.eof))
.desc("a comment");
const _ = P.alt(P.whitespace, Comment).many().desc("whitespace");
We start off by describing what comments look like (semicolon until the end of line), and what whitespace looks like.
Everything in Lisp is an S-expression (sexp for short). In our very simple Lisp, this just means it’s either a list or an “atom”, which is the Lisp word for something very basic, like numbers and strings.
const SExp = P.lazy(() => P.alt(AList, Atom));
A list is simply an open parenthesis followed by zero or more sexps and
terminated by a closing parenthesis. We tag this information inside an object
with type: "List"
so that later we can easily inspect the things we parsed.
const AList = P.string("(")
.then(_.then(SExp).skip(_).many())
.skip(P.string(")"))
.map((items) => ({ type: "List", items }))
.desc("a list");
For the purpose of brevity, Duckweed strings do not have any escape characters
(e.g. "\n"
).
const AString = P.string('"')
// TODO: Accept escaped characters
.then(P.regex(/[^"]*/))
.skip(P.string('"'))
// TODO: Convert escaped characters back
.map((value) => ({ type: "String", value }))
.desc("a string");
It’s traditional in Lisp to just write #t
for true and #f
for false.
const True = P.string("#t").result({ type: "True" }).desc("#t");
const False = P.string("#f").result({ type: "False" }).desc("#f");
Numbers are numbers. We pass them through the JavaScript Number
function to
convert them from strings.
const ANumber = P.regex(/-?[0-9]+/)
.map((x) => Number(x))
.map((value) => ({ type: "Number", value }))
.desc("a number");
Symbols are plain words in your code that usually represent variables, but can
also represent keywords (such as if
).
const ASymbol = P.regex(/[a-zA-Z_+*=<>?\/-]+/)
.map((name) => ({ type: "Symbol", name }))
.desc("a symbol");
Any Lisp would be remiss without this shorthand for choosing not to evaluate a
sexp. Basically you just prefix any code with '
and Lisp treats it as data
instead.
const Quote = P.string("'")
.skip(_)
.then(SExp)
.map((sexp) => ({
type: "List",
items: [{ type: "Symbol", name: "quote" }, sexp],
}));
Like I said earlier, an atom is just any of the basic non-list data types. And a “file” is just a single sexp with optional whitespace around it. Most Lisps accept zero or more sexps at the top level of a file, but we’re keeping it simple here.
const Atom = P.alt(Quote, AString, ANumber, True, False, ASymbol);
const File = _.then(SExp).skip(_);
function parse(code) {
return File.parse(code);
}
exports.parse = parse;
Scope
Variable scope is one of the most important things to model in a programming language, so we’ll go over that first.
I used algebraic data types for my implementation. Basically there are two kinds of scope: empty and non-empty. We start off with an empty scope and every scope after that is non-empty. A non-empty scope just contains a JavaScript object mapping the variable names to their values.
const Empty = ["Scope.Empty"];
This is an empty scope. We just have an array with a string tag so we know which
kind it is. Then we use the create
function to wrap a scope with another
scope. This is how we create non-empty scopes.
function create(parent, items) {
return ["Scope.Nonempty", items, parent];
}
Again we have a string tag at the beginning to identify which case (empty vs non-empty), but in this case we also have an object containing the values in the current scope, and a reference to the parent scope so we can traverse the scope hierarchy.
A simple scope chain could look like this:
const s1 = Scope.Empty;
const s2 = Scope.create(s1, { a: 1, b: 2 });
const s3 = Scope.create(s2, { a: 3 });
And thus from the scope s3
we could see variables a
and b
, but a
would
have the value 3
from scope s3
’s perspective since it is shadowing
s2
’s variable a
.
Now that we have creation, we need a way to look up variables by their names. It’s a little unwieldy to manually dig through nested scopes, so we make a recursive function to help out.
function lookup(scope, key) {
if (scope[0] === "Scope.Empty") {
throw new Error("no such variable " + key);
}
if (scope[0] !== "Scope.Nonempty") {
throw new Error("not a valid scope");
}
if (scope[1].hasOwnProperty(key)) {
return scope[1][key];
}
return lookup(scope[2], key);
}
It’s a little cryptic accessing the pieces of a scope since I put them in an
array, but remember that the first element is the tag. Basically for empty
scopes we throw an error, then non-empty scopes check to see if the scope
contains the variable, and if not, call lookup
recursively on the parent
scope.
So using the scopes s1
, s2
, and s3
from above, we can use lookup
function like this:
lookup(s3, "a"); // Returns 3
lookup(s2, "a"); // Returns 1
lookup(s1, "a"); // Throws an error
This might seem like enough to model scope, but you also need to be able to update the values in an existing scope. We’re not exposing this in the language through reassignment, but it’s still important for modeling variable scope.
For the purposes of Duckweed it’s sufficient to just update the current scope without going up the chain, but languages with more complicated assignment rules we need to take extra care in this functionality.
function assign(scope, key, value) {
if (scope[0] === "Scope.Nonempty") {
scope[1][key] = value;
} else {
throw new Error("not a valid scope to assign to");
}
}
exports.assign = assign;
exports.lookup = lookup;
exports.create = create;
exports.Empty = Empty;
Evaluation
Most evaluation is straightforward in Duckweed: data is just data. Numbers are just numbers, strings are just strings, true and false are exactly what they sound like. But the way you evaluate lists is where the complication happens.
So here’s the overview of the evaluation file, with the complicated bits taken out for now:
const Scope = require("./scope");
const special = {
// ...
};
const table = {
List(stack, scope, node) {
// ...
},
JSFunction(stack, scope, node) {
return node;
},
True(stack, scope, node) {
return node;
},
False(stack, scope, node) {
return node;
},
String(stack, scope, node) {
return node;
},
Number(stack, scope, node) {
return node;
},
Symbol(stack, scope, node) {
return Scope.lookup(scope, node.name);
},
};
function EVAL(stack, scope, node) {
if (table.hasOwnProperty(node.type)) {
return table[node.type](stack, scope, node);
}
throw new Error("cannot evaluate " + JSON.stringify(node));
}
function evaluate(scope, node) {
return EVAL([], scope, node);
}
exports.EVAL = EVAL;
exports.evaluate = evaluate;
So to evaluate a symbol you look it up in the variable scope, and everything else is just passed through as raw data.
List eval
Now here’s the list evaluation function.
List(stack, scope, node) {
const first = node.items[0];
// Evaluate special form such as `if` or `lambda`
if (first.type === 'Symbol' && special.hasOwnProperty(first.name)) {
return special[first.name](stack, scope, node);
}
// Regular function call
const f = EVAL(stack, scope, first);
// but first, let's make sure we're right here!
if (f.type !== 'JSFunction' && f.type !== 'Function') {
throw new Error('cannot call non-function');
}
const args = node
.items
.slice(1)
.map(x => EVAL(stack, scope, x));
if (f.type === 'JSFunction') {
const args2 = [stack, scope].concat(args);
return f.f.apply(null, args2);
} else if (f.type === 'Function') {
const newStack = stack.concat(['#<lambda>']);
const obj = {};
f.parameters.items.forEach((p, i) => {
obj[p.name] = args[i];
});
const newScope = Scope.create(f.scope, obj);
const values = f.body.map(x => EVAL(newScope, newScope, x));
return values[values.length - 1];
}
},
There are three cases here, and we try to deal with them as early as possible:
-
The symbol is a special form, not a function call (such as
if
). -
The symbol references a native JavaScript function.
-
The symbol references a function created inside Duckweed.
So for the first case we just dispatch to a table with the unevaluated list data. Note that all list evaluations pass through the current call stack and the current variable scope.
Special eval
quote(stack, scope, node) {
// Should be called like (quote foo) with exactly one argument.
if (node.items.length !== 2) {
throw new Error('bad quote syntax');
}
return node.items[1];
},
The Lisp concept of quoting just means “please don’t evaluate me”, so this is
the most simple special form. It just returns the unevaluated data. It can be
used like (eval (quote a))
which is equivalent to simply a
.
list(stack, scope, node) {
const items =
node
.items
.slice(1)
.map(x => EVAL(stack, scope, x));
return {type: 'List', items};
},
To make a list, we just evaluate the arguments and put them into a list data
structure. You can use it like (list 1 2 (+ a b) 4)
.
lambda(stack, scope, node) {
const parameters = node.items[1];
const body = node.items.slice(2);
return {
type: 'Function',
scope,
parameters,
body
};
},
The critical part of lambda
is that we store a reference to the current scope
on the function object. If you’re familiar with the concept of closure,
this is required to implement that.
if(stack, scope, node) {
const condition = node.items[1];
const trueBranch = node.items[2];
const falseBranch = node.items[3];
// The key to `if` is to only evaluate the right branch, not both of them.
const value = EVAL(stack, scope, condition);
if (value.type === 'False') {
return EVAL(stack, scope, falseBranch);
} else {
return EVAL(stack, scope, trueBranch);
}
},
Basically with if
you choose the false branch of the value is false, otherwise
the true branch. And you only evaluate at most one branch.
The most complicated special form is let
. Basically we have to create a new
scope and one-by-one evaluate and assign variables inside it, then evaluate the
body of the let
expression.
Example:
(let ((a 1)
(b (+ a 1)))
(print a b)
(+ a b))
It’s important that the scope grows as we evaluate a
and b
in order, and
then we evaluate (print a b)
, throw it away, then evaluate and return
(+ a b)
.
let(stack, scope, node) {
// We need to actually mutate the scope as we evaluate let-bindings so that
// recursive functions can see themselves.
const pairs = node.items[1];
const newScope = Scope.create(scope, {});
pairs.items.forEach(p => {
const name = p.items[0].name;
const value = EVAL(stack, newScope, p.items[1]);
Scope.assign(newScope, name, value);
});
// Evaluate one or more expressions after the let-bindings and return the
// last one. Useful for side effects.
const values =
node
.items
.slice(2)
.map(x => EVAL(scope, newScope, x));
return values[values.length - 1];
}
Global built-ins
We have all the core operations at this point, but usually programming languages don’t ask you to implement basic math, numbers, and printing yourself, so we’ll provide that from JavaScript via some globals in the evaluator.
Just a few simple imports.
const Scope = require("./scope");
const U = require("./util");
const E = require("./evaluate");
The basic boolean constants we expect to see in any language.
const TRUE = { type: "True" };
const FALSE = { type: "False" };
Numbers are pretty basic too.
function NUMBER(value) {
return { type: "Number", value };
}
You might notice a pattern at this point: “basic” values are usually represented
in an evaluator as a type
field and a value
field if necessary.
This printing function isn’t a whole lot of fun, so just check out util.js
yourself if you want to.
function print(stack, scope, x) {
console.log(U.showSexp(x));
return x;
}
We just need to reach inside the wrapped numbers and then re-wrap them, reusing JavaScript’s operators to accomplish the basic math.
function add(stack, scope, a, b) {
return NUMBER(a.value + b.value);
}
function subtract(stack, scope, a, b) {
return NUMBER(a.value - b.value);
}
function multiply(stack, scope, a, b) {
return NUMBER(a.value * b.value);
}
function lessThan(stack, scope, a, b) {
return a.value < b.value ? TRUE : FALSE;
}
Ideally these math functions should add to the stack also, but it’s not strictly necessary for the evaluator to work. It just would make a stack trace nicer if the program crashed.
eval
is just using the EVAL
function we built-up earlier! We’re just
exposing it for the code to use.
function evaluate(stack, scope, sexp) {
return E.EVAL(stack, scope, sexp);
}
Now we make the actual top level scope the evaluator will user later.
const api = {
print,
"+": add,
"-": subtract,
"*": multiply,
"<": lessThan,
eval: evaluate,
};
Object.keys(api).forEach((k) => {
api[k] = { type: "JSFunction", f: api[k] };
});
const globals = Scope.create(Scope.Empty, api);
exports.globals = globals;
Wrapping up
This covers basically everything about the Duckweed language and evaluator. Check out the repo for the full code. Once you understand that, check out my other language Hibiscus for a (slightly larger) example of a small JS-like language.