Making a language

2016 November 1

How to make a programming language from scratch in JS.

What’s in a programming language?

Programming language implementations vary wildly, but the steps are roughly as follows:

Parse source code text into abstract syntax tree (AST) such as JSON
(Optional) Static analysis of AST to report warnings and errors
Depending on whether you want a compiler or an interpreter:
- Compile to executable, bytecode, JS, etc. OR
- Evaluate the AST to run the program

Note: Regexps are not powerful enough to for the job here. Don’t believe me? Check out the Chomsky hierarchy on Wikipedia, or this funny StackOverflow answer about parsing HTML.

Starting out easy

One of the smallest programming languages I could choose to implement is Lambda Calculus, but it’s pretty hard to see how that relates to everyday programming. So I’ll be walking through a tiny Lisp I made called Duckweed. It’s about 300 lines, so it shouldn’t be too bad to get through!

In the interest of space, I will not be diving into a full explanation of how Parsimmon works. You should check out the repo for more information, or just read this post anyway.

What is Duckweed?

A lot of people get scared at the idea of Lisp because they hear it’s a “hacker language” or some other gatekeeping nonsense. The short of it is, Lisp is just a style of programming language with very simple and regular syntax, perfect for a short example like this blog post.

Some Lisp languages have many features, but Duckweed intentionally has very few. In terms of JavaScript, Duckweed has operations similar to function, call(), var, +, *, -, if, and not really much of anything else.

Here’s the hello world program in Duckweed:

(print "hello world")

Now let’s cover a basic factorial program in JavaScript.

In math factorial can be defined as follows:

factorial(0) = 1
factorial(n) = n * factorial(n - 1)

In English, when factorial is given 0, it returns 1. When factorial is given any other number (n), it returns n * factorial(n - 1). In JavaScript, this looks like the following:

function factorial(n) {
  if (n === 0) {
    return 1;
  } else {
    return n * factorial(n - 1);
  }
}
console.log(factorial(4));

Duckweed doesn’t have the concept of a function declaration, just anonymous functions, so the Duckweed program is a little closer to this JavaScript program:

var factorial = function (n) {
  if (n === 0) {
    return 1;
  } else {
    return n * factorial(n - 1);
  }
};
console.log(factorial(4));

In Duckweed, let is the word for var, and it takes an expression at the end to evaluate. Also lambda is the word for function.

(let ((factorial (lambda (n)
       (if (= n 0)
        1
        (* n (factorial (- n 1)))))))
  (print (factorial 4)))

So that’s what Duckweed looks like, and you can see it’s capable of doing at least basic math.

First things first

Let’s start off with the top level of Duckweed: main.js.

const fs = require("fs");
const util = require("util");
const parse = require("./parse").parse;
const evaluate = require("./evaluate").evaluate;
const globals = require("./globals").globals;
const Scope = require("./scope");
const U = require("./util");

First up we have a lot of things to import. Not much to see here.

const args = process.argv.slice(2);
const filename = args[0];
const code = fs.readFileSync(filename, "utf-8");

const opts = { colors: true, depth: null };
function show(x) {
  return console.log(util.inspect(x, opts));
}

Next we read the Duckweed code from the file specified on the command line, and define a helper function to pretty-print objects. This will be useful for inspecting the data generated by the parser.

const result = parse(code);
if (result.status) {
  const ast = result.value;
  // show(ast);
  console.log(U.showSexp(ast));
  console.log();
  const x = evaluate(globals, ast);
  console.log(U.showSexp(x));
} else {
  console.error("parse error!");
  console.error(result);
}

This is where the magic happens. It all really boils down to parse and evaluate, but we’ve got some extra code in here to show parse errors and inspect the code being evaluated.

Parsing all those parentheses

One of the reasons Lisp is a good choice is that the syntax is very simple to parse compared to most other languages.

const P = require("parsimmon");

const Comment = P.regex(/;[^\n]*/)
  .skip(P.alt(P.string("\n"), P.eof))
  .desc("a comment");

const _ = P.alt(P.whitespace, Comment).many().desc("whitespace");

We start off by describing what comments look like (semicolon until the end of line), and what whitespace looks like.

Everything in Lisp is an S-expression (sexp for short). In our very simple Lisp, this just means it’s either a list or an “atom”, which is the Lisp word for something very basic, like numbers and strings.

const SExp = P.lazy(() => P.alt(AList, Atom));

A list is simply an open parenthesis followed by zero or more sexps and terminated by a closing parenthesis. We tag this information inside an object with type: "List" so that later we can easily inspect the things we parsed.

const AList = P.string("(")
  .then(_.then(SExp).skip(_).many())
  .skip(P.string(")"))
  .map((items) => ({ type: "List", items }))
  .desc("a list");

For the purpose of brevity, Duckweed strings do not have any escape characters (e.g. "\n").

const AString = P.string('"')
  // TODO: Accept escaped characters
  .then(P.regex(/[^"]*/))
  .skip(P.string('"'))
  // TODO: Convert escaped characters back
  .map((value) => ({ type: "String", value }))
  .desc("a string");

It’s traditional in Lisp to just write #t for true and #f for false.

const True = P.string("#t").result({ type: "True" }).desc("#t");

const False = P.string("#f").result({ type: "False" }).desc("#f");

Numbers are numbers. We pass them through the JavaScript Number function to convert them from strings.

const ANumber = P.regex(/-?[0-9]+/)
  .map((x) => Number(x))
  .map((value) => ({ type: "Number", value }))
  .desc("a number");

Symbols are plain words in your code that usually represent variables, but can also represent keywords (such as if).

const ASymbol = P.regex(/[a-zA-Z_+*=<>?\/-]+/)
  .map((name) => ({ type: "Symbol", name }))
  .desc("a symbol");

Any Lisp would be remiss without this shorthand for choosing not to evaluate a sexp. Basically you just prefix any code with ' and Lisp treats it as data instead.

const Quote = P.string("'")
  .skip(_)
  .then(SExp)
  .map((sexp) => ({
    type: "List",
    items: [{ type: "Symbol", name: "quote" }, sexp],
  }));

Like I said earlier, an atom is just any of the basic non-list data types. And a “file” is just a single sexp with optional whitespace around it. Most Lisps accept zero or more sexps at the top level of a file, but we’re keeping it simple here.

const Atom = P.alt(Quote, AString, ANumber, True, False, ASymbol);

const File = _.then(SExp).skip(_);

function parse(code) {
  return File.parse(code);
}

exports.parse = parse;

Scope

Variable scope is one of the most important things to model in a programming language, so we’ll go over that first.

I used algebraic data types for my implementation. Basically there are two kinds of scope: empty and non-empty. We start off with an empty scope and every scope after that is non-empty. A non-empty scope just contains a JavaScript object mapping the variable names to their values.

const Empty = ["Scope.Empty"];

This is an empty scope. We just have an array with a string tag so we know which kind it is. Then we use the create function to wrap a scope with another scope. This is how we create non-empty scopes.

function create(parent, items) {
  return ["Scope.Nonempty", items, parent];
}

Again we have a string tag at the beginning to identify which case (empty vs non-empty), but in this case we also have an object containing the values in the current scope, and a reference to the parent scope so we can traverse the scope hierarchy.

A simple scope chain could look like this:

const s1 = Scope.Empty;
const s2 = Scope.create(s1, { a: 1, b: 2 });
const s3 = Scope.create(s2, { a: 3 });

And thus from the scope s3 we could see variables a and b, but a would have the value 3 from scope s3’s perspective since it is shadowing s2’s variable a.

Now that we have creation, we need a way to look up variables by their names. It’s a little unwieldy to manually dig through nested scopes, so we make a recursive function to help out.

function lookup(scope, key) {
  if (scope[0] === "Scope.Empty") {
    throw new Error("no such variable " + key);
  }
  if (scope[0] !== "Scope.Nonempty") {
    throw new Error("not a valid scope");
  }

  if (scope[1].hasOwnProperty(key)) {
    return scope[1][key];
  }
  return lookup(scope[2], key);
}

It’s a little cryptic accessing the pieces of a scope since I put them in an array, but remember that the first element is the tag. Basically for empty scopes we throw an error, then non-empty scopes check to see if the scope contains the variable, and if not, call lookup recursively on the parent scope.

So using the scopes s1, s2, and s3 from above, we can use lookup function like this:

lookup(s3, "a"); // Returns 3
lookup(s2, "a"); // Returns 1
lookup(s1, "a"); // Throws an error

This might seem like enough to model scope, but you also need to be able to update the values in an existing scope. We’re not exposing this in the language through reassignment, but it’s still important for modeling variable scope.

For the purposes of Duckweed it’s sufficient to just update the current scope without going up the chain, but languages with more complicated assignment rules we need to take extra care in this functionality.

function assign(scope, key, value) {
  if (scope[0] === "Scope.Nonempty") {
    scope[1][key] = value;
  } else {
    throw new Error("not a valid scope to assign to");
  }
}

exports.assign = assign;
exports.lookup = lookup;
exports.create = create;
exports.Empty = Empty;

Evaluation

Most evaluation is straightforward in Duckweed: data is just data. Numbers are just numbers, strings are just strings, true and false are exactly what they sound like. But the way you evaluate lists is where the complication happens.

So here’s the overview of the evaluation file, with the complicated bits taken out for now:

const Scope = require("./scope");

const special = {
  // ...
};

const table = {
  List(stack, scope, node) {
    // ...
  },
  JSFunction(stack, scope, node) {
    return node;
  },
  True(stack, scope, node) {
    return node;
  },
  False(stack, scope, node) {
    return node;
  },
  String(stack, scope, node) {
    return node;
  },
  Number(stack, scope, node) {
    return node;
  },
  Symbol(stack, scope, node) {
    return Scope.lookup(scope, node.name);
  },
};

function EVAL(stack, scope, node) {
  if (table.hasOwnProperty(node.type)) {
    return table[node.type](stack, scope, node);
  }
  throw new Error("cannot evaluate " + JSON.stringify(node));
}

function evaluate(scope, node) {
  return EVAL([], scope, node);
}

exports.EVAL = EVAL;
exports.evaluate = evaluate;

So to evaluate a symbol you look it up in the variable scope, and everything else is just passed through as raw data.

List eval

Now here’s the list evaluation function.

List(stack, scope, node) {
  const first = node.items[0];
  // Evaluate special form such as `if` or `lambda`
  if (first.type === 'Symbol' && special.hasOwnProperty(first.name)) {
    return special[first.name](stack, scope, node);
  }
  // Regular function call
  const f = EVAL(stack, scope, first);
  // but first, let's make sure we're right here!
  if (f.type !== 'JSFunction' && f.type !== 'Function') {
    throw new Error('cannot call non-function');
  }
  const args = node
    .items
    .slice(1)
    .map(x => EVAL(stack, scope, x));
  if (f.type === 'JSFunction') {
    const args2 = [stack, scope].concat(args);
    return f.f.apply(null, args2);
  } else if (f.type === 'Function') {
    const newStack = stack.concat(['#<lambda>']);
    const obj = {};
    f.parameters.items.forEach((p, i) => {
      obj[p.name] = args[i];
    });
    const newScope = Scope.create(f.scope, obj);
    const values = f.body.map(x => EVAL(newScope, newScope, x));
    return values[values.length - 1];
  }
},

There are three cases here, and we try to deal with them as early as possible:

The symbol is a special form, not a function call (such as if).
The symbol references a native JavaScript function.
The symbol references a function created inside Duckweed.

So for the first case we just dispatch to a table with the unevaluated list data. Note that all list evaluations pass through the current call stack and the current variable scope.

Special eval

quote(stack, scope, node) {
  // Should be called like (quote foo) with exactly one argument.
  if (node.items.length !== 2) {
    throw new Error('bad quote syntax');
  }
  return node.items[1];
},

The Lisp concept of quoting just means “please don’t evaluate me”, so this is the most simple special form. It just returns the unevaluated data. It can be used like (eval (quote a)) which is equivalent to simply a.

list(stack, scope, node) {
  const items =
    node
      .items
      .slice(1)
      .map(x => EVAL(stack, scope, x));
  return {type: 'List', items};
},

To make a list, we just evaluate the arguments and put them into a list data structure. You can use it like (list 1 2 (+ a b) 4).

lambda(stack, scope, node) {
  const parameters = node.items[1];
  const body = node.items.slice(2);
  return {
    type: 'Function',
    scope,
    parameters,
    body
  };
},

The critical part of lambda is that we store a reference to the current scope on the function object. If you’re familiar with the concept of closure, this is required to implement that.

if(stack, scope, node) {
  const condition = node.items[1];
  const trueBranch = node.items[2];
  const falseBranch = node.items[3];
  // The key to `if` is to only evaluate the right branch, not both of them.
  const value = EVAL(stack, scope, condition);
  if (value.type === 'False') {
    return EVAL(stack, scope, falseBranch);
  } else {
    return EVAL(stack, scope, trueBranch);
  }
},

Basically with if you choose the false branch of the value is false, otherwise the true branch. And you only evaluate at most one branch.

The most complicated special form is let. Basically we have to create a new scope and one-by-one evaluate and assign variables inside it, then evaluate the body of the let expression.

Example:

(let ((a 1)
      (b (+ a 1)))
  (print a b)
  (+ a b))

It’s important that the scope grows as we evaluate a and b in order, and then we evaluate (print a b), throw it away, then evaluate and return (+ a b).

let(stack, scope, node) {
  // We need to actually mutate the scope as we evaluate let-bindings so that
  // recursive functions can see themselves.
  const pairs = node.items[1];
  const newScope = Scope.create(scope, {});
  pairs.items.forEach(p => {
    const name = p.items[0].name;
    const value = EVAL(stack, newScope, p.items[1]);
    Scope.assign(newScope, name, value);
  });
  // Evaluate one or more expressions after the let-bindings and return the
  // last one. Useful for side effects.
  const values =
    node
      .items
      .slice(2)
      .map(x => EVAL(scope, newScope, x));
  return values[values.length - 1];
}

Global built-ins

We have all the core operations at this point, but usually programming languages don’t ask you to implement basic math, numbers, and printing yourself, so we’ll provide that from JavaScript via some globals in the evaluator.

Just a few simple imports.

const Scope = require("./scope");
const U = require("./util");
const E = require("./evaluate");

The basic boolean constants we expect to see in any language.

const TRUE = { type: "True" };
const FALSE = { type: "False" };

Numbers are pretty basic too.

function NUMBER(value) {
  return { type: "Number", value };
}

You might notice a pattern at this point: “basic” values are usually represented in an evaluator as a type field and a value field if necessary.

This printing function isn’t a whole lot of fun, so just check out util.js yourself if you want to.

function print(stack, scope, x) {
  console.log(U.showSexp(x));
  return x;
}

We just need to reach inside the wrapped numbers and then re-wrap them, reusing JavaScript’s operators to accomplish the basic math.

function add(stack, scope, a, b) {
  return NUMBER(a.value + b.value);
}

function subtract(stack, scope, a, b) {
  return NUMBER(a.value - b.value);
}

function multiply(stack, scope, a, b) {
  return NUMBER(a.value * b.value);
}

function lessThan(stack, scope, a, b) {
  return a.value < b.value ? TRUE : FALSE;
}

Ideally these math functions should add to the stack also, but it’s not strictly necessary for the evaluator to work. It just would make a stack trace nicer if the program crashed.

eval is just using the EVAL function we built-up earlier! We’re just exposing it for the code to use.

function evaluate(stack, scope, sexp) {
  return E.EVAL(stack, scope, sexp);
}

Now we make the actual top level scope the evaluator will user later.

const api = {
  print,
  "+": add,
  "-": subtract,
  "*": multiply,
  "<": lessThan,
  eval: evaluate,
};

Object.keys(api).forEach((k) => {
  api[k] = { type: "JSFunction", f: api[k] };
});

const globals = Scope.create(Scope.Empty, api);

exports.globals = globals;

Wrapping up

This covers basically everything about the Duckweed language and evaluator. Check out the repo for the full code. Once you understand that, check out my other language Hibiscus for a (slightly larger) example of a small JS-like language.