Writing a language

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 17 Aug 2012
Tagged with: [ ast ]  [ bison ]  [ flex ]  [ grammar ]  [ lex ]  [ saffire ]  [ yacc

In the last blogpost I was talking about a new language in the making. Unfortunately, writing a complete new language - from scratch - isn’t as easy and takes a fair bit of time. During this development process, I will try and blog a bit on the things we are developing, the problems we are facing and the solutions we are implementing. The current Saffire status: we are able to generate AST tree’s from   Saffire source programs. If you have no clue what I’m talking about, no worries: this blogpost will try and explain it all.

On writing compilers

I’ve written some fairly bit of complex code in my life, including an Operating System capable of context switching, memory management, ext2/3, fat16 and VFS capabilities. Not something that will make it to your desktop soon, but the basic and probably most complex parts of any operating system nevertheless. So I guess it’s safe to say I know a thing or two about systems. Unfortunately, all that knowledge is pretty useless when dealing with writing a compiler (or at least: the most important bits of a compiler). But help is on it’s way: everybody who is into compiler development know about THE book that you will need: the (purple) dragon book. If you’re a student and use this book in class, you will probably hate it. Other people, like me who never really went to school, and learned it all “on the streets”, we love this book.  It’s a bit of dry theory, and you definately have to put away the book once in a while do try to put it all in practice. It will make sense once you get the hang of it.

Now, something like this cannot be written without the help of some smart people who already know this stuff. So a big thank you to Richard van Velzen, who has helped me a lot with getting things up and running.

Defining your language

Before you can actually run an application in your language, you have to figure out how the language would work. This is the easiest step in the whole process, and the only limit is your own imagination. Just write down how your language would look like, what its rules are, how you would do certain stuff. You will run into dead-ends lots of times, since your idea only works for scenario X, but not for scenario Y. For instance, Saffire only supports methods. There are no global functions or procedures like you can find in other languages. This is done to support a more OO type of programming, and so we don’t have people printf(), or str_replace() all over the place. Everything is neatly grouped into classes and objects. But as a side-effect, we are still not 100% sure how we would support anonymous functions, lambda’s and closures. But in the end, it’s just a matter of writing down example code, discuss it with others who find flaws in your reasoning and discuss and decide on a solution. In the end, sooner or later, this will end up in a language specification. This is something that is vital in order to create your compiler: if you have no idea what it will look like, your compiler won’t either. With Saffire, we are still in this process: we have decided on a lot of features and specifications, but a lot of things are left open. Luckily, this doesn’t prevent us from continuing with the rest of the process, but sooner or later we need to have everything documentated and thought out, in order to implement it.

Step 1: Tokenize your code

There is a reason why writing a compiler isn’t easy, and it’s the same reason why we implement things like coding standards when we deal with code that is written and/or read by others. If we do things a certain way, it makes it easier to process that information: this is how we define constants, this is how we write an if-else structure, etc etc. For a compiler this works pretty much the same way: it needs to figure out what is what, and in order to do so, it will follow certain rules.

Now, the first part in this all is converting your source-code into tokens. For example, there is an T_IF token (the T_ stands for token). So whenever a programmer has written: if ($a), IF($A), if    ($a), iF($a) or any other possible way to write an if-statement, that “if” part is replaced by the T_IF token. But some tokens are just single characters. The parenthesis are both tokens as well T_PARENTHESIS_OPEN and T_PARENTHESIS_CLOSE for example. Things like variable names are also converted to tokens, which could be the T_VARIABLE token (the token itself can hold additional information, in this case, it would be the actual name of the variable, which would be $a).

The conversion of source-code to tokens is done by a so-called lexer. There are multiple out there, but we decided to use flex for this job. Flex is a pretty old system, but works really well and got lot of documentation and tutorials.  This makes it easier to get your system up and running. The tokenizer also can do things like removing comments and empty lines. Things that makes the code visually easier to read, but are of no use for the compiler.

So, phase 1 is complete: a source code is translated into a stream of tokens.

Phase 2: parsing

From this point we don’t have to worry on how a programmer has written his code. It’s all converted into generic tokens which we can work with. But just like a natural language like English or Dutch, writing just words down randomly doesn’t make it into a sentence that people can understand. The parser phase actually takes the stream of tokens, and figures out if those tokens actually make sense.

For instance, let’s assume we have a small bit of code:  

if ($a){}

The lexer would translate this into tokens:  

T_IF T_PARENTHESIS_OPEN T_VARIABLE  T_PARENTHESIS_CLOSE T_CURLY_OPEN T_CURLY_CLOSE

But suppose we have written something like this:

 )($a {if}

Our lexer very happily tokenize this into: 

T_PARENTHESIS_CLOSE T_PARENTHESIS_OPEN T_VARIABLE  T_CURLY_OPEN T_IF T_CURLY_CLOSE

Unless you are defining a REALLY complicated language, this isn’t probably something you would want.

Now, as said: the parser will take this information and find certain grammar-rules to see if it finds a match. For instance, we could have a grammar-rule describing if-statements:

if-statement:
      T_IF T_PARENTHESIS_OPEN T_VARIABLE T_PARENTHESIS_CLOSE
   |  T_IF T_PARENTHESIS_OPEN T_PARENTHESIS_CLOSE

What this does, is telling the parser that an if-statement consists of the T_IF token, followed by a ‘(‘, followed by a variable and closed with a ‘)’. The second line tells the parser that it could also be a T_IF token, followed by a ‘(‘ and ‘)’, which would be an empty if-statement (again, it doesn’t make sense, but suppose you support it).

When we are parsing the tokens, the parser will try and match the tokens against the grammar rules provided. If this would be the only place inside our grammar where we would use the T_IF token, it means that a T_IF token ALWAYS is followed by a ‘(‘, and after that there might be a variable OR a ‘)’. It’s pretty easy to write those grammar rules, as most languages have a fairly small number of rules.

Now, if we happen to have written: “if $a”, the tokenizer converts it into T_IF T_VARIABLE, but the parser will not find a rule that matches, since there are no rules that matches a T_IF followed by a T_VARIABLE. At that point, it will throw an syntax error (or you can customize the message it needs to return for instance: “Error: an if-statement must be followed by an opening parenthesis”.

When the complete token-stream is parsed correctly, you have created pretty much your first lint-checker: you can verify if the source-code actually is grammatically correct!

Creating these statements from our grammar rules is done automatically by a program called bison. Bison generates a C file from your grammar rules to make your life easier. Never every write your own parser (unless you know what you are doing, and in that case, you won’t write a parser anyway), but generate them through 3rd party apps like bison. There are others, like antlr and lemon. I’ve looked at them, but I decided bison, just like flex might not be the best or the fastest tools, but they provide lots of information, documentation and tutorials, which makes it much easier to implement them.

Creating an AST

Now that the code is tokenized, and parsed grammatically, we need to figure out how to actually make it “work”. How can we make sure that when i enter “$a = $a + 1”, that the variable $a is actually incremented. Unfortunally, we are still a long way from making that happen, since there are anoth steps that needs to be taken care of. Once we have parsed our grammar, we will convert our code into a so-called AST-tree.

An Abstract Syntax Tree (AST), is a tree-like representation of the code that can easily be processed by a computer. A tree begins with a root-node, this is the start of your program. In Saffire, a program consist of  2 parts (both optional): a list of use-statements, and “the rest”. Once you have started with the rest (like defining classes, or doing other stuff), you cannot add use-statements anymore. For instance, this is a correct saffire application:

use io;
use fooFramework as foo;
$a = 1;

But this following isn’t:

use io;
$a = 1;
use fooFramework as foo;

This is all handled by the parser, which will throw an syntax error. If everything was allright, we got two branches in our AST root-node: a use-list branch, and a top-list branch (the rest of the program). There are a few different kind of nodes in Saffire at the moment (this is bound to change later, when more and hopefully knowledgable people tell me i did it all wrong :p): we have a numericalType, a stringType and a variableType. If we have $a somewhere in our code, this will generate the token T_IDENTIFIER and this will result in a variableNode with the value $a. There is a classNode, which defines a class (it’s modifiers: public, private, abstract, final, static etc), the actual name, a list of extended classes, the interfaces it implements and the actual class body). We have an interfaceNode, a methodNode (defines actual methods) and a very important oprNode. This node is the operator node.

Suppose we have the code:  $a = 1;  This code would result in T_IDENTIFIER T_ASSIGN T_NUMERIC. There will be a grammar rule that allows this kind of assignments and in the end, an assignment is nothing more than a left hand side, an operator and a right hand side. In our AST tree, this would be represented by an operatorNode with the value T_ASSIGN, with 2 child nodes. On the left: the T_IDENTIFIER (as an identifierNode with value $a) and on the right, the T_NUMERIC node (with value 1).  Don’t swap these two nodes, otherwise you would get: 1 = $a, which obviously will not work.

Now, AST tree’s are very abstract, and since they can be complex and only reside in memory, they CAN be hard to debug when creating them. That’s why I’ve created a simple tool that generates DOT files from the AST. They can be converted into PNG’s with the Graphiz tool, which gives you a visual interpretation of the actual tree.

As an example: this is the output of a (very small) program:

You can imagine that AST tree’s will be very, VERY complex, but parsing them is quite easy: there are only a few types of nodes, and you only have to deal with them. All the ordering, making sure that $a = 1 + 2 * 3 is done in the correct order and such, is taken care of by the grammar.

Now, these DOT files are generated mainly for debugging purposes, but I figured they are looking nice anyway. This is why Saffire has the -dot argument, where you can output the AST tree into a file, which you can then use to generate tree’s like above (the DOT output file has information on how to generate PNG’s from them).

What’s next?

So this is the point where Saffire currently is: from code to AST.

What we have to do next is have a system that takes the root-node from the AST and interpret it. Since the first node will be the root node (special T_PROGRAM node), we know that that node has 2 children: on the left, a tree with our use-statements (or a NULL-node when no use-statements are given), and on the right the rest of the program (or a NULL-node when no program is given, but only use-statements are present in the code). We iterate all nodes from top-to-bottom left-to-right, and if we interpret every node correctly, we just have created our Saffire-interpreter! Now, it SOUNDS easy, and in a sense it is, but the idea is to make this FAST, compact and manageable, which will take much more time.

Because of the AST tree, there are much more things we can do: instead of interpreting (read a node and execute it), we could write a compiler: read a node, generate machine-code for it, write it to a file). That way we would end up with a executable generated from a saffire-file (awesome!), or maybe we could have something in between: a JIT compiler that read nodes (or bunch of nodes), and generates machine code on the fly.

And maybe we can do even more: we already can output it as a graphic, but we could also generate saffire-code back (we would only loose formatting and comments, since they are not present in the AST, as they are stripped of by the tokenizer). The possibilities are endless.

Anyway, there are lots of things to do, but it’s looking very promising. Don’t get your hopes up though: we WILL make it possible to run your saffire scripts through fastCGI on apache, nginx etc, but we are still a long way from a 1.0 release. Until that time, keep an eye on the site (www.saffire-lang.org), and join us at IRC (freenode, channel saffire) and discuss what you would like to see in the language. We’re always open for idea’s, and every bit of help is appreciated!