Writing a language
Tagged with: [ ast ] [ bison ] [ flex ] [ grammar ] [ lex ] [ saffire ] [ yacc ]
In the last blogpost I was talking about a new language in the making. Unfortunately, writing a complete new language - from scratch - isn’t as easy and takes a fair bit of time. During this development process, I will try and blog a bit on the things we are developing, the problems we are facing and the solutions we are implementing. The current Saffire status: we are able to generate AST tree’s from Saffire source programs. If you have no clue what I’m talking about, no worries: this blogpost will try and explain it all.
On writing compilers
I’ve written some fairly bit of complex code in my life, including an Operating System capable of context switching, memory management, ext2/3, fat16 and VFS capabilities. Not something that will make it to your desktop soon, but the basic and probably most complex parts of any operating system nevertheless. So I guess it’s safe to say I know a thing or two about systems. Unfortunately, all that knowledge is pretty useless when dealing with writing a compiler (or at least: the most important bits of a compiler). But help is on it’s way: everybody who is into compiler development know about THE book that you will need: the (purple) dragon book. If you’re a student and use this book in class, you will probably hate it. Other people, like me who never really went to school, and learned it all “on the streets”, we love this book. It’s a bit of dry theory, and you definately have to put away the book once in a while do try to put it all in practice. It will make sense once you get the hang of it.
Now, something like this cannot be written without the help of some smart people who already know this stuff. So a big thank you to Richard van Velzen, who has helped me a lot with getting things up and running.
Defining your language
Before you can actually run an application in your language, you have to figure out how the language would work. This is
the easiest step in the whole process, and the only limit is your own imagination. Just write down how your language
would look like, what its rules are, how you would do certain stuff. You will run into dead-ends lots of times, since
your idea only works for scenario X, but not for scenario Y. For instance, Saffire only supports methods. There are no
global functions or procedures like you can find in other languages. This is done to support a more OO type of
programming, and so we don’t have people printf()
, or str_replace()
all over the place. Everything is neatly grouped
into classes and objects. But as a side-effect, we are still not 100% sure how we would support anonymous functions,
lambda’s and closures. But in the end, it’s just a matter of writing down example code, discuss it with others who find
flaws in your reasoning and discuss and decide on a solution. In the end, sooner or later, this will end up in a
language specification. This is something that is vital in order to create your compiler: if you have no idea what it
will look like, your compiler won’t either. With Saffire, we are still in this process: we have decided on a lot of
features and specifications, but a lot of things are left open. Luckily, this doesn’t prevent us from continuing with
the rest of the process, but sooner or later we need to have everything documentated and thought out, in order to
implement it.
Step 1: Tokenize your code
There is a reason why writing a compiler isn’t easy, and it’s the same reason why we implement things like coding standards when we deal with code that is written and/or read by others. If we do things a certain way, it makes it easier to process that information: this is how we define constants, this is how we write an if-else structure, etc etc. For a compiler this works pretty much the same way: it needs to figure out what is what, and in order to do so, it will follow certain rules.
Now, the first part in this all is converting your source-code into tokens
. For example, there is an T_IF
token
(the T_
stands for token). So whenever a programmer has written: if ($a)
, IF($A)
, if ($a)
, iF($a)
or any other
possible way to write an if-statement, that “if” part is replaced by the T_IF
token. But some tokens are just single
characters. The parenthesis are both tokens as well T_PARENTHESIS_OPEN
and T_PARENTHESIS_CLOSE
for example. Things like
variable names are also converted to tokens, which could be the T_VARIABLE
token (the token itself can hold additional
information, in this case, it would be the actual name of the variable, which would be $a
).
The conversion of source-code to tokens is done by a so-called lexer. There are multiple out there, but we decided to use flex for this job. Flex is a pretty old system, but works really well and got lot of documentation and tutorials. This makes it easier to get your system up and running. The tokenizer also can do things like removing comments and empty lines. Things that makes the code visually easier to read, but are of no use for the compiler.
So, phase 1 is complete: a source code is translated into a stream of tokens.
Phase 2: parsing
From this point we don’t have to worry on how a programmer has written his code. It’s all converted into generic tokens which we can work with. But just like a natural language like English or Dutch, writing just words down randomly doesn’t make it into a sentence that people can understand. The parser phase actually takes the stream of tokens, and figures out if those tokens actually make sense.
For instance, let’s assume we have a small bit of code:
if ($a){}
The lexer would translate this into tokens:
T_IF T_PARENTHESIS_OPEN T_VARIABLE T_PARENTHESIS_CLOSE T_CURLY_OPEN T_CURLY_CLOSE
But suppose we have written something like this:
)($a {if}
Our lexer very happily tokenize this into:
T_PARENTHESIS_CLOSE T_PARENTHESIS_OPEN T_VARIABLE T_CURLY_OPEN T_IF T_CURLY_CLOSE
Unless you are defining a REALLY complicated language, this isn’t probably something you would want.
Now, as said: the parser will take this information and find certain grammar-rules to see if it finds a match. For instance, we could have a grammar-rule describing if-statements:
if-statement: T_IF T_PARENTHESIS_OPEN T_VARIABLE T_PARENTHESIS_CLOSE | T_IF T_PARENTHESIS_OPEN T_PARENTHESIS_CLOSE
What this does, is telling the parser that an if-statement consists of the T_IF
token, followed by a ‘(‘, followed by a
variable and closed with a ‘)’. The second line tells the parser that it could also be a T_IF
token, followed by a ‘(‘
and ‘)’, which would be an empty if-statement (again, it doesn’t make sense, but suppose you support it).
When we are parsing the tokens, the parser will try and match the tokens against the grammar rules provided. If this
would be the only place inside our grammar where we would use the T_IF
token, it means that a T_IF
token ALWAYS is
followed by a ‘(‘, and after that there might be a variable OR a ‘)’. It’s pretty easy to write those grammar rules, as
most languages have a fairly small number of rules.
Now, if we happen to have written: “if $a”, the tokenizer converts it into T_IF T_VARIABLE
, but the parser will not
find a rule that matches, since there are no rules that matches a T_IF
followed by a T_VARIABLE.
At that point, it will
throw an syntax error (or you can customize the message it needs to return for instance: “Error: an if-statement must be
followed by an opening parenthesis”.
When the complete token-stream is parsed correctly, you have created pretty much your first lint-checker: you can verify if the source-code actually is grammatically correct!
Creating these statements from our grammar rules is done automatically by a program called bison. Bison generates a C file from your grammar rules to make your life easier. Never every write your own parser (unless you know what you are doing, and in that case, you won’t write a parser anyway), but generate them through 3rd party apps like bison. There are others, like antlr and lemon. I’ve looked at them, but I decided bison, just like flex might not be the best or the fastest tools, but they provide lots of information, documentation and tutorials, which makes it much easier to implement them.
Creating an AST
Now that the code is tokenized, and parsed grammatically, we need to figure out how to actually make it “work”. How can we make sure that when i enter “$a = $a + 1”, that the variable $a is actually incremented. Unfortunally, we are still a long way from making that happen, since there are anoth steps that needs to be taken care of. Once we have parsed our grammar, we will convert our code into a so-called AST-tree.
An Abstract Syntax Tree (AST), is a tree-like representation of the code that can easily be processed by a computer. A tree begins with a root-node, this is the start of your program. In Saffire, a program consist of 2 parts (both optional): a list of use-statements, and “the rest”. Once you have started with the rest (like defining classes, or doing other stuff), you cannot add use-statements anymore. For instance, this is a correct saffire application:
But this following isn’t:
This is all handled by the parser, which will throw an syntax error. If everything was allright, we got two branches
in our AST root-node: a use-list branch, and a top-list branch (the rest of the program). There are a few different kind
of nodes in Saffire at the moment (this is bound to change later, when more and hopefully knowledgable people tell me i
did it all wrong :p): we have a numericalType, a stringType and a variableType. If we have $a somewhere in our code,
this will generate the token T_IDENTIFIER
and this will result in a variableNode with the value $a. There is a
classNode, which defines a class (it’s modifiers: public, private, abstract, final, static etc), the actual name, a list
of extended classes, the interfaces it implements and the actual class body). We have an interfaceNode, a methodNode
(defines actual methods) and a very important oprNode. This node is the operator node.
Suppose we have the code: $a = 1; This code would result in T_IDENTIFIER
T_ASSIGN
T_NUMERIC.
There will be a
grammar rule that allows this kind of assignments and in the end, an assignment is nothing more than a left hand side,
an operator and a right hand side. In our AST tree, this would be represented by an operatorNode with the value
T_ASSIGN,
with 2 child nodes. On the left: the T_IDENTIFIER
(as an identifierNode with value $a) and on the right, the
T_NUMERIC
node (with value 1). Don’t swap these two nodes, otherwise you would get: 1 = $a, which obviously will
not work.
Now, AST tree’s are very abstract, and since they can be complex and only reside in memory, they CAN be hard to debug when creating them. That’s why I’ve created a simple tool that generates DOT files from the AST. They can be converted into PNG’s with the Graphiz tool, which gives you a visual interpretation of the actual tree.
As an example: this is the output of a (very small) program:
You can imagine that AST tree’s will be very, VERY complex, but parsing them is quite easy: there are only a few types
of nodes, and you only have to deal with them. All the ordering, making sure that $a = 1 + 2 * 3
is done in the correct
order and such, is taken care of by the grammar.
Now, these DOT files are generated mainly for debugging purposes, but I figured they are looking nice anyway. This is why Saffire has the -dot argument, where you can output the AST tree into a file, which you can then use to generate tree’s like above (the DOT output file has information on how to generate PNG’s from them).
What’s next?
So this is the point where Saffire currently is: from code to AST.
What we have to do next is have a system that takes the root-node from the AST and interpret it. Since the first node
will be the root node (special T_PROGRAM
node), we know that that node has 2 children: on the left, a tree with our
use-statements (or a NULL-node when no use-statements are given), and on the right the rest of the program (or a
NULL-node when no program is given, but only use-statements are present in the code). We iterate all nodes from
top-to-bottom left-to-right, and if we interpret every node correctly, we just have created our Saffire-interpreter!
Now, it SOUNDS easy, and in a sense it is, but the idea is to make this FAST, compact and manageable, which will take
much more time.
Because of the AST tree, there are much more things we can do: instead of interpreting (read a node and execute it), we could write a compiler: read a node, generate machine-code for it, write it to a file). That way we would end up with a executable generated from a saffire-file (awesome!), or maybe we could have something in between: a JIT compiler that read nodes (or bunch of nodes), and generates machine code on the fly.
And maybe we can do even more: we already can output it as a graphic, but we could also generate saffire-code back (we would only loose formatting and comments, since they are not present in the AST, as they are stripped of by the tokenizer). The possibilities are endless.
Anyway, there are lots of things to do, but it’s looking very promising. Don’t get your hopes up though: we WILL make it possible to run your saffire scripts through fastCGI on apache, nginx etc, but we are still a long way from a 1.0 release. Until that time, keep an eye on the site (www.saffire-lang.org), and join us at IRC (freenode, channel saffire) and discuss what you would like to see in the language. We’re always open for idea’s, and every bit of help is appreciated!