A phrase synonymous with the first program you write in a programming language that you haven't used before. But what if it was the first output of a language that you wrote?

If you've never stopped to think about how exactly the complex syntax of your favourite language ultimately does your bidding, maybe now is the time. It might be simpler than you think. The same toolkit is available in almost any language and can be used at a much smaller scale to create simple syntaxes for all sorts of other things.

Let's start by confusing the heck out of you, and work backwards from there…

++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.

Brainfuck

Yes, it's a real language and it has been around since 1993. Originally created by Urban Müller it's not meant to be used for practical purposes, it's believed his original intention was to create the smallest possible compiler for a programming language. Many compilers have been written that are less than 200 bytes, and at least one compiler is less than 100 bytes!

Brainfuck works by modifying and traversing bytes in memory. It can be used with any amount of bytes, enough to satisfy as many as you need to run the program. Is is made up of only 8 symbols, each of which is a character:

  • < and > move the position of the current byte left and right respectively.
  • + and - increment and decrement the value of the current byte.
  • . will print the current byte (as a character)
  • , will allow the input of a byte to the current location. We will not implement this in the example but it would be very easy to modify the source to allow this.
  • [ and ] are used as loops. The loop exits when the current byte is 0. Loops can be nested, this makes it perfect to demonstrate the AST tree later on.

What's About to Happen

Turns out, there is actually another cool use for this seemingly amusing language; demonstrating how to create a language interpreter from end to end. Even for the most complicated languages they can really be broken down into 4 stages, each of which I will cover in this article.

  1. The lexer splits the source code into a stream of indivisible pieces called tokens.
  2. The tokens run through a parser which outlines rules of the order in which the tokens can be expected and generates an AST.
  3. The AST is the physical (and usually hierarchal) representation of the source code that can now be traversed.
  4. Traversing is done by the compiler to generate the final product, or immediately run by an interpreter, like in this case.

No matter which host language you use (that's the language your writing the parser generator in) those 4 steps will remain the same and even most of the intricacies will even be the same too. I will use Python since it's super easy to read and understand.

There are different algorithms for language parsers that each have their advantages, disadvantages and performance concerns. I won't dive into the details of that for this article. I will be using PLY which is an LR-parser. You don't need to know more than that for the purpose of this article.

1. Lexer

This is the easiest stage and the least likely to change over time. In a nutshell the lexer uses regular expressions to divide the source code into tokens. Here are some tokens with their respective regular expressions:

KEYWORD_VAR = "var"
INTEGER = "\-?[0-9]+"
IDENTIFIER = "[a-zA-Z0-9]+"
OP_PLUS = "\+"
OP_COLON = ";"
OP_ASSIGN = "="

Using the above tokens to parse the following JavaScript:

var a = 3 + 2;

Would generate the tokens:

KEYWORD_VAR IDENTIFIER OP_ASSIGN INTEGER OP_PLUS INTEGER COLON

We not only need the token type, but also the value of the original token. So the INTEGER token must also include the characters that were matched. For example, 3. The parser generator you will be using will take care of this for you.

Here is the definitions of all our tokens in PLY:

tokens = (
'INCREMENT',
'DECREMENT',
'SHIFT_LEFT',
'SHIFT_RIGHT',
'OUTPUT',
'INPUT',
'OPEN_LOOP',
'CLOSE_LOOP',
)

t_INCREMENT = r'\+'
t_DECREMENT = r'-'
t_SHIFT_LEFT = r'<'
t_SHIFT_RIGHT = r'>'
t_OUTPUT = r'\.'
t_INPUT = r','
t_OPEN_LOOP = r'\['
t_CLOSE_LOOP = r'\]'

This is a lexer in the purest sense, the rules can get more complicated if you chose to do translation of tokens or their values while lexing but usually that's often for handling edge cases where the regular expression may conflict with one another or you need to translate internal token types.

One example of a token that has attached logic is the newline character. This allows you to report the line number when an error occurs. PLY makes it easy to connect a token with some arbitrary code by defining a function where the first line represents the regular expression and then any custom code following that:

def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)

The lexer will expect that every character perfectly fits into a continuous string of regular expressions. If that doesn't happen an error will be throw, this includes whitespace. It would be a pain to incorporate whitespace into each of the tokens so instead we can use t_ignore to specify any characters that will always be ignored.

t_ignore = ' \t'

Brainfuck should really ignore any character that isn't one of the 8 specified above, but for demonstrating I'll just ignore spaces. Be careful not to ignore the newline as it need to be consumed by the t_newline token.

Finally, if all else fails we need an error handler:

def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)

2. Parser

The parser is where we can define the actual syntax of the language. We use the tokens specified in the lexer to outline the sequence for each rule in the form of:

rule_name : token1 token2 ...

For example, to define a variable in JavaScript you might use:

define_variable : KEYWORD_VAR IDENTIFIER COLON

Notice that the tokens are in upper case but the parser rules are in lower case. Some parser regenerators enforce this (like PLY) and some don't. But it's a good idea anyway because in a more complex language it makes it much clearer to read when your mixing tokens and other rules.

You could also say that a variable in JavaScript could be defined with a default value (let's just pretend for a moment that JavaScript only has integers):

define_variable : KEYWORD_VAR IDENTIFIER OP_ASSIGN INTEGER COLON

We really classify these are different versions of the same rule so it can be collapsed together by using the | operator to represent and OR:

define_variable : KEYWORD_VAR IDENTIFIER COLON
| KEYWORD_VAR IDENTIFIER OP_ASSIGN INTEGER COLON

Most parsers are very simple, so you cannot treat | like a logical operator. It simply allow separate linear sequences of tokens to match the same parser rule.

All the parser rules are defined before the parser starts running. Just like how custom tokens are represented as functions, the same is true for the parser rules. Once a rule is matched it will invoke the function for that rule. The first parameter contains an array of tokens for the rule that matched. This is especially important if you use | to know which rule actually matched.

A command would represent an action, it's not just a single character command but also a loop. So +[<>]- would be three commands since the whole [<>] will be ingested by the loop rule:

def p_command(p):
"""
command : INCREMENT
| DECREMENT
| SHIFT_LEFT
| SHIFT_RIGHT
| OUTPUT
| INPUT
| loop
"""
if isinstance(p[1], str):
p[0] = Command(p[1])
else:
p[0] = p[1]

Keeping the state (effectively the return) is extremely important when dealing with a hierarchal syntax. Rather than returning directly from the function we use the token at index 0 to be the return value. For example, for the following rule:

define_variable : KEYWORD_VAR IDENTIFIER COLON

token[0] would be null, token[1] would be KEYWORD_VAR, token[2] would be IDENTIFIER, etc. The total amount of tokens includes the 0th token, so even though the rule is made up of 3 tokens, the len(p) is 4.

The loop rule looks like:

def p_loop(p):
"""
loop : OPEN_LOOP commands CLOSE_LOOP
"""
p[0] = Loop(p[2])

The commands rule (not to be confused with the command rule) is to match one or more commands. Since all rules are linear sequences of tokens, they can't be repeated, as such. This is one way to represent a list of commands of one or more times. Initially it is confusing, it's easiest to explain with an example. Let say we have the following tokens:

+ + > -

Is reduced by:

  1. commands command -> (+ + >) -
  2. (commands command) command -> ((+ +) >) -
  3. ((command command) command) command -> (((+) +) >) -

This type of parsing is especially useful for operator precedence... which is beyond the scope of this article.

It seems counter intuitive, but the parser actually works from bottom up (most nested elements first, meaning it starts with the 3rd rule above) so we use len(p) == 2 to check if the rule matched is command (since len(p) will be 1 greater than the real amount of tokens) and create the base instance Commands and add the first command. Further rules will be able to append the command to the original Commands instance and keep passing the instance through p[0] for the next layer to pick up.

def p_commands(p):
"""
commands : command
| commands command
"""
if len(p) == 2:
p[0] = Commands()
p[0].commands = [p[1]]
return

if not p[1]:
p[1] = Commands()

p[1].commands.append(p[2])
p[0] = p[1]

Just like with the lexer, the parser has a default error handler if it comes across a sequence of tokens that do no match any rules. It will be up to you to decide if the parser should try and recover or simply abort with a message:

def p_error(p):
print("Syntax error in input!")

3. AST

The Abstract Syntax Tree is the code representation of the structure of the source code (program). That is to say if you has the following expression in JavaScript:

2 * 3 + 4 * 5

With the correct operator precedence would generate a combination of nested instances like:

new Add(new Multiply(2, 3), new Mutliply(4, 5))

For Brainfuck we need only 3 classes to represent the structure of the application:

  1. Commands holds an array of Command.
  2. Command is a single character action, or a Loop instance.
  3. A Loop holds a reference to a Commands for the actions contained in the loop.

Since Loop contains a reference to a Commands, the AST generated could be nested as deeply as needed for each nested loop.

class Commands:
def __init__(self):
self.commands = []

def run(self, program):
for command in self.commands:
command.run(program)

def __str__(self):
return ''.join([str(command) for command in self.commands])

class Command:
def __init__(self, command):
self.command = command

def run(self, program):
if isinstance(self.command, Loop):
self.command.run(program)

if self.command == '+':
program.data[program.location] += 1
if self.command == '-':
program.data[program.location] -= 1
if self.command == '<':
program.location -= 1
if self.command == '>':
program.location += 1
if self.command == '.':
sys.stdout.write(chr(program.data[program.location]))

def __str__(self):
return self.command

class Loop:
def __init__(self, commands):
self.commands = commands

def run(self, program):
while program.data[program.location] != 0:
self.commands.run(program)

def __str__(self):
return '[' + str(self.commands) + ']'

You may notice that the AST classes very closely resemble their parser rules. This is very common, although a single parser rule may output or mutate any type of object.

Included with each of the AST classes is a __str__ method which means you could render the entire program back to one continuous string. If there were spaces and other characters that were ignored they would not appear in this output.

4. Compiler or Interpreter

The interpreter can now traverse the AST instances however it likes, depending on your application logic, of course. It's a good idea to try and use the most out of the parser before you even get to this stage, such as covering error conditions, edge cases and special syntaxes.

In a more complicated parser the AST classes would simply represent the source, but contain no other logic. In this example I've used the same classes to actually run themselves since all commands simply mutate the bytes of memory of the application.



One final class puts it all together by running the source through the lexer and parser, creating the simulated RAM (all 20 bytes of it), setting the starting position to0 (the first byte) and eventually invoking the chain of actions on the AST instances themselves:

class BrainfuckProgram:
def __init__(self, source):
self.source = source

def run(self):
self.data = [0] * 20
self.location = 0
commands = self.parse(self.source)
commands.run(self)

def parse(self, source):
lexer = lex.lex()
parser = yacc.yacc()
return parser.parse(source)

def __str__(self):
return str(self.parse(self.source))

Now we can invoke the runner:

source = '++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.'
program = BrainfuckProgram(source)
program.run()

Using the complete program in this Gist, you can now run it!

$ python brainfuck.py
Hello World!