Saturday, March 25, 2017

A simple language with indentation based blocks. Part 3: Blocks

In this post we're going to add support for statements containing blocks which is the main goal of these series of posts.

Identifying blocks by using indentation

The technique we're going to use to identify blocks is based the following description from the (excellent) Python documentation:

Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero.

This section was taken from: https://docs.python.org/3/reference/lexical_analysis.html#indentation

The IDENT and DEDENT tokens act as begin/end block indicators. They have the same function as '{' and '}' in C-based languages. The following Python grammar fragment shows how this tokens are used.

...
suite: simple_stmt | NEWLINE IDENT stmt+ DEDENT
...
if_stmt: 'if' test ':' suite ...
...

Since our little parser uses parsing combinators without a tokenizer, we need to adjust this strategy a little bit. We're going to jump right ahead to show how we define the parser for the if statement.

   let pBlock =
       newlines   >>
       indent     >>
       pStatement >>= (fun firstStat ->
       (zeroOrMore
           (newlines  >>
            indented  >>
            pStatement) [])  >>= (fun restStats ->
       dedent      >> preturn (firstStat::restStats)))

   let ifParser =
       ifKeyword   >>
       pExpression >>= (fun expr ->
       colon       >>
       pBlock      >>= (fun block ->
       (optionalP pElse []) >>= (fun optElseBlockStats ->
                     preturn (PIf(expr,
                                  block,
                                  (match optElseBlockStats with
                                   | [] -> None
                                   | elements -> Some elements))))))

Here's an example of the generated AST for a simple if statement:

> parse "if x:
-           print(x)
-           return x*x
- " ifParser;;
val it : (PStat * ReaderState) option =
  Some
    (PIf
       (PSymbol "x",
        [PCallStat (PCall ("print",[PSymbol "x"]));
         PReturn (PBinaryOperation (Times,PSymbol "x",PSymbol "x"))],null),
     {Data = "if x:
          print(x)
          return x*x
";
      Position = 45;
      Indentation = [0];})

The key element here is the definition of the pBlock parser. In the definition of this parser we try to emulate the same strategy as the Python grammar definition:

   let pBlock =
       newlines   >>
       indent     >>
       pStatement >>= (fun firstStat ->
       (zeroOrMore
           (newlines  >>
            indented  >>
            pStatement) [])  >>= (fun restStats ->
       dedent      >> preturn (firstStat::restStats)))

We define indent, dedent and indented as follows:

   let indentation =
          pGetIndentation >>=(fun indentation ->                          
                (readZeroOrMoreChars (fun c -> c = ' '))
                >>=
                (fun spaces ->
                   match (spaces.Length,
                          indentation) with
                   | (lessThan, top::_) when lessThan > top ->
                       (pSetIndentation lessThan) >> (preturn "INDENT")
                   | (lessThan, top::_) when lessThan = top ->
preturn "INDENTED"
                   | (identifiedIndentation, top::rest) ->
                          (pSetFullIndentation rest) >> (preturn "DEDENT")
                   | _ -> pfail))

   let indent = indentation >>= (fun result -> if result = "INDENT" then preturn result else pfail)
   let dedent = indentation >>= (fun result -> if result = "DEDENT" then preturn result else pfail)
   let indented = indentation >>= (fun result -> if result = "INDENTED" then preturn result else pfail)

We define each of these indentation elements as:

  • INDENTED verifies that the expected indentation level is present. It consumes this indentation. For example, if expecting 3 space indentation, it will consume and verify that there are three spaces from the current position.
  • INDENT Verifies that there's a new indentation level can be found from the current position. It changes the 'Indentation' value in the current reader state.
  • DEDENT decreases the expected indentation level from the reader.

We can test these combinators to see how they affect the state of the parser.

We start with the following fragment:

> let code = "if x:
  if y:
     return 1
  else:
     return 2
"

We can test parts of our complex parsers by specifying just pieces of others. For example let's get to the then part of the parser.

> parse code (ifKeyword >> pExpression >> colon >> newlines >> indent);;
val it : (string * ReaderState) option =
  Some
    ("INDENT", {Data = "if x:
  if y:
     return 1
  else:
     return 2
";
                Position = 8;
                Indentation = [2; 0];})

As you can see, after executing these parsers we get to position 8. For example:

> code.Substring(8);;
val it : string = "if y:
     return 1
  else:
     return 2
"

Also the indentation stack now has 2 and 0. If we execute these series of parsers again to get to the then section of the first nested if we get:

> parse code (ifKeyword >> pExpression >> colon >> newlines >> indent >>
-             ifKeyword >> pExpression >> colon >> newlines >> indent);;
val it : (string * ReaderState) option =
  Some
    ("INDENT", {Data = "if x:
  if y:
     return 1
  else:
     return 2
";
                Position = 19;
                Indentation = [5; 2; 0];})

Now we have three indentation levels on our stack and we got to position 19.

Blocks

Just as the Python grammar reuses suite production, we can reuse the pBlock parser to define other kinds of statements. For example we can define the while statement as follows:

   let whileParser =
        whileKeyword    >>
        pExpression     >>= (fun condition ->
        colon           >>
        pBlock          >>= (fun block ->
        preturn (PWhile(condition, block))))

Now we can define more interesting statements such as:

val code : string = "if x > 0:
  while x < 10:
    print(x)
    x := x + 1
"

> parse code pStatement;;
val it : (PStat * ReaderState) option =
  Some
    (PIf
       (PBinaryOperation (Gt,PSymbol "x",PNumber "0"),
        [PWhile
           (PBinaryOperation (Lt,PSymbol "x",PNumber "10"),
            [PCallStat (PCall ("print",[PSymbol "x"]));
             PAssignStat
               (PSymbol "x",PBinaryOperation (Plus,PSymbol "x",PNumber "1"))])],
        null),
     {Data = "if x > 0:
  while x < 10:
    print(x)
    x := x + 1
";
      Position = 53;
      Indentation = [0];})

Next steps

Now that we can define blocks we can focus in other parts of our little language parser. One thing we can definetly improve is error messages. For example, let's try to parse the following fragment:

> parse "notanif x y z" ifParser;;
val it : (PStat * ReaderState) option = None

Here's where the Option<'a> type is not that useful since it doesn't allow you to specify a failure value such as a error message or location.

Another thing that we need to implement is comments. Our parser combinators work directly with the text. Because of this it will be interesting to see to introduce comments support without changing all our parsers.

This is part #3 of an ongoing series of posts on building a parser for a small language. The first post can be found here: http://langexplr.blogspot.com/2017/01/a-simple-language-with-indentation.html.