Sunday, January 29, 2017

A simple language with indentation based blocks. Part 1: Parsing combinators

In this post, I'm going to start with the implementation of the parser using the F# programming language. There are several tools for creating parsers for example FParsec http://www.quanttec.com/fparsec/ or FsYacc/FsLex https://en.wikibooks.org/wiki/F_Sharp_Programming/Lexing_and_,arsing. However for this experiment I wanted to create a small set of simple parser combinators, just to learn more about them.

The contents of this post is loosely based on the definition of the Parser Monad . There are many great articles on the net about them. One of this papers is the incredibly nice Monadic Parsing in Haskell http://www.cs.nott.ac.uk/~pszgmh/pearl.pdf .

Let's start by defining the type of the state of our parser:

   type ReaderState = { Data : string;
                        Position: int;
                        Indentation: int list}

This state contains the following:

  • Data : the input string containing the program to parse
  • Position : the current position of the parser inside the Data string
  • Indentation : an indentation stack (more on that in future posts)

And now we're going to define the type of a "parser" :

ReaderState ->  (('a * ReaderState) option )

This means that a "parser" is a function that takes the reader state and returns a tuple of some parsed value ('a) and the new parsing state. Notice that parsers could fail, so the result it's wrapped in an option type https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/options.

A very simple example of a parser is the following:

   let readChar(state : ReaderState) : (string * ReaderState) option =
       if state.Data.Length > state.Position then
          Some (state.Data.[state.Position].ToString(),
                { state with Position = state.Position + 1 } )
       else
          None

We can test this parser using the following function:

   let parse (input : string) (parser : (ReaderState ->  (('a * ReaderState) option ))) =
       parser {  Data = input ; Position = 0; Indentation = [0] }

For example:

> parse "hello" readChar;;        
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
                                                     Position = 1;
                                                     Indentation = [0];})

Here we can see that the result is a successful optional Some with a string with the single recognized char . Also part of the result is the new reader state .

We can also specify operations for several characters for example:

   let readZeroOrMoreChars (pred:char -> bool) (state : ReaderState) : (string * ReaderState) option =
       let
         secondPosition = readingChar state.Position pred state.Data
         in
           Some ( state.Data.Substring(state.Position,
                                       secondPosition - state.Position),
                 { state with Position = secondPosition })

We can use this function as follows:

> parse "1231abc" (readZeroOrMoreChars System.Char.IsDigit);;
val it : (string * ReaderState) option = Some ("1231", {Data = "1231abc";
                                                        Position = 4;
                                                        Indentation = [0];})

Using this function we can define a parser for whitespace:

   let whitespace = readZeroOrMoreChars System.Char.IsWhiteSpace

Connecting parsers

The parsers I've been showing so far look very simple, and they are!. The goal is to have small parsers and combine them to create more interesting ones. The next operation is called concatParsers2 and it's goal is create a new parser that will execute two parsers in sequence. That is, apply the first one and with the resulting state, execute the second one.

For example, the following parser for recognizing symbols use the readWithConditionOnChar parser with the readZeroOrMoreChars in order to create a PSymbol with the recognized element:

   let symbol =
       concatParsers2
          (readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))      
          (fun initialChar ->
               concatParsers2
                  (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || System.Char.IsDigit(c)))
                  (fun suffixString -> (preturn (PSymbol (initialChar + suffixString))))
           )

In our little language a symbol is represented by :

  • a letter
  • zero or more letters or numbers

We can use the readWithConditionOnChar and readZeroOrMoreChars to archive these goals, for example:

> parse "hello" (readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))- ;; 
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
                                                     Position = 1;
                                                     Indentation = [0];})
> parse "hi1and2and3" (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || - System.Char.IsDigit(c)));;
val it : (string * ReaderState) option =
  Some ("hi1and2and3", {Data = "hi1and2and3";
                        Position = 11;
                        Indentation = [0];})

Notice that we cannot use just readZeroOrMoreChars since it will recognize symbols starting with numbers (like '13hello').

We can create a small parser for recognizing a letter (readWithConditionOnChar) and we can use readZeroOrMoreChars to read the rest. We could combine these two blocks to get another parser. The key for composing our parsers is the concatParsers2 function. This function has the following signature:

> concatParsers2;;
val it :
  ((ReaderState -> ('a * ReaderState) option) ->
     ('a -> ReaderState -> ('b * ReaderState) option) -> ReaderState ->
     ('b * ReaderState) option) = <fun:clo@21-1>
                     

This signature means that concatParsers2 receives a parser that produces a value of type 'a. The second argument is a function that receives a value of type 'a (the result of the first parser) and returns a parser of another type 'b. The result of concatParsers2 is itself another parser that produces a value of type 'b.

The implementation looks like this:

   let concatParsers2 (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                      (parser2N : ('a ->  (ReaderState ->  (('b * ReaderState) option )))) =
       fun input -> match (parser1 input) with
                    | Some (matchedTxt, restState) -> (parser2N matchedTxt) restState
                    | _ -> None

This small function lets us create parsers using small blocks. But before we do that, we need to create another useful operation.

let preturn aValue (state : ReaderState ) = Some (aValue, state)

This operation is very simple, but really important!. It lets us prepare a result. By taking a look at the signature, we can see that it matches our parser signature. We can now used it in conjunction with concatParsers2 to produce results:

>  parse "AXXXC" 
-        (concatParsers2 recognizeABC                     
-                      (fun firstChar ->
-                           (concatParsers2
-                                recognizeXs              
-                                (fun xs -> 
-                                     (concatParsers2
-                                         recognizeABC
-                                         (fun lastChar ->
-                                              preturn (firstChar, xs, lastChar)- ))))));; 
val it : ((string * string * string) * ReaderState) option =
  Some (("A", "XXX", "C"), {Data = "AXXXC";
                            Position = 5;
                            Indentation = [0];})

Given that we define recognizeABC and recognizeXs as follows:

> let recognizeABC = readWithConditionOnChar (fun c -> c = "A" || c = "B" || c= "C");;

val recognizeABC : (ReaderState -> (string * ReaderState) option)

> let recognizeXs = readZeroOrMoreChars (fun c -> c = 'X' ) ;;    

val recognizeXs : (ReaderState -> (string * ReaderState) option)

Using the concatParsers2 function seems a little verbose. We can borrow inspiration from the Haskell's Monad type class https://hackage.haskell.org/package/base-4.9.1.0/docs/Control-Monad.html#t:Monad by defining versions of the >>= operator as follows:

   let inline (>>=) (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                    (parser2N : ('a ->  (ReaderState ->  (('b * ReaderState) option )))) = concatParsers2 parser1 parser2N

Now using this operator we can write:

parse "AXXXC" ( recognizeABC >>= (fun firstChar -> 
                recognizeXs  >>= (fun xs -> 
                recognizeABC >>= (fun lastChar ->
                preturn (firstChar, xs, lastChar)))))

Ignoring output from previous parsers

Another useful operation allows us to connect two parsers but

   let concatParsers (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                     (parser2 : (ReaderState ->  (('b * ReaderState) option ))) =
       fun input -> match (parser1 input) with
                    | Some (_, restState) -> parser2 restState
                    | _ -> None

Notice that it's similar to the concatParsers2 but it ignores the result of parser1. This is useful for discarting whitespace:

let whitespaceNoNl =
    readZeroOrMoreChars
        (fun c -> System.Char.IsWhiteSpace(c) && c <> '\n')

let colon  =
    concatParsers whitespaceNoNl (readSpecificChar ':')
    

As with the concatParsers2 we can use another well known operator for this operation:

   let inline (>>) (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                   (parser2 : (ReaderState ->  (('b * ReaderState) option ))) = concatParsers parser1 parser2

Useful parsing operations

We need to create more operations that help us create complex parsers. For example, we can create a small operation for optional elements:

   let optionalP (parser : (ReaderState ->  (('a * ReaderState) option ))) (defaultValue:'a) =
       fun input -> match (parser input) with
                    | result & Some _ -> result
                    | _ -> (Some (defaultValue, input))

We can use this new operator to parse numbers as follows:

   let number =
              ( (optionalP (readSpecificChar '-') "") >>= (fun neg -> 
                digitP  >>= (fun firstChar -> 
                (readZeroOrMoreChars (fun c ->  System.Char.IsDigit(c))) >>= (fun chars ->
                (optionalP decimalPartP "") >>= (fun dec ->                                                                               
                preturn (PNumber (neg + firstChar + chars + dec))))

Choosing from two options

The following operation helps us to try to use one parser and fallback to another if it didn't work:

   let disjParser (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                  (parser2 : (ReaderState ->  (('a * ReaderState) option ))) =
       fun input -> match (parser1 input) with
                    | result & Some _ -> result
                    | _ -> parser2 input

A simple scenario using this parser looks like this:

> parse "39" (disjParser symbol number);; 
val it : (PExpr * ReaderState) option =
  Some (PNumber "39", {Data = "39";
                       Position = 2;
                       Indentation = [0];})
> parse "hello" (disjParser symbol number);;
val it : (PExpr * ReaderState) option =
  Some (PSymbol "hello", {Data = "hello";
                          Position = 5;
                          Indentation = [0];})

Repetitions

Identifying sequences of elements identified by a parser is very useful. We can define the following operations

   let rec oneOrMore parser accumulated =
       parser >>=
           (fun lastResult ->
              let result = lastResult::accumulated
              in disjParser (oneOrMore parser result) (preturn (List.rev result)))

   let zeroOrMore parser accumulated =
       disjParser (oneOrMore parser accumulated) (preturn accumulated)

A simple example of using this operation looks like this:

> parse "1   2 3  4" (zeroOrMore (whitespace >> number) []);;
val it : (PExpr list * ReaderState) option =
  Some
    ([PNumber "1"; PNumber "2"; PNumber "3"; PNumber "4"],
     {Data = "1   2 3  4";
      Position = 10;
      Indentation = [0];})

The next post will show how to create more interesting things with the pieces we have so far.

A simple language with indentation based blocks (part 0)

One of the first things I noticed when reading about Python, is the use of indentation for defining code blocks. For example:

def foo(x):
   while x < 20:
      if x > 10:
         print "argument is greater than 10"
         print "value: " + x
      else:
         print "the argument is less than or equal to 10"
         print "its value is : " + x

If you're only familiar to C-based language this seems a bit strange. Not using characters like '{}' or words like BEGIN/END seems fragile at first. For example one could expect something like:

def foo(x) {
   while x < 20 {
      if x > 10 {
         print "argument is greater than 10"
         print "value: " + x
      } else {
         print "the argument is less than or equal to 10"
         print "its value is : " + x
   }
}

I've always wondered how do you create a parser for this kind of a language. In the following series of posts I'm going to record my experiences writing a parser for a simple little language. This language will use a style similar to Python. I'm going to write the code using F# .

Posts

  1. Parsing combinators