## Sunday, January 29, 2017

### A simple language with indentation based blocks. Part 1: Parsing combinators

In this post, I'm going to start with the implementation of the parser using the F# programming language. There are several tools for creating parsers for example FParsec http://www.quanttec.com/fparsec/ or FsYacc/FsLex https://en.wikibooks.org/wiki/F_Sharp_Programming/Lexing_and_,arsing. However for this experiment I wanted to create a small set of simple parser combinators, just to learn more about them.

The contents of this post is loosely based on the definition of the Parser Monad . There are many great articles on the net about them. One of this papers is the incredibly nice Monadic Parsing in Haskell http://www.cs.nott.ac.uk/~pszgmh/pearl.pdf .

Let's start by defining the type of the state of our parser:

``````   type ReaderState = { Data : string;
Position: int;
Indentation: int list}``````

This state contains the following:

• Data : the input string containing the program to parse
• Position : the current position of the parser inside the Data string
• Indentation : an indentation stack (more on that in future posts)

And now we're going to define the type of a "parser" :

``ReaderState ->  (('a * ReaderState) option )``

This means that a "parser" is a function that takes the reader state and returns a tuple of some parsed value (`'a`) and the new parsing state. Notice that parsers could fail, so the result it's wrapped in an `option` type https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/options.

A very simple example of a parser is the following:

``````   let readChar(state : ReaderState) : (string * ReaderState) option =
if state.Data.Length > state.Position then
Some (state.Data.[state.Position].ToString(),
{ state with Position = state.Position + 1 } )
else
None``````

We can test this parser using the following function:

``````   let parse (input : string) (parser : (ReaderState ->  (('a * ReaderState) option ))) =
parser {  Data = input ; Position = 0; Indentation =  }``````

For example:

``````> parse "hello" readChar;;
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
Position = 1;
Indentation = ;})``````

Here we can see that the result is a successful optional `Some` with a string with the single recognized char . Also part of the result is the new reader state .

We can also specify operations for several characters for example:

``````   let readZeroOrMoreChars (pred:char -> bool) (state : ReaderState) : (string * ReaderState) option =
let
secondPosition = readingChar state.Position pred state.Data
in
Some ( state.Data.Substring(state.Position,
secondPosition - state.Position),
{ state with Position = secondPosition })``````

We can use this function as follows:

``````> parse "1231abc" (readZeroOrMoreChars System.Char.IsDigit);;
val it : (string * ReaderState) option = Some ("1231", {Data = "1231abc";
Position = 4;
Indentation = ;})``````

Using this function we can define a parser for whitespace:

``   let whitespace = readZeroOrMoreChars System.Char.IsWhiteSpace``

## Connecting parsers

The parsers I've been showing so far look very simple, and they are!. The goal is to have small parsers and combine them to create more interesting ones. The next operation is called `concatParsers2` and it's goal is create a new parser that will execute two parsers in sequence. That is, apply the first one and with the resulting state, execute the second one.

For example, the following parser for recognizing symbols use the `readWithConditionOnChar` parser with the `readZeroOrMoreChars` in order to create a `PSymbol` with the recognized element:

``````   let symbol =
concatParsers2
(readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))
(fun initialChar ->
concatParsers2
(readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || System.Char.IsDigit(c)))
(fun suffixString -> (preturn (PSymbol (initialChar + suffixString))))
)``````

In our little language a symbol is represented by :

• a letter
• zero or more letters or numbers

We can use the `readWithConditionOnChar` and `readZeroOrMoreChars` to archive these goals, for example:

``````> parse "hello" (readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))- ;;
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
Position = 1;
Indentation = ;})
> parse "hi1and2and3" (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || - System.Char.IsDigit(c)));;
val it : (string * ReaderState) option =
Some ("hi1and2and3", {Data = "hi1and2and3";
Position = 11;
Indentation = ;})``````

Notice that we cannot use just `readZeroOrMoreChars` since it will recognize symbols starting with numbers (like '13hello').

We can create a small parser for recognizing a letter (`readWithConditionOnChar`) and we can use `readZeroOrMoreChars` to read the rest. We could combine these two blocks to get another parser. The key for composing our parsers is the `concatParsers2` function. This function has the following signature:

``````> concatParsers2;;
val it :
('b * ReaderState) option) = <fun:clo@21-1>
``````

This signature means that `concatParsers2` receives a parser that produces a value of type `'a`. The second argument is a function that receives a value of type `'a` (the result of the first parser) and returns a parser of another type `'b`. The result of `concatParsers2` is itself another parser that produces a value of type `'b`.

The implementation looks like this:

``````   let concatParsers2 (parser1 : (ReaderState ->  (('a * ReaderState) option )))
(parser2N : ('a ->  (ReaderState ->  (('b * ReaderState) option )))) =
fun input -> match (parser1 input) with
| Some (matchedTxt, restState) -> (parser2N matchedTxt) restState
| _ -> None``````

This small function lets us create parsers using small blocks. But before we do that, we need to create another useful operation.

``let preturn aValue (state : ReaderState ) = Some (aValue, state)``

This operation is very simple, but really important!. It lets us prepare a result. By taking a look at the signature, we can see that it matches our `parser` signature. We can now used it in conjunction with `concatParsers2` to produce results:

``````>  parse "AXXXC"
-        (concatParsers2 recognizeABC
-                      (fun firstChar ->
-                           (concatParsers2
-                                recognizeXs
-                                (fun xs ->
-                                     (concatParsers2
-                                         recognizeABC
-                                         (fun lastChar ->
-                                              preturn (firstChar, xs, lastChar)- ))))));;
val it : ((string * string * string) * ReaderState) option =
Some (("A", "XXX", "C"), {Data = "AXXXC";
Position = 5;
Indentation = ;})``````

Given that we define `recognizeABC` and `recognizeXs` as follows:

``````> let recognizeABC = readWithConditionOnChar (fun c -> c = "A" || c = "B" || c= "C");;

> let recognizeXs = readZeroOrMoreChars (fun c -> c = 'X' ) ;;

Using the `concatParsers2` function seems a little verbose. We can borrow inspiration from the Haskell's Monad type class https://hackage.haskell.org/package/base-4.9.1.0/docs/Control-Monad.html#t:Monad by defining versions of the `>>=` operator as follows:

``````   let inline (>>=) (parser1 : (ReaderState ->  (('a * ReaderState) option )))
(parser2N : ('a ->  (ReaderState ->  (('b * ReaderState) option )))) = concatParsers2 parser1 parser2N``````

Now using this operator we can write:

``````parse "AXXXC" ( recognizeABC >>= (fun firstChar ->
recognizeXs  >>= (fun xs ->
recognizeABC >>= (fun lastChar ->
preturn (firstChar, xs, lastChar)))))``````

## Ignoring output from previous parsers

Another useful operation allows us to connect two parsers but

``````   let concatParsers (parser1 : (ReaderState ->  (('a * ReaderState) option )))
fun input -> match (parser1 input) with
| Some (_, restState) -> parser2 restState
| _ -> None``````

Notice that it's similar to the `concatParsers2` but it ignores the result of `parser1`. This is useful for discarting whitespace:

``````let whitespaceNoNl =
(fun c -> System.Char.IsWhiteSpace(c) && c <> '\n')

let colon  =
``````

As with the `concatParsers2` we can use another well known operator for this operation:

``````   let inline (>>) (parser1 : (ReaderState ->  (('a * ReaderState) option )))
(parser2 : (ReaderState ->  (('b * ReaderState) option ))) = concatParsers parser1 parser2``````

## Useful parsing operations

We need to create more operations that help us create complex parsers. For example, we can create a small operation for optional elements:

``````   let optionalP (parser : (ReaderState ->  (('a * ReaderState) option ))) (defaultValue:'a) =
fun input -> match (parser input) with
| result & Some _ -> result
| _ -> (Some (defaultValue, input))``````

We can use this new operator to parse numbers as follows:

``````   let number =
( (optionalP (readSpecificChar '-') "") >>= (fun neg ->
digitP  >>= (fun firstChar ->
(readZeroOrMoreChars (fun c ->  System.Char.IsDigit(c))) >>= (fun chars ->
(optionalP decimalPartP "") >>= (fun dec ->
preturn (PNumber (neg + firstChar + chars + dec))))``````

## Choosing from two options

The following operation helps us to try to use one parser and fallback to another if it didn't work:

``````   let disjParser (parser1 : (ReaderState ->  (('a * ReaderState) option )))
fun input -> match (parser1 input) with
| result & Some _ -> result
| _ -> parser2 input``````

A simple scenario using this parser looks like this:

``````> parse "39" (disjParser symbol number);;
val it : (PExpr * ReaderState) option =
Some (PNumber "39", {Data = "39";
Position = 2;
Indentation = ;})
> parse "hello" (disjParser symbol number);;
val it : (PExpr * ReaderState) option =
Some (PSymbol "hello", {Data = "hello";
Position = 5;
Indentation = ;})``````

## Repetitions

Identifying sequences of elements identified by a parser is very useful. We can define the following operations

``````   let rec oneOrMore parser accumulated =
parser >>=
(fun lastResult ->
let result = lastResult::accumulated
in disjParser (oneOrMore parser result) (preturn (List.rev result)))

let zeroOrMore parser accumulated =
disjParser (oneOrMore parser accumulated) (preturn accumulated)``````

A simple example of using this operation looks like this:

``````> parse "1   2 3  4" (zeroOrMore (whitespace >> number) []);;
val it : (PExpr list * ReaderState) option =
Some
([PNumber "1"; PNumber "2"; PNumber "3"; PNumber "4"],
{Data = "1   2 3  4";
Position = 10;
Indentation = ;})``````

The next post will show how to create more interesting things with the pieces we have so far.

### A simple language with indentation based blocks (part 0)

One of the first things I noticed when reading about Python, is the use of indentation for defining code blocks. For example:

``````def foo(x):
while x < 20:
if x > 10:
print "argument is greater than 10"
print "value: " + x
else:
print "the argument is less than or equal to 10"
print "its value is : " + x``````

If you're only familiar to C-based language this seems a bit strange. Not using characters like '{}' or words like BEGIN/END seems fragile at first. For example one could expect something like:

``````def foo(x) {
while x < 20 {
if x > 10 {
print "argument is greater than 10"
print "value: " + x
} else {
print "the argument is less than or equal to 10"
print "its value is : " + x
}
}``````

I've always wondered how do you create a parser for this kind of a language. In the following series of posts I'm going to record my experiences writing a parser for a simple little language. This language will use a style similar to Python. I'm going to write the code using F# .