In this post, I'm going to start with the implementation of the parser using the F# programming language. There are several tools for creating parsers for example FParsec http://www.quanttec.com/fparsec/ or FsYacc/FsLex https://en.wikibooks.org/wiki/F_Sharp_Programming/Lexing_and_,arsing. However for this experiment I wanted to create a small set of simple parser combinators, just to learn more about them.
The contents of this post is loosely based on the definition of the Parser Monad . There are many great articles on the net about them. One of this papers is the incredibly nice Monadic Parsing in Haskell http://www.cs.nott.ac.uk/~pszgmh/pearl.pdf .
Let's start by defining the type of the state of our parser:
type ReaderState = { Data : string;
Position: int;
Indentation: int list}
This state contains the following:
- Data : the input string containing the program to parse
- Position : the current position of the parser inside the Data string
- Indentation : an indentation stack (more on that in future posts)
And now we're going to define the type of a "parser" :
ReaderState -> (('a * ReaderState) option )
This means that a "parser" is a function that takes the reader state and returns a tuple of some parsed value ('a
) and the new parsing state. Notice that parsers could fail, so the result it's wrapped in an option
type https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/options.
A very simple example of a parser is the following:
let readChar(state : ReaderState) : (string * ReaderState) option =
if state.Data.Length > state.Position then
Some (state.Data.[state.Position].ToString(),
{ state with Position = state.Position + 1 } )
else
None
We can test this parser using the following function:
let parse (input : string) (parser : (ReaderState -> (('a * ReaderState) option ))) =
parser { Data = input ; Position = 0; Indentation = [0] }
For example:
> parse "hello" readChar;;
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
Position = 1;
Indentation = [0];})
Here we can see that the result is a successful optional Some
with a string with the single recognized char . Also part of the result is the new reader state .
We can also specify operations for several characters for example:
let readZeroOrMoreChars (pred:char -> bool) (state : ReaderState) : (string * ReaderState) option =
let
secondPosition = readingChar state.Position pred state.Data
in
Some ( state.Data.Substring(state.Position,
secondPosition - state.Position),
{ state with Position = secondPosition })
We can use this function as follows:
> parse "1231abc" (readZeroOrMoreChars System.Char.IsDigit);;
val it : (string * ReaderState) option = Some ("1231", {Data = "1231abc";
Position = 4;
Indentation = [0];})
Using this function we can define a parser for whitespace:
let whitespace = readZeroOrMoreChars System.Char.IsWhiteSpace
Connecting parsers
The parsers I've been showing so far look very simple, and they are!. The goal is to have small parsers and combine them to create more interesting ones. The next operation is called concatParsers2
and it's goal is create a new parser that will execute two parsers in sequence. That is, apply the first one and with the resulting state, execute the second one.
For example, the following parser for recognizing symbols use the readWithConditionOnChar
parser with the readZeroOrMoreChars
in order to create a PSymbol
with the recognized element:
let symbol =
concatParsers2
(readWithConditionOnChar (fun c -> System.Char.IsLetter(c, 0)))
(fun initialChar ->
concatParsers2
(readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || System.Char.IsDigit(c)))
(fun suffixString -> (preturn (PSymbol (initialChar + suffixString))))
)
In our little language a symbol is represented by :
- a letter
- zero or more letters or numbers
We can use the readWithConditionOnChar
and readZeroOrMoreChars
to archive these goals, for example:
> parse "hello" (readWithConditionOnChar (fun c -> System.Char.IsLetter(c, 0)))- ;;
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
Position = 1;
Indentation = [0];})
> parse "hi1and2and3" (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || - System.Char.IsDigit(c)));;
val it : (string * ReaderState) option =
Some ("hi1and2and3", {Data = "hi1and2and3";
Position = 11;
Indentation = [0];})
Notice that we cannot use just readZeroOrMoreChars
since it will recognize symbols starting with numbers (like '13hello').
We can create a small parser for recognizing a letter (readWithConditionOnChar
) and we can use readZeroOrMoreChars
to read the rest. We could combine these two blocks to get another parser. The key for composing our parsers is the concatParsers2
function. This function has the following signature:
> concatParsers2;;
val it :
((ReaderState -> ('a * ReaderState) option) ->
('a -> ReaderState -> ('b * ReaderState) option) -> ReaderState ->
('b * ReaderState) option) = <fun:clo@21-1>
This signature means that concatParsers2
receives a parser that produces a value of type 'a
. The second argument is a function that receives a value of type 'a
(the result of the first parser) and returns a parser of another type 'b
. The result of concatParsers2
is itself another parser that produces a value of type 'b
.
The implementation looks like this:
let concatParsers2 (parser1 : (ReaderState -> (('a * ReaderState) option )))
(parser2N : ('a -> (ReaderState -> (('b * ReaderState) option )))) =
fun input -> match (parser1 input) with
| Some (matchedTxt, restState) -> (parser2N matchedTxt) restState
| _ -> None
This small function lets us create parsers using small blocks. But before we do that, we need to create another useful operation.
let preturn aValue (state : ReaderState ) = Some (aValue, state)
This operation is very simple, but really important!. It lets us prepare a result. By taking a look at the signature, we can see that it matches our parser
signature. We can now used it in conjunction with concatParsers2
to produce results:
> parse "AXXXC"
- (concatParsers2 recognizeABC
- (fun firstChar ->
- (concatParsers2
- recognizeXs
- (fun xs ->
- (concatParsers2
- recognizeABC
- (fun lastChar ->
- preturn (firstChar, xs, lastChar)- ))))));;
val it : ((string * string * string) * ReaderState) option =
Some (("A", "XXX", "C"), {Data = "AXXXC";
Position = 5;
Indentation = [0];})
Given that we define recognizeABC
and recognizeXs
as follows:
> let recognizeABC = readWithConditionOnChar (fun c -> c = "A" || c = "B" || c= "C");;
val recognizeABC : (ReaderState -> (string * ReaderState) option)
> let recognizeXs = readZeroOrMoreChars (fun c -> c = 'X' ) ;;
val recognizeXs : (ReaderState -> (string * ReaderState) option)
Using the concatParsers2
function seems a little verbose. We can borrow inspiration from the Haskell's Monad type class https://hackage.haskell.org/package/base-4.9.1.0/docs/Control-Monad.html#t:Monad by defining versions of the >>=
operator as follows:
let inline (>>=) (parser1 : (ReaderState -> (('a * ReaderState) option )))
(parser2N : ('a -> (ReaderState -> (('b * ReaderState) option )))) = concatParsers2 parser1 parser2N
Now using this operator we can write:
parse "AXXXC" ( recognizeABC >>= (fun firstChar ->
recognizeXs >>= (fun xs ->
recognizeABC >>= (fun lastChar ->
preturn (firstChar, xs, lastChar)))))
Ignoring output from previous parsers
Another useful operation allows us to connect two parsers but
let concatParsers (parser1 : (ReaderState -> (('a * ReaderState) option )))
(parser2 : (ReaderState -> (('b * ReaderState) option ))) =
fun input -> match (parser1 input) with
| Some (_, restState) -> parser2 restState
| _ -> None
Notice that it's similar to the concatParsers2
but it ignores the result of parser1
. This is useful for discarting whitespace:
let whitespaceNoNl =
readZeroOrMoreChars
(fun c -> System.Char.IsWhiteSpace(c) && c <> '\n')
let colon =
concatParsers whitespaceNoNl (readSpecificChar ':')
As with the concatParsers2
we can use another well known operator for this operation:
let inline (>>) (parser1 : (ReaderState -> (('a * ReaderState) option )))
(parser2 : (ReaderState -> (('b * ReaderState) option ))) = concatParsers parser1 parser2
Useful parsing operations
We need to create more operations that help us create complex parsers. For example, we can create a small operation for optional elements:
let optionalP (parser : (ReaderState -> (('a * ReaderState) option ))) (defaultValue:'a) =
fun input -> match (parser input) with
| result & Some _ -> result
| _ -> (Some (defaultValue, input))
We can use this new operator to parse numbers as follows:
let number =
( (optionalP (readSpecificChar '-') "") >>= (fun neg ->
digitP >>= (fun firstChar ->
(readZeroOrMoreChars (fun c -> System.Char.IsDigit(c))) >>= (fun chars ->
(optionalP decimalPartP "") >>= (fun dec ->
preturn (PNumber (neg + firstChar + chars + dec))))
Choosing from two options
The following operation helps us to try to use one parser and fallback to another if it didn't work:
let disjParser (parser1 : (ReaderState -> (('a * ReaderState) option )))
(parser2 : (ReaderState -> (('a * ReaderState) option ))) =
fun input -> match (parser1 input) with
| result & Some _ -> result
| _ -> parser2 input
A simple scenario using this parser looks like this:
> parse "39" (disjParser symbol number);;
val it : (PExpr * ReaderState) option =
Some (PNumber "39", {Data = "39";
Position = 2;
Indentation = [0];})
> parse "hello" (disjParser symbol number);;
val it : (PExpr * ReaderState) option =
Some (PSymbol "hello", {Data = "hello";
Position = 5;
Indentation = [0];})
Repetitions
Identifying sequences of elements identified by a parser is very useful. We can define the following operations
let rec oneOrMore parser accumulated =
parser >>=
(fun lastResult ->
let result = lastResult::accumulated
in disjParser (oneOrMore parser result) (preturn (List.rev result)))
let zeroOrMore parser accumulated =
disjParser (oneOrMore parser accumulated) (preturn accumulated)
A simple example of using this operation looks like this:
> parse "1 2 3 4" (zeroOrMore (whitespace >> number) []);;
val it : (PExpr list * ReaderState) option =
Some
([PNumber "1"; PNumber "2"; PNumber "3"; PNumber "4"],
{Data = "1 2 3 4";
Position = 10;
Indentation = [0];})
The next post will show how to create more interesting things with the pieces we have so far.