Monday, January 23, 2012

Creating a small parser using Parsec

In this post I'm going to show a small parser for a Scheme-like language written in Haskell using the Parsec parsing combinator library.

We're going to start by defining an AST for Scheme-like expressions:

data Expr = ScSymbol String
            | ScString String
            | ScNumber Integer
            | ScDouble Double 
            | ScCons Expr Expr
            | ScNil
            | ScBool Bool
            | ScQuote Expr
     deriving Show

For this experiment I'm going to use the Text.Parsec.Token and Text.Parsec.Language packages which have useful functions for creating language parsers.

idSymbol = oneOf ":!$%&*+/<=>?@\\^|-~"

schemeLanguageDef :: LanguageDef st
schemeLanguageDef = emptyDef { 
                   commentLine = ";;"
                   , identStart = letter <|> idSymbol
                   , identLetter = alphaNum <|> idSymbol
                   , opStart = parserZero
                   , opLetter = parserZero
                 }

schemeTokenParser = makeTokenParser schemeLanguageDef

TokenParser {
   identifier = idParser,
   reservedOp = opParser,
   stringLiteral = stringLiteralParser,
   parens = parParser,
   lexeme = lexParser,
   naturalOrFloat = naturalOrFloatParser
} = schemeTokenParser


Given these definitions we get parsers for identifiers, numbers, string literals "for free".

Boolean literals are parsed using the following parser:

boolLiteral = lexParser (
                 do 
                   char '#'
                   val <- (char 't') <|> (char 'f')
                   return $ ScBool $ val == 't'
             )


Basic quotations are parsed as follows:

quoteParser = lexParser (
              do 
                char '\''
                val <- expressionParser
                return $ ScQuote val
          )


Atomic expressions are parsed as follows:

atomParser =
   (do 
      id <- idParser
      return $ ScSymbol id)
   <|>
    (do 
       fnumber <- naturalOrFloatParser
       return $ case fnumber of
                Left num -> ScNumber num
                Right num -> ScDouble num)
   <|>        
    (do str <- stringLiteralParser
        return $ ScString str)
   <|> boolLiteral
   <|> quoteParser



Next we need to consider S-Expressions. This kind of expressions are combinations of the above elements surrounded by parentheses.

For example:
(+ 1 2 3)

dottedSuffixParser =     
   do 
      dotParser
      finalExpr <- expressionParser
      return finalExpr

parExpressionParser = 
   do (exprs, last) <- parParser 
                (do 
                   seq <- many expressionParser
                   dottedSuffix <- optionMaybe dottedSuffixParser 
                   return (case dottedSuffix of
                            Just lastExpr -> (seq, lastExpr)
                            Nothing -> (seq, ScNil)))
      return $ foldr ScCons last exprs

As shown above, the parser need to verify if there was a dot ('.') before the last element for example:
(1 . 2)
More details on the dot syntax can be found here.
Finally we define expressions:

       expressionParser =
            atomParser <|> parExpressionParser


       parseIt input = parse expressionParser "" input


We can use this function to parse this expressions as follows:

*SCParser> parseIt "(- x 1)"
Right (ScCons (ScSymbol "-") (ScCons (ScSymbol "x") (ScCons (ScNumber 1) ScNil)))


There's a great a series of articles on implementing a Scheme interpreter in Haskell called "Write Yourself a Scheme in 48 Hours/Parsing" which is a great source of information on this topic.