Saturday, December 29, 2018

Using Racklog for parsing

This post shows a small experiment of creating parsing predicates using the Racket's Racklog package . This package enables the use of logic programming features inside Racket.

Logic programming languages, like Prolog, use the Definite clause grammars syntax for this purpose. In this post this technique is not directly used, the goal is to express the grammar in the same way the DCG syntax is expanded. A nice experiment for a future post is to create macros for hiding state change variables.

As always the goal is to experiment with a concept without worrying too much about performance!

1.1 How it looks

Here's an example of a predicate for parsing an if statement for a fictional language.

(define %if-stat
  (%rel (ctxt new-ctxt)
        [(ctxt new-ctxt)
         (%let (ifkw-ctxt lpar-ctxt rpar-ctxt
                cond-ctxt then-ctxt
                stats-ctxt else-ctxt end-ctxt)
               (%and
                ((%literal-id "if")   ctxt ifkw-ctxt) 
                (%lpar                ifkw-ctxt lpar-ctxt)
                (%expression          lpar-ctxt cond-ctxt)
                (%rpar                cond-ctxt rpar-ctxt)
                ((%literal-id "then") rpar-ctxt then-ctxt)
                (%stat                then-ctxt stats-ctxt)
                ((%opt %else-block)   stats-ctxt else-ctxt)
                (%is new-ctxt (with-result
                                `(-if ,(parse-context-result cond-ctxt)
                                      ,(parse-context-result stats-ctxt)
                                      ,(parse-context-result else-ctxt))
                                else-ctxt))))]))

Running this predicate from the REPL using a string with sample code produces the following output:

racklogexperiments.rkt> (parse-context-result
 (cdar
   (%which (result-ctxt)
           (%if-stat (string-context "if (c) then
                                        foo(c);
                                      else
                                         moo(d);")
                 result-ctxt))))
'(-if
  (-id "c")
  (-call-stat (-call "foo" ((-id "c"))))
  (-else (-call-stat (-call "moo" ((-id "d"))))))
racklogexperiments.rkt>

1.2 Creating parsing predicates

An important piece that we need to define our parsers it's the "parsing context". This element is represented as as simple structure which keeps track of :

  1. The result or output of the last parser.
  2. The text used as input for the parser.
  3. The current position inside the string as a zero-based index.
(struct
  parse-context
  (result text position))

(define (string-context str)
        (parse-context '() str 0))

(define (with-result value ctxt)
  (struct-copy parse-context ctxt [result value]))

The way we define parsers using Racklog predicates is that we transform an input parsing context to new context. The new context is going to have a the result of the last parsing operation and it will move the position as many characters as consumed by the last parsing operation.

We use a couple of parsers as the foundation for all must all other parsers. The first one, %a-char has the next available character as the result.

(define %a-char
  (%rel (a-char ctxt new-ctxt)
        [(a-char ctxt new-ctxt)
            (%is #t (has-more-chars? ctxt))
            (%is new-ctxt (get-char-from-context ctxt))
            (%is a-char (parse-context-result new-ctxt))
            ]))

The code for the utility functions used in these parsers is the following:

(define (get-char-from-context ctxt)
  (struct-copy
     parse-context
     ctxt
     [result (string-ref (parse-context-text ctxt) (parse-context-position ctxt))]
     [position (+ 1 (parse-context-position ctxt))]))

(define (has-more-chars? ctx)
   (<
      (parse-context-position ctx)
      (string-length (parse-context-text ctx))))

The get-char-from-context function creates a new context with the current character as result and advances the context to the next position. We use Racklog unification and %and to chain together the parsing contexts. The following interaction illustrates this:

racklogexperiments.rkt> (define result (%which (c1 result-ctxt) 
                                               (%and (%a-char #\a (string-context "abc") c1)
                                                     (%a-char #\b c1 result-ctxt))))
racklogexperiments.rkt> result
'((c1 . #<parse-context>) (result-ctxt . #<parse-context>))
racklogexperiments.rkt> (assoc 'result-ctxt result)
'(result-ctxt . #<parse-context>)
racklogexperiments.rkt> (parse-context-result (cdr (assoc 'result-ctxt result)))
#\b
racklogexperiments.rkt> (parse-context-position (cdr (assoc 'result-ctxt result)))
2
racklogexperiments.rkt>

Here we create a very simple parser that recognizes the sequence: "ab" . As presented above, the position of the resulting parsing context is 2 which is the zero based position inside the string after 'b'.

1.3 Sequences

A very useful parser is one that let's you apply another parser zero or more times. The parser that archives this is the following:

(define %p-simple-sequence
  (λ (pred)
    (%rel (ctxt new-ctxt)
          [(ctxt new-ctxt)
           (%let (tmp-ctxt tmp-result)
                 (%or
                  (%and
                   (pred ctxt tmp-ctxt)
                   (%is tmp-result (with-result                                     
                                     (cons (parse-context-result tmp-ctxt)
                                           (parse-context-result ctxt))
                                     tmp-ctxt))
                   ((%p-simple-sequence pred) tmp-result new-ctxt))
                  (%is new-ctxt ctxt)))])))

(define %p-sequence
  (λ (pred)
    (%rel ( ctxt new-ctxt)
          [(ctxt new-ctxt)
           (%let (tmp-ctxt)
               (%and
                 (%is tmp-ctxt (with-result '() ctxt))
                 ((%p-simple-sequence pred)
                  tmp-ctxt new-ctxt)
                 ))]
          )))

We can apply this parser as follows:

racklogexperiments.rkt> (parse-context-result 
       (cdar (%which (result-ctxt) 
               ((%p-sequence 
                    (%rel (c r) [(c r) (%a-char #\a c r)])) 
                (string-context "aaaa") result-ctxt)))  )
'(#\a #\a #\a #\a)
racklogexperiments.rkt>

1.4 Optionals

Another useful element we need is a way to parse optional elements. We used this in our if example above for the else section.

To implement this we use %or to try to parse the optional parser first or succeed with an empty result. Using this technique will enable multiple solutions (see an example of this below).

(define %opt
  (λ (parser)
    (%rel (ctxt new-ctxt)
          [(ctxt new-ctxt)
           (%let (tmp-ctxt)
                 (%and
                  (%or (parser ctxt new-ctxt)
                       (%is new-ctxt (with-result '() ctxt)))))])))

1.5 Multiple possible ASTs

One interesting possibility of using a Racklog (or Prolog's DCGs) is that you can get many possible interpretations of the grammar. Although it may not be of practical use it looks rather interesting.

An example of these shows up when parsing an if statement with a https://en.wikipedia.org/wiki/Dangling_else .

(define sample-code
  "if (x) then if (y) then foo(); else goo();")

Here there are two possible valid interpretations of this if statement. The default one:

racklogexperiments.rkt> (parse-context-result
 (cdar
  (%which (result-ctxt)
          (%if-stat (string-context sample-code)
                    result-ctxt))))

'(-if
  (-id "x")
  (-if
   (-id "y")
   (-call-stat (-call "foo" ()))
   (-else (-call-stat (-call "goo" ()))))
  ())
racklogexperiments.rkt>

Here's an alternative visualization of this tree:

We can now ask Racklog for another solution using the %more function. See the result here:

racklogexperiments.rkt> (parse-context-result (cdar (%more)))
'(-if
  (-id "x")
  (-if (-id "y") (-call-stat (-call "foo" ())) ())
  (-else (-call-stat (-call "goo" ()))))
racklogexperiments.rkt>

Here's the other alternative visualization:

1.6 Code

The code for this post can be found here.