Sunday, June 13, 2010

Parsing XML documents with namespaces in Newspeak

The previous post showed a basic grammar for XML written using Newspeak parsing combinators. Although Newspeak already includes XML support, it is a great exercise to explore another interesting parts of the language.

The grammar



The grammar defined in the previous post looks like this:

class XmlGrammar = ExecutableGrammar(
"Xml 1.0 grammar with namespaces according to http://www.w3.org/TR/REC-xml/ "
|
openAB = char: $<.
closeAB = char: $>.
amp = char: $&.
semicolon = char: $; .
slash = char: $/.


topenAB = tokenFromChar: $<.
tcloseAB = tokenFromChar: $>.
tslash = tokenFromChar: $/.

comment = openAB , (char: $-),(char: $-),
((charExceptFor: $-) | ((char: $-), (charExceptFor: $-))) plus,
(char: $-),(char: $-),closeAB .

letter = (charBetween: $a and: $z) | (charBetween: $A and: $Z).
digit = charBetween: $0 and: $9.

colon = char: $:.

quote = (char:$') .
dquote = (char:$") .

eq = tokenFromChar: $= .

VersionNum = (char:$1), (char: $.) , (charBetween: $0 and: $9) plus.

VersionInfo = (tokenFromSymbol: #version), eq, ((quote, VersionNum,quote) | (dquote, VersionNum, dquote )).

EncName = letter, (letter | digit) star.

EncodingDecl = (tokenFromSymbol: #enconding) , eq , ((quote, EncName ,quote) | (dquote , EncName , dquote )).

yesNo = (tokenFromSymbol: #yes) | (tokenFromSymbol: #no).

SDDecl = (tokenFromSymbol: #standalone), eq, ((quote, yesNo ,quote) | (dquote , yesNo , dquote )).

XMLDecl = (char: $<) , (char: $?) ,(tokenFromSymbol: #xml), VersionInfo , EncodingDecl opt, SDDecl opt,
(tokenFromChar: $?), (char: $>).

dprolog = XMLDecl.

NameStartChar = letter | (char: $_) .

NameChar = NameStartChar | (char: $-) | (char: $.) | digit.

Name = NameStartChar, NameChar star.

NameWithPrefix = Name, colon, Name.

QName = NameWithPrefix | Name.

TQName = tokenFor: QName.

EntityRef = amp, Name ,semicolon .

CharRef = amp, (char: $#), (((char:$x), (digit | letter) plus) | (digit plus)) ,semicolon .

Reference = EntityRef | CharRef.

AttributeContent1 = (charExceptForCharIn: {$< . $". $&. }) | Reference.


AttributeValue = (dquote, AttributeContent1 star,dquote) |
(quote, AttributeContent1 star,quote).

Attribute = TQName ,eq, AttributeValue.

EmptyElemTag = topenAB ,QName, Attribute star, tslash , closeAB .
STag = topenAB ,QName, Attribute star, tcloseAB .
ETag = topenAB ,slash,QName, tcloseAB .

CDStart = topenAB ,(char: $!),(char:$[),(tokenFromSymbol: #CDATA),(char:$[).
CDEnd = (char: $]),(char: $]),(char: $>).

CDSect = CDStart, (charExceptFor: $]) star , CDEnd.

CharData = tokenFor: ((charExceptForCharIn: {$&. $<}) plus).

content = CharData opt, ((element | Reference | CDSect), CharData opt) star.

ComplexElement = STag, content , ETag.

element = EmptyElemTag | ComplexElement .


|
)
(
...
)



XML nodes



The next step was to add support for the creation of data structures representing the XML document. A technique presented in the Executable Grammars[PDF] paper suggest creating a subclass of the grammar which adds the code for creating the XML node tree.

class XmlParserWithXmlNodes = XmlGrammar (
"Basic XmlParser with XML node AST"
|


|)
('as yet unclassified'
Attribute = (
^ super Attribute wrapper: [:name :eq :value | {name. value.}].
)

AttributeValue = (
|flattenString|
^ super AttributeValue
wrapper: [:q1 :chars :q2 |
flattenString:: (chars collect:
[:c | (c isString)
ifTrue: [c at: 1]
ifFalse: [c]]).
String withAll:flattenString ].
)

CDSect = (
^ super CDSect wrapper: [:cs :data :ce | String withAll: data].
)

CDStart= (
^ super CDStart wrapper: [:t :e :b1 :cdata :b2 | 'cdatastart'].
)

CharData = (
^ super CharData wrapper: [:chars | String withAll: (chars token ) ].

)

CharRef = (
|numberStr|
^ super CharRef
wrapper: [:a1 :p :numberChars :sc |
|base|
numberStr:: ((numberChars at: 1) = $x)
ifTrue: [base:: 16.
String withAll: (numberChars at: 2)]
ifFalse: [base:: 10.
String withAll: numberChars].

(Unicode value: (Number readFrom: numberStr base: base)) asString
].
)

ComplexElement = (
|tn attCollection|
tn:: XmlNodes new.
attCollection:: tn Attributes new.

^ super ComplexElement
wrapper: [:s :childNodes :e |
(e asString = s name asString)
ifFalse: [error: 'Open tag different from closing tag'].
s childNodes: childNodes.
s].
)

ETag = (
^ super ETag wrapper: [:o :s :name :c | name].
)

EmptyElemTag = (

^super EmptyElemTag
wrapper: [:oab :name :atts :s :cab |
|tn attCollection|
tn:: XmlNodes new.
attCollection:: tn Attributes new.

atts do: [:att | attCollection addAttributeWithQName: ((att at: 1) token) value: (att at:2)].
tn Element name: name attributes: attCollection childNodes:{} ].
)

EntityRef = (
^ super EntityRef
wrapper: [:a :name :sc |
(entities valueForName: ((name at: 1) at: 1)) asString ].
)

Name = (
^ super Name wrapper: [:startChar :rest | { { (startChar asString), (String withAll: rest) } } ].
)

NameWithPrefix = (
|tn|
tn:: XmlNodes new.
^ super NameWithPrefix
wrapper: [ :prefix :c :name |
{{(prefix at: 1). (name at: 1).}} ].
)

QName = (
|tn|
tn:: XmlNodes new.
^ super QName wrapper: [ :data | ((data size) = 2)
ifTrue:[tn QualifiedName prefix: ((data at:1) at: 1)
localPart:((data at: 2) at:1)]
ifFalse:[tn QualifiedName prefix: nil localPart: (data at: 1)]]
)

STag = (
^super STag
wrapper: [:oab :name :atts :cab |
|tn attCollection|
tn:: XmlNodes new.
attCollection:: tn Attributes new.
"Add attributes"
atts do: [:att | attCollection addAttributeWithQName: ((att at: 1) token)
value: (att at:2)].
"Create elements"
tn Element name: name attributes: attCollection childNodes:{} ].
)

VersionInfo = (
^super VersionInfo
wrapper: [:v :e :versionTextList | versionTextList at: 2].
)

VersionNum = (
^super VersionNum
wrapper: [:one :dot :num |
Number
readFrom: (String withAll: {one.dot},(String withAll: num))].
)

XMLDecl = (
|tn|
tn:: XmlNodes new.
^super XMLDecl
wrapper: [:c1 :c2 :x :versionInfo :enc :sd :c3 :c4 |
tn TAstXmlDecl version: versionInfo
encoding: enc
standalone: sd] .
)

content = (
|result|

^ super content wrapper: [:chars :cseq |
(chars = nil)
ifTrue: [result:: {}]
ifFalse: [result:: {chars}].
cseq inject: result into: [:total :current |
result:: addCompactingStrings: (current at: 1) to: result .
(current at: 2) ifNotNil: [:c | addCompactingStrings: c to: result ].
result ].
].
)

...

)



XML Namespaces



Support for XML namespace resolution was the next step. As described in the Namespaces in XML 1.0 document, XML can have different namespaces for its elements. For example:

<docs:letter docs:xmlns="http://www.foo.com/docs"
pic:xmlns="http://www.foo.com/docs/links">
<docs:paragraph>some text1</docs:paragraph>
<pic:pictureRef location="goo/moo/loo"/>
<docs:paragraph>some text2</docs:paragraph>
</docs:letter>


In this case the 'docs:xmlns' and 'pic:xmlns' attributes associates the 'docs' and 'pics' prefixes to the specified namespaces. So the namespace of the 'letter' element is 'http://www.foo.com/docs'.

Also default namespaces could be specified, like this:

<letter xmlns="http://www.foo.com/docs">

<paragraph>some text1</paragraph>
<pictureRef location="goo/moo/loo"
xmlns="http://www.foo.com/docs/links" />
<paragraph>some text2</paragraph>
</letter>


The basic idea is that xmlns attributes define an scope where a set of prefix/namepace associations and a default namespace are valid. For each start tag a new scope with new declarations must be in context and each end tag must remove the last scope.

Extending the parsing context



In order to do add this functionality, we need to be able to keep a stack with the scopes of namespace declarations that is modified each time we enter or exit from a element declaration. Fortunately, the parser combinator library includes a ParserContext class which is used for keeping track parsing errors.

One problem to use ParserContext was that an instance of it is created inside the CombinatorialParser parse: method which is part of the parsing library. This instance is configured in certain way to process parsing errors. So in order to create add a extended context we needed to somehow replace the ParserContext class with a new implementation.

Luckly, as described in the Modules as Objects in Newspeak[PDF] paper in Newspeak you can override inner class definitions just like you override methods in any other OO language. This means that I can extend the parser library and since ParserContext is defined as an inner class, override its definition to add my own parsing context. The definition of the extended parsing library looks like this:

class ParserLibraryWithXmlContext = parserLibraryClass usingLib: platform (
"A parser library with overwritten parsing context for Xml namespace resolution"
|
parserLibContext = super ParserContext .
|
)
(

class ParserContext = parserLibContext(
"An Xml parser context that keeps track of namespaces."
|
protected prefixes = collections MutableArrayList new: 10.
|
)
('as yet unclassified'
addPrefix: prefix for: namespace = (
|lastLevel|
lastLevel:: prefixes at: (prefixes size).
lastLevel at: prefix put: namespace.

)

namespaceFor: prefix = (
|levelIndex|

levelIndex:: prefixes findLast: [:lvl | (lvl includesKey: prefix) ].
^(levelIndex > 0)
ifTrue: [(prefixes at: levelIndex) at: prefix]
ifFalse: [nil].
)

popLevel = (
prefixes pop.
)

pushLevel = (
prefixes push: (collections MutableHashedMap new).
)



))


This is pretty nice since it allowed us to inject our own implementation of ParsingContext (which extends the original!). An interesting thing to notice is the definition of the parserLibContext which is bound to the base definition of the ParserContext class which we need in order to inherit from it.

Using the extended context



Now having added the new parsing context the next step is to add the code to manipulate the context. In order to do that, we need to add a some functionality to the STag, EmptyElement and ETag productions of the grammar so we can push and pop namespace scopes for each element. In other to do this without modifying the existing functionality three wrappers were created for each of this productions:


class ParserWithNamespaceForStartTag = CombinatorialParser (
"A parser that takes care of modyfing the parsing context for a start tag"
|
innerparser
|
)
('as yet unclassified'
forParser: p = (
innerparser:: p.
)

parse: input inContext: context ifError: blk = (
|result xmlnsAttributes elementName elementNamespace attName|
result:: innerparser parse: input inContext:context ifError:blk.
"First update the context with the newest prefix declarations"
context pushLevel.
xmlnsAttributes:: result attributes allAttributesWithLocalName: 'xmlns'.
xmlnsAttributes
do: [:aPair | context addPrefix: ((aPair at: 1) prefix)
for: (aPair at: 2)].
"Update the element"
elementName:: result name.
elementNamespace:: context namespaceFor: elementName prefix.
elementName namespace: elementNamespace.

"Update the attributes"
result attributes
do: [:aPair |
attName:: aPair at: 1.
attName namespace:
(context namespaceFor: attName prefix)
].

^result.
)

class ParserWithNamespaceForStartEndTag =ParserWithNamespaceForStartTag(
"Parser that takes care of modifying the parsing context for namespace declarations for self closing tags."
|

|
)
('as yet unclassified'
parse: input inContext: context ifError: blk = (
|result|
result:: super parse: input inContext: context ifError: blk.
context popLevel.
^result
)

)

class ParserWithNamespaceForEndTag = CombinatorialParser (
"A parser that takes care of modyfing the parsing context for a end tag"
|
protected innerParser = nil.

|
)
('as yet unclassified'
forParser: parser = (
innerParser:: parser.
)

parse: input inContext: context ifError: blk = (
|result|
result:: innerParser parse: input inContext: context ifError: blk.
context popLevel.
^result
)


As shown here the start tag parser pushes a new scope with the new namespace declarations, while the end tag parser pops a scope.

Now to add this wrappers I created a new subclass of the parser with nodes to add this functionality.

class XmlParserWithNodesAndNamespaces = XmlParserWithXmlNodes (
"An XML Parser that creates a node tree and that resolves the namespaces."
|

|
)
('as yet unclassified'
ETag = (
|newWrappingParser|
newWrappingParser:: ParserWithNamespaceForEndTag new.
newWrappingParser forParser: (super ETag).
^newWrappingParser
)

EmptyElemTag = (
|newWrappingParser|
newWrappingParser:: ParserWithNamespaceForStartEndTag new.
newWrappingParser forParser: (super EmptyElemTag).
^newWrappingParser
)

STag = (
|tn attCollection newWrappingParser|
newWrappingParser:: ParserWithNamespaceForStartTag new.
newWrappingParser forParser: (super STag).
^newWrappingParser
)

)



Code organization



The code for this experiment is organized as follows:

class  XmlTools withParserLibClass: parserLibraryClass usingLib: platform = 
(
...
) (
class XmlParsing withParsingLib: parserLibrary = (
...
)
(
class XmlGrammar = ExecutableGrammar
( ...) ( ... )
class XmlParserWithXmlNodes = XmlGrammar
( ... ) ( ... )
class XmlParserWithNodesAndNamespaces = XmlParserWithXmlNodes
( ... ) ( ... )
...
)

basicParser = (
|parsingLib xmlparsing|
parsingLib:: parserLibraryClass usingLib: platform.
xmlparsing:: XmlParsing withParsingLib: parsingLib.
^xmlparsing XmlParserWithXmlNodes new.
)

parserWithNamespaceSupport = (
|parsingLib xmlparsing|
parsingLib:: ParserLibraryWithXmlContext new.
xmlparsing:: XmlParsing withParsingLib: parsingLib.
^xmlparsing XmlParserWithNodesAndNamespaces new.
)
...
)


Here the XmlTools class definition will represent a module for XML utilities. Its inner definitions include the XmlParsing inner class which defines the grammar along other parsing utilities. The basicParser parserWithNamespaceSupport show an example of how the parser is created.

The following methods show how the XmlTools class is used to create a concrete parser for a basic XML document:

getTestingNsParser = (
|platform|
platform:: Platform new.
^(XmlTools
withParserLibClass: BlocklessCombinatorialParsing
usingLib: platform) parserWithNamespaceSupport.
)

testElementWithOneChildWithNamespaces = (
|parser r ctxt|
parser:: xmlNsParserWrapper: (getTestingNsParser element) .
r:: parser parse: (streamFromString: '') .

assert:[r childNodes size = 1].
assert:[r name asString = 'myElement'].
assert:[r name namespace = 'http://foo'].
assert:[((r childNodes at: 1) name asString) = 'childElement1'].
assert:[((r childNodes at: 1) name namespace) = 'http://foo'].
)


Final words



The nicest think to notice is that the original class containing the grammar for XML was not modified in order to introduce this feature. In fact the parser with XML nodes without namespaces is also available . Also I really liked the way the ParserContext class was replaced which automatically allowed me to add this new functionality.

Code for this post can be found here.

0 comments: