The
previous post showed a basic grammar for XML written using
Newspeak parsing combinators. Although Newspeak already includes XML support, it is a great exercise to explore another interesting parts of the language.
The grammar
The grammar defined in the previous post looks like this:
class XmlGrammar = ExecutableGrammar(
|
openAB = char: $<.
closeAB = char: $>.
amp = char: $&.
semicolon = char: $; .
slash = char: $/.
topenAB = tokenFromChar: $<.
tcloseAB = tokenFromChar: $>.
tslash = tokenFromChar: $/.
comment = openAB , (char: $-),(char: $-),
((charExceptFor: $-) | ((char: $-), (charExceptFor: $-))) plus,
(char: $-),(char: $-),closeAB .
letter = (charBetween: $a and: $z) | (charBetween: $A and: $Z).
digit = charBetween: $0 and: $9.
colon = char: $:.
quote = (char:$') .
dquote = (char:$") .
eq = tokenFromChar: $= .
VersionNum = (char:$1), (char: $.) , (charBetween: $0 and: $9) plus.
VersionInfo = (tokenFromSymbol: #version), eq, ((quote, VersionNum,quote) | (dquote, VersionNum, dquote )).
EncName = letter, (letter | digit) star.
EncodingDecl = (tokenFromSymbol: #enconding) , eq , ((quote, EncName ,quote) | (dquote , EncName , dquote )).
yesNo = (tokenFromSymbol: #yes) | (tokenFromSymbol: #no).
SDDecl = (tokenFromSymbol: #standalone), eq, ((quote, yesNo ,quote) | (dquote , yesNo , dquote )).
XMLDecl = (char: $<) , (char: $?) ,(tokenFromSymbol: #xml), VersionInfo , EncodingDecl opt, SDDecl opt,
(tokenFromChar: $?), (char: $>).
dprolog = XMLDecl.
NameStartChar = letter | (char: $_) .
NameChar = NameStartChar | (char: $-) | (char: $.) | digit.
Name = NameStartChar, NameChar star.
NameWithPrefix = Name, colon, Name.
QName = NameWithPrefix | Name.
TQName = tokenFor: QName.
EntityRef = amp, Name ,semicolon .
CharRef = amp, (char: $#), (((char:$x), (digit | letter) plus) | (digit plus)) ,semicolon .
Reference = EntityRef | CharRef.
AttributeContent1 = (charExceptForCharIn: {$< . $". $&. }) | Reference.
AttributeValue = (dquote, AttributeContent1 star,dquote) |
(quote, AttributeContent1 star,quote).
Attribute = TQName ,eq, AttributeValue.
EmptyElemTag = topenAB ,QName, Attribute star, tslash , closeAB .
STag = topenAB ,QName, Attribute star, tcloseAB .
ETag = topenAB ,slash,QName, tcloseAB .
CDStart = topenAB ,(char: $!),(char:$[),(tokenFromSymbol: #CDATA),(char:$[).
CDEnd = (char: $]),(char: $]),(char: $>).
CDSect = CDStart, (charExceptFor: $]) star , CDEnd.
CharData = tokenFor: ((charExceptForCharIn: {$&. $<}) plus).
content = CharData opt, ((element | Reference | CDSect), CharData opt) star.
ComplexElement = STag, content , ETag.
element = EmptyElemTag | ComplexElement .
|
)
(
...
)
XML nodes
The next step was to add support for the creation of data structures representing the XML document. A technique presented in the
Executable Grammars[PDF] paper suggest creating a subclass of the grammar which adds the code for creating the XML node tree.
class XmlParserWithXmlNodes = XmlGrammar (
"Basic XmlParser with XML node AST"
|
|)
('as yet unclassified'
Attribute = (
^ super Attribute wrapper: [:name :eq :value | {name. value.}].
)
AttributeValue = (
|flattenString|
^ super AttributeValue
wrapper: [:q1 :chars :q2 |
flattenString:: (chars collect:
[:c | (c isString)
ifTrue: [c at: 1]
ifFalse: [c]]).
String withAll:flattenString ].
)
CDSect = (
^ super CDSect wrapper: [:cs :data :ce | String withAll: data].
)
CDStart= (
^ super CDStart wrapper: [:t :e :b1 :cdata :b2 | 'cdatastart'].
)
CharData = (
^ super CharData wrapper: [:chars | String withAll: (chars token ) ].
)
CharRef = (
|numberStr|
^ super CharRef
wrapper: [:a1 :p :numberChars :sc |
|base|
numberStr:: ((numberChars at: 1) = $x)
ifTrue: [base:: 16.
String withAll: (numberChars at: 2)]
ifFalse: [base:: 10.
String withAll: numberChars].
(Unicode value: (Number readFrom: numberStr base: base)) asString
].
)
ComplexElement = (
|tn attCollection|
tn:: XmlNodes new.
attCollection:: tn Attributes new.
^ super ComplexElement
wrapper: [:s :childNodes :e |
(e asString = s name asString)
ifFalse: [error: 'Open tag different from closing tag'].
s childNodes: childNodes.
s].
)
ETag = (
^ super ETag wrapper: [:o :s :name :c | name].
)
EmptyElemTag = (
^super EmptyElemTag
wrapper: [:oab :name :atts :s :cab |
|tn attCollection|
tn:: XmlNodes new.
attCollection:: tn Attributes new.
atts do: [:att | attCollection addAttributeWithQName: ((att at: 1) token) value: (att at:2)].
tn Element name: name attributes: attCollection childNodes:{} ].
)
EntityRef = (
^ super EntityRef
wrapper: [:a :name :sc |
(entities valueForName: ((name at: 1) at: 1)) asString ].
)
Name = (
^ super Name wrapper: [:startChar :rest | { { (startChar asString), (String withAll: rest) } } ].
)
NameWithPrefix = (
|tn|
tn:: XmlNodes new.
^ super NameWithPrefix
wrapper: [ :prefix :c :name |
{{(prefix at: 1). (name at: 1).}} ].
)
QName = (
|tn|
tn:: XmlNodes new.
^ super QName wrapper: [ :data | ((data size) = 2)
ifTrue:[tn QualifiedName prefix: ((data at:1) at: 1)
localPart:((data at: 2) at:1)]
ifFalse:[tn QualifiedName prefix: nil localPart: (data at: 1)]]
)
STag = (
^super STag
wrapper: [:oab :name :atts :cab |
|tn attCollection|
tn:: XmlNodes new.
attCollection:: tn Attributes new.
"Add attributes"
atts do: [:att | attCollection addAttributeWithQName: ((att at: 1) token)
value: (att at:2)].
"Create elements"
tn Element name: name attributes: attCollection childNodes:{} ].
)
VersionInfo = (
^super VersionInfo
wrapper: [:v :e :versionTextList | versionTextList at: 2].
)
VersionNum = (
^super VersionNum
wrapper: [:one :dot :num |
Number
readFrom: (String withAll: {one.dot},(String withAll: num))].
)
XMLDecl = (
|tn|
tn:: XmlNodes new.
^super XMLDecl
wrapper: [:c1 :c2 :x :versionInfo :enc :sd :c3 :c4 |
tn TAstXmlDecl version: versionInfo
encoding: enc
standalone: sd] .
)
content = (
|result|
^ super content wrapper: [:chars :cseq |
(chars = nil)
ifTrue: [result:: {}]
ifFalse: [result:: {chars}].
cseq inject: result into: [:total :current |
result:: addCompactingStrings: (current at: 1) to: result .
(current at: 2) ifNotNil: [:c | addCompactingStrings: c to: result ].
result ].
].
)
...
)
XML Namespaces
Support for XML namespace resolution was the next step. As described in the
Namespaces in XML 1.0 document, XML can have different namespaces for its elements. For example:
<docs:letter docs:xmlns="http://www.foo.com/docs"
pic:xmlns="http://www.foo.com/docs/links">
<docs:paragraph>some text1</docs:paragraph>
<pic:pictureRef location="goo/moo/loo"/>
<docs:paragraph>some text2</docs:paragraph>
</docs:letter>
In this case the 'docs:xmlns' and 'pic:xmlns' attributes associates the 'docs' and 'pics' prefixes to the specified namespaces. So the namespace of the 'letter' element is 'http://www.foo.com/docs'.
Also default namespaces could be specified, like this:
<letter xmlns="http://www.foo.com/docs">
<paragraph>some text1</paragraph>
<pictureRef location="goo/moo/loo"
xmlns="http://www.foo.com/docs/links" />
<paragraph>some text2</paragraph>
</letter>
The basic idea is that
xmlns
attributes define an scope where a set of prefix/namepace associations and a default namespace are valid. For each start tag a new scope with new declarations must be in context and each end tag must remove the last scope.
Extending the parsing context
In order to do add this functionality, we need to be able to keep a stack with the scopes of namespace declarations that is modified each time we enter or exit from a element declaration. Fortunately, the parser combinator library includes a
ParserContext
class which is used for keeping track parsing errors.
One problem to use
ParserContext
was that an instance of it is created inside the
CombinatorialParser parse:
method which is part of the parsing library. This instance is configured in certain way to process parsing errors. So in order to create add a extended context we needed to somehow replace the
ParserContext
class with a new implementation.
Luckly, as described in the
Modules as Objects in Newspeak[PDF] paper in Newspeak you can override inner class definitions just like you override methods in any other OO language. This means that I can extend the parser library and since
ParserContext
is defined as an inner class, override its definition to add my own parsing context. The definition of the extended parsing library looks like this:
class ParserLibraryWithXmlContext = parserLibraryClass usingLib: platform (
|
parserLibContext = super ParserContext .
|
)
(
class ParserContext = parserLibContext(
|
protected prefixes = collections MutableArrayList new: 10.
|
)
('as yet unclassified'
addPrefix: prefix for: namespace = (
|lastLevel|
lastLevel:: prefixes at: (prefixes size).
lastLevel at: prefix put: namespace.
)
namespaceFor: prefix = (
|levelIndex|
levelIndex:: prefixes findLast: [:lvl | (lvl includesKey: prefix) ].
^(levelIndex > 0)
ifTrue: [(prefixes at: levelIndex) at: prefix]
ifFalse: [nil].
)
popLevel = (
prefixes pop.
)
pushLevel = (
prefixes push: (collections MutableHashedMap new).
)
))
This is pretty nice since it allowed us to
inject our own implementation of
ParsingContext
(which extends the original!). An interesting thing to notice is the definition of the
parserLibContext
which is bound to the base definition of the
ParserContext
class which we need in order to inherit from it.
Using the extended context
Now having added the new parsing context the next step is to add the code to manipulate the context. In order to do that, we need to add a some functionality to the
STag
,
EmptyElement
and
ETag
productions of the grammar so we can push and pop namespace scopes for each element. In other to do this without modifying the existing functionality three wrappers were created for each of this productions:
class ParserWithNamespaceForStartTag = CombinatorialParser (
|
innerparser
|
)
('as yet unclassified'
forParser: p = (
innerparser:: p.
)
parse: input inContext: context ifError: blk = (
|result xmlnsAttributes elementName elementNamespace attName|
result:: innerparser parse: input inContext:context ifError:blk.
context pushLevel.
xmlnsAttributes:: result attributes allAttributesWithLocalName: 'xmlns'.
xmlnsAttributes
do: [:aPair | context addPrefix: ((aPair at: 1) prefix)
for: (aPair at: 2)].
elementName:: result name.
elementNamespace:: context namespaceFor: elementName prefix.
elementName namespace: elementNamespace.
result attributes
do: [:aPair |
attName:: aPair at: 1.
attName namespace:
(context namespaceFor: attName prefix)
].
^result.
)
class ParserWithNamespaceForStartEndTag =ParserWithNamespaceForStartTag(
|
|
)
('as yet unclassified'
parse: input inContext: context ifError: blk = (
|result|
result:: super parse: input inContext: context ifError: blk.
context popLevel.
^result
)
)
class ParserWithNamespaceForEndTag = CombinatorialParser (
"A parser that takes care of modyfing the parsing context for a end tag"
|
protected innerParser = nil.
|
)
('as yet unclassified'
forParser: parser = (
innerParser:: parser.
)
parse: input inContext: context ifError: blk = (
|result|
result:: innerParser parse: input inContext: context ifError: blk.
context popLevel.
^result
)
As shown here the start tag parser
pushes a new scope with the new namespace declarations, while the end tag parser
pops a scope.
Now to add this wrappers I created a new subclass of the parser with nodes to add this functionality.
class XmlParserWithNodesAndNamespaces = XmlParserWithXmlNodes (
|
|
)
('as yet unclassified'
ETag = (
|newWrappingParser|
newWrappingParser:: ParserWithNamespaceForEndTag new.
newWrappingParser forParser: (super ETag).
^newWrappingParser
)
EmptyElemTag = (
|newWrappingParser|
newWrappingParser:: ParserWithNamespaceForStartEndTag new.
newWrappingParser forParser: (super EmptyElemTag).
^newWrappingParser
)
STag = (
|tn attCollection newWrappingParser|
newWrappingParser:: ParserWithNamespaceForStartTag new.
newWrappingParser forParser: (super STag).
^newWrappingParser
)
)
Code organization
The code for this experiment is organized as follows:
class XmlTools withParserLibClass: parserLibraryClass usingLib: platform =
(
...
) (
class XmlParsing withParsingLib: parserLibrary = (
...
)
(
class XmlGrammar = ExecutableGrammar
( ...) ( ... )
class XmlParserWithXmlNodes = XmlGrammar
( ... ) ( ... )
class XmlParserWithNodesAndNamespaces = XmlParserWithXmlNodes
( ... ) ( ... )
...
)
basicParser = (
|parsingLib xmlparsing|
parsingLib:: parserLibraryClass usingLib: platform.
xmlparsing:: XmlParsing withParsingLib: parsingLib.
^xmlparsing XmlParserWithXmlNodes new.
)
parserWithNamespaceSupport = (
|parsingLib xmlparsing|
parsingLib:: ParserLibraryWithXmlContext new.
xmlparsing:: XmlParsing withParsingLib: parsingLib.
^xmlparsing XmlParserWithNodesAndNamespaces new.
)
...
)
Here the
XmlTools
class definition will represent a
module for XML utilities. Its inner definitions include the XmlParsing inner class which defines the grammar along other parsing utilities. The
basicParser
parserWithNamespaceSupport
show an example of how the parser is created.
The following methods show how the
XmlTools
class is used to create a concrete parser for a basic XML document:
getTestingNsParser = (
|platform|
platform:: Platform new.
^(XmlTools
withParserLibClass: BlocklessCombinatorialParsing
usingLib: platform) parserWithNamespaceSupport.
)
testElementWithOneChildWithNamespaces = (
|parser r ctxt|
parser:: xmlNsParserWrapper: (getTestingNsParser element) .
r:: parser parse: (streamFromString: '') .
assert:[r childNodes size = 1].
assert:[r name asString = 'myElement'].
assert:[r name namespace = 'http://foo'].
assert:[((r childNodes at: 1) name asString) = 'childElement1'].
assert:[((r childNodes at: 1) name namespace) = 'http://foo'].
)
Final words
The nicest think to notice is that the original class containing the grammar for XML was not modified in order to introduce this feature. In fact the parser with XML nodes without namespaces is also available . Also I really liked the way the
ParserContext
class was replaced which automatically allowed me to add this new functionality.
Code for this post can be found
here.