Thursday, October 1, 2009

Parsing Javascript using Newspeak parsing combinators

I've been working on a parser for Javascript/Ecmascript using Newspeak parsing combinators. The parser is based on the grammar presented in the ECMAScript Language Specification [PDF] document. It is still incomplete, however it can parse simple statements.

The grammar looks like this:


class JSGrammar = ExecutableGrammar (
"Experiment for JS grammar based on the description from http://www.ecma-international.org/publications/standards/Ecma-262.htm"
|
doubleQuote = (char: $").
backslash = (char: $\).
str = doubleQuote,((backslash, ( char: $" )) |
(backslash, ( char: $/ )) |
(backslash, backslash) |
(backslash, ( char: $r )) |
(backslash, ( char: $n )) |
(backslash, ( char: $t )) |
(charExceptFor: $")) star, doubleQuote.
string = tokenFor: str.

tilde = char: $~.
exclamation = char: $!.
starChar = char: $*.
slash = char: $/.
modulo = char: $%.
pipe = char: $|.
amp = char: $&.
cir = char: $^.
question = char: $?.
colon = char: $:.
semicolon = char: $;.

negSign = (char: $-).
plusSign = (char: $+).
digit = (charBetween: $0 and: $9).
dot = (char: $. ) .
lt = char: $<.
gt = char: $>.
eq = char: $=.
num = negSign opt, digit, digit star, dot opt,digit star, ((char: $e) | (char: $E)) opt, (plusSign | negSign) opt,digit star.
number = tokenFor: num.

tQuestion = tokenFor: question.
tColon = tokenFor: colon.
tplusSign = tokenFor: plusSign.
tnegSign = tokenFor: negSign.
tmodulo = tokenFor: modulo.
tslash = tokenFor: slash.
tstarChar = tokenFor: starChar.
texclamation = tokenFor:exclamation.
tdot = tokenFor:dot.
tLt = tokenFor: lt.
tGt = tokenFor: gt.
tEq = tokenFor: eq.
tAmp = tokenFor: amp.
tPipe = tokenFor: pipe.
tCir = tokenFor: cir.
tSlash = tokenFor: slash.
tSemicolon = tokenFor: semicolon.

tStarEq = tstarChar,eq.
tModEq = tmodulo,eq.
tSlashEq = tSlash,eq.
tPlusEq = tplusSign,eq.
tMinusEq = tnegSign,eq.
tAmpAmp = tAmp,amp.
tPipePipe = tPipe,pipe.
tLtEq = tLt,eq.
tGtEq = tGt,eq.
tleftShift = tLt,lt.
trightShift = tGt,gt.
tsRightShift = tGt,gt,gt.
tEqEq = tEq,eq.
tEqEqEq = tEq,eq,eq.
tNotEq = texclamation,eq.
tNotEqEq = texclamation,eq,eq.
tleftShiftEq = tleftShift,eq.
trightShiftEq = trightShift,eq.
tsRightShiftEq = tsRightShift,eq.
tAmpEq = tAmp,eq.
tPipeEq = tPipe,eq.
tCirEq = tCir,eq.

lineTerminator = (char: (Character lf)) | (char: (Character cr)).

regularExpressionLiteral =
tslash,
( ((backslash, ( charExceptForCharIn: { (Character lf). (Character cr). })) |
(charExceptForCharIn: { (Character lf). (Character cr). $/.})) plus),
slash, (identifierStart star).

leftparen = tokenFromChar: $(.
rightparen =tokenFromChar: $).

leftbrace = tokenFromChar: ${.
rightbrace =tokenFromChar: $}.
comma = tokenFromChar: $,.
propertyName = string | identifier | number.
propertyNameAndValue = propertyName,tColon,value.
obj = leftbrace, (propertyNameAndValue starSeparatedBy: comma),rightbrace.
object = obj.

leftbracket = tokenFromChar: $[.
rightbracket = tokenFromChar: $].
arr = leftbracket, (value starSeparatedBy: comma), rightbracket.
array = tokenFor: arr.

comment = (slash,starChar,blockCommentBody,starChar,slash) | (slash,slash, lineCommentBody).


ttrue = tokenFromSymbol: #true.
tfalse = tokenFromSymbol: #false.
null = tokenFromSymbol: #null.
function = tokenFromSymbol: #function.
tnew = tokenFromSymbol: #new.
break = tokenFromSymbol: #break.
case = tokenFromSymbol: #case.
catch = tokenFromSymbol: #catch.
continue = tokenFromSymbol: #continue.
default = tokenFromSymbol: #default.
delete = tokenFromSymbol: #delete.
do = tokenFromSymbol: #do.
else = tokenFromSymbol: #else.
finally = tokenFromSymbol: #finally.
for = tokenFromSymbol: #for.
if = tokenFromSymbol: #if.
in = tokenFromSymbol: #in.
instanceof = tokenFromSymbol: #instanceof.
return = tokenFromSymbol: #return.
switch = tokenFromSymbol: #switch.
this = tokenFromSymbol: #this.
throw = tokenFromSymbol: #throw.
try = tokenFromSymbol: #try.
typeof = tokenFromSymbol: #typeof.
var = tokenFromSymbol: #var.
void = tokenFromSymbol: #void.
while = tokenFromSymbol: #while.
with = tokenFromSymbol: #with.



letter = (charBetween: $a and: $z) | (charBetween: $A and: $Z).
identifierStart = letter | (char: $$) | (char: $_).
identifier = accept: (tokenFor: (identifierStart), (identifierStart | digit) star) ifNotIn: keywords .

value = assignmentExpression .

literal = null | ttrue | tfalse | number | string | regularExpressionLiteral.

primaryexpression = this | literal | identifier | array | object | parenthesized.

parenthesized = leftparen,expression,rightparen.

functionexpression = function , identifier opt,
leftparen,formalParameterList , rightparen ,
leftbrace,sourceElements,rightbrace.
formalParameterList = identifier starSeparatedBy: comma.

memberexpression = (simplememberexpression ),
(( leftbracket, expression, rightbracket) | ( tdot, identifier)) star.

simplememberexpression = primaryexpression |
functionexpression |
simpleNewExpression.
simpleNewExpression = tnew,memberexpression, arguments.



callExpression = (simplememberexpression ),
( arguments |
( leftbracket, expression, rightbracket) |
( tdot, identifier) ) star.
simpleCallExpression = memberexpression , arguments.
arguments = leftparen ,
(assignmentExpression, (comma, assignmentExpression) star) opt,
rightparen.

newExpression = memberexpression | simpleNewMemberExpression.
simpleNewMemberExpression = tnew, memberexpression.

plusPlus = plusSign,plusSign.
minusMinus = negSign,negSign.

leftHandSideExpression = callExpression | newExpression .
postfixExpression = leftHandSideExpression ,
((plusPlus | minusMinus) star).


unaryExpression = postfixExpression | complexUnaryExpression.

complexUnaryExpression =
(typeof, unaryExpression) |
(delete, unaryExpression) |
(void, unaryExpression) |
(plusPlus, unaryExpression) |
(minusMinus, unaryExpression) |
((tokenFor: ( plusSign | negSign | tilde | exclamation )), unaryExpression).

multiplicativeExpression =
unaryExpression, ((tstarChar | tslash | tmodulo), unaryExpression) star.

additiveExpression =
multiplicativeExpression, ((tplusSign | tnegSign), multiplicativeExpression) star.

shiftExpression =
additiveExpression, ((tsRightShift | tleftShift | trightShift), additiveExpression) star.

relationalExpression =
shiftExpression, (( tLtEq | tGtEq | tLt | tGt | instanceof | in) , shiftExpression) star.

relationalExpressionNoIn =
shiftExpression, (( tLtEq | tGtEq | tLt | tGt | instanceof ) , shiftExpression) star.

equalityExpression =
relationalExpression, ((tEqEqEq | tEqEq | tNotEqEq | tNotEq ), relationalExpression) star.

equalityExpressionNoIn =
relationalExpressionNoIn, ((tEqEqEq | tEqEq | tNotEqEq | tNotEq ), relationalExpressionNoIn) star.

bitwiseANDExpression =
equalityExpression,(tAmp, equalityExpression) star.

bitwiseANDExpressionNoIn =
equalityExpressionNoIn,(tAmp, equalityExpressionNoIn) star.

bitwiseXORExpression =
bitwiseANDExpression,(tCir, bitwiseANDExpression) star.

bitwiseXORExpressionNoIn =
bitwiseANDExpressionNoIn,(tCir, bitwiseANDExpressionNoIn) star.

bitwiseORExpression =
bitwiseXORExpression,(tPipe, bitwiseXORExpression) star.

bitwiseORExpressionNoIn =
bitwiseXORExpressionNoIn,(tPipe, bitwiseXORExpressionNoIn) star.

logicalAndExpression =
bitwiseORExpression, (tAmpAmp,bitwiseORExpression) star.

logicalAndExpressionNoIn =
bitwiseORExpressionNoIn, (tAmpAmp,bitwiseORExpressionNoIn) star.

logicalOrExpression =
logicalAndExpression, (tPipePipe,logicalAndExpression) star.

logicalOrExpressionNoIn =
logicalAndExpressionNoIn, (tPipePipe,logicalAndExpressionNoIn) star.

assignmentOperator =
tEq | tStarEq | tSlashEq | tModEq | tPlusEq | tMinusEq |
tleftShiftEq | tsRightShiftEq | trightShiftEq | tAmpEq | tPipeEq |
tCirEq.

conditionalExpression =
logicalOrExpression, (tQuestion, assignmentExpression,tColon,assignmentExpression) opt.

assignmentExpression =
conditionalExpression, (assignmentOperator,conditionalExpression) star.

conditionalExpressionNoIn =
logicalOrExpressionNoIn, (tQuestion, assignmentExpressionNoIn,tColon,assignmentExpressionNoIn) opt.

assignmentExpressionNoIn =
conditionalExpressionNoIn, (assignmentOperator,conditionalExpressionNoIn) star.

expression = assignmentExpression, (comma , assignmentExpression) star.

expressionNoIn = assignmentExpressionNoIn, (comma , assignmentExpressionNoIn) star.


statement = block | variableStatement | emptyStatement | expressionStatement |
ifStatement | iterationStatement | withStatement | switchStatement |
labelledStatement | tryStatement | throwStatement |
breakStatement | returnStatement.

block = leftbrace, statementList , rightbrace.

statementList = statement star.
variableStatement = var, variableDeclarationList, tSemicolon.

variableDeclarationList = (variableDeclaration plusSeparatedBy: comma).
variableDeclaration = identifier, (tEq,assignmentExpression) opt.

variableDeclarationListNoIn = (variableDeclarationNoIn plusSeparatedBy: comma).
variableDeclarationNoIn = identifier, (tEq,assignmentExpressionNoIn) opt.

emptyStatement = tSemicolon.

expressionStatement = (((function | leftbrace) not) & expression), tSemicolon.

ifStatement = if, leftparen,expression,rightparen,statement,(else,statement) opt.

iterationStatement = doStatement | forStatement | forStatementNoVar |
whileStatement | forInStatement | forInStatementNoVar.

doStatement = do, statement, while, leftparen,expression,rightparen,tSemicolon.

forStatement = for, leftparen,
var, variableDeclarationListNoIn ,
tSemicolon,
(expression opt),
tSemicolon,
(expression opt),
rightparen, statement.

forStatementNoVar = for, leftparen,
(expression opt),
tSemicolon,
(expression opt),
tSemicolon,
(expression opt),
rightparen, statement.


forInStatement = for, leftparen,
var, variableDeclarationListNoIn ,
in,
expression ,
rightparen, statement.

forInStatementNoVar = for, leftparen,
leftHandSideExpression ,
in,
expression ,
rightparen, statement.
whileStatement = while,leftparen,expression,rightparen,statement.

continueStatement = continue, (identifier opt), tSemicolon.

breakStatement = break, (identifier opt), tSemicolon.

returnStatement = return, (expression opt), tSemicolon.

withStatement = with, leftparen, expression ,rightparen, statement.

switchStatement = switch ,leftparen,expression,rightparen, clauseBlock.

clauseBlock = leftbrace,(clause star),(defaultClause opt),rightbrace.

clause = case, expression, tColon,statementList.

defaultClause = default,tColon, statementList.

labelledStatement = identifier,tColon,statement.

throwStatement = throw,expression,tSemicolon.

tryStatement = try, block, (catchBlock opt), (finallyBlock opt).

catchBlock = catch, leftparen, identifier,rightparen, block.

finallyBlock = finally, block.

functionDeclaration = function,identifier,
leftparen,formalParameterList,rightparen,
leftbrace,sourceElements,rightbrace.



sourceElements = (statement | functionDeclaration ) star.

program = sourceElements.
|
)

...


I really like the way you can separate the grammar from the AST creation (as described in the Executable Grammars[PDF] paper). As you can see there's no code specified for this purpose. Along with the source code there's is a 'testing AST' and a parser that inherits from this grammar which is used by the unit tests .

Almost all the grammar was written using the parser combinators from the library. Only charExceptFor: and accept: ifNotIn: were created.

There are still a lot work to do with this parser:

  1. Clean up the code
  2. Work on performance issues
  3. Find a solution for the "Automatic Semicolon Insertion" feature (see section 7.9) of the Ecma document)
  4. Get rid of some repetition (for example the 'NoVar' productions which are also present in the document)
  5. Better AST creation
  6. See if unicode support is possible
  7. More tests!


In order to see the result of using this parser I created a little program to display the 'testing AST'. For example:



The parser with tests and the other code mentioned can be found here .