Creating a JavaScript Parser

I recently decided to implement a JavaScript parser.  To be more specific, I implemented an ECMAScript 5.1 parser.  This post describes the process in detail, including the roll of parsers in compiler design, and various design decisions and hurdles which I encountered.

Overview

Parsers are responsible for enforcing a language’s syntax rules, which are specified in a grammar.  There are numerous tools, known as parser generators, which can read a grammar, and output a corresponding parser.  For this project, I chose the Jison parser generator because it outputs JavaScript parsers which can easily be used on the web.

In addition to enforcing syntax, a parser is also responsible for constructing an abstract syntax tree (AST) representation of a program.  Mozilla has implemented a reflection of its SpiderMonkey parser, known as the Parser API.  The Parser API specifies the structure of the AST and its nodes.  I have chosen to use a similar structure for my parser’s AST.

Using Jison

Jison is easy to install.  Once you have it installed, you can run it from the command line as shown below.  The resulting parser is placed in a file with the same name as your grammar file and a “js” extension.  For example, my grammar file is named ecmascript.jison, and my parser is named ecmascript.js.  Of course, you will need a grammar file.  You can download my JavaScript grammar file or create your own.

jison grammar_file

A Word on Conflicts

Non-trivial grammars will likely contain ambiguities which cause Jison to generate warnings about shift-reduce conflicts and reduce-reduce conflicts.  These types of conflicts sometimes arise when working with the shift-reduce style parsers generated by Jison.  Without going into too much detail, shift-reduce parsers work by maintaining a stack of tokens.  Tokens are pushed, or shifted, onto the stack as they are encountered.  When a sequence of tokens corresponding to a grammar production are encountered, the stack is popped, or reduced.  A shift-reduce conflict occurs when, due to an ambiguity in the grammar, the parser could correctly shift or reduce.  To resolve shift-reduce conflicts, parsers generally default to shifting.  Reduce-reduce conflicts are more serious, as the parser could reduce by multiple productions.  These conflicts should be resolved by removing problematic ambiguities from the grammar.  If you choose to use my Jison file, please disregard the shift-reduce conflict warning messages.

The Lexer

The lexer, also referred to as a scanner, reads source code and returns a stream of tokens to the parser.  Tokens correspond to the terminal symbols of your grammar, and can be thought of as the most basic building blocks of a programming language.  Much like parsers, scanners can also be generated automatically by specifying tokens as regular expressions.

Jison conveniently allows you to specify both your scanner and grammar in the same file.  In my Jison file, the scanner is located between the lines labeled “%lex” and “/lex”.  The first section of the scanner consists of a number of named regular expressions (DecimalDigit, HexDigit, etc).  Named regular expressions can be referenced by placing the name inside of curly braces (i.e. {DecimalDigit}).  This is especially useful for very long regular expressions such as UnicodeIdentifierStart and UnicodeIdentifierPart, which are used in JavaScript identifiers.  Note, I did not write the extremely long Unicode regular expressions myself.  I found them after some digging around Stack Overflow.

After the named regular expressions comes the following line:

%x REGEXP

This line defines a start condition named “REGEXP”.  Start conditions are used to conditionally activate rules in the scanner.  This particular start condition is used to handle regular expression literals, which are somewhat of a separate language within JavaScript. Start conditions can be specified as inclusive or exclusive using “%s” or “%x”.  Inclusive start conditions can match rules that do not specify a start condition, while exclusive start conditions cannot.  This is important as we want to isolate the regular expression rules from all of the others.  For example, take the following JavaScript code:

var foo = /'/, bar = /'/;

If the “REGEXP” start condition was inclusive, then this line could incorrectly create a string literal, due to the matching single quotes.

The scanner’s rules are located between the two lines labeled “%%”.  Individual rules are made up of a token regular expression and semantic actions.  When the scanner matches a regular expression, it performs the corresponding semantic actions.  For most of the tokens, the semantic action simply returns the token.  Note that semantic actions which span multiple lines are enclosed between “%{” and “%}”.

The final piece of the scanner is a function named lex().  This function is automatically generated by Jison.  For my purposes, I needed to track line breaks.  Therefore, I overwrote the the existing lex() function with my own.  My function is simply a wrapper which tracks line breaks and then calls the original lex().

Creating the Grammar

The remainder of my Jison file following the “/lex” line is the ECMAScript grammar.  I took the bulk of my grammar directly from the productions provided in the specification document.  An example Jison production for an “if” statement is shown below.  In this example, “IfStatement” is a non-terminal grammar symbol representing “if” and “if-else” statements. When the parser recognizes an “IfStatement”, it creates an “IfStatementNode” AST node.

IfStatement
    : "IF" "(" Expression ")" Statement
        {
            $$ = new IfStatementNode($3, $5, null, createSourceLocation(null, @1, @5));
        }
    | "IF" "(" Expression ")" Statement "ELSE" Statement
        {
            $$ = new IfStatementNode($3, $5, $7, createSourceLocation(null, @1, @7));
        }
    ;

Unfortunately, there are a few corner cases that required me to extend my grammar beyond the specification.  For example, in my grammar you will notice a number of rules whose names end with “NoBrace” or “NoBF”.  These suffixes stand for “No Curly Brace” and “No Curly Brace or Function”, and they are required because “ExpressionStatements” cannot begin with an opening curly brace or the “function” keyword.  This is because the curly brace could represent an object literal or a block statement.  Similarly, the “function” keyword could represent a function declaration or a function expression.

Automatic Semicolon Insertion

I have previously written about the nuisance which is automatic semicolon insertion.  However, after creating a parser, I have a new level of hatred for this language “feature.”  You can read my previous article for a better explanation of the topic, but basically the JavaScript parser automatically inserts semicolons in certain situations.  Here are the rules for automatic semicolon insertion:

  • If a token (called the offending token), causes a parsing error, then the parser will insert a semicolon before the offending token if the offending token is separated from the previous token by at least one line break.
  • A semicolon is also inserted if the offending token is a closing curly brace or the End-of-File (EOF).
  • If a valid token immediately follows the “[no LineTerminator here]” annotation of a restricted production, but actually does follow a line break, then a semicolon is inserted before it.
  • No semicolon can be inserted if the semicolon would be parsed as an empty statement or as one of the two semicolons in the head of a “for” loop.
To handle the first two conditions, I modified the scanner to track line breaks and overrode the parser’s parseError() method as shown below.
parser.parseError = function(str, hash) {
  if (!((hash.expected && hash.expected.indexOf("';'") >= 0) && 
        (hash.token === "}" || hash.token === "EOF" ||
         parser.newLine || parser.wasNewLine)))
  {
    throw new SyntaxError(str);
  }
};
When a parsing error occurs, I first check to see if a semicolon was expected.  If so, I check if the offending token is a closing curly brace, EOF, or if a line break occurred.  If these conditions are true, then I do not throw a syntax error.  Instead, I introduce productions into the grammar which handle the “error” token.  For example, the “ExpressionStatement” production now looks like this:
ExpressionStatement
    : ExpressionNoBF ";"
        {
            $$ = new ExpressionStatementNode($1, createSourceLocation(null, @1, @2));
        }
    | ExpressionNoBF error
        {
            $$ = new ExpressionStatementNode($1, createSourceLocation(null, @1, @1));
        }
    ;

To handle the “break”, “continue”, “return” and “throw” restricted productions, I created the “parser.restricted” variable.  This variable detects when a restricted production is being processed.  When a line break is encountered during a restricted production, the scanner automatically inserts a semicolon.

The final restricted production involves the postfix increment and decrement operators.  Unlike the other restricted productions, there is no keyword immediately before the ”[no LineTerminator here]” annotation.  Instead, I introduce two new tokens, “BR++” and “BR–”, corresponding to whitespace followed by an increment or decrement.  I also updated the parseError() method again to detect these two tokens.

Regular Expression Literals

As I previously mentioned, regular expression literals pose a problem because they are essentially a mini language nested inside of JavaScript.  They are also problematic because their opening slash is ambiguous with the division operators (“/” and “/=”).  Fortunately, regular expression literals and division operators are not used in the same places within any language productions.  To handle regular expression literals, I look for a forward slash in productions involving literals.  Once detected, the scanner activates the “REGEXP” start condition.  After the regular expression has been handled, the scanner switches back to its default rules.

Writing a Print Pass

In compiler speak, a pass is one traversal of the source code or AST.  Compilers often divide work logically between passes.  For example, one pass might perform semantic analysis, while another performs code generation, and several other passes perform code optimizations.

Implementing a pass sometimes requires extending the AST nodes.  In JavaScript this means accessing the AST node constructor functions.  As my parser currently stands, the AST node constructors are wrapped inside the parser.  I did this so that the parser could be used without polluting the global namespace.  Realizing that future passes might require access to the constructors, I exposed them by extending the Jison generated parser.  The parser object now includes a property named “ast” which includes the constructor functions for each type of AST node.  Assuming the parser object is named “parser”, a method named “foo” can be added to the “ProgramNode” constructor like this:

parser.ast.ProgramNode.prototype.foo = function() {};

Using this approach, I implemented a simple print pass which outputs a JavaScript representation of the AST.  By wrapping the entire pass in an immediately-invoked function expression (IIFE), the global namespace is kept clean.  I must point out that there are several noticeable differences between the original source code and the code generated by my print pass.  First, comments and whitespace are not preserved.  Second, the print pass encloses most expressions in parentheses.  This is done to maintain operator precedence, since parentheses from the source code are not kept in the AST.

Strict Mode

JavaScript supports a more restrictive mode of execution, known as strict mode.  I have decided that my parser will target non-strict mode because it is a superset of strict mode.  Additionally, I believe that it would be trivial to implement strict mode’s rules in a semantic analysis pass.

The Final Product

You can download all of the source code from the jsparser GitHub repo. I have also embedded the parser into an example page.  The page allows you to type in arbitrary JavaScript code.  Press the “Parse” button, to parse your code and run the print pass.  The print pass output is displayed in another text field below your code.  I have successfully parsed the jQuery 1.8.1 source code, but it is very possible that bugs still exist.  If you find any bugs, please let me know.  Also feel free to suggest a fix!


Comments are closed.