Python Code Generator for ANTLR 2.7.5

With the release of ANTLR 2.7.5, you can now generate your Lexers, Parsers and TreeParsers in Python. This feature extends the benefits of ANTLR's predicated-LL(k) parsing technology to the Python language and platform.

To be able to build and use the Python language Lexers, Parsers and TreeParsers, you will need to have the ANTLR Python runtime library installed in your Python path. The Python runtime model is based on the existing runtime model for Java and is thus immediately familiar. The Python runtime and the Java runtime are very similar although there a number of subtle (and not so subtle) differences. Some of these result from differences in the respective runtime environments.

ANTLR Python support was contributed (and is to be maintained) by Wolfgang Haefelinger and Marq Kole.

Building the ANTLR Python Runtime

The ANTLR Python runtime source and build files are completely integrated in the ANTLR build process.The ANTLR runtime support module for Python is located in the lib/python subdirectory of the ANTLR distribution. To enable the installation of the Python runtime support you will have to provide the --enable-python option for the configure script, for instance:

./configure --enable-python --prefix=$HOME

With Python support enabled the current distribution will look for the presence of a python executable of version 2.2 or higher. If it has found such a beast, it will generate and install the ANTLR Python runtime as part of the overall ANTLR building and installation process.

If the python distribution you are using is at an unusual location, perhaps because you are using a local installation instead of a system-wide one, you can provide the location of that python executable using the --with-python=<path> option for the configure script, for instance:

./configure --enable-python --prefix=$HOME --with-python=$HOME/bin/python2.3

Also, if the python executable is at a regular location, but has a name that differs from "python", you can specify the correct name through the $PYTHON environment variable.

export PYTHON=python2.3
./configure --enable-python --prefix=$HOME --with-python=$HOME/bin/python2.3

All the example grammars can be built and run in one go by running make in the examples/python subdirectory of the ANTLR distribution.

# Build all examples and run them
cd examples/python ; make
# Clean all examples
make clean

Specifying Code Generation

You can instruct ANTLR to generate your Lexers, Parsers and TreeParsers using the Python code generator by adding the following entry to the global options section at the beginning of your grammar file.

{
    language="Python";
}

After that things are pretty much the same as in the default java code generation mode. See the examples in examples/python for some illustrations.

One particular issue that is worth mentioning is the handling of comments in ANTLR Python. Java, C++, and C# all use the same lexical structures to define comments: // for single-line comments, and /* ... */ for block comments. Unfortunately, Python does not handle comments this way. It only knows about single-line comments, and these start off with a # symbol.

Normally, all comments outside of actions are actually comments in the ANTLR input language. These comments, and that is both block comments and single-line comments are translated into Python single-line comments.

Secondly, all comments inside actions should be comments in the target language, Python in this case. Unfortunately, if the actions contain ANTLR actions, such as $getText, the code generator seems to choke on Python comments as the # sign is also used in tree construction. The solution is to use Java/C++-style comments in all actions; these will be translated into Python comments by the ANTLR as it checks these actions for the presence of predefined action symbols such as $getText.

So, as a general issue: all comments in an ANTLR grammar for the Python target should be in Java/C++ style, not in Python style.

Python-Specific ANTLR Sections

Python-Specific ANTLR Options

A Template Python ANTLR Grammar File

As the handling of modules &emdash; packages in Java speak &emdash; in Python differs from that in Java, the current approach in ANTLR to call both the file and the class they contain after the name of the grammar is kind of awkward. Instead, a different approach is chosen that better reflects the handling of modules in Python. The name of the generated Python file is still derived from the name of the grammar, but the name of the class is fixed to the particular kind of grammar. A lexer grammar will be used to generate a class Lexer; a parser grammar will be used to generate a class Parser; and a treeparser grammar will be used to generate a class Walker.

header {
    // gets inserted in the Python source file before any
    // generated declarations
}
options {
    language  = "Python";
}
{
   // global code stuff that will be included in the 'MyParser.py' source
   // file just before the 'Parser' class below
   ...
}
class MyParser extends Parser;
options {
   exportVocab=My;
}
{
   // additional methods and members for the generated 'Parser' class
   ...
}
... generated RULES go here ...
{
   // global code stuff that will be included in the 'MyLexer' source file
   // just before the 'Lexer' class below
   ...
}
class MyLexer extends Lexer;
options {
   exportVocab=My;
}
{
   // additional methods and members for the generated 'Lexer' class
   ...
}
... generated RULES go here ...
{
   // global code stuff that will be included in the 'MyTreeParser' source
   // file just before the 'Walker' class below
   ...
}
class MyTreeParser extends TreeParser;
options {
   exportVocab=My;
}
{
   // additional methods and members for the generated 'Walker' class
   ...
}
... generated RULES go here ...

Version number in parantheses shows the tool version used to develop and test. It may work with older versions as well. Python 2.2 or better is required as I'm using some recent Python features (like super() for example).

Known Bugs and Limitations

E0004

ANTLR requires that a rule's return statement contains always a type and an identifier. Python does not know about types, so the type information is getting ignored but still needs to be present. Further, ANTLR's API does not allow to access the identifier. Therefore the variable 'r' is used for returning values, no matter what identifier is listed.

For example:

expr returns [float f]
{
    r = 0
}
    : #(EXPR r = multexpr())
    ;
Note that 'r' is used even if 'f' is given!
L0001
There's no documentation available but the source code.
L0002
Performance should be improved

Miscellaneous Notes