{
and }
characters in your rules. In general, you are advised to keep the code you embed within these actions, and the grammar itself to an absolute minimum. Rather than embed code directly in your grammar, you should construct an API, that is called from the actions within your grammar. This way you will keep the grammar clean and maintainable and separate the code generators or other code from the definition of the grammar itself.However, when you wish to call your API functions, or insert small pieces of code that do not warrant external functions, you will need to access elements of tokens, return elements from parser rules and perhaps the internals of the recognizer itself. The C runtime provides a number of MACROs that you can use within your action code. It also provides a number of performant structures that you may find useful for building symbol tables, lists, tries, stacks, arrays and so on (all of which are managed so that your memory allocation problems are minimized.)
You should note that if your parser rule returns more than a single entity, then the return type of the generated rule function is a struct, which is returned by value. This is also the case if your rule is part of a tree building grammar (uses the output=AST;
option.
Other than the notes above, you can use any pre-declared type as an input or output parameter for your rule.
For performance reasons, and to avoid thrashing the malloc allocation system, memory for amy elements of your generated parser is allocated in chunks and parcelled out by factories. For instance memory for tokens is created as an array of tokens, and a token factory hands out the next available slot to the lexer. When you free the lexer, the allocated memory is returned to the pool. The same applies to 'strings' that contain the token text and various other text elements accessed within the lexer.
The only side effect of this is that after your parse and analysis is complete, if you wish to retain anything generated automatically, you must copy it before freeing the recognizer structures. In practice it is usually practical to retain the recognizer context objects until your processing is complete or to use your own allocation scheme for generating output etc.
The advantage of using object factories is of course that memory leaks and accessing de-allocated memory are bugs that rarely occur within the ANTLR3 C runtime. Further, allocating memory for tokens, trees and so on is very fast.
The context pointer is used because this removes the need for any global/static variables at all, either within the generated code, or the C runtime. This is of course fundamental to creating free threading recognizers. Wherever a function call or rule call required the ctx parameter, you either reference it via the CTX macro, or the ctx parameter is in fact the return type from calling the 'constructor' function for your parser/lexer/tree parser (see code example in "How to build Generated Code" .)
: Macros that act like statements must be terminated with a ';'. The macro body does not supply this, nor should it. Macros that call functions are declared with () even if they have no parameters, macros that reference fields do not have a () declaration.
LEXER
macro returns a pointer to the base lexer object, which is of type pANTLR3_LEXER. This is not the pointer to your generated lexer, which is supplied by the CTX macro, but to the common implementation of a lexer interface, which is supplied to all generated lexers.LA
macro returns the character at index n from the current input stream index. The return type is ANTLR3_UINT32. Hence LA(1)
returns the character at the current input position (the character that will be consumed next), LA(-1)
returns the character that has just been consumed and so on. The LA(n)
macro is useful for constructing semantic predicates in lexer rules. The reference LA(0)
is undefined and will cause an error in your lexer.GETCHARINDEX
macro returns the index of the current character position as a 0 based offset from the start of the input stream. It returns a value type of ANTLR3_UINT32.GETLINE
macro returns the line number of current character (LA(1)
in the input stream. It returns a value type of ANTLR3_UINT32. Note that the line number is incremented automatically by an input stream when it sees the input character 'GETTEXT
macro returns the text currently matched by the lexer rule. In general you should use the generic $text reference in ANTLR to retrieve this. The return type is a reference type of pANTLR3_STRING which allows you to manipulate the text you have retrieved (NB this does not change the input stream only the text you copy from the input stream when you use this MACRO or $text).The reference $text->chars or GETTEXT()->chars will reference a pointer to the '\0' terminated character string that the ANTLR3 pANTLR3_STRING represents. String space is allocated automatically as well as the structure that holds the string. The pANTLR3_STRING_FACTORY associated with the lexer handles this and when you close the lexer, it will automatically free any space allocated for strings and their structures.
GETCHARPOSITIONINLINE
returns the zero based offset of character LA(1)
from the start of the current input line. See the macro GETLINE
for details on what the line number means.EMIT
causes the text range currently matched to the lexer rule to be emitted immediately as the token for the rule. Subsequent text is matched but ignored. The type used for the the token is the name of the lexer rule or, if you have change this by using $type = XXX;, the type XXX is used.EMITNEW
causes the supplied token reference t
to be used as the token emitted by the rule. The parameter t
must be of type pANTLR3_COMMON_TOKEN.INDEX
macro returns the current input position according to the input stream. It is not guaranteed to be the character offset in the input stream but is instead used as a value for marking and rewinding to specific points in the input stream. Use the macro GETCHARINDEX()
to find out the position of the LA(1)
in the input stream.PUSHSTREAM
macro, in conjunction with the POPSTREAM
macro (called internally in the runtime usually) can be used to stack many input streams to the lexer, and implement constructs such as the C pre-processor #include directive.
An input stream that is pushed on to the stack becomes the current input stream for the lexer and the state of the previous stream is automatically saved. The input stream will be automatically popped from the stack when it is exhausted by the lexer. You may use the macro POPSTREAM
to return to the previous input stream prior to exhausting the currently stacked input stream.
Here is an example of using the macro in a lexer to implement the C #include pre-processor directive:
fragment STRING_GUTS : (~('\\'|'"') )* ; LINE_COMMAND : '#' (' ' | '\t')* ( 'include' (' ' | '\t')+ '"' file = STRING_GUTS '"' (' ' | '\t')* '\r'? '\n' { pANTLR3_STRING fName; pANTLR3_INPUT_STREAM in; // Create an initial string, then take a substring // We can do this by messing with the start and end // pointers of tokens and so on. This shows a reasonable way to // manipulate strings. // fName = $file.text; printf("Including file '\%s'\n", fName->chars); // Create a new input stream and take advantage of built in stream stacking // in C target runtime. // in = antlr38BitFileStreamNew(fName->chars); PUSHSTREAM(in); // Note that the input stream is not closed when it EOFs, I don't bother // to do it here, but it is up to you to track streams created like this // and destroy them when the whole parse session is complete. Remember that you // don't want to do this until all tokens have been manipulated all the way through // your tree parsers etc as the token does not store the text it just refers // back to the input stream and trying to get the text for it will abort if you // close the input stream too early. // } | (('0'..'9')=>('0'..'9'))+ ~('\n'|'\r')* '\r'? '\n' ) {$channel=HIDDEN;} ;
The token fields user1, user2, and user3 are all value types of ANTLR_UINT32. In the parser you can reference these fields directly from the token: x=TOKNAME { $x->user1 ...
but when you are building the token in the lexer, you must assign to the fields using the macros USER1
, USER2
, or USER3
. As in:
LEXTOK: 'AAAAA' { USER1 = 99; } ;
PARSER
macro returns a pointer to the base parser or tree parser object, which is of type pANTLR3_PARSER or pANTLR3_TREE_PARSER . This is not the pointer to your generated parser, which is supplied by the CTX
macro, but to the common implementation of a parser or tree parser interface, which is supplied to all generated parsers.INDEX
macro returns the position of the current token ( LT(1) ) in the input token stream. It can be used for MARK
and REWIND
operations.LT(n)
returns the pANTLR3_COMMON_TOKEN at offset n
from the current token stream input position. The macro LA(n)
returns the token type of the token at position n
. The value n
cannot be zero, and such a reference will return NULL
and possibly cause an error. LA(1)
is the token that is about to be recognized and LA(-1)
is the token that has just been recognized. Values of n that exceed the limits of the token stream boundaries will return NULL
.ADAPTOR
macro returns the reference to the tree adaptor which is always of type pANTLR3_BASE_TREE_ADAPTOR, even if it is your custom adapter.REWIND
macro to return to the marked point in the input.
If you know you will only ever rewind to the last MARK
, then you can ignore the return value of this macro and just use the REWINDLAST
macro to return to the last MARK
that was set in the input stream.
m
to the REWIND(m)
macro.MARK
macro call. Fails silently if there was no prior MARK
call.n
in the stream. Works for all input stream types, both lexer, parser and tree parser.