cscg24-lolpython

CSCG 2024 Challenge 'Can I Haz Lolpython?'
git clone https://git.sinitax.com/sinitax/cscg24-lolpython
Log | Files | Refs | sfeed.txt

ply.html (96141B)


      1<html>
      2<head>
      3<title>PLY (Python Lex-Yacc)</title>
      4</head>
      5<body bgcolor="#ffffff">
      6
      7<h1>PLY (Python Lex-Yacc)</h1>
      8 
      9<b>
     10David M. Beazley <br>
     11dave@dabeaz.com<br>
     12</b>
     13
     14<p>
     15<b>PLY Version: 2.2</b>
     16<p>
     17
     18<!-- INDEX -->
     19<div class="sectiontoc">
     20<ul>
     21<li><a href="#ply_nn1">Introduction</a>
     22<li><a href="#ply_nn2">PLY Overview</a>
     23<li><a href="#ply_nn3">Lex</a>
     24<ul>
     25<li><a href="#ply_nn4">Lex Example</a>
     26<li><a href="#ply_nn5">The tokens list</a>
     27<li><a href="#ply_nn6">Specification of tokens</a>
     28<li><a href="#ply_nn7">Token values</a>
     29<li><a href="#ply_nn8">Discarded tokens</a>
     30<li><a href="#ply_nn9">Line numbers and positional information</a>
     31<li><a href="#ply_nn10">Ignored characters</a>
     32<li><a href="#ply_nn11">Literal characters</a>
     33<li><a href="#ply_nn12">Error handling</a>
     34<li><a href="#ply_nn13">Building and using the lexer</a>
     35<li><a href="#ply_nn14">The @TOKEN decorator</a>
     36<li><a href="#ply_nn15">Optimized mode</a>
     37<li><a href="#ply_nn16">Debugging</a>
     38<li><a href="#ply_nn17">Alternative specification of lexers</a>
     39<li><a href="#ply_nn18">Maintaining state</a>
     40<li><a href="#ply_nn19">Duplicating lexers</a>
     41<li><a href="#ply_nn20">Internal lexer state</a>
     42<li><a href="#ply_nn21">Conditional lexing and start conditions</a>
     43<li><a href="#ply_nn21">Miscellaneous Issues</a>
     44</ul>
     45<li><a href="#ply_nn22">Parsing basics</a>
     46<li><a href="#ply_nn23">Yacc reference</a>
     47<ul>
     48<li><a href="#ply_nn24">An example</a>
     49<li><a href="#ply_nn25">Combining Grammar Rule Functions</a>
     50<li><a href="#ply_nn26">Character Literals</a>
     51<li><a href="#ply_nn26">Empty Productions</a>
     52<li><a href="#ply_nn28">Changing the starting symbol</a>
     53<li><a href="#ply_nn27">Dealing With Ambiguous Grammars</a>
     54<li><a href="#ply_nn28">The parser.out file</a>
     55<li><a href="#ply_nn29">Syntax Error Handling</a>
     56<ul>
     57<li><a href="#ply_nn30">Recovery and resynchronization with error rules</a>
     58<li><a href="#ply_nn31">Panic mode recovery</a>
     59<li><a href="#ply_nn32">General comments on error handling</a>
     60</ul>
     61<li><a href="#ply_nn33">Line Number and Position Tracking</a>
     62<li><a href="#ply_nn34">AST Construction</a>
     63<li><a href="#ply_nn35">Embedded Actions</a>
     64<li><a href="#ply_nn36">Yacc implementation notes</a>
     65</ul>
     66<li><a href="#ply_nn37">Parser and Lexer State Management</a>
     67<li><a href="#ply_nn38">Using Python's Optimized Mode</a>
     68<li><a href="#ply_nn39">Where to go from here?</a>
     69</ul>
     70</div>
     71<!-- INDEX -->
     72
     73
     74
     75
     76
     77
     78<H2><a name="ply_nn1"></a>1. Introduction</H2>
     79
     80
     81PLY is a pure-Python implementation of the popular compiler
     82construction tools lex and yacc. The main goal of PLY is to stay
     83fairly faithful to the way in which traditional lex/yacc tools work.
     84This includes supporting LALR(1) parsing as well as providing
     85extensive input validation, error reporting, and diagnostics.  Thus,
     86if you've used yacc in another programming language, it should be
     87relatively straightforward to use PLY.  
     88
     89<p>
     90Early versions of PLY were developed to support an Introduction to
     91Compilers Course I taught in 2001 at the University of Chicago.  In this course,
     92students built a fully functional compiler for a simple Pascal-like
     93language.  Their compiler, implemented entirely in Python, had to
     94include lexical analysis, parsing, type checking, type inference,
     95nested scoping, and code generation for the SPARC processor.
     96Approximately 30 different compiler implementations were completed in
     97this course.  Most of PLY's interface and operation has been influenced by common
     98usability problems encountered by students.
     99
    100<p>
    101Since PLY was primarily developed as an instructional tool, you will
    102find it to be fairly picky about token and grammar rule
    103specification. In part, this
    104added formality is meant to catch common programming mistakes made by
    105novice users.  However, advanced users will also find such features to
    106be useful when building complicated grammars for real programming
    107languages.  It should also be noted that PLY does not provide much in
    108the way of bells and whistles (e.g., automatic construction of
    109abstract syntax trees, tree traversal, etc.). Nor would I consider it
    110to be a parsing framework.  Instead, you will find a bare-bones, yet
    111fully capable lex/yacc implementation written entirely in Python.
    112
    113<p>
    114The rest of this document assumes that you are somewhat familar with
    115parsing theory, syntax directed translation, and the use of compiler
    116construction tools such as lex and yacc in other programming
    117languages. If you are unfamilar with these topics, you will probably
    118want to consult an introductory text such as "Compilers: Principles,
    119Techniques, and Tools", by Aho, Sethi, and Ullman.  O'Reilly's "Lex
    120and Yacc" by John Levine may also be handy.  In fact, the O'Reilly book can be
    121used as a reference for PLY as the concepts are virtually identical.
    122
    123<H2><a name="ply_nn2"></a>2. PLY Overview</H2>
    124
    125
    126PLY consists of two separate modules; <tt>lex.py</tt> and
    127<tt>yacc.py</tt>, both of which are found in a Python package
    128called <tt>ply</tt>. The <tt>lex.py</tt> module is used to break input text into a
    129collection of tokens specified by a collection of regular expression
    130rules.  <tt>yacc.py</tt> is used to recognize language syntax that has
    131been specified in the form of a context free grammar. <tt>yacc.py</tt> uses LR parsing and generates its parsing tables
    132using either the LALR(1) (the default) or SLR table generation algorithms.
    133
    134<p>
    135The two tools are meant to work together.  Specifically,
    136<tt>lex.py</tt> provides an external interface in the form of a
    137<tt>token()</tt> function that returns the next valid token on the
    138input stream.  <tt>yacc.py</tt> calls this repeatedly to retrieve
    139tokens and invoke grammar rules.  The output of <tt>yacc.py</tt> is
    140often an Abstract Syntax Tree (AST).  However, this is entirely up to
    141the user.  If desired, <tt>yacc.py</tt> can also be used to implement
    142simple one-pass compilers.  
    143
    144<p>
    145Like its Unix counterpart, <tt>yacc.py</tt> provides most of the
    146features you expect including extensive error checking, grammar
    147validation, support for empty productions, error tokens, and ambiguity
    148resolution via precedence rules.  In fact, everything that is possible in traditional yacc 
    149should be supported in PLY.
    150
    151<p>
    152The primary difference between
    153<tt>yacc.py</tt> and Unix <tt>yacc</tt> is that <tt>yacc.py</tt> 
    154doesn't involve a separate code-generation process. 
    155Instead, PLY relies on reflection (introspection)
    156to build its lexers and parsers.  Unlike traditional lex/yacc which
    157require a special input file that is converted into a separate source
    158file, the specifications given to PLY <em>are</em> valid Python
    159programs.  This means that there are no extra source files nor is
    160there a special compiler construction step (e.g., running yacc to
    161generate Python code for the compiler).  Since the generation of the
    162parsing tables is relatively expensive, PLY caches the results and
    163saves them to a file.  If no changes are detected in the input source,
    164the tables are read from the cache. Otherwise, they are regenerated.
    165
    166<H2><a name="ply_nn3"></a>3. Lex</H2>
    167
    168
    169<tt>lex.py</tt> is used to tokenize an input string.  For example, suppose
    170you're writing a programming language and a user supplied the following input string:
    171
    172<blockquote>
    173<pre>
    174x = 3 + 42 * (s - t)
    175</pre>
    176</blockquote>
    177
    178A tokenizer splits the string into individual tokens
    179
    180<blockquote>
    181<pre>
    182'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')'
    183</pre>
    184</blockquote>
    185
    186Tokens are usually given names to indicate what they are. For example:
    187
    188<blockquote>
    189<pre>
    190'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES',
    191'LPAREN','ID','MINUS','ID','RPAREN'
    192</pre>
    193</blockquote>
    194
    195More specifically, the input is broken into pairs of token types and values.  For example:
    196
    197<blockquote>
    198<pre>
    199('ID','x'), ('EQUALS','='), ('NUMBER','3'), 
    200('PLUS','+'), ('NUMBER','42), ('TIMES','*'),
    201('LPAREN','('), ('ID','s'), ('MINUS','-'),
    202('ID','t'), ('RPAREN',')'
    203</pre>
    204</blockquote>
    205
    206The identification of tokens is typically done by writing a series of regular expression
    207rules.  The next section shows how this is done using <tt>lex.py</tt>.
    208
    209<H3><a name="ply_nn4"></a>3.1 Lex Example</H3>
    210
    211
    212The following example shows how <tt>lex.py</tt> is used to write a simple tokenizer.
    213
    214<blockquote>
    215<pre>
    216# ------------------------------------------------------------
    217# calclex.py
    218#
    219# tokenizer for a simple expression evaluator for
    220# numbers and +,-,*,/
    221# ------------------------------------------------------------
    222import ply.lex as lex
    223
    224# List of token names.   This is always required
    225tokens = (
    226   'NUMBER',
    227   'PLUS',
    228   'MINUS',
    229   'TIMES',
    230   'DIVIDE',
    231   'LPAREN',
    232   'RPAREN',
    233)
    234
    235# Regular expression rules for simple tokens
    236t_PLUS    = r'\+'
    237t_MINUS   = r'-'
    238t_TIMES   = r'\*'
    239t_DIVIDE  = r'/'
    240t_LPAREN  = r'\('
    241t_RPAREN  = r'\)'
    242
    243# A regular expression rule with some action code
    244def t_NUMBER(t):
    245    r'\d+'
    246    try:
    247         t.value = int(t.value)    
    248    except ValueError:
    249         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    250	 t.value = 0
    251    return t
    252
    253# Define a rule so we can track line numbers
    254def t_newline(t):
    255    r'\n+'
    256    t.lexer.lineno += len(t.value)
    257
    258# A string containing ignored characters (spaces and tabs)
    259t_ignore  = ' \t'
    260
    261# Error handling rule
    262def t_error(t):
    263    print "Illegal character '%s'" % t.value[0]
    264    t.lexer.skip(1)
    265
    266# Build the lexer
    267lex.lex()
    268
    269</pre>
    270</blockquote>
    271To use the lexer, you first need to feed it some input text using its <tt>input()</tt> method.  After that, repeated calls to <tt>token()</tt> produce tokens.   The following code shows how this works:
    272
    273<blockquote>
    274<pre>
    275
    276# Test it out
    277data = '''
    2783 + 4 * 10
    279  + -20 *2
    280'''
    281
    282# Give the lexer some input
    283lex.input(data)
    284
    285# Tokenize
    286while 1:
    287    tok = lex.token()
    288    if not tok: break      # No more input
    289    print tok
    290</pre>
    291</blockquote>
    292
    293When executed, the example will produce the following output:
    294
    295<blockquote>
    296<pre>
    297$ python example.py
    298LexToken(NUMBER,3,2,1)
    299LexToken(PLUS,'+',2,3)
    300LexToken(NUMBER,4,2,5)
    301LexToken(TIMES,'*',2,7)
    302LexToken(NUMBER,10,2,10)
    303LexToken(PLUS,'+',3,14)
    304LexToken(MINUS,'-',3,16)
    305LexToken(NUMBER,20,3,18)
    306LexToken(TIMES,'*',3,20)
    307LexToken(NUMBER,2,3,21)
    308</pre>
    309</blockquote>
    310
    311The tokens returned by <tt>lex.token()</tt> are instances
    312of <tt>LexToken</tt>.  This object has
    313attributes <tt>tok.type</tt>, <tt>tok.value</tt>,
    314<tt>tok.lineno</tt>, and <tt>tok.lexpos</tt>.  The following code shows an example of
    315accessing these attributes:
    316
    317<blockquote>
    318<pre>
    319# Tokenize
    320while 1:
    321    tok = lex.token()
    322    if not tok: break      # No more input
    323    print tok.type, tok.value, tok.line, tok.lexpos
    324</pre>
    325</blockquote>
    326
    327The <tt>tok.type</tt> and <tt>tok.value</tt> attributes contain the
    328type and value of the token itself. 
    329<tt>tok.line</tt> and <tt>tok.lexpos</tt> contain information about
    330the location of the token.  <tt>tok.lexpos</tt> is the index of the
    331token relative to the start of the input text.
    332
    333<H3><a name="ply_nn5"></a>3.2 The tokens list</H3>
    334
    335
    336All lexers must provide a list <tt>tokens</tt> that defines all of the possible token
    337names that can be produced by the lexer.  This list is always required
    338and is used to perform a variety of validation checks.  The tokens list is also used by the
    339<tt>yacc.py</tt> module to identify terminals.
    340
    341<p>
    342In the example, the following code specified the token names:
    343
    344<blockquote>
    345<pre>
    346tokens = (
    347   'NUMBER',
    348   'PLUS',
    349   'MINUS',
    350   'TIMES',
    351   'DIVIDE',
    352   'LPAREN',
    353   'RPAREN',
    354)
    355</pre>
    356</blockquote>
    357
    358<H3><a name="ply_nn6"></a>3.3 Specification of tokens</H3>
    359
    360
    361Each token is specified by writing a regular expression rule.  Each of these rules are
    362are defined by  making declarations with a special prefix <tt>t_</tt> to indicate that it
    363defines a token.  For simple tokens, the regular expression can
    364be specified as strings such as this (note: Python raw strings are used since they are the
    365most convenient way to write regular expression strings):
    366
    367<blockquote>
    368<pre>
    369t_PLUS = r'\+'
    370</pre>
    371</blockquote>
    372
    373In this case, the name following the <tt>t_</tt> must exactly match one of the
    374names supplied in <tt>tokens</tt>.   If some kind of action needs to be performed,
    375a token rule can be specified as a function.  For example, this rule matches numbers and
    376converts the string into a Python integer.
    377
    378<blockquote>
    379<pre>
    380def t_NUMBER(t):
    381    r'\d+'
    382    try:
    383         t.value = int(t.value)
    384    except ValueError:
    385         print "Number %s is too large!" % t.value
    386	 t.value = 0
    387    return t
    388</pre>
    389</blockquote>
    390
    391When a function is used, the regular expression rule is specified in the function documentation string. 
    392The function always takes a single argument which is an instance of 
    393<tt>LexToken</tt>.   This object has attributes of <tt>t.type</tt> which is the token type (as a string),
    394<tt>t.value</tt> which is the lexeme (the actual text matched), <tt>t.lineno</tt> which is the current line number, and <tt>t.lexpos</tt> which
    395is the position of the token relative to the beginning of the input text.
    396By default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix.  The action
    397function can modify the contents of the <tt>LexToken</tt> object as appropriate.  However, 
    398when it is done, the resulting token should be returned.  If no value is returned by the action
    399function, the token is simply discarded and the next token read.
    400
    401<p>
    402Internally, <tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching.  When building the master regular expression,
    403rules are added in the following order:
    404<p>
    405<ol>
    406<li>All tokens defined by functions are added in the same order as they appear in the lexer file.
    407<li>Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions
    408are added first).
    409</ol>
    410<p>
    411Without this ordering, it can be difficult to correctly match certain types of tokens.  For example, if you 
    412wanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first.  By sorting regular
    413expressions in order of decreasing length, this problem is solved for rules defined as strings.  For functions,
    414the order can be explicitly controlled since rules appearing first are checked first.
    415
    416<p>
    417To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function
    418like this:
    419
    420<blockquote>
    421<pre>
    422reserved = {
    423   'if' : 'IF',
    424   'then' : 'THEN',
    425   'else' : 'ELSE',
    426   'while' : 'WHILE',
    427   ...
    428}
    429
    430def t_ID(t):
    431    r'[a-zA-Z_][a-zA-Z_0-9]*'
    432    t.type = reserved.get(t.value,'ID')    # Check for reserved words
    433    return t
    434</pre>
    435</blockquote>
    436
    437This approach greatly reduces the number of regular expression rules and is likely to make things a little faster.
    438
    439<p>
    440<b>Note:</b> You should avoid writing individual rules for reserved words.  For example, if you write rules like this,
    441
    442<blockquote>
    443<pre>
    444t_FOR   = r'for'
    445t_PRINT = r'print'
    446</pre>
    447</blockquote>
    448
    449those rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed".  This is probably not
    450what you want.
    451
    452<H3><a name="ply_nn7"></a>3.4 Token values</H3>
    453
    454
    455When tokens are returned by lex, they have a value that is stored in the <tt>value</tt> attribute.    Normally, the value is the text
    456that was matched.   However, the value can be assigned to any Python object.   For instance, when lexing identifiers, you may
    457want to return both the identifier name and information from some sort of symbol table.  To do this, you might write a rule like this:
    458
    459<blockquote>
    460<pre>
    461def t_ID(t):
    462    ...
    463    # Look up symbol table information and return a tuple
    464    t.value = (t.value, symbol_lookup(t.value))
    465    ...
    466    return t
    467</pre>
    468</blockquote>
    469
    470It is important to note that storing data in other attribute names is <em>not</em> recommended.  The <tt>yacc.py</tt> module only exposes the
    471contents of the <tt>value</tt> attribute.  Thus, accessing other attributes may  be unnecessarily awkward.
    472
    473<H3><a name="ply_nn8"></a>3.5 Discarded tokens</H3>
    474
    475
    476To discard a token, such as a comment, simply define a token rule that returns no value.  For example:
    477
    478<blockquote>
    479<pre>
    480def t_COMMENT(t):
    481    r'\#.*'
    482    pass
    483    # No return value. Token discarded
    484</pre>
    485</blockquote>
    486
    487Alternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored.  For example:
    488
    489<blockquote>
    490<pre>
    491t_ignore_COMMENT = r'\#.*'
    492</pre>
    493</blockquote>
    494
    495Be advised that if you are ignoring many different kinds of text, you may still want to use functions since these provide more precise
    496control over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are
    497sorted by regular expression length).
    498
    499<H3><a name="ply_nn9"></a>3.6 Line numbers and positional information</H3>
    500
    501
    502<p>By default, <tt>lex.py</tt> knows nothing about line numbers.  This is because <tt>lex.py</tt> doesn't know anything
    503about what constitutes a "line" of input (e.g., the newline character or even if the input is textual data).
    504To update this information, you need to write a special rule.  In the example, the <tt>t_newline()</tt> rule shows how to do this.
    505
    506<blockquote>
    507<pre>
    508# Define a rule so we can track line numbers
    509def t_newline(t):
    510    r'\n+'
    511    t.lexer.lineno += len(t.value)
    512</pre>
    513</blockquote>
    514Within the rule, the <tt>lineno</tt> attribute of the underlying lexer <tt>t.lexer</tt> is updated.
    515After the line number is updated, the token is simply discarded since nothing is returned.
    516
    517<p>
    518<tt>lex.py</tt> does not perform and kind of automatic column tracking.  However, it does record positional
    519information related to each token in the <tt>lexpos</tt> attribute.   Using this, it is usually possible to compute 
    520column information as a separate step.   For instance, just count backwards until you reach a newline.
    521
    522<blockquote>
    523<pre>
    524# Compute column. 
    525#     input is the input text string
    526#     token is a token instance
    527def find_column(input,token):
    528    i = token.lexpos
    529    while i > 0:
    530        if input[i] == '\n': break
    531        i -= 1
    532    column = (token.lexpos - i)+1
    533    return column
    534</pre>
    535</blockquote>
    536
    537Since column information is often only useful in the context of error handling, calculating the column
    538position can be performed when needed as opposed to doing it for each token.
    539
    540<H3><a name="ply_nn10"></a>3.7 Ignored characters</H3>
    541
    542
    543<p>
    544The special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters
    545that should be completely ignored in the input stream. 
    546Usually this is used to skip over whitespace and other non-essential characters. 
    547Although it is possible to define a regular expression rule for whitespace in a manner
    548similar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better
    549lexing performance because it is handled as a special case and is checked in a much
    550more efficient manner than the normal regular expression rules.
    551
    552<H3><a name="ply_nn11"></a>3.8 Literal characters</H3>
    553
    554
    555<p>
    556Literal characters can be specified by defining a variable <tt>literals</tt> in your lexing module.  For example:
    557
    558<blockquote>
    559<pre>
    560literals = [ '+','-','*','/' ]
    561</pre>
    562</blockquote>
    563
    564or alternatively
    565
    566<blockquote>
    567<pre>
    568literals = "+-*/"
    569</pre>
    570</blockquote>
    571
    572A literal character is simply a single character that is returned "as is" when encountered by the lexer.  Literals are checked
    573after all of the defined regular expression rules.  Thus, if a rule starts with one of the literal characters, it will always 
    574take precedence.
    575<p>
    576When a literal token is returned, both its <tt>type</tt> and <tt>value</tt> attributes are set to the character itself. For example, <tt>'+'</tt>.
    577
    578<H3><a name="ply_nn12"></a>3.9 Error handling</H3>
    579
    580
    581<p>
    582Finally, the <tt>t_error()</tt>
    583function is used to handle lexing errors that occur when illegal
    584characters are detected.  In this case, the <tt>t.value</tt> attribute contains the
    585rest of the input string that has not been tokenized.  In the example, the error function
    586was defined as follows:
    587
    588<blockquote>
    589<pre>
    590# Error handling rule
    591def t_error(t):
    592    print "Illegal character '%s'" % t.value[0]
    593    t.lexer.skip(1)
    594</pre>
    595</blockquote>
    596
    597In this case, we simply print the offending character and skip ahead one character by calling <tt>t.lexer.skip(1)</tt>.
    598
    599<H3><a name="ply_nn13"></a>3.10 Building and using the lexer</H3>
    600
    601
    602<p>
    603To build the lexer, the function <tt>lex.lex()</tt> is used.  This function
    604uses Python reflection (or introspection) to read the the regular expression rules
    605out of the calling context and build the lexer. Once the lexer has been built, two functions can
    606be used to control the lexer.
    607
    608<ul>
    609<li><tt>lex.input(data)</tt>.   Reset the lexer and store a new input string.
    610<li><tt>lex.token()</tt>.  Return the next token.  Returns a special <tt>LexToken</tt> instance on success or
    611None if the end of the input text has been reached.
    612</ul>
    613
    614If desired, the lexer can also be used as an object.  The <tt>lex()</tt> returns a <tt>Lexer</tt> object that
    615can be used for this purpose.  For example:
    616
    617<blockquote>
    618<pre>
    619lexer = lex.lex()
    620lexer.input(sometext)
    621while 1:
    622    tok = lexer.token()
    623    if not tok: break
    624    print tok
    625</pre>
    626</blockquote>
    627
    628<p>
    629This latter technique should be used if you intend to use multiple lexers in your application.  Simply define each
    630lexer in its own module and use the object returned by <tt>lex()</tt> as appropriate.
    631
    632<p>
    633Note: The global functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> 
    634and <tt>token()</tt> methods of the last lexer created by the lex module. 
    635
    636<H3><a name="ply_nn14"></a>3.11 The @TOKEN decorator</H3>
    637
    638
    639In some applications, you may want to define build tokens from as a series of
    640more complex regular expression rules.  For example:
    641
    642<blockquote>
    643<pre>
    644digit            = r'([0-9])'
    645nondigit         = r'([_A-Za-z])'
    646identifier       = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)'        
    647
    648def t_ID(t):
    649    # want docstring to be identifier above. ?????
    650    ...
    651</pre>
    652</blockquote>
    653
    654In this case, we want the regular expression rule for <tt>ID</tt> to be one of the variables above. However, there is no
    655way to directly specify this using a normal documentation string.   To solve this problem, you can use the <tt>@TOKEN</tt>
    656decorator.  For example:
    657
    658<blockquote>
    659<pre>
    660from ply.lex import TOKEN
    661
    662@TOKEN(identifier)
    663def t_ID(t):
    664    ...
    665</pre>
    666</blockquote>
    667
    668This will attach <tt>identifier</tt> to the docstring for <tt>t_ID()</tt> allowing <tt>lex.py</tt> to work normally.  An alternative
    669approach this problem is to set the docstring directly like this:
    670
    671<blockquote>
    672<pre>
    673def t_ID(t):
    674    ...
    675
    676t_ID.__doc__ = identifier
    677</pre>
    678</blockquote>
    679
    680<b>NOTE:</b> Use of <tt>@TOKEN</tt> requires Python-2.4 or newer.  If you're concerned about backwards compatibility with older
    681versions of Python, use the alternative approach of setting the docstring directly.
    682
    683<H3><a name="ply_nn15"></a>3.12 Optimized mode</H3>
    684
    685
    686For improved performance, it may be desirable to use Python's
    687optimized mode (e.g., running Python with the <tt>-O</tt>
    688option). However, doing so causes Python to ignore documentation
    689strings.  This presents special problems for <tt>lex.py</tt>.  To
    690handle this case, you can create your lexer using
    691the <tt>optimize</tt> option as follows:
    692
    693<blockquote>
    694<pre>
    695lexer = lex.lex(optimize=1)
    696</pre>
    697</blockquote>
    698
    699Next, run Python in its normal operating mode.  When you do
    700this, <tt>lex.py</tt> will write a file called <tt>lextab.py</tt> to
    701the current directory.  This file contains all of the regular
    702expression rules and tables used during lexing.  On subsequent
    703executions,
    704<tt>lextab.py</tt> will simply be imported to build the lexer.  This
    705approach substantially improves the startup time of the lexer and it
    706works in Python's optimized mode.
    707
    708<p>
    709To change the name of the lexer-generated file, use the <tt>lextab</tt> keyword argument.  For example:
    710
    711<blockquote>
    712<pre>
    713lexer = lex.lex(optimize=1,lextab="footab")
    714</pre>
    715</blockquote>
    716
    717When running in optimized mode, it is important to note that lex disables most error checking.  Thus, this is really only recommended
    718if you're sure everything is working correctly and you're ready to start releasing production code.
    719
    720<H3><a name="ply_nn16"></a>3.13 Debugging</H3>
    721
    722
    723For the purpose of debugging, you can run <tt>lex()</tt> in a debugging mode as follows:
    724
    725<blockquote>
    726<pre>
    727lexer = lex.lex(debug=1)
    728</pre>
    729</blockquote>
    730
    731This will result in a large amount of debugging information to be printed including all of the added rules and the master
    732regular expressions.
    733
    734In addition, <tt>lex.py</tt> comes with a simple main function which
    735will either tokenize input read from standard input or from a file specified
    736on the command line. To use it, simply put this in your lexer:
    737
    738<blockquote>
    739<pre>
    740if __name__ == '__main__':
    741     lex.runmain()
    742</pre>
    743</blockquote>
    744
    745<H3><a name="ply_nn17"></a>3.14 Alternative specification of lexers</H3>
    746
    747
    748As shown in the example, lexers are specified all within one Python module.   If you want to
    749put token rules in a different module from the one in which you invoke <tt>lex()</tt>, use the
    750<tt>module</tt> keyword argument.
    751
    752<p>
    753For example, you might have a dedicated module that just contains
    754the token rules:
    755
    756<blockquote>
    757<pre>
    758# module: tokrules.py
    759# This module just contains the lexing rules
    760
    761# List of token names.   This is always required
    762tokens = (
    763   'NUMBER',
    764   'PLUS',
    765   'MINUS',
    766   'TIMES',
    767   'DIVIDE',
    768   'LPAREN',
    769   'RPAREN',
    770)
    771
    772# Regular expression rules for simple tokens
    773t_PLUS    = r'\+'
    774t_MINUS   = r'-'
    775t_TIMES   = r'\*'
    776t_DIVIDE  = r'/'
    777t_LPAREN  = r'\('
    778t_RPAREN  = r'\)'
    779
    780# A regular expression rule with some action code
    781def t_NUMBER(t):
    782    r'\d+'
    783    try:
    784         t.value = int(t.value)    
    785    except ValueError:
    786         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    787	 t.value = 0
    788    return t
    789
    790# Define a rule so we can track line numbers
    791def t_newline(t):
    792    r'\n+'
    793    t.lexer.lineno += len(t.value)
    794
    795# A string containing ignored characters (spaces and tabs)
    796t_ignore  = ' \t'
    797
    798# Error handling rule
    799def t_error(t):
    800    print "Illegal character '%s'" % t.value[0]
    801    t.lexer.skip(1)
    802</pre>
    803</blockquote>
    804
    805Now, if you wanted to build a tokenizer from these rules from within a different module, you would do the following (shown for Python interactive mode):
    806
    807<blockquote>
    808<pre>
    809>>> import tokrules
    810>>> <b>lexer = lex.lex(module=tokrules)</b>
    811>>> lexer.input("3 + 4")
    812>>> lexer.token()
    813LexToken(NUMBER,3,1,1,0)
    814>>> lexer.token()
    815LexToken(PLUS,'+',1,2)
    816>>> lexer.token()
    817LexToken(NUMBER,4,1,4)
    818>>> lexer.token()
    819None
    820>>>
    821</pre>
    822</blockquote>
    823
    824The <tt>object</tt> option can be used to define lexers as a class instead of a module.  For example:
    825
    826<blockquote>
    827<pre>
    828import ply.lex as lex
    829
    830class MyLexer:
    831    # List of token names.   This is always required
    832    tokens = (
    833       'NUMBER',
    834       'PLUS',
    835       'MINUS',
    836       'TIMES',
    837       'DIVIDE',
    838       'LPAREN',
    839       'RPAREN',
    840    )
    841
    842    # Regular expression rules for simple tokens
    843    t_PLUS    = r'\+'
    844    t_MINUS   = r'-'
    845    t_TIMES   = r'\*'
    846    t_DIVIDE  = r'/'
    847    t_LPAREN  = r'\('
    848    t_RPAREN  = r'\)'
    849
    850    # A regular expression rule with some action code
    851    # Note addition of self parameter since we're in a class
    852    def t_NUMBER(self,t):
    853        r'\d+'
    854        try:
    855             t.value = int(t.value)    
    856        except ValueError:
    857             print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    858             t.value = 0
    859        return t
    860
    861    # Define a rule so we can track line numbers
    862    def t_newline(self,t):
    863        r'\n+'
    864        t.lexer.lineno += len(t.value)
    865
    866    # A string containing ignored characters (spaces and tabs)
    867    t_ignore  = ' \t'
    868
    869    # Error handling rule
    870    def t_error(self,t):
    871        print "Illegal character '%s'" % t.value[0]
    872        t.lexer.skip(1)
    873
    874    <b># Build the lexer
    875    def build(self,**kwargs):
    876        self.lexer = lex.lex(object=self, **kwargs)</b>
    877    
    878    # Test it output
    879    def test(self,data):
    880        self.lexer.input(data)
    881        while 1:
    882             tok = lexer.token()
    883             if not tok: break
    884             print tok
    885
    886# Build the lexer and try it out
    887m = MyLexer()
    888m.build()           # Build the lexer
    889m.test("3 + 4")     # Test it
    890</pre>
    891</blockquote>
    892
    893For reasons that are subtle, you should <em>NOT</em> invoke <tt>lex.lex()</tt> inside the <tt>__init__()</tt> method of your class.  If you
    894do, it may cause bizarre behavior if someone tries to duplicate a lexer object.  Keep reading.
    895
    896<H3><a name="ply_nn18"></a>3.15 Maintaining state</H3>
    897
    898
    899In your lexer, you may want to maintain a variety of state information.  This might include mode settings, symbol tables, and other details.  There are a few
    900different ways to handle this situation.  First, you could just keep some global variables:
    901
    902<blockquote>
    903<pre>
    904num_count = 0
    905def t_NUMBER(t):
    906    r'\d+'
    907    global num_count
    908    num_count += 1
    909    try:
    910         t.value = int(t.value)    
    911    except ValueError:
    912         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    913	 t.value = 0
    914    return t
    915</pre>
    916</blockquote>
    917
    918Alternatively, you can store this information inside the Lexer object created by <tt>lex()</tt>.  To this, you can use the <tt>lexer</tt> attribute
    919of tokens passed to the various rules. For example:
    920
    921<blockquote>
    922<pre>
    923def t_NUMBER(t):
    924    r'\d+'
    925    t.lexer.num_count += 1     # Note use of lexer attribute
    926    try:
    927         t.value = int(t.value)    
    928    except ValueError:
    929         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    930	 t.value = 0
    931    return t
    932
    933lexer = lex.lex()
    934lexer.num_count = 0            # Set the initial count
    935</pre>
    936</blockquote>
    937
    938This latter approach has the advantage of storing information inside
    939the lexer itself---something that may be useful if multiple instances
    940of the same lexer have been created.  However, it may also feel kind
    941of "hacky" to the purists.  Just to put their mind at some ease, all
    942internal attributes of the lexer (with the exception of <tt>lineno</tt>) have names that are prefixed
    943by <tt>lex</tt> (e.g., <tt>lexdata</tt>,<tt>lexpos</tt>, etc.).  Thus,
    944it should be perfectly safe to store attributes in the lexer that
    945don't have names starting with that prefix.
    946
    947<p>
    948A third approach is to define the lexer as a class as shown in the previous example:
    949
    950<blockquote>
    951<pre>
    952class MyLexer:
    953    ...
    954    def t_NUMBER(self,t):
    955        r'\d+'
    956        self.num_count += 1
    957        try:
    958             t.value = int(t.value)    
    959        except ValueError:
    960             print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    961             t.value = 0
    962        return t
    963
    964    def build(self, **kwargs):
    965        self.lexer = lex.lex(object=self,**kwargs)
    966
    967    def __init__(self):
    968        self.num_count = 0
    969
    970# Create a lexer 
    971m = MyLexer()
    972lexer = lex.lex(object=m)
    973</pre>
    974</blockquote>
    975
    976The class approach may be the easiest to manage if your application is going to be creating multiple instances of the same lexer and
    977you need to manage a lot of state.  
    978
    979<H3><a name="ply_nn19"></a>3.16 Duplicating lexers</H3>
    980
    981
    982<b>NOTE: I am thinking about deprecating this feature.  Post comments on <a href="http://groups.google.com/group/ply-hack">ply-hack@googlegroups.com</a> or send me a private email at dave@dabeaz.com.</b>
    983
    984<p>
    985If necessary, a lexer object can be quickly duplicated by invoking its <tt>clone()</tt> method.  For example:
    986
    987<blockquote>
    988<pre>
    989lexer = lex.lex()
    990...
    991newlexer = lexer.clone()
    992</pre>
    993</blockquote>
    994
    995When a lexer is cloned, the copy is identical to the original lexer,
    996including any input text. However, once created, different text can be
    997fed to the clone which can be used independently.  This capability may
    998be useful in situations when you are writing a parser/compiler that
    999involves recursive or reentrant processing.  For instance, if you
   1000needed to scan ahead in the input for some reason, you could create a
   1001clone and use it to look ahead.
   1002
   1003<p>
   1004The advantage of using <tt>clone()</tt> instead of reinvoking <tt>lex()</tt> is
   1005that it is significantly faster.  Namely, it is not necessary to re-examine all of the 
   1006token rules, build a regular expression, and construct internal tables.  All of this
   1007information can simply be reused in the new lexer.
   1008
   1009<p>
   1010Special considerations need to be made when cloning a lexer that is defined as a class.  Previous sections
   1011showed an example of a class <tt>MyLexer</tt>.  If you have the following code:
   1012
   1013<blockquote>
   1014<pre>
   1015m = MyLexer()
   1016a = lex.lex(object=m)      # Create a lexer
   1017
   1018b = a.clone()              # Clone the lexer
   1019</pre>
   1020</blockquote>
   1021
   1022Then both <tt>a</tt> and <tt>b</tt> are going to be bound to the same
   1023object <tt>m</tt>.  If the object <tt>m</tt> contains internal state
   1024related to lexing, this sharing may lead to quite a bit of confusion. To fix this,
   1025the <tt>clone()</tt> method accepts an optional argument that can be used to supply a new object.  This
   1026can be used to clone the lexer and bind it to a new instance.  For example:
   1027
   1028<blockquote>
   1029<pre>
   1030m = MyLexer()              # Create a lexer
   1031a = lex.lex(object=m)
   1032
   1033# Create a clone 
   1034n = MyLexer()              # New instance of MyLexer
   1035b = a.clone(n)             # New lexer bound to n
   1036</pre>
   1037</blockquote>
   1038
   1039It may make sense to encapsulate all of this inside a method:
   1040
   1041<blockquote>
   1042<pre>
   1043class MyLexer:
   1044     ...
   1045     def clone(self):
   1046         c = MyLexer()        # Create a new instance of myself
   1047         # Copy attributes from self to c as appropriate
   1048         ...
   1049         # Clone the lexer
   1050         c.lexer = self.lexer.clone(c)
   1051         return c
   1052</pre>
   1053</blockquote>
   1054
   1055The fact that a new instance of <tt>MyLexer</tt> may be created while cloning a lexer is the reason why you should never
   1056invoke <tt>lex.lex()</tt> inside <tt>__init__()</tt>.  If you do, the lexer will be rebuilt from scratch and you lose
   1057all of the performance benefits of using <tt>clone()</tt> in the first place.
   1058
   1059<H3><a name="ply_nn20"></a>3.17 Internal lexer state</H3>
   1060
   1061
   1062A Lexer object <tt>lexer</tt> has a number of internal attributes that may be useful in certain
   1063situations. 
   1064
   1065<p>
   1066<tt>lexer.lexpos</tt>
   1067<blockquote>
   1068This attribute is an integer that contains the current position within the input text.  If you modify
   1069the value, it will change the result of the next call to <tt>token()</tt>.  Within token rule functions, this points
   1070to the first character <em>after</em> the matched text.  If the value is modified within a rule, the next returned token will be
   1071matched at the new position.
   1072</blockquote>
   1073
   1074<p>
   1075<tt>lexer.lineno</tt>
   1076<blockquote>
   1077The current value of the line number attribute stored in the lexer.  This can be modified as needed to
   1078change the line number.
   1079</blockquote>
   1080
   1081<p>
   1082<tt>lexer.lexdata</tt>
   1083<blockquote>
   1084The current input text stored in the lexer.  This is the string passed with the <tt>input()</tt> method. It
   1085would probably be a bad idea to modify this unless you really know what you're doing.
   1086</blockquote>
   1087
   1088<P>
   1089<tt>lexer.lexmatch</tt>
   1090<blockquote>
   1091This is the raw <tt>Match</tt> object returned by the Python <tt>re.match()</tt> function (used internally by PLY) for the
   1092current token.  If you have written a regular expression that contains named groups, you can use this to retrieve those values.
   1093</blockquote>
   1094
   1095<H3><a name="ply_nn21"></a>3.18 Conditional lexing and start conditions</H3>
   1096
   1097
   1098In advanced parsing applications, it may be useful to have different
   1099lexing states. For instance, you may want the occurrence of a certain
   1100token or syntactic construct to trigger a different kind of lexing.
   1101PLY supports a feature that allows the underlying lexer to be put into
   1102a series of different states.  Each state can have its own tokens,
   1103lexing rules, and so forth.  The implementation is based largely on
   1104the "start condition" feature of GNU flex.  Details of this can be found
   1105at <a
   1106href="http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html">http://www.gnu.org/software/flex/manual/html_chapter/flex_11.html.</a>.
   1107
   1108<p>
   1109To define a new lexing state, it must first be declared.  This is done by including a "states" declaration in your
   1110lex file.  For example:
   1111
   1112<blockquote>
   1113<pre>
   1114states = (
   1115   ('foo','exclusive'),
   1116   ('bar','inclusive'),
   1117)
   1118</pre>
   1119</blockquote>
   1120
   1121This declaration declares two states, <tt>'foo'</tt>
   1122and <tt>'bar'</tt>.  States may be of two types; <tt>'exclusive'</tt>
   1123and <tt>'inclusive'</tt>.  An exclusive state completely overrides the
   1124default behavior of the lexer.  That is, lex will only return tokens
   1125and apply rules defined specifically for that state.  An inclusive
   1126state adds additional tokens and rules to the default set of rules.
   1127Thus, lex will return both the tokens defined by default in addition
   1128to those defined for the inclusive state.
   1129
   1130<p>
   1131Once a state has been declared, tokens and rules are declared by including the
   1132state name in token/rule declaration.  For example:
   1133
   1134<blockquote>
   1135<pre>
   1136t_foo_NUMBER = r'\d+'                      # Token 'NUMBER' in state 'foo'        
   1137t_bar_ID     = r'[a-zA-Z_][a-zA-Z0-9_]*'   # Token 'ID' in state 'bar'
   1138
   1139def t_foo_newline(t):
   1140    r'\n'
   1141    t.lexer.lineno += 1
   1142</pre>
   1143</blockquote>
   1144
   1145A token can be declared in multiple states by including multiple state names in the declaration. For example:
   1146
   1147<blockquote>
   1148<pre>
   1149t_foo_bar_NUMBER = r'\d+'         # Defines token 'NUMBER' in both state 'foo' and 'bar'
   1150</pre>
   1151</blockquote>
   1152
   1153Alternative, a token can be declared in all states using the 'ANY' in the name.
   1154
   1155<blockquote>
   1156<pre>
   1157t_ANY_NUMBER = r'\d+'         # Defines a token 'NUMBER' in all states
   1158</pre>
   1159</blockquote>
   1160
   1161If no state name is supplied, as is normally the case, the token is associated with a special state <tt>'INITIAL'</tt>.  For example,
   1162these two declarations are identical:
   1163
   1164<blockquote>
   1165<pre>
   1166t_NUMBER = r'\d+'
   1167t_INITIAL_NUMBER = r'\d+'
   1168</pre>
   1169</blockquote>
   1170
   1171<p>
   1172States are also associated with the special <tt>t_ignore</tt> and <tt>t_error()</tt> declarations.  For example, if a state treats
   1173these differently, you can declare:
   1174
   1175<blockquote>
   1176<pre>
   1177t_foo_ignore = " \t\n"       # Ignored characters for state 'foo'
   1178
   1179def t_bar_error(t):          # Special error handler for state 'bar'
   1180    pass 
   1181</pre>
   1182</blockquote>
   1183
   1184By default, lexing operates in the <tt>'INITIAL'</tt> state.  This state includes all of the normally defined tokens. 
   1185For users who aren't using different states, this fact is completely transparent.   If, during lexing or parsing, you want to change
   1186the lexing state, use the <tt>begin()</tt> method.   For example:
   1187
   1188<blockquote>
   1189<pre>
   1190def t_begin_foo(t):
   1191    r'start_foo'
   1192    t.lexer.begin('foo')             # Starts 'foo' state
   1193</pre>
   1194</blockquote>
   1195
   1196To get out of a state, you use <tt>begin()</tt> to switch back to the initial state.  For example:
   1197
   1198<blockquote>
   1199<pre>
   1200def t_foo_end(t):
   1201    r'end_foo'
   1202    t.lexer.begin('INITIAL')        # Back to the initial state
   1203</pre>
   1204</blockquote>
   1205
   1206The management of states can also be done with a stack.  For example:
   1207
   1208<blockquote>
   1209<pre>
   1210def t_begin_foo(t):
   1211    r'start_foo'
   1212    t.lexer.push_state('foo')             # Starts 'foo' state
   1213
   1214def t_foo_end(t):
   1215    r'end_foo'
   1216    t.lexer.pop_state()                   # Back to the previous state
   1217</pre>
   1218</blockquote>
   1219
   1220<p>
   1221The use of a stack would be useful in situations where there are many ways of entering a new lexing state and you merely want to go back
   1222to the previous state afterwards.
   1223
   1224<P>
   1225An example might help clarify.  Suppose you were writing a parser and you wanted to grab sections of arbitrary C code enclosed by
   1226curly braces.  That is, whenever you encounter a starting brace '{', you want to read all of the enclosed code up to the ending brace '}' 
   1227and return it as a string.   Doing this with a normal regular expression rule is nearly (if not actually) impossible.  This is because braces can
   1228be nested and can be included in comments and strings.  Thus, simply matching up to the first matching '}' character isn't good enough.  Here is how
   1229you might use lexer states to do this:
   1230
   1231<blockquote>
   1232<pre>
   1233# Declare the state
   1234states = (
   1235  ('ccode','exclusive'),
   1236)
   1237
   1238# Match the first {. Enter ccode state.
   1239def t_ccode(t):
   1240    r'\{'
   1241    t.lexer.code_start = t.lexer.lexpos        # Record the starting position
   1242    t.lexer.level = 1                          # Initial brace level
   1243    t.lexer.begin('ccode')                     # Enter 'ccode' state
   1244
   1245# Rules for the ccode state
   1246def t_ccode_lbrace(t):     
   1247    r'\{'
   1248    t.lexer.level +=1                
   1249
   1250def t_ccode_rbrace(t):
   1251    r'\}'
   1252    t.lexer.level -=1
   1253
   1254    # If closing brace, return the code fragment
   1255    if t.lexer.level == 0:
   1256         t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
   1257         t.type = "CCODE"
   1258         t.lexer.lineno += t.value.count('\n')
   1259         t.lexer.begin('INITIAL')           
   1260         return t
   1261
   1262# C or C++ comment (ignore)    
   1263def t_ccode_comment(t):
   1264    r'(/\*(.|\n)*?*/)|(//.*)'
   1265    pass
   1266
   1267# C string
   1268def t_ccode_string(t):
   1269   r'\"([^\\\n]|(\\.))*?\"'
   1270
   1271# C character literal
   1272def t_ccode_char(t):
   1273   r'\'([^\\\n]|(\\.))*?\''
   1274
   1275# Any sequence of non-whitespace characters (not braces, strings)
   1276def t_ccode_nonspace(t):
   1277   r'[^\s\{\}\'\"]+'
   1278
   1279# Ignored characters (whitespace)
   1280t_ccode_ignore = " \t\n"
   1281
   1282# For bad characters, we just skip over it
   1283def t_ccode_error(t):
   1284    t.lexer.skip(1)
   1285</pre>
   1286</blockquote>
   1287
   1288In this example, the occurrence of the first '{' causes the lexer to record the starting position and enter a new state <tt>'ccode'</tt>.  A collection of rules then match
   1289various parts of the input that follow (comments, strings, etc.).  All of these rules merely discard the token (by not returning a value).
   1290However, if the closing right brace is encountered, the rule <tt>t_ccode_rbrace</tt> collects all of the code (using the earlier recorded starting
   1291position), stores it, and returns a token 'CCODE' containing all of that text.  When returning the token, the lexing state is restored back to its
   1292initial state.
   1293
   1294<H3><a name="ply_nn21"></a>3.19 Miscellaneous Issues</H3>
   1295
   1296
   1297<P>
   1298<li>The lexer requires input to be supplied as a single input string.  Since most machines have more than enough memory, this 
   1299rarely presents a performance concern.  However, it means that the lexer currently can't be used with streaming data
   1300such as open files or sockets.  This limitation is primarily a side-effect of using the <tt>re</tt> module.
   1301
   1302<p>
   1303<li>The lexer should work properly with both Unicode strings given as token and pattern matching rules as
   1304well as for input text.
   1305
   1306<p>
   1307<li>If you need to supply optional flags to the re.compile() function, use the reflags option to lex.  For example:
   1308
   1309<blockquote>
   1310<pre>
   1311lex.lex(reflags=re.UNICODE)
   1312</pre>
   1313</blockquote>
   1314
   1315<p>
   1316<li>Since the lexer is written entirely in Python, its performance is
   1317largely determined by that of the Python <tt>re</tt> module.  Although
   1318the lexer has been written to be as efficient as possible, it's not
   1319blazingly fast when used on very large input files.  If
   1320performance is concern, you might consider upgrading to the most
   1321recent version of Python, creating a hand-written lexer, or offloading
   1322the lexer into a C extension module.  
   1323
   1324<p>
   1325If you are going to create a hand-written lexer and you plan to use it with <tt>yacc.py</tt>, 
   1326it only needs to conform to the following requirements:
   1327
   1328<ul>
   1329<li>It must provide a <tt>token()</tt> method that returns the next token or <tt>None</tt> if no more
   1330tokens are available.
   1331<li>The <tt>token()</tt> method must return an object <tt>tok</tt> that has <tt>type</tt> and <tt>value</tt> attributes.
   1332</ul>
   1333
   1334<H2><a name="ply_nn22"></a>4. Parsing basics</H2>
   1335
   1336
   1337<tt>yacc.py</tt> is used to parse language syntax.  Before showing an
   1338example, there are a few important bits of background that must be
   1339mentioned.  First, <em>syntax</em> is usually specified in terms of a BNF grammar.
   1340For example, if you wanted to parse
   1341simple arithmetic expressions, you might first write an unambiguous
   1342grammar specification like this:
   1343
   1344<blockquote>
   1345<pre> 
   1346expression : expression + term
   1347           | expression - term
   1348           | term
   1349
   1350term       : term * factor
   1351           | term / factor
   1352           | factor
   1353
   1354factor     : NUMBER
   1355           | ( expression )
   1356</pre>
   1357</blockquote>
   1358
   1359In the grammar, symbols such as <tt>NUMBER</tt>, <tt>+</tt>, <tt>-</tt>, <tt>*</tt>, and <tt>/</tt> are known
   1360as <em>terminals</em> and correspond to raw input tokens.  Identifiers such as <tt>term</tt> and <tt>factor</tt> refer to more
   1361complex rules, typically comprised of a collection of tokens.  These identifiers are known as <em>non-terminals</em>.
   1362<P>
   1363The semantic behavior of a language is often specified using a
   1364technique known as syntax directed translation.  In syntax directed
   1365translation, attributes are attached to each symbol in a given grammar
   1366rule along with an action.  Whenever a particular grammar rule is
   1367recognized, the action describes what to do.  For example, given the
   1368expression grammar above, you might write the specification for a
   1369simple calculator like this:
   1370
   1371<blockquote>
   1372<pre> 
   1373Grammar                             Action
   1374--------------------------------    -------------------------------------------- 
   1375expression0 : expression1 + term    expression0.val = expression1.val + term.val
   1376            | expression1 - term    expression0.val = expression1.val - term.val
   1377            | term                  expression0.val = term.val
   1378
   1379term0       : term1 * factor        term0.val = term1.val * factor.val
   1380            | term1 / factor        term0.val = term1.val / factor.val
   1381            | factor                term0.val = factor.val
   1382
   1383factor      : NUMBER                factor.val = int(NUMBER.lexval)
   1384            | ( expression )        factor.val = expression.val
   1385</pre>
   1386</blockquote>
   1387
   1388A good way to think about syntax directed translation is to simply think of each symbol in the grammar as some
   1389kind of object.  The semantics of the language are then expressed as a collection of methods/operations on these
   1390objects. 
   1391
   1392<p>
   1393Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing.  LR parsing is a
   1394bottom up technique that tries to recognize the right-hand-side of various grammar rules.
   1395Whenever a valid right-hand-side is found in the input, the appropriate action code is triggered and the
   1396grammar symbols are replaced by the grammar symbol on the left-hand-side. 
   1397
   1398<p>
   1399LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next
   1400input token for patterns.   The details of the algorithm can be found in a compiler text, but the
   1401following example illustrates the steps that are performed if you wanted to parse the expression 
   1402<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above:
   1403
   1404<blockquote>
   1405<pre>
   1406Step Symbol Stack           Input Tokens            Action
   1407---- ---------------------  ---------------------   -------------------------------
   14081    $                      3 + 5 * ( 10 - 20 )$    Shift 3
   14092    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
   14103    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
   14114    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
   14125    $ expr                   + 5 * ( 10 - 20 )$    Shift +
   14136    $ expr +                   5 * ( 10 - 20 )$    Shift 5
   14147    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
   14158    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
   14169    $ expr + term                * ( 10 - 20 )$    Shift *
   141710   $ expr + term *                ( 10 - 20 )$    Shift (
   141811   $ expr + term * (                10 - 20 )$    Shift 10
   141912   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
   142013   $ expr + term * ( factor            - 20 )$    Reduce term : factor
   142114   $ expr + term * ( term              - 20 )$    Reduce expr : term
   142215   $ expr + term * ( expr              - 20 )$    Shift -
   142316   $ expr + term * ( expr -              20 )$    Shift 20
   142417   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
   142518   $ expr + term * ( expr - factor          )$    Reduce term : factor
   142619   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
   142720   $ expr + term * ( expr                   )$    Shift )
   142821   $ expr + term * ( expr )                  $    Reduce factor : (expr)
   142922   $ expr + term * factor                    $    Reduce term : term * factor
   143023   $ expr + term                             $    Reduce expr : expr + term
   143124   $ expr                                    $    Reduce expr
   143225   $                                         $    Success!
   1433</pre>
   1434</blockquote>
   1435
   1436When parsing the expression, an underlying state machine and the current input token determine what to do next.  
   1437If the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted
   1438onto the stack.  If the top of the stack contains a valid right-hand-side of a grammar rule, it is
   1439usually "reduced" and the symbols replaced with the symbol on the left-hand-side.  When this reduction occurs, the
   1440appropriate action is triggered (if defined).  If the input token can't be shifted and the top of stack doesn't match
   1441any grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out).
   1442
   1443<p>
   1444It is important to note that the underlying implementation is built around a large finite-state machine that is encoded
   1445in a collection of tables. The construction of these tables is quite complicated and beyond the scope of this discussion.
   1446However, subtle details of this process explain why, in the example above, the parser chooses to shift a token
   1447onto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>.
   1448
   1449<H2><a name="ply_nn23"></a>5. Yacc reference</H2>
   1450
   1451
   1452This section describes how to use write parsers in PLY.
   1453
   1454<H3><a name="ply_nn24"></a>5.1 An example</H3>
   1455
   1456
   1457Suppose you wanted to make a grammar for simple arithmetic expressions as previously described.   Here is
   1458how you would do it with <tt>yacc.py</tt>:
   1459
   1460<blockquote>
   1461<pre>
   1462# Yacc example
   1463
   1464import ply.yacc as yacc
   1465
   1466# Get the token map from the lexer.  This is required.
   1467from calclex import tokens
   1468
   1469def p_expression_plus(p):
   1470    'expression : expression PLUS term'
   1471    p[0] = p[1] + p[3]
   1472
   1473def p_expression_minus(p):
   1474    'expression : expression MINUS term'
   1475    p[0] = p[1] - p[3]
   1476
   1477def p_expression_term(p):
   1478    'expression : term'
   1479    p[0] = p[1]
   1480
   1481def p_term_times(p):
   1482    'term : term TIMES factor'
   1483    p[0] = p[1] * p[3]
   1484
   1485def p_term_div(p):
   1486    'term : term DIVIDE factor'
   1487    p[0] = p[1] / p[3]
   1488
   1489def p_term_factor(p):
   1490    'term : factor'
   1491    p[0] = p[1]
   1492
   1493def p_factor_num(p):
   1494    'factor : NUMBER'
   1495    p[0] = p[1]
   1496
   1497def p_factor_expr(p):
   1498    'factor : LPAREN expression RPAREN'
   1499    p[0] = p[2]
   1500
   1501# Error rule for syntax errors
   1502def p_error(p):
   1503    print "Syntax error in input!"
   1504
   1505# Build the parser
   1506yacc.yacc()
   1507
   1508# Use this if you want to build the parser using SLR instead of LALR
   1509# yacc.yacc(method="SLR")
   1510
   1511while 1:
   1512   try:
   1513       s = raw_input('calc > ')
   1514   except EOFError:
   1515       break
   1516   if not s: continue
   1517   result = yacc.parse(s)
   1518   print result
   1519</pre>
   1520</blockquote>
   1521
   1522In this example, each grammar rule is defined by a Python function where the docstring to that function contains the
   1523appropriate context-free grammar specification.  Each function accepts a single
   1524argument <tt>p</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule.  The values of
   1525<tt>p[i]</tt> are mapped to grammar symbols as shown here:
   1526
   1527<blockquote>
   1528<pre>
   1529def p_expression_plus(p):
   1530    'expression : expression PLUS term'
   1531    #   ^            ^        ^    ^
   1532    #  p[0]         p[1]     p[2] p[3]
   1533
   1534    p[0] = p[1] + p[3]
   1535</pre>
   1536</blockquote>
   1537
   1538For tokens, the "value" of the corresponding <tt>p[i]</tt> is the
   1539<em>same</em> as the <tt>p.value</tt> attribute assigned
   1540in the lexer module.  For non-terminals, the value is determined by
   1541whatever is placed in <tt>p[0]</tt> when rules are reduced.  This
   1542value can be anything at all.  However, it probably most common for
   1543the value to be a simple Python type, a tuple, or an instance.  In this example, we
   1544are relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value
   1545field.  All of the other rules simply perform various types of integer operations and store
   1546the result.
   1547
   1548<P>
   1549Note: The use of negative indices have a special meaning in yacc---specially <tt>p[-1]</tt> does
   1550not have the same value as <tt>p[3]</tt> in this example.  Please see the section on "Embedded Actions" for further
   1551details.
   1552
   1553<p>
   1554The first rule defined in the yacc specification determines the starting grammar
   1555symbol (in this case, a rule for <tt>expression</tt> appears first).  Whenever
   1556the starting rule is reduced by the parser and no more input is available, parsing 
   1557stops and the final value is returned (this value will be whatever the top-most rule
   1558placed in <tt>p[0]</tt>). Note: an alternative starting symbol can be specified using the <tt>start</tt> keyword argument to
   1559<tt>yacc()</tt>.
   1560
   1561<p>The <tt>p_error(p)</tt> rule is defined to catch syntax errors.  See the error handling section
   1562below for more detail.
   1563
   1564<p>
   1565To build the parser, call the <tt>yacc.yacc()</tt> function.  This function
   1566looks at the module and attempts to construct all of the LR parsing tables for the grammar
   1567you have specified.   The first time <tt>yacc.yacc()</tt> is invoked, you will get a message
   1568such as this:
   1569
   1570<blockquote>
   1571<pre>
   1572$ python calcparse.py
   1573yacc: Generating LALR parsing table...  
   1574calc > 
   1575</pre>
   1576</blockquote>
   1577
   1578Since table construction is relatively expensive (especially for large
   1579grammars), the resulting parsing table is written to the current
   1580directory in a file called <tt>parsetab.py</tt>.  In addition, a
   1581debugging file called <tt>parser.out</tt> is created.  On subsequent
   1582executions, <tt>yacc</tt> will reload the table from
   1583<tt>parsetab.py</tt> unless it has detected a change in the underlying
   1584grammar (in which case the tables and <tt>parsetab.py</tt> file are
   1585regenerated).   Note: The names of parser output files can be changed if necessary.  See the notes that follow later.
   1586
   1587<p>
   1588If any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce
   1589diagnostic messages and possibly raise an exception.  Some of the errors that can be detected include:
   1590
   1591<ul>
   1592<li>Duplicated function names (if more than one rule function have the same name in the grammar file).
   1593<li>Shift/reduce and reduce/reduce conflicts generated by ambiguous grammars.
   1594<li>Badly specified grammar rules.
   1595<li>Infinite recursion (rules that can never terminate).
   1596<li>Unused rules and tokens
   1597<li>Undefined rules and tokens
   1598</ul>
   1599
   1600The next few sections now discuss a few finer points of grammar construction.
   1601
   1602<H3><a name="ply_nn25"></a>5.2 Combining Grammar Rule Functions</H3>
   1603
   1604
   1605When grammar rules are similar, they can be combined into a single function.
   1606For example, consider the two rules in our earlier example:
   1607
   1608<blockquote>
   1609<pre>
   1610def p_expression_plus(p):
   1611    'expression : expression PLUS term'
   1612    p[0] = p[1] + p[3]
   1613
   1614def p_expression_minus(t):
   1615    'expression : expression MINUS term'
   1616    p[0] = p[1] - p[3]
   1617</pre>
   1618</blockquote>
   1619
   1620Instead of writing two functions, you might write a single function like this:
   1621
   1622<blockquote>
   1623<pre>
   1624def p_expression(p):
   1625    '''expression : expression PLUS term
   1626                  | expression MINUS term'''
   1627    if p[2] == '+':
   1628        p[0] = p[1] + p[3]
   1629    elif p[2] == '-':
   1630        p[0] = p[1] - p[3]
   1631</pre>
   1632</blockquote>
   1633
   1634In general, the doc string for any given function can contain multiple grammar rules.  So, it would
   1635have also been legal (although possibly confusing) to write this:
   1636
   1637<blockquote>
   1638<pre>
   1639def p_binary_operators(p):
   1640    '''expression : expression PLUS term
   1641                  | expression MINUS term
   1642       term       : term TIMES factor
   1643                  | term DIVIDE factor'''
   1644    if p[2] == '+':
   1645        p[0] = p[1] + p[3]
   1646    elif p[2] == '-':
   1647        p[0] = p[1] - p[3]
   1648    elif p[2] == '*':
   1649        p[0] = p[1] * p[3]
   1650    elif p[2] == '/':
   1651        p[0] = p[1] / p[3]
   1652</pre>
   1653</blockquote>
   1654
   1655When combining grammar rules into a single function, it is usually a good idea for all of the rules to have
   1656a similar structure (e.g., the same number of terms).  Otherwise, the corresponding action code may be more 
   1657complicated than necessary.  However, it is possible to handle simple cases using len().  For example:
   1658
   1659<blockquote>
   1660<pre>
   1661def p_expressions(p):
   1662    '''expression : expression MINUS expression
   1663                  | MINUS expression'''
   1664    if (len(p) == 4):
   1665        p[0] = p[1] - p[3]
   1666    elif (len(p) == 3):
   1667        p[0] = -p[2]
   1668</pre>
   1669</blockquote>
   1670
   1671<H3><a name="ply_nn26"></a>5.3 Character Literals</H3>
   1672
   1673
   1674If desired, a grammar may contain tokens defined as single character literals.   For example:
   1675
   1676<blockquote>
   1677<pre>
   1678def p_binary_operators(p):
   1679    '''expression : expression '+' term
   1680                  | expression '-' term
   1681       term       : term '*' factor
   1682                  | term '/' factor'''
   1683    if p[2] == '+':
   1684        p[0] = p[1] + p[3]
   1685    elif p[2] == '-':
   1686        p[0] = p[1] - p[3]
   1687    elif p[2] == '*':
   1688        p[0] = p[1] * p[3]
   1689    elif p[2] == '/':
   1690        p[0] = p[1] / p[3]
   1691</pre>
   1692</blockquote>
   1693
   1694A character literal must be enclosed in quotes such as <tt>'+'</tt>.  In addition, if literals are used, they must be declared in the
   1695corresponding <tt>lex</tt> file through the use of a special <tt>literals</tt> declaration.
   1696
   1697<blockquote>
   1698<pre>
   1699# Literals.  Should be placed in module given to lex()
   1700literals = ['+','-','*','/' ]
   1701</pre>
   1702</blockquote>
   1703
   1704<b>Character literals are limited to a single character</b>.  Thus, it is not legal to specify literals such as <tt>'&lt;='</tt> or <tt>'=='</tt>.  For this, use
   1705the normal lexing rules (e.g., define a rule such as <tt>t_EQ = r'=='</tt>).
   1706
   1707<H3><a name="ply_nn26"></a>5.4 Empty Productions</H3>
   1708
   1709
   1710<tt>yacc.py</tt> can handle empty productions by defining a rule like this:
   1711
   1712<blockquote>
   1713<pre>
   1714def p_empty(p):
   1715    'empty :'
   1716    pass
   1717</pre>
   1718</blockquote>
   1719
   1720Now to use the empty production, simply use 'empty' as a symbol.  For example:
   1721
   1722<blockquote>
   1723<pre>
   1724def p_optitem(p):
   1725    'optitem : item'
   1726    '        | empty'
   1727    ...
   1728</pre>
   1729</blockquote>
   1730
   1731Note: You can write empty rules anywhere by simply specifying an empty right hand side.  However, I personally find that
   1732writing an "empty" rule and using "empty" to denote an empty production is easier to read.
   1733
   1734<H3><a name="ply_nn28"></a>5.5 Changing the starting symbol</H3>
   1735
   1736
   1737Normally, the first rule found in a yacc specification defines the starting grammar rule (top level rule).  To change this, simply
   1738supply a <tt>start</tt> specifier in your file.  For example:
   1739
   1740<blockquote>
   1741<pre>
   1742start = 'foo'
   1743
   1744def p_bar(p):
   1745    'bar : A B'
   1746
   1747# This is the starting rule due to the start specifier above
   1748def p_foo(p):
   1749    'foo : bar X'
   1750...
   1751</pre>
   1752</blockquote>
   1753
   1754The use of a <tt>start</tt> specifier may be useful during debugging since you can use it to have yacc build a subset of
   1755a larger grammar.  For this purpose, it is also possible to specify a starting symbol as an argument to <tt>yacc()</tt>. For example:
   1756
   1757<blockquote>
   1758<pre>
   1759yacc.yacc(start='foo')
   1760</pre>
   1761</blockquote>
   1762
   1763<H3><a name="ply_nn27"></a>5.6 Dealing With Ambiguous Grammars</H3>
   1764
   1765
   1766The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. 
   1767However, in many situations, it is extremely difficult or awkward to write grammars in this format.  A
   1768much more natural way to express the grammar is in a more compact form like this:
   1769
   1770<blockquote>
   1771<pre>
   1772expression : expression PLUS expression
   1773           | expression MINUS expression
   1774           | expression TIMES expression
   1775           | expression DIVIDE expression
   1776           | LPAREN expression RPAREN
   1777           | NUMBER
   1778</pre>
   1779</blockquote>
   1780
   1781Unfortunately, this grammar specification is ambiguous.  For example, if you are parsing the string
   1782"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped.
   1783For example, does the expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?
   1784
   1785<p>
   1786When an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" 
   1787or a "reduce/reduce conflicts".   A shift/reduce conflict is caused when the parser generator can't decide
   1788whether or not to reduce a rule or shift a symbol on the parsing stack.   For example, consider
   1789the string "3 * 4 + 5" and the internal parsing stack:
   1790
   1791<blockquote>
   1792<pre>
   1793Step Symbol Stack           Input Tokens            Action
   1794---- ---------------------  ---------------------   -------------------------------
   17951    $                                3 * 4 + 5$    Shift 3
   17962    $ 3                                * 4 + 5$    Reduce : expression : NUMBER
   17973    $ expr                             * 4 + 5$    Shift *
   17984    $ expr *                             4 + 5$    Shift 4
   17995    $ expr * 4                             + 5$    Reduce: expression : NUMBER
   18006    $ expr * expr                          + 5$    SHIFT/REDUCE CONFLICT ????
   1801</pre>
   1802</blockquote>
   1803
   1804In this case, when the parser reaches step 6, it has two options.  One is to reduce the
   1805rule <tt>expr : expr * expr</tt> on the stack.  The other option is to shift the
   1806token <tt>+</tt> on the stack.   Both options are perfectly legal from the rules
   1807of the context-free-grammar.
   1808
   1809<p>
   1810By default, all shift/reduce conflicts are resolved in favor of shifting.  Therefore, in the above
   1811example, the parser will always shift the <tt>+</tt> instead of reducing.    Although this
   1812strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic
   1813expressions.  In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have
   1814reduced <tt>expr * expr</tt> since multiplication has higher mathematical precedence than addition.
   1815
   1816<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual
   1817tokens to be assigned a precedence level and associativity.  This is done by adding a variable
   1818<tt>precedence</tt> to the grammar file like this:
   1819
   1820<blockquote>
   1821<pre>
   1822precedence = (
   1823    ('left', 'PLUS', 'MINUS'),
   1824    ('left', 'TIMES', 'DIVIDE'),
   1825)
   1826</pre>
   1827</blockquote>
   1828
   1829This declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have
   1830the same precedence level and are left-associative and that 
   1831<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative. 
   1832Within the <tt>precedence</tt> declaration, tokens are ordered from lowest to highest precedence. Thus,
   1833this declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher
   1834precedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the
   1835precedence specification).
   1836
   1837<p>
   1838The precedence specification works by associating a numerical precedence level value and associativity direction to
   1839the listed tokens.  For example, in the above example you get:
   1840
   1841<blockquote>
   1842<pre>
   1843PLUS      : level = 1,  assoc = 'left'
   1844MINUS     : level = 1,  assoc = 'left'
   1845TIMES     : level = 2,  assoc = 'left'
   1846DIVIDE    : level = 2,  assoc = 'left'
   1847</pre>
   1848</blockquote>
   1849
   1850These values are then used to attach a numerical precedence value and associativity direction 
   1851to each grammar rule. <em>This is always determined by looking at the precedence of the right-most terminal symbol.</em>  
   1852For example:
   1853
   1854<blockquote>
   1855<pre>
   1856expression : expression PLUS expression                 # level = 1, left
   1857           | expression MINUS expression                # level = 1, left
   1858           | expression TIMES expression                # level = 2, left
   1859           | expression DIVIDE expression               # level = 2, left
   1860           | LPAREN expression RPAREN                   # level = None (not specified)
   1861           | NUMBER                                     # level = None (not specified)
   1862</pre>
   1863</blockquote>
   1864
   1865When shift/reduce conflicts are encountered, the parser generator resolves the conflict by
   1866looking at the precedence rules and associativity specifiers.
   1867
   1868<p>
   1869<ol>
   1870<li>If the current token has higher precedence, it is shifted.
   1871<li>If the grammar rule on the stack has higher precedence, the rule is reduced.
   1872<li>If the current token and the grammar rule have the same precedence, the
   1873rule is reduced for left associativity, whereas the token is shifted for right associativity.
   1874<li>If nothing is known about the precedence, shift/reduce conflicts are resolved in
   1875favor of shifting (the default).
   1876</ol>
   1877
   1878For example, if "expression PLUS expression" has been parsed and the next token
   1879is "TIMES", the action is going to be a shift because "TIMES" has a higher precedence level than "PLUS".  On the other
   1880hand, if "expression TIMES expression" has been parsed and the next token is "PLUS", the action
   1881is going to be reduce because "PLUS" has a lower precedence than "TIMES."
   1882
   1883<p>
   1884When shift/reduce conflicts are resolved using the first three techniques (with the help of
   1885precedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. 
   1886
   1887<p>
   1888One problem with the precedence specifier technique is that it is sometimes necessary to
   1889change the precedence of an operator in certain contents.  For example, consider a unary-minus operator
   1890in "3 + 4 * -5".  Normally, unary minus has a very high precedence--being evaluated before the multiply.
   1891However, in our precedence specifier, MINUS has a lower precedence than TIMES.  To deal with this,
   1892precedence rules can be given for fictitious tokens like this:
   1893
   1894<blockquote>
   1895<pre>
   1896precedence = (
   1897    ('left', 'PLUS', 'MINUS'),
   1898    ('left', 'TIMES', 'DIVIDE'),
   1899    ('right', 'UMINUS'),            # Unary minus operator
   1900)
   1901</pre>
   1902</blockquote>
   1903
   1904Now, in the grammar file, we can write our unary minus rule like this:
   1905
   1906<blockquote>
   1907<pre>
   1908def p_expr_uminus(p):
   1909    'expression : MINUS expression %prec UMINUS'
   1910    p[0] = -p[2]
   1911</pre>
   1912</blockquote>
   1913
   1914In this case, <tt>%prec UMINUS</tt> overrides the default rule precedence--setting it to that
   1915of UMINUS in the precedence specifier.
   1916
   1917<p>
   1918At first, the use of UMINUS in this example may appear very confusing.
   1919UMINUS is not an input token or a grammer rule.  Instead, you should
   1920think of it as the name of a special marker in the precedence table.   When you use the <tt>%prec</tt> qualifier, you're simply
   1921telling yacc that you want the precedence of the expression to be the same as for this special marker instead of the usual precedence.
   1922
   1923<p>
   1924It is also possible to specify non-associativity in the <tt>precedence</tt> table. This would
   1925be used when you <em>don't</em> want operations to chain together.  For example, suppose
   1926you wanted to support comparison operators like <tt>&lt;</tt> and <tt>&gt;</tt> but you didn't want to allow
   1927combinations like <tt>a &lt; b &lt; c</tt>.   To do this, simply specify a rule like this:
   1928
   1929<blockquote>
   1930<pre>
   1931precedence = (
   1932    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
   1933    ('left', 'PLUS', 'MINUS'),
   1934    ('left', 'TIMES', 'DIVIDE'),
   1935    ('right', 'UMINUS'),            # Unary minus operator
   1936)
   1937</pre>
   1938</blockquote>
   1939
   1940<p>
   1941If you do this, the occurrence of input text such as <tt> a &lt; b &lt; c</tt> will result in a syntax error.  However, simple
   1942expressions such as <tt>a &lt; b</tt> will still be fine.
   1943
   1944<p>
   1945Reduce/reduce conflicts are caused when there are multiple grammar
   1946rules that can be applied to a given set of symbols.  This kind of
   1947conflict is almost always bad and is always resolved by picking the
   1948rule that appears first in the grammar file.   Reduce/reduce conflicts
   1949are almost always caused when different sets of grammar rules somehow
   1950generate the same set of symbols.  For example:
   1951
   1952<blockquote>
   1953<pre>
   1954assignment :  ID EQUALS NUMBER
   1955           |  ID EQUALS expression
   1956           
   1957expression : expression PLUS expression
   1958           | expression MINUS expression
   1959           | expression TIMES expression
   1960           | expression DIVIDE expression
   1961           | LPAREN expression RPAREN
   1962           | NUMBER
   1963</pre>
   1964</blockquote>
   1965
   1966In this case, a reduce/reduce conflict exists between these two rules:
   1967
   1968<blockquote>
   1969<pre>
   1970assignment  : ID EQUALS NUMBER
   1971expression  : NUMBER
   1972</pre>
   1973</blockquote>
   1974
   1975For example, if you wrote "a = 5", the parser can't figure out if this
   1976is supposed to be reduced as <tt>assignment : ID EQUALS NUMBER</tt> or
   1977whether it's supposed to reduce the 5 as an expression and then reduce
   1978the rule <tt>assignment : ID EQUALS expression</tt>.
   1979
   1980<p>
   1981It should be noted that reduce/reduce conflicts are notoriously difficult to spot
   1982simply looking at the input grammer.    To locate these, it is usually easier to look at the
   1983<tt>parser.out</tt> debugging file with an appropriately high level of caffeination.
   1984
   1985<H3><a name="ply_nn28"></a>5.7 The parser.out file</H3>
   1986
   1987
   1988Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR
   1989parsing algorithm.  To assist in debugging, <tt>yacc.py</tt> creates a debugging file called
   1990'parser.out' when it generates the parsing table.   The contents of this file look like the following:
   1991
   1992<blockquote>
   1993<pre>
   1994Unused terminals:
   1995
   1996
   1997Grammar
   1998
   1999Rule 1     expression -> expression PLUS expression
   2000Rule 2     expression -> expression MINUS expression
   2001Rule 3     expression -> expression TIMES expression
   2002Rule 4     expression -> expression DIVIDE expression
   2003Rule 5     expression -> NUMBER
   2004Rule 6     expression -> LPAREN expression RPAREN
   2005
   2006Terminals, with rules where they appear
   2007
   2008TIMES                : 3
   2009error                : 
   2010MINUS                : 2
   2011RPAREN               : 6
   2012LPAREN               : 6
   2013DIVIDE               : 4
   2014PLUS                 : 1
   2015NUMBER               : 5
   2016
   2017Nonterminals, with rules where they appear
   2018
   2019expression           : 1 1 2 2 3 3 4 4 6 0
   2020
   2021
   2022Parsing method: LALR
   2023
   2024
   2025state 0
   2026
   2027    S' -> . expression
   2028    expression -> . expression PLUS expression
   2029    expression -> . expression MINUS expression
   2030    expression -> . expression TIMES expression
   2031    expression -> . expression DIVIDE expression
   2032    expression -> . NUMBER
   2033    expression -> . LPAREN expression RPAREN
   2034
   2035    NUMBER          shift and go to state 3
   2036    LPAREN          shift and go to state 2
   2037
   2038
   2039state 1
   2040
   2041    S' -> expression .
   2042    expression -> expression . PLUS expression
   2043    expression -> expression . MINUS expression
   2044    expression -> expression . TIMES expression
   2045    expression -> expression . DIVIDE expression
   2046
   2047    PLUS            shift and go to state 6
   2048    MINUS           shift and go to state 5
   2049    TIMES           shift and go to state 4
   2050    DIVIDE          shift and go to state 7
   2051
   2052
   2053state 2
   2054
   2055    expression -> LPAREN . expression RPAREN
   2056    expression -> . expression PLUS expression
   2057    expression -> . expression MINUS expression
   2058    expression -> . expression TIMES expression
   2059    expression -> . expression DIVIDE expression
   2060    expression -> . NUMBER
   2061    expression -> . LPAREN expression RPAREN
   2062
   2063    NUMBER          shift and go to state 3
   2064    LPAREN          shift and go to state 2
   2065
   2066
   2067state 3
   2068
   2069    expression -> NUMBER .
   2070
   2071    $               reduce using rule 5
   2072    PLUS            reduce using rule 5
   2073    MINUS           reduce using rule 5
   2074    TIMES           reduce using rule 5
   2075    DIVIDE          reduce using rule 5
   2076    RPAREN          reduce using rule 5
   2077
   2078
   2079state 4
   2080
   2081    expression -> expression TIMES . expression
   2082    expression -> . expression PLUS expression
   2083    expression -> . expression MINUS expression
   2084    expression -> . expression TIMES expression
   2085    expression -> . expression DIVIDE expression
   2086    expression -> . NUMBER
   2087    expression -> . LPAREN expression RPAREN
   2088
   2089    NUMBER          shift and go to state 3
   2090    LPAREN          shift and go to state 2
   2091
   2092
   2093state 5
   2094
   2095    expression -> expression MINUS . expression
   2096    expression -> . expression PLUS expression
   2097    expression -> . expression MINUS expression
   2098    expression -> . expression TIMES expression
   2099    expression -> . expression DIVIDE expression
   2100    expression -> . NUMBER
   2101    expression -> . LPAREN expression RPAREN
   2102
   2103    NUMBER          shift and go to state 3
   2104    LPAREN          shift and go to state 2
   2105
   2106
   2107state 6
   2108
   2109    expression -> expression PLUS . expression
   2110    expression -> . expression PLUS expression
   2111    expression -> . expression MINUS expression
   2112    expression -> . expression TIMES expression
   2113    expression -> . expression DIVIDE expression
   2114    expression -> . NUMBER
   2115    expression -> . LPAREN expression RPAREN
   2116
   2117    NUMBER          shift and go to state 3
   2118    LPAREN          shift and go to state 2
   2119
   2120
   2121state 7
   2122
   2123    expression -> expression DIVIDE . expression
   2124    expression -> . expression PLUS expression
   2125    expression -> . expression MINUS expression
   2126    expression -> . expression TIMES expression
   2127    expression -> . expression DIVIDE expression
   2128    expression -> . NUMBER
   2129    expression -> . LPAREN expression RPAREN
   2130
   2131    NUMBER          shift and go to state 3
   2132    LPAREN          shift and go to state 2
   2133
   2134
   2135state 8
   2136
   2137    expression -> LPAREN expression . RPAREN
   2138    expression -> expression . PLUS expression
   2139    expression -> expression . MINUS expression
   2140    expression -> expression . TIMES expression
   2141    expression -> expression . DIVIDE expression
   2142
   2143    RPAREN          shift and go to state 13
   2144    PLUS            shift and go to state 6
   2145    MINUS           shift and go to state 5
   2146    TIMES           shift and go to state 4
   2147    DIVIDE          shift and go to state 7
   2148
   2149
   2150state 9
   2151
   2152    expression -> expression TIMES expression .
   2153    expression -> expression . PLUS expression
   2154    expression -> expression . MINUS expression
   2155    expression -> expression . TIMES expression
   2156    expression -> expression . DIVIDE expression
   2157
   2158    $               reduce using rule 3
   2159    PLUS            reduce using rule 3
   2160    MINUS           reduce using rule 3
   2161    TIMES           reduce using rule 3
   2162    DIVIDE          reduce using rule 3
   2163    RPAREN          reduce using rule 3
   2164
   2165  ! PLUS            [ shift and go to state 6 ]
   2166  ! MINUS           [ shift and go to state 5 ]
   2167  ! TIMES           [ shift and go to state 4 ]
   2168  ! DIVIDE          [ shift and go to state 7 ]
   2169
   2170state 10
   2171
   2172    expression -> expression MINUS expression .
   2173    expression -> expression . PLUS expression
   2174    expression -> expression . MINUS expression
   2175    expression -> expression . TIMES expression
   2176    expression -> expression . DIVIDE expression
   2177
   2178    $               reduce using rule 2
   2179    PLUS            reduce using rule 2
   2180    MINUS           reduce using rule 2
   2181    RPAREN          reduce using rule 2
   2182    TIMES           shift and go to state 4
   2183    DIVIDE          shift and go to state 7
   2184
   2185  ! TIMES           [ reduce using rule 2 ]
   2186  ! DIVIDE          [ reduce using rule 2 ]
   2187  ! PLUS            [ shift and go to state 6 ]
   2188  ! MINUS           [ shift and go to state 5 ]
   2189
   2190state 11
   2191
   2192    expression -> expression PLUS expression .
   2193    expression -> expression . PLUS expression
   2194    expression -> expression . MINUS expression
   2195    expression -> expression . TIMES expression
   2196    expression -> expression . DIVIDE expression
   2197
   2198    $               reduce using rule 1
   2199    PLUS            reduce using rule 1
   2200    MINUS           reduce using rule 1
   2201    RPAREN          reduce using rule 1
   2202    TIMES           shift and go to state 4
   2203    DIVIDE          shift and go to state 7
   2204
   2205  ! TIMES           [ reduce using rule 1 ]
   2206  ! DIVIDE          [ reduce using rule 1 ]
   2207  ! PLUS            [ shift and go to state 6 ]
   2208  ! MINUS           [ shift and go to state 5 ]
   2209
   2210state 12
   2211
   2212    expression -> expression DIVIDE expression .
   2213    expression -> expression . PLUS expression
   2214    expression -> expression . MINUS expression
   2215    expression -> expression . TIMES expression
   2216    expression -> expression . DIVIDE expression
   2217
   2218    $               reduce using rule 4
   2219    PLUS            reduce using rule 4
   2220    MINUS           reduce using rule 4
   2221    TIMES           reduce using rule 4
   2222    DIVIDE          reduce using rule 4
   2223    RPAREN          reduce using rule 4
   2224
   2225  ! PLUS            [ shift and go to state 6 ]
   2226  ! MINUS           [ shift and go to state 5 ]
   2227  ! TIMES           [ shift and go to state 4 ]
   2228  ! DIVIDE          [ shift and go to state 7 ]
   2229
   2230state 13
   2231
   2232    expression -> LPAREN expression RPAREN .
   2233
   2234    $               reduce using rule 6
   2235    PLUS            reduce using rule 6
   2236    MINUS           reduce using rule 6
   2237    TIMES           reduce using rule 6
   2238    DIVIDE          reduce using rule 6
   2239    RPAREN          reduce using rule 6
   2240</pre>
   2241</blockquote>
   2242
   2243In the file, each state of the grammar is described.  Within each state the "." indicates the current
   2244location of the parse within any applicable grammar rules.   In addition, the actions for each valid
   2245input token are listed.   When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected
   2246are prefixed with an !.  For example:
   2247
   2248<blockquote>
   2249<pre>
   2250  ! TIMES           [ reduce using rule 2 ]
   2251  ! DIVIDE          [ reduce using rule 2 ]
   2252  ! PLUS            [ shift and go to state 6 ]
   2253  ! MINUS           [ shift and go to state 5 ]
   2254</pre>
   2255</blockquote>
   2256
   2257By looking at these rules (and with a little practice), you can usually track down the source
   2258of most parsing conflicts.  It should also be stressed that not all shift-reduce conflicts are
   2259bad.  However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>.
   2260  
   2261<H3><a name="ply_nn29"></a>5.8 Syntax Error Handling</H3>
   2262
   2263
   2264When a syntax error occurs during parsing, the error is immediately
   2265detected (i.e., the parser does not read any more tokens beyond the
   2266source of the error).  Error recovery in LR parsers is a delicate
   2267topic that involves ancient rituals and black-magic.   The recovery mechanism
   2268provided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want
   2269consult a book like O'Reilly's "Lex and Yacc" for some of the finer details.
   2270
   2271<p>
   2272When a syntax error occurs, <tt>yacc.py</tt> performs the following steps:
   2273
   2274<ol>
   2275<li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function
   2276is called with the offending token as an argument.  Afterwards, the parser enters
   2277an "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it
   2278has successfully shifted at least 3 tokens onto the parsing stack.
   2279
   2280<p>
   2281<li>If no recovery action is taken in <tt>p_error()</tt>, the offending lookahead token is replaced
   2282with a special <tt>error</tt> token.
   2283
   2284<p>
   2285<li>If the offending lookahead token is already set to <tt>error</tt>, the top item of the parsing stack is
   2286deleted.
   2287
   2288<p>
   2289<li>If the entire parsing stack is unwound, the parser enters a restart state and attempts to start
   2290parsing from its initial state.
   2291
   2292<p>
   2293<li>If a grammar rule accepts <tt>error</tt> as a token, it will be
   2294shifted onto the parsing stack.
   2295
   2296<p>
   2297<li>If the top item of the parsing stack is <tt>error</tt>, lookahead tokens will be discarded until the
   2298parser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>.
   2299</ol>
   2300
   2301<H4><a name="ply_nn30"></a>5.8.1 Recovery and resynchronization with error rules</H4>
   2302
   2303
   2304The most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt>
   2305token.  For example, suppose your language had a grammar rule for a print statement like this:
   2306
   2307<blockquote>
   2308<pre>
   2309def p_statement_print(p):
   2310     'statement : PRINT expr SEMI'
   2311     ...
   2312</pre>
   2313</blockquote>
   2314
   2315To account for the possibility of a bad expression, you might write an additional grammar rule like this:
   2316
   2317<blockquote>
   2318<pre>
   2319def p_statement_print_error(p):
   2320     'statement : PRINT error SEMI'
   2321     print "Syntax error in print statement. Bad expression"
   2322
   2323</pre>
   2324</blockquote>
   2325
   2326In this case, the <tt>error</tt> token will match any sequence of
   2327tokens that might appear up to the first semicolon that is
   2328encountered.  Once the semicolon is reached, the rule will be
   2329invoked and the <tt>error</tt> token will go away.
   2330
   2331<p>
   2332This type of recovery is sometimes known as parser resynchronization.
   2333The <tt>error</tt> token acts as a wildcard for any bad input text and
   2334the token immediately following <tt>error</tt> acts as a
   2335synchronization token.
   2336
   2337<p>
   2338It is important to note that the <tt>error</tt> token usually does not appear as the last token
   2339on the right in an error rule.  For example:
   2340
   2341<blockquote>
   2342<pre>
   2343def p_statement_print_error(p):
   2344    'statement : PRINT error'
   2345    print "Syntax error in print statement. Bad expression"
   2346</pre>
   2347</blockquote>
   2348
   2349This is because the first bad token encountered will cause the rule to
   2350be reduced--which may make it difficult to recover if more bad tokens
   2351immediately follow.   
   2352
   2353<H4><a name="ply_nn31"></a>5.8.2 Panic mode recovery</H4>
   2354
   2355
   2356An alternative error recovery scheme is to enter a panic mode recovery in which tokens are
   2357discarded to a point where the parser might be able to recover in some sensible manner.
   2358
   2359<p>
   2360Panic mode recovery is implemented entirely in the <tt>p_error()</tt> function.  For example, this
   2361function starts discarding tokens until it reaches a closing '}'.  Then, it restarts the 
   2362parser in its initial state.
   2363
   2364<blockquote>
   2365<pre>
   2366def p_error(p):
   2367    print "Whoa. You are seriously hosed."
   2368    # Read ahead looking for a closing '}'
   2369    while 1:
   2370        tok = yacc.token()             # Get the next token
   2371        if not tok or tok.type == 'RBRACE': break
   2372    yacc.restart()
   2373</pre>
   2374</blockquote>
   2375
   2376<p>
   2377This function simply discards the bad token and tells the parser that the error was ok.
   2378
   2379<blockquote>
   2380<pre>
   2381def p_error(p):
   2382    print "Syntax error at token", p.type
   2383    # Just discard the token and tell the parser it's okay.
   2384    yacc.errok()
   2385</pre>
   2386</blockquote>
   2387
   2388<P>
   2389Within the <tt>p_error()</tt> function, three functions are available to control the behavior
   2390of the parser:
   2391<p>
   2392<ul>
   2393<li><tt>yacc.errok()</tt>.  This resets the parser state so it doesn't think it's in error-recovery
   2394mode.   This will prevent an <tt>error</tt> token from being generated and will reset the internal
   2395error counters so that the next syntax error will call <tt>p_error()</tt> again.
   2396
   2397<p>
   2398<li><tt>yacc.token()</tt>.  This returns the next token on the input stream.
   2399
   2400<p>
   2401<li><tt>yacc.restart()</tt>.  This discards the entire parsing stack and resets the parser
   2402to its initial state. 
   2403</ul>
   2404
   2405Note: these functions are only available when invoking <tt>p_error()</tt> and are not available
   2406at any other time.
   2407
   2408<p>
   2409To supply the next lookahead token to the parser, <tt>p_error()</tt> can return a token.  This might be
   2410useful if trying to synchronize on special characters.  For example:
   2411
   2412<blockquote>
   2413<pre>
   2414def p_error(p):
   2415    # Read ahead looking for a terminating ";"
   2416    while 1:
   2417        tok = yacc.token()             # Get the next token
   2418        if not tok or tok.type == 'SEMI': break
   2419    yacc.errok()
   2420
   2421    # Return SEMI to the parser as the next lookahead token
   2422    return tok  
   2423</pre>
   2424</blockquote>
   2425
   2426<H4><a name="ply_nn32"></a>5.8.3 General comments on error handling</H4>
   2427
   2428
   2429For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable
   2430technique. This is because you can instrument the grammar to catch errors at selected places where it is relatively easy 
   2431to recover and continue parsing.  Panic mode recovery is really only useful in certain specialized applications where you might want
   2432to discard huge portions of the input text to find a valid restart point.
   2433
   2434<H3><a name="ply_nn33"></a>5.9 Line Number and Position Tracking</H3>
   2435
   2436
   2437<tt>yacc.py</tt> automatically tracks line numbers and positions for all of the grammar symbols and tokens it processes.  To retrieve the line
   2438numbers, two functions are used in grammar rules:
   2439
   2440<ul>
   2441<li><tt>p.lineno(num)</tt>.  Return the starting line number for symbol <em>num</em>
   2442<li><tt>p.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>.
   2443</ul>
   2444
   2445For example:
   2446
   2447<blockquote>
   2448<pre>
   2449def p_expression(p):
   2450    'expression : expression PLUS expression'
   2451    p.lineno(1)        # Line number of the left expression
   2452    p.lineno(2)        # line number of the PLUS operator
   2453    p.lineno(3)        # line number of the right expression
   2454    ...
   2455    start,end = p.linespan(3)    # Start,end lines of the right expression
   2456
   2457</pre>
   2458</blockquote>
   2459
   2460Since line numbers are managed internally by the parser, there is usually no need to modify the line
   2461numbers.  However, if you want to save the line numbers in a parse-tree node, you will need to make your own
   2462private copy.
   2463
   2464<p>
   2465To get positional information about where tokens were lexed, the following two functions are used:
   2466
   2467<ul>
   2468<li><tt>p.lexpos(num)</tt>.  Return the starting lexing position for symbol <em>num</em>
   2469<li><tt>p.lexspan(num)</tt>. Return a tuple (start,end) with the starting and ending positions for symbol <em>num</em>.
   2470</ul>
   2471
   2472For example:
   2473
   2474<blockquote>
   2475<pre>
   2476def p_expression(p):
   2477    'expression : expression PLUS expression'
   2478    p.lexpos(1)        # Lexing position of the left expression
   2479    p.lexpos(2)        # Lexing position of the PLUS operator
   2480    p.lexpos(3)        # Lexing position of the right expression
   2481    ...
   2482    start,end = p.lexspan(3)    # Start,end positions of the right expression
   2483</pre>
   2484</blockquote>
   2485
   2486Note: The <tt>lexspan()</tt> function only returns the range of values up the start of the last grammar symbol. 
   2487
   2488<H3><a name="ply_nn34"></a>5.10 AST Construction</H3>
   2489
   2490
   2491<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree.  However, such
   2492construction is easy enough to do on your own.  Simply create a data structure for abstract syntax tree nodes
   2493and assign nodes to <tt>p[0]</tt> in each rule.
   2494
   2495For example:
   2496
   2497<blockquote>
   2498<pre>
   2499class Expr: pass
   2500
   2501class BinOp(Expr):
   2502    def __init__(self,left,op,right):
   2503        self.type = "binop"
   2504        self.left = left
   2505        self.right = right
   2506        self.op = op
   2507
   2508class Number(Expr):
   2509    def __init__(self,value):
   2510        self.type = "number"
   2511        self.value = value
   2512
   2513def p_expression_binop(p):
   2514    '''expression : expression PLUS expression
   2515                  | expression MINUS expression
   2516                  | expression TIMES expression
   2517                  | expression DIVIDE expression'''
   2518
   2519    p[0] = BinOp(p[1],p[2],p[3])
   2520
   2521def p_expression_group(p):
   2522    'expression : LPAREN expression RPAREN'
   2523    p[0] = p[2]
   2524
   2525def p_expression_number(p):
   2526    'expression : NUMBER'
   2527    p[0] = Number(p[1])
   2528</pre>
   2529</blockquote>
   2530
   2531To simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes.
   2532For example:
   2533
   2534<blockquote>
   2535<pre>
   2536class Node:
   2537    def __init__(self,type,children=None,leaf=None):
   2538         self.type = type
   2539         if children:
   2540              self.children = children
   2541         else:
   2542              self.children = [ ]
   2543         self.leaf = leaf
   2544	 
   2545def p_expression_binop(p):
   2546    '''expression : expression PLUS expression
   2547                  | expression MINUS expression
   2548                  | expression TIMES expression
   2549                  | expression DIVIDE expression'''
   2550
   2551    p[0] = Node("binop", [p[1],p[3]], p[2])
   2552</pre>
   2553</blockquote>
   2554
   2555<H3><a name="ply_nn35"></a>5.11 Embedded Actions</H3>
   2556
   2557
   2558The parsing technique used by yacc only allows actions to be executed at the end of a rule.  For example,
   2559suppose you have a rule like this:
   2560
   2561<blockquote>
   2562<pre>
   2563def p_foo(p):
   2564    "foo : A B C D"
   2565    print "Parsed a foo", p[1],p[2],p[3],p[4]
   2566</pre>
   2567</blockquote>
   2568
   2569<p>
   2570In this case, the supplied action code only executes after all of the
   2571symbols <tt>A</tt>, <tt>B</tt>, <tt>C</tt>, and <tt>D</tt> have been
   2572parsed. Sometimes, however, it is useful to execute small code
   2573fragments during intermediate stages of parsing.  For example, suppose
   2574you wanted to perform some action immediately after <tt>A</tt> has
   2575been parsed. To do this, you can write a empty rule like this:
   2576
   2577<blockquote>
   2578<pre>
   2579def p_foo(p):
   2580    "foo : A seen_A B C D"
   2581    print "Parsed a foo", p[1],p[3],p[4],p[5]
   2582    print "seen_A returned", p[2]
   2583
   2584def p_seen_A(p):
   2585    "seen_A :"
   2586    print "Saw an A = ", p[-1]   # Access grammar symbol to left
   2587    p[0] = some_value            # Assign value to seen_A
   2588
   2589</pre>
   2590</blockquote>
   2591
   2592<p>
   2593In this example, the empty <tt>seen_A</tt> rule executes immediately
   2594after <tt>A</tt> is shifted onto the parsing stack.  Within this
   2595rule, <tt>p[-1]</tt> refers to the symbol on the stack that appears
   2596immediately to the left of the <tt>seen_A</tt> symbol.  In this case,
   2597it would be the value of <tt>A</tt> in the <tt>foo</tt> rule
   2598immediately above.  Like other rules, a value can be returned from an
   2599embedded action by simply assigning it to <tt>p[0]</tt>
   2600
   2601<p>
   2602The use of embedded actions can sometimes introduce extra shift/reduce conflicts.  For example,
   2603this grammar has no conflicts:
   2604
   2605<blockquote>
   2606<pre>
   2607def p_foo(p):
   2608    """foo : abcd
   2609           | abcx"""
   2610
   2611def p_abcd(p):
   2612    "abcd : A B C D"
   2613
   2614def p_abcx(p):
   2615    "abcx : A B C X"
   2616</pre>
   2617</blockquote>
   2618
   2619However, if you insert an embedded action into one of the rules like this,
   2620
   2621<blockquote>
   2622<pre>
   2623def p_foo(p):
   2624    """foo : abcd
   2625           | abcx"""
   2626
   2627def p_abcd(p):
   2628    "abcd : A B C D"
   2629
   2630def p_abcx(p):
   2631    "abcx : A B seen_AB C X"
   2632
   2633def p_seen_AB(p):
   2634    "seen_AB :"
   2635</pre>
   2636</blockquote>
   2637
   2638an extra shift-reduce conflict will be introduced.  This conflict is caused by the fact that the same symbol <tt>C</tt> appears next in 
   2639both the <tt>abcd</tt> and <tt>abcx</tt> rules.   The parser can either shift the symbol (<tt>abcd</tt> rule) or reduce the empty rule <tt>seen_AB</tt> (<tt>abcx</tt> rule).
   2640
   2641<p>
   2642A common use of embedded rules is to control other aspects of parsing
   2643such as scoping of local variables.  For example, if you were parsing C code, you might
   2644write code like this:
   2645
   2646<blockquote>
   2647<pre>
   2648def p_statements_block(p):
   2649    "statements: LBRACE new_scope statements RBRACE"""
   2650    # Action code
   2651    ...
   2652    pop_scope()        # Return to previous scope
   2653
   2654def p_new_scope(p):
   2655    "new_scope :"
   2656    # Create a new scope for local variables
   2657    s = new_scope()
   2658    push_scope(s)
   2659    ...
   2660</pre>
   2661</blockquote>
   2662
   2663In this case, the embedded action <tt>new_scope</tt> executes immediately after a <tt>LBRACE</tt> (<tt>{</tt>) symbol is parsed.   This might 
   2664adjust internal symbol tables and other aspects of the parser.  Upon completion of the rule <tt>statements_block</tt>, code might undo the operations performed in the embedded action (e.g., <tt>pop_scope()</tt>).
   2665
   2666<H3><a name="ply_nn36"></a>5.12 Yacc implementation notes</H3>
   2667
   2668
   2669<ul>
   2670<li>The default parsing method is LALR. To use SLR instead, run yacc() as follows:
   2671
   2672<blockquote>
   2673<pre>
   2674yacc.yacc(method="SLR")
   2675</pre>
   2676</blockquote>
   2677Note: LALR table generation takes approximately twice as long as SLR table generation.   There is no
   2678difference in actual parsing performance---the same code is used in both cases.   LALR is preferred when working
   2679with more complicated grammars since it is more powerful.
   2680
   2681<p>
   2682
   2683<li>By default, <tt>yacc.py</tt> relies on <tt>lex.py</tt> for tokenizing.  However, an alternative tokenizer
   2684can be supplied as follows:
   2685
   2686<blockquote>
   2687<pre>
   2688yacc.parse(lexer=x)
   2689</pre>
   2690</blockquote>
   2691in this case, <tt>x</tt> must be a Lexer object that minimally has a <tt>x.token()</tt> method for retrieving the next
   2692token.   If an input string is given to <tt>yacc.parse()</tt>, the lexer must also have an <tt>x.input()</tt> method.
   2693
   2694<p>
   2695<li>By default, the yacc generates tables in debugging mode (which produces the parser.out file and other output).
   2696To disable this, use
   2697
   2698<blockquote>
   2699<pre>
   2700yacc.yacc(debug=0)
   2701</pre>
   2702</blockquote>
   2703
   2704<p>
   2705<li>To change the name of the <tt>parsetab.py</tt> file,  use:
   2706
   2707<blockquote>
   2708<pre>
   2709yacc.yacc(tabmodule="foo")
   2710</pre>
   2711</blockquote>
   2712
   2713<p>
   2714<li>To change the directory in which the <tt>parsetab.py</tt> file (and other output files) are written, use:
   2715<blockquote>
   2716<pre>
   2717yacc.yacc(tabmodule="foo",outputdir="somedirectory")
   2718</pre>
   2719</blockquote>
   2720
   2721<p>
   2722<li>To prevent yacc from generating any kind of parser table file, use:
   2723<blockquote>
   2724<pre>
   2725yacc.yacc(write_tables=0)
   2726</pre>
   2727</blockquote>
   2728
   2729Note: If you disable table generation, yacc() will regenerate the parsing tables
   2730each time it runs (which may take awhile depending on how large your grammar is).
   2731
   2732<P>
   2733<li>To print copious amounts of debugging during parsing, use:
   2734
   2735<blockquote>
   2736<pre>
   2737yacc.parse(debug=1)
   2738</pre>
   2739</blockquote>
   2740
   2741<p>
   2742<li>To redirect the debugging output to a filename of your choosing, use:
   2743
   2744<blockquote>
   2745<pre>
   2746yacc.parse(debug=1, debugfile="debugging.out")
   2747</pre>
   2748</blockquote>
   2749
   2750<p>
   2751<li>The <tt>yacc.yacc()</tt> function really returns a parser object.  If you want to support multiple
   2752parsers in the same application, do this:
   2753
   2754<blockquote>
   2755<pre>
   2756p = yacc.yacc()
   2757...
   2758p.parse()
   2759</pre>
   2760</blockquote>
   2761
   2762Note: The function <tt>yacc.parse()</tt> is bound to the last parser that was generated.
   2763
   2764<p>
   2765<li>Since the generation of the LALR tables is relatively expensive, previously generated tables are
   2766cached and reused if possible.  The decision to regenerate the tables is determined by taking an MD5
   2767checksum of all grammar rules and precedence rules.  Only in the event of a mismatch are the tables regenerated.
   2768
   2769<p>
   2770It should be noted that table generation is reasonably efficient, even for grammars that involve around a 100 rules
   2771and several hundred states.  For more complex languages such as C, table generation may take 30-60 seconds on a slow
   2772machine.  Please be patient.
   2773
   2774<p>
   2775<li>Since LR parsing is driven by tables, the performance of the parser is largely independent of the
   2776size of the grammar.   The biggest bottlenecks will be the lexer and the complexity of the code in your grammar rules.
   2777</ul>
   2778
   2779<H2><a name="ply_nn37"></a>6. Parser and Lexer State Management</H2>
   2780
   2781
   2782In advanced parsing applications, you may want to have multiple
   2783parsers and lexers.  Furthermore, the parser may want to control the
   2784behavior of the lexer in some way.
   2785
   2786<p>
   2787To do this, it is important to note that both the lexer and parser are
   2788actually implemented as objects.   These objects are returned by the
   2789<tt>lex()</tt> and <tt>yacc()</tt> functions respectively.  For example:
   2790
   2791<blockquote>
   2792<pre>
   2793lexer  = lex.lex()       # Return lexer object
   2794parser = yacc.yacc()     # Return parser object
   2795</pre>
   2796</blockquote>
   2797
   2798To attach the lexer and parser together, make sure you use the <tt>lexer</tt> argumemnt to parse.  For example:
   2799
   2800<blockquote>
   2801<pre>
   2802parser.parse(text,lexer=lexer)
   2803</pre>
   2804</blockquote>
   2805
   2806Within lexer and parser rules, these objects are also available.  In the lexer,
   2807the "lexer" attribute of a token refers to the lexer object in use.  For example:
   2808
   2809<blockquote>
   2810<pre>
   2811def t_NUMBER(t):
   2812   r'\d+'
   2813   ...
   2814   print t.lexer           # Show lexer object
   2815</pre>
   2816</blockquote>
   2817
   2818In the parser, the "lexer" and "parser" attributes refer to the lexer
   2819and parser objects respectively.
   2820
   2821<blockquote>
   2822<pre>
   2823def p_expr_plus(p):
   2824   'expr : expr PLUS expr'
   2825   ...
   2826   print p.parser          # Show parser object
   2827   print p.lexer           # Show lexer object
   2828</pre>
   2829</blockquote>
   2830
   2831If necessary, arbitrary attributes can be attached to the lexer or parser object.
   2832For example, if you wanted to have different parsing modes, you could attach a mode
   2833attribute to the parser object and look at it later.
   2834
   2835<H2><a name="ply_nn38"></a>7. Using Python's Optimized Mode</H2>
   2836
   2837
   2838Because PLY uses information from doc-strings, parsing and lexing
   2839information must be gathered while running the Python interpreter in
   2840normal mode (i.e., not with the -O or -OO options).  However, if you
   2841specify optimized mode like this:
   2842
   2843<blockquote>
   2844<pre>
   2845lex.lex(optimize=1)
   2846yacc.yacc(optimize=1)
   2847</pre>
   2848</blockquote>
   2849
   2850then PLY can later be used when Python runs in optimized mode. To make this work,
   2851make sure you first run Python in normal mode.  Once the lexing and parsing tables
   2852have been generated the first time, run Python in optimized mode. PLY will use
   2853the tables without the need for doc strings.
   2854
   2855<p>
   2856Beware: running PLY in optimized mode disables a lot of error
   2857checking.  You should only do this when your project has stabilized
   2858and you don't need to do any debugging.
   2859  
   2860<H2><a name="ply_nn39"></a>8. Where to go from here?</H2>
   2861
   2862
   2863The <tt>examples</tt> directory of the PLY distribution contains several simple examples.   Please consult a
   2864compilers textbook for the theory and underlying implementation details or LR parsing.
   2865
   2866</body>
   2867</html>
   2868
   2869
   2870
   2871
   2872
   2873
   2874