Next: Demonstration, Previous: (dir), Up: (dir) [Contents]
Previous: Parsing, Up: NYACC User’s Guide [Contents]
A LALR(1) parser is a pushdown automata for parsing computer languages. LALR parser generators like NYACC have been around since the 1970’s. In NYACC the automata, along with its auxiliary parameters (e.g., actions), is called a machine. The grammar is called the specification. The program that processes, driven by the machine, input tokens to generate a final output, or error, is the parser.
A more detailed explantion of how to use an LALR parser generator can be found in the Bison manual (see References).
An easy way to introduce you to NYACC is to work through a simple
example. Consider the following. A similar example is in the file
calc.scm in the subdirectory examples/nyacc/lang/calc/
of the source distribution. (To execute the file you need to source
the file env.sh in the top-level sub-directory examples.)
(use-modules (nyacc lalr)) (use-modules (nyacc lex)) (use-modules (nyacc parse)) (define (next) (newline) (display "> ") (force-output)) (define spec (lalr-spec (prec< (left "+" "-") (left "*" "/")) (start prog) (grammar (prog (stmt-list)) (stmt-list (stmt) (stmt-list "\n" stmt)) (stmt ($empty ($$ (next))) (expr ($$ (display $1) (next)))) (expr (expr "+" expr ($$ (+ $1 $3))) (expr "-" expr ($$ (- $1 $3))) (expr "*" expr ($$ (* $1 $3))) (expr "/" expr ($$ (/ $1 $3))) ($fixed ($$ (string->number $1))) ($float ($$ (string->number $1))) ($ident ($$ (module-ref (current-module) (string->symbol $1)))) ("(" expr ")" ($$ $2)))))) (define mach (make-lalr-machine spec)) (define mtab (lalr-match-table mach)) (define gen-lexer (make-lexer-generator mtab)) (define raw-parse (make-lalr-parser mach)) (define (parse) (raw-parse (gen-lexer))) (parse)
Here is an explanation of the above code:
use-modules
syntax.
next
is used to generate a prompt to the user.
lalr-spec
is used to generate a language
specification from the grammar and options provided in the form.
prec<
directive indicates that
the tokens appearing in the sequence of associativity directives
should be interpreted in increasing order of precedence. The
associativity statements left
indicate that the tokens have left
associativity. So, in this grammar +
, -
, *
, and
/
are left associative, *
and /
have equal
precedence, +
and -
have equal precedence, but *
and /
have higher precedence than +
and -
.
start
directive indicates which left-hand symbol in the
grammar is the starting symbol for the grammar.
grammar
directive is used to specify the production rules.
expr
).
"+"
),
character literals (e.g., #\+
), quoted symbols
(e.g., '+
) or NYACCreserved symbols, which always begin with
$
. Reserved symbols used in this example are $fixed
and
$float
. Note that tokens or terminals do not need to be
declared as in Bison or in the Guile module (system base lalr)
.
$fixed
and $float
indicate
unsigned integer and floating point number, respectively. The
NYACC procedures for generating lexical analyzers will emit this token
when the corresponding numbers are detected in the input.
$$
form is
used to specify an action associated with the rule. Ordinarily, the action
appears as the last element of a right-hand side, but mid-rule
actions are possible. Inside the $$
form, the variables
$1
, $2
, etc. refer to the symantic value of the
corresponding item in the right hand side of the production rule.
See Actions for more details on how those actions are evaluated.
lalr-spec
is an association list (a-list);
you can peek at the internals using typical Scheme procedures for a-lists.
mach
is defined using the
procedure make-lalr-machine
, which returns an association list.
The procedure does the bulk of the work to produce what is needed to
generate a LALR(1) parser. Separate procedures can be called to
further compact tables and hash the automaton. See Hashing and Compacting
mtab
is defined to be the machine’s match-table.
raw-parser
takes one required argument, a lexical analyzer
procecdure, and optional keyword arguments.
(define gen-lexer (make-lexer-generator mtab))
Note that the above returns a generator for lexical analyzers: lexical
analysis for a call to the parser may require maintaining internal
state (e.g., line number, mode). (Sorry, we now deviate from being
purely function. This could be fixed by using continuation passing
style, but I digress.) The procedure make-lexer-generator
is
imported from the module (nyacc lex)
. Optional arguments to
make-lexer-generator
allow the user to specify custom readers
for identifiers, comments, numbers, etc. See lex
(next) (parse)
The lexical analyzer reads code from (current-input-port)
. If
we want to run this on a string we would need to use
with-input-from-string
or equivalent.
See (guile)Input and Output
$ guile calc.scm ... > 1 + 1 2 >
If we execute the example file above we should get the following:
$ guile calc1.scm 2 + 2 => 4 $
One of the many cool features of Guile is that it provides a backend infrastructure for evaluation of multiple frontend languages. The files mach.scm, parser.scm and compiler.scm in the examples/nyacc/lang/calc directory and spec.scm in the examples/language/calc directory implement our calculator within this Guile infrastructure. You can run the calculator if you have sourced the env.sh file as described above.
$ guile ... scheme@(guile-user)> ,L calc ... Happy hacking with calc! To switch back, type `,L scheme'. calc@(guile-user)> (2 + 2)/(1 + 1) 2 calc@(guile-user)>
The evaluator uses SXML as the intermediate representation between the parser and compiler, which generates to Tree-IL. See also the example in the directory examples/language/javascript and examples/nyacc/lang/javascript directories. For details on the Guile backend see
The parser can provide debugging output with the appropriate keyword
argument. In calc1.scm there is a modified version of
calc1-eval
which will print out debugging info:
(define (calc1-eval str) (with-input-from-string str (lambda () (raw-parser (gen-lexer) #:debug #t))))
To make use of this info you probably want to generate an output file as describe in Section Human Readable Output which provides context for the debugging output. The output looks like
state 0, token "2" => (shift . 3) state 3, token "+" => (reduce . 5) state 0, token expr => (shift . 4) state 4, token "+" => (shift . 5) state 5, token "2" => (shift . 3) state 3, token #<eof> => (reduce . 5) state 5, token expr => (shift . 14) state 14, token #<eof> => (reduce . 1) state 0, token expr => (shift . 4) state 4, token #<eof> => (accept . 0) 2 + 2 => 4
Next: Translation, Previous: Demonstration, Up: NYACC User’s Guide [Contents]
Most of the syntax and procecures for generating skelton parsers
exported from the module (nyacc lalr)
. Other modules include
(nyacc lex)
This module provides procedures for generating lexical analyzers.
(nyacc util)
This module provides utilities used by the other modules.
The syntax for generating specifications is lalr-spec
. As
mentioned in the previous chapter, the syntax generates an
association list, or a-list.
This routine reads a grammar in a scheme-like syntax and returns an a-list.
The returned a-list is normally used as an input for
make-lalr-machine
. The syntax is of the form
(lalr-spec (specifier …) …)
The order of the specifiers does not matter, but typically the
grammar
specifier occurs last.
The specifiers are
notice
This is used to push a comment string (e.g., copyright) into the resulting parser tables.
reserve
This is a list of tokens which do not appear in the grammar but should be added to the match table.
prec<, prec>
These specifiers are used to specify precedence and associativity symbols.
expect
This is the expected number of shift-reduce conflicts to occur.
start
This specifies the top-level starting non-terminal.
alt-start
This specifies alternative start symbols used in restart-spec
.
Its use prevents warning messages.
grammar
the grammar see below
The notice
specifier allows one to provide a comment that will
be carried into generated output files (e.g., parse tables generated
by write-lalr-tables
. For example, if the spec’ looks like
(define spec (lalr-spec (notice "last edit: Mar 25, 2015") …))
and one generates parse tables from the machine with
write-lalr-tables
then the resulting file will look like
;; calctab.scm ;; last edit: Mar 25, 2015 (define calc-len-v #(1 1 … …
The notice is available using the expression
Print the notice to the port, if specified, or (current-output-port)
.
The notice
specifier allows one to provide a comment that will
be carried into generated output files (e.g., parse tables generated
by write-lalr-tables
.
In the javascript parser we have added reserved keywords:
(reserve "abstract" "boolean" "byte" "char" "class" "const" …)
and this results in a generated match table (for a hashed machine) that looks like:
… ("abstract" . 86) ("boolean" . 87) ("byte" . 88) ("char" . 89) ("class" . 90) ("const" . 91) …
Recall the following specifier from the calc1
example:
(prec< (left "+" "-") (left "*" "/"))
This declaration indicates presedence among the math operators.
The <
in prec<
indicates that the precedence is in
increasing order. An equivalent specification would be as follows:
(prec> (left "*" "/") (left "+" "-"))
The precedence specification can be used along with $prec
in
the grammer to resolve shift-reduce or reduce-reduce conflicts in a
grammar. The classical case is the if-then-else construct in C, where
a conflict occurs on the input
if (expr1) if (expr2) bar(); else baz();
The above could be interpreted as
if (expr1) { if (expr2) bar(); } else baz();
or as
if (expr1) { if (expr2) bar(); else baz(); }
hence a conflict. The language specification indicates the latter, so the parser should shift. This rule can be specificed in a C parser as follows:
(lalr-spec … (prec< 'then "else") ; "then/else" SR-conflict resolution … (grammar … (selection-statement ("if" "(" expression ")" statement ($prec 'then) ($$ `(if ,$3 ,$5))) ("if" "(" expression ")" statement "else" statement ($$ `(if ,$3 ,$5 ,$7))) …
It is important to note here that we use a quoted symbol 'then
rather than a string "then"
as a dummy token. If we would have
used "then"
as the dummy then the lex’er would return the associated
token when "then"
appears in the input and the C declaration
int then(int);
would produce a syntax error.
There are default rules for handling shift-reduce conflicts. If you
can live with these it is possible to inhibit the error messages
generated by using the expect
specifier, which takes as
argument the expected number of shift-reduce conflicts:
(lalr-spec (expect 3) …
The grammer is a list of production rules. Each production rule take the form
(lhs (rhs1 …) (rhs2 …) …)
where lhs is the left hand side is a non-termianl represented as
a Scheme identifier. Each right hand side is a list non-terminals,
terminals, actions or proxies, represented by Scheme
indentifiers, Scheme constants, $$
-expressions or proxy
expressions, respectively. The terminals can be Scheme strings,
character constants or quoted symbols, but not numbers. For example,
the following is the production rule for a C99 additive expression:
(add-expr (mul-expr) (add-expr "+" mul-expr ($$ `(add ,$1 ,$3))) (add-expr "-" mul-expr ($$ `(sub ,$1 ,$3))))
Here add-expr
and mul-expr
are non-terminals and "+"
, "-"
are terminals and
($$ `(add ,$1 ,$3))
and ($$ `(sub ,$1 ,$3))
are actions.
In the actions $1
refers to the semantic value of the term
add-expr
.
Symbols starting with $
are reserved. The following symbols
have special meaning:
All symbols starting with $
are reserved. Unused reserved symbols
will likely not signal an error. The following reserved symbols are in use:
$prec
This symbol is used in the right hand side of a production rule to
indicate precedence (e.g., ($prec 'foo)
).
$error
This symbol is used in the right hand side of a production rule to indicate the rule is an error. The associated parser will abort on this rule.
$empty
This symbol can be used in the right hand side of a production to indicate it has no terms.
$ident
This is emitted by the lexical analyzer to indicate an identifier.
$fixed
This is emitted by the lexical analyzer to indicate an unsigned integer.
$float
This is emitted by the lexical analyzer to indicate an unsigned floating point number.
$string
This is emitted by the lexical analyzer to indicate a string.
$code-comm
This is emitted by the lexical analyzer to indicate a comment starting after code appearing on a line.
$lone-comm
This is emitted by the lexical analyzer to indicate a comment starting on a line without preceeding code.
$$, $$-ref, $$/ref
These define an action in the right-hand side of a production. They have the forms
($$ body) ($$-ref 'rule12) ($$/ref 'rule12 body)
In an action, body is a Scheme expression, which can refer to
the semantic values via the special variables $1
, $2
,
… See Actions for more details on how body is evaluated.
The ref
forms are used to provide references for future use to
support other (non-Scheme) languages, where the parser will be
equipped to execute reduce-actions by reference (e.g. an associative array).
$1, $2, ...
These appear as arguments to user-supplied actions and will appear in the body shown above. The variables reference the symantic values of right-hand-side symbols of a production rule. Note that mid-rule actions count here so
(lhs (l-expr ($$ (gen-op)) r-expr ($$ (list $2 $1 $3))))
generates a list from the return of (gen-op)
followed by the
semantic value associated with l-expr
and then r-expr
.
$?, $*, $+
These are (experimental) macros used for grammar specification.
($? foo bar baz) => ``foo bar baz'' occurs never or once ($* foo bar baz) => ``foo bar baz'' occurs zero or more times ($+ foo bar baz) => ``foo bar baz'' occurs one or more times
However, these have hardcoded actions and are considered to be, in current form, unattractive for practical use.
In addition, the following reserved symbols may appear in output files:
$chlit
This is emitted by the lexical analyzer to indicate a character literal.
$start
This is used in the machine specification to indicate the production rule for starting the grammar.
$end
This is emitted by the lexical analysis to indicate end of input and appears in the machine to catch the end of input.
$P1, $P2, …
Symbols of the form $P1
, $P2
,... are as symbols for
proxy productions (e.g., for mid-rule actions). For example, the
production rule
(lhs (ex1 ($$ (gen-x)) ex2 ex3) ($$ (list $1 $2 $3 $4)))
will result in the internal p-rules
(lhs (ex1 $P1 ex2 ex3) ($$ (list $1 $2 $3 $4))) ($P1 ($empty ($$ (gen-x))))
$default
This is used in the generated parser to indicate a default action.
A grammar rule’s right-hand side can contain actions:
($$ body)
body is a Scheme expression which can refer to the values
matched by the rule’s right-hand side via the variables ($1
,
$2
, …). This expression is evaluates in the top-level
environment of the module from which ’make-lalr-parser’ is called, so
it will “see” the bindings defined there.
When actions appear within a production rule (known as mid-rule
actions) they count as right-hand side items and thus the results of
their evaluation is bound to the corresponding $n
variable
available for the end-of-rule action.
The grammar specification allows the user to handle some syntax errors. This allows parsing to continue. The behavior is similar to parser generators like yacc or bison. The following production rule-list allows the user to trap an error.
(line ("\n") (exp "\n") ($error "\n"))
If the current input token does not match the grammar, then the parser
will skip input tokens until a "\n"
is read. The default
behavior is to generate an error message: "syntax error".
To provide a user-defined handler just add an action for the rule:
(line ("\n") (exp "\n") ($error "\n" ($$ (format #t "line error\n"))))
Note that if the action is not at the end of the rule then the default recovery action ("syntax error") will be executed.
Say you have a NYACC specification cspec
for the C language
and you want to generate a machine for parsing C expressions. You can
do this using restart-spec
:
(define cxspec (restart-spec cspec 'expression)) (define cxmach (make-lalr-machine cxspec))
Given a specification generated by lalr-spec
this procecure
generates an a-list which contains the data required to implement a
parser generated with, for example, make-lalr-parser
.
The generated a-list includes the following keys:
pat-v
a vector of parse action procecures
ref-v
a vector of parse action references (for supporting other languages)
len-v
a vector of p-rule lengths
rto-v
a vector of lhs symbols (“reduce to” symbols)
lhs-v
a vector of left hand side symbols
rhs-v
a vector of vectors of right hand side symbols
kis-v
a vector of itemsets
The lexical analyzer returns tokens to the parser. The parser
executes state transitions based on these tokens. When we build a
lexical analyzer (via make-lexer
) we provide a list of strings
to detect along with associated tokens to return to the parser. By
default the tokens returned are symbols or characters. But these
could as well be integers. Also, the parser uses symbols to represent
non-terminals, which are also used to trigger state transitions. We
could use integers instead of symbols and characters by mapping the
tokens to integers. This makes the parser more efficient. The tokens
we are talking about include
$end
marker
$ident
$fixed
)
$float
)
$accept
$default
$error
For the hash table we use positive integers for terminals and negative integers for non-terminals. To apply such a hash table we perform the following:
Note that the parser is hardcoded to assume that the phony token for the
default (reduce) action is '$default
for unhashed machine or
1
for a hashed machine.
The actions that the parser executes based on these are shift, reduce and accept. When a shift occurs, the parser must transition to a new state. We can encode the parser transitions using integers.
The NYACC procedure to convert a machine to a hashed-machine is
hashify-machine
.
Convert machine to use integers instead of symbols. The match table will change from
("abc" . 'abc)
to
("abc" . 2)
and the pax will change from
("abc" . (reduce . 1))
to
("abc" . 2)
Indicate if the machine has been hashed.
A "filter" to compact the parse table. For each state this will replace
the most populus set of reductions of the same production rule with a
default production. However, reductions triggered by user-specified keepers
and the default keepers – '$error
, '$end
, '$lone-comm
and '$lone-comm
are not counted. The parser will want to treat
errors and comments separately so that they can be trapped (e.g.,
unaccounted comments are skipped).
In some parser generators one declares terminals in the grammar file
and the generator will provide an include file providing the list of
terminals along with the associated “hash codes”. In NYACC the
terminals are detected in the grammar as non-identifiers: strings
(e.g., "for"
), symbols (e.g., '$ident
) or characters
(e.g., #\+
). The machine generation phase of the parser
generates a match table which is an a-list of these objects along with
the token code. These codes are what the lexical analyzer should return.
In the end we have
"for"
).
make-lexer-generator
).
for
), whereas in the case of special items,
processed in the lexical analyzer by readers (e.g., read-num
), the
keys will be symbols (e.g., '$float
).
The symbol for the default transition is '$default
. For
hashified machines this is translated to the integer 1
.
The lex
module provides a set of procedures to build lexical
analyzers. The approach is to first build a set of readers for
different types of tokens (e.g., numbers, identifiers, character
sequences) and then process input characters (or code points) through
the procedures. The signature of most readers is the following:
(reader ch) => #f | (type . value)
If the reader fails to read a token then #f
is returned. If
the reader reads more characters from input and fails, then it will
push back characters. So, the basic structure of a lexical analyzer
is
(lambda () (let iter ((ch (get-char))) (cond ((eof-object? ch) '($end . "")) ((whitespace-reader ch) (iter (read-char))) ((comment-reader ch) (iter (read-char))) ((number-reader ch)) ((keyword-reader ch)) ((ident-reader ch)) … (else (error)))))
The types of readers used are
reads an identifier
reads a number
reads a string literal
reads a character literal
reads a comment
same as comm-reader
a reader for a sequence of characters (e.g., +=
)
Note that some of our parsers (e.g., the C99 parser) is crafted to keep some comments in the output syntax tree. So comments may be passed to the parser or skipped, hence the “skipper”.
The Lex Module does not provide lexical analyzers (lex’ers), but lexical analyzer generator generators. The rationale behind this is as follows. A lexical analyzer may have state (e.g., beginning of line state for languages where newline is not whitespace). In addition, our generator uses a default set of readers, but allows the caller to specify other readers. Or, if the user prefers, lex’ers can be rolled from provided readers. Now we introduce our lex’er generator generator:
Returns a lex’er generator from the match table and options. The options are
#:ident-reader reader
Use the provided reader for reading identifiers. The default is a C language ident reader, generated from
(make-ident-reader c:if c:ir)
#:num-reader reader
Use the provided number reader.
#:string-reader reader
Use the provided reader for string literals.
#:chlit-reader reader
Use the provide charater literal reader. The default is for C. So,
for example the letter ‘a’ is represented as 'a'
.
#:comm-reader reader
Use the provided comment reader to pass comments to the parser.
#:comm-skipper reader
Use the provided comment reader, but throw the token away. The
default for this is #f
.
space-chars string
not a reader but a string containing the whitespace characters (fix this)
(define gen-lexer (make-lexer-generator #:ident-reader my-id-rdr)) (with-input-from-file "foo" (parse (gen-lexer)))
(Minor note: The ident-reader will be used to read ident-like keywords from the match table.)
This routine will generate a reader to skip whitespace.
If ch
is C whitespace, skip all spaces, then return #t
,
else return #f
.
For identifiers, given the char-set for first character and the char-set
for following characters, return a return a reader for identifiers.
The reader takes a character as input and returns #f
or string
.
This will generate exception on #<eof>
.
If ident pointer at following char, else (if #f) ch still last-read.
Generate a predicate, from a reader, that determines if a string qualifies as an identifier.
Determine if a string qualifies as a C identifier.
Generate a reader that uses delim
as delimiter for strings.
TODO: need to handle matlab-type strings.
TODO: need to handle multiple delim’s (like python)
Read octal number.
Read octal number.
Read a C-code string. Output to code is write
not display
.
Return #f if ch is not "
. This reader does not yet read
trigraphs.
Generate a reader for character literals. NOT DONE.
For C, this reads 'c'
or '\n'
.
... 'c' ... => (read-c-chlit #\') => '($ch-lit . #\c)
Generates a procedure to read C numbers where output is of the form
#f
, ($fixed . "1")
or ($float . "1.0")
This routine will clean up floats by adding "0" before or after dot.
Convert C number-string (e.g, 0x123LL
) to Scheme numbers-string
(e.g., #x123
).
Reader for unsigned numbers as used in C (or close to it).
Given alist of pairs (string, token) return a function that eats chars
until (token . string) is returned or #f
if no match is found.
ch bol -> (’$code-comm "..")|(’$lone-comm "..")|#f
comm-table is list of cons for (start . end) comment.
e.g. ("–" . "\n") ("/*" . "*/")
test with "/* hello **/"
If eat-newline
is specified as true then for read comments
ending with a newline a newline swallowed with the comment.
Note: assumes backslash is never part of the end
The following routines are provided for rolling your own lexical analyzer generator. An example is provided in the file examples/nyacc/lang/matlab.
Filter match-table based on cars of al.
Remove match-table based on cars of al.
Map cars of al.
For test and debug, this procedure will evaluate a reader on a string.
A reader is a procedure that accepts a single character argument intended
to match a specific character sequence. A reader will read more characters
by evaluating read-char
until it matches or fails. If it fails, it
will pushback all characters read via read-char
and return #f
.
If it succeeds the input pointer will be at the position following the
last matched character.
Sometimes LALR(1) parsers must be equipped with methods to parse non-context free grammars. With respect to typenames, C is not context free. Consider the following example.
typedef int foo_t; foo_t x;
The lexical analyzer must identify the first occurance of foo_t
as an identifier and the second occurance of foo_t
as a
typename. This can be accomplished by keeping a list of typenames in
the parent environment to the parser and lexical analyzer. In the
parser, when the first statement is parsed, an action could declare
foo_t
to now be a typename. In the lexical analyzer, as
tokens that look like identifers are parsed they are checked against
the list of typenames and if a match is found, 'typename
is
returned, otherwise $ident
is returned.
Another example of this handshaking is used in the JavaScript parser.
The language allows newline as a statement terminator, but it must
be prevented in certain places, for example between ++
and
an expression in the post-increment operator. We handle this using
a mid-rule action to tell the lexer to skip newline if that is the
next token.
(LeftHandSideExpression ($$ (NSI)) "++" ($$ `(post-inc $1)))
The procedure NSI
in the lex’er is as follows:
(define (NSI) ;; no semicolon insertion (fluid-set! *insert-semi* #f))
and the newline reader in the lex’er acts as follows:
… ((eqv? ch #\newline) (if (fluid-ref *insert-semi*) (cons semicolon ";") (iter (read-char)))) …
Note that generating a parser requires a machine argument. It is possible to export the machine to a pair of files and later regenerate enough info to create a parser from the tables saved in the machine.
For example, constant tables for a machine can be generated using NYACC procedures as follows:
(write-lalr-actions calc-mach "calc-act.scm" #:prefix "calc-") (write-lalr-tables calc-mach "calc-tab.scm" #:prefix "calc-")
This saves the variable definition for calc-act-v
to the
file calc-act.scm and variable definitions for the following
to the file calc-tab.scm:
calc-len-v
calc-pat-v
calc-rto-v
calc-mtab
calc-tables
The variable calc-act-v
is a vector of procedures associated with
each of the production rules, to be executed as the associated
production ruled is reduced in parsing. The procedures can be
modified without editing the grammar file and re-executing
lalr-spec
and make-lalr-machine
.
To create a parser from the generated tables, one can write the following code:
(include "calc-tab.scm") (include "calc-act.scm") (define raw-parser (make-lalr-parser (acons 'act-v calc-act-v calc-tables)))
See the example code in examples/nyacc/lang/calc/parser.scm for more detail.
The NYACC procedure compact-machine
will compact the parse
tables generated by make-lalr-machine
.
That is, if multiple tokens generate the same transition, then these
will be combined into a single default transition.
Ordinarily NYACC will expect symbols to be emitted from the lexical
analyzer. To use integers instead, use the procedure
hashify-machine
. One can, of course, use both procedures:
(define calc-mach (compact-machine (hashify-machine (make-lalr-machine calc-spec))))
Indicate if the machine has been compacted.
NYACC provides routines for exporting NYACC grammar specifications to other LALR parser generators.
The Bison exporter uses the following rules:
"for"
is
converted to FOR
.
"+"
is converted to '+'
.
#\!
is converted to '!'
.
ChSeq_i_j_k
where
i, j and k are decimal representations of the character
code. For example "+="
is converted to ChSeq_43_61
.
$
and -
are replaced with _
.
This functionality has not been worked for a while so I suspect there is a chance it does not work anymore.
The provided parsers are able to generate debugging information.
You can generate text files which provide human-readable forms of the grammar specification and resulting automaton, akin to what you might get with bison using the ‘-r’ flag.
(with-output-to-file "calc.out" (lambda () (pp-lalr-grammar calc1-mach) (pp-lalr-machine calc1-mach)))
The above code will generate something that looks like
0 $start => prog 1 prog => stmt-list 2 stmt-list => stmt 3 stmt-list => stmt-list stmt 4 stmt => "\n" 5 stmt => expr "\n" 6 stmt => assn "\n" 7 expr => expr "+" expr 8 expr => expr "-" expr 9 expr => expr "*" expr 10 expr => expr "/" expr 11 expr => '$fixed 12 expr => '$float 13 expr => '$ident 14 expr => "(" expr ")" 15 assn => '$ident "=" expr 0: $start => . prog prog => . stmt-list stmt-list => . stmt stmt-list => . stmt-list stmt stmt => . "\n" stmt => . expr "\n" stmt => . assn "\n" expr => . expr "+" expr expr => . expr "-" expr expr => . expr "*" expr expr => . expr "/" expr expr => . '$fixed expr => . '$float expr => . '$ident expr => . "(" expr ")" assn => . '$ident "=" expr "(" => shift 1 '$ident => shift 2 '$float => shift 3 '$fixed => shift 4 assn => shift 5 expr => shift 6 "\n" => shift 7 stmt => shift 8 stmt-list => shift 9 prog => shift 10 ... 26: expr => expr . "/" expr expr => expr . "*" expr expr => expr . "-" expr expr => expr . "+" expr expr => expr "+" expr . "*" => shift 15 "/" => shift 16 $default => reduce 7 ["+" => shift 13] REMOVED by associativity ["-" => shift 14] REMOVED by associativity ["*" => reduce 7] REMOVED by precedence ["/" => reduce 7] REMOVED by precedence
The parsers generted by NYACC accept an optional keyword argument
#:debug
. When #t
is passed to this argument then the
parser will display shift and reduce actions as it executes. The
calculator demo under the examples directory has an option (in
calc.scm) to do this. Here is what the output looks like
(with the displayed result edited out).
> 1 + 2 state 0, token "1" => shift, goto 4 state 4, token "+" => reduce 11 state 0, token expr => shift, goto 6 state 6, token "+" => shift, goto 11 state 11, token "2" => shift, goto 4 state 4, token "\n" => reduce 11 state 11, token expr => shift, goto 23 state 23, token "\n" => reduce 7 state 0, token expr => shift, goto 6 state 6, token "\n" => reduce 5 state 0, token stmt => shift, goto 7 state 7, token "\n" => reduce 2 state 0, token stmt-list => shift, goto 8 state 8, token "\n" => shift, goto 10
Currently, when parsers are hashed (using hashify-machine
the
non-terminals (e.g., expr
above) are reported as integers.
Next: Coding to the Compiler Tower, Previous: Parsing, Up: NYACC User’s Guide [Contents]
In this chapter we present procedures for generating syntax trees in a
uniform form. This format, based on SXML, may use more cons cells
than other formats but upon use you will see that it is easy to
produce code in the parser, and one can use all the processing tools
that have been written for SXML (e.g., (sxml match)
(sxml
fold)
.
The syntax of SXML trees is simple:
expr => (tag item …) | (tag (@ attr …) item …) item => string | expr attr => (tag . string)
where tag names cannot contain the characters
(
)
"
'
`
,
;
?
>
<
[
]
~
=
!
#
$
%
&
*
+
/
\
@
^
|
{
}
and cannot begin with -
, .
or a numeric digit.
For example our Javascript parser given the input
function foo(x, y) { return x + y; }
will produce the following syntax tree:
(Program (SourceElements (FunctionDeclaration (Identifier "foo") (FormalParameterList (Identifier "x") (Identifier "y")) (SourceElements (EmptyStatement) (ReturnStatement (add (PrimaryExpression (Identifier "x")) (PrimaryExpression (Identifier "y")))) (EmptyStatement))) (EmptyStatement)))
And by the way, put through our tree-il compiler, which uses
foldts*-values
from the module (sxml fold)
we get
(begin (define foo (lambda ((name . foo)) (lambda-case ((() #f @args #f () (JS~5575)) (prompt (const return) (begin (abort (const return) ((apply (@@ (nyacc lang javascript jslib) JS:+) (apply (toplevel list-ref) (lexical @args JS~5575) (const 0)) (apply (toplevel list-ref) (lexical @args JS~5575) (const 1)))) (const ()))) (lambda-case (((tag val) #f #f #f () (JS~5576 JS~5577)) (lexical val JS~5577)))))))))
Paring actions in NYACC can use tagged-lists from the module
(nyacc lang util)
to help build SXML trees efficiently.
Building a statement list for a program might go as follows:
(program (stmt-list ($$ `(program ,(tl->list $1))))) (stmt-list (stmt ($$ (make-tl 'stmt-list $1))) (stmt-list stmt ($$ (tl-append $1 $2))))
The sequence of calls to the tl-
routines goes as follows:
(make-tl 'stmt-list)
Generate a tagged list with tag 'stmt-list
.
(tl-append $1 $2)
Append item $2
(not a list) to the tagged-list $1
.
(tl->list $1)
Convert the tagged-list $1
to a list. It will be of the form
'(stmt-list (stmt …) (stmt …) …)
The first element of the
list will be the tag 'stmt-list
. If attributes were added, the
list of attributes will be the second element of the list.
The following procedures are provided by the module (nyacc lang util)
:
Create a tagged-list structure for tag tag. Any number of additional items can be added.
Convert a tagged list structure to a list. This collects added attributes and puts them right after the (leading) tag, resulting in something like
(<tag> (@ <attr>) <item> …)
Insert item at front of tagged list (but after tag).
Append items at end of tagged list.
Extend with a list of items.
Extend with a list of items. Uses set-cdr!
.
Add an attribute to a tagged list. Return the tl.
(tl+attr tl 'type "int")
Merge guts of phony-tl tl1
into tl
.
To work with the trees described in the last section use
(sx-ref tree 1) (sx-attr tree) (sx-attr-ref tree 'item) (sx-tail tree 2)
Reference the ix
-th element of the list, not counting the optional
attributes item. If the list is shorter than the index, return #f
.
(sx-ref '(abc "def") 1) => "def" (sx-ref '(abc (@ (foo "1")) "def") 1) => "def"
Return the tag for a tree
Generate the tag and the attr list if it exists. Note that
Return the ix-th tail starting after the tag and attribut list, where ix must be positive. For example,
(sx-tail '(tag (@ (abc . "123")) (foo) (bar)) 1) => ((foo) (bar))
Without second argument ix is 1.
A predicate to determine if sx has attributes.
(sx-attr '(abc (@ (foo "1")) def) 1) => '(@ (foo "1"))
should change this to
(sx-attr sx) => '((a . 1) (b . 2) ...)
Return an attribute value given the key, or #f
.
Set attribute for sx. If no attributes exist, if key does not exist, add it, if it does exist, replace it.
Generate sx with added or changed attributes.
Add key-val pairs. key must be a symbol and val must be a string. Return a new sx.
Find the first matching element (in the first level).
This illustrates translation with foldts*-values
and
sxml-match
.
Next: Administrative Notes, Previous: Translation, Up: NYACC User’s Guide [Contents]
(define-module (language javascript spec) #:export (javascript) #:use-module (nyacc lang javascript separser) #:use-module (nyacc lang javascript compile-tree-il) #:use-module (nyacc lang javascript pprint) #:use-module (system base language)) (define-language javascript #:title "javascript" #:reader js-reader #:compilers `((tree-il . ,compile-tree-il)) #:printer pretty-print-js)
(define-module (nyacc lang javascript compile-tree-il) #:export (compile-tree-il) #:use-module (nyacc lang javascript jslib) #:use-module ((sxml match) #:select (sxml-match)) #:use-module ((sxml fold) #:select (foldts*-values)) #:use-module ((srfi srfi-1) #:select (fold)) #:use-module (language tree-il)) … (define (compile-tree-il exp env opts) (let* ((xrep (js-sxml->tree-il-ext exp env opts)) (code (parse-tree-il xrep))) (values code env env)))
(fmtr 'push) ;; push indent level (fmtr 'pop) ;; pop indent level (fmtr "fmt" arg1 arg2 ...)
Next: TODOs, Notes, Ideas, Previous: Coding to the Compiler Tower, Up: NYACC User’s Guide [Contents]
Installation instructions are included in the top-level file INSTALL of the source distribution. If you have an installed Guile then the basic steps are
$ ./configure $ make install
Help with alternative usage is available with
$ ./configure --help
If Guile is not installed it is possible to install source only:
$ ./configure --site-scm-dir=/path/to/dest --site-scm-go-dir=/dummy $ make install-srcs
Please report bugs by navigating with your browser to
‘https://savannah.nongnu.org/projects/nyacc
’ and select
the “Submit New” item under the “Bugs” menu. Alternatively,
ask on the Guile user’s mailing list guile-user@gnu.org.
The Free Documentation License is included in the Guile Reference Manual. It is included with the NYACC source as the file COPYING.DOC.
Next: References, Previous: Administrative Notes, Up: NYACC User’s Guide [Contents]
Todo/Notes/Ideas:
add error handling (lalr-spec will now return #f for fatal error)
support other target languages: (write-lalr-parser pgen "foo.py" #:lang ’python)
export functions to allow user to control the flow i.e., something like: (parse-1 state) => state
macros - gotta be scheme macros but how to deal with other stuff
(macro ($? val ...) () (val ...)) (macro ($* val ...) () (_ val ...)) (macro ($+ val ...) (val ...) (_ val ...))
support semantic forms: (1) attribute grammars, (2) translational semantics, (3) operational semantics, (4) denotational semantics
add ($abort) and ($accept)
add a location stack to the parser/lexer
Fix lexical analyzer to return tval, sval pairs using cons-source
instead of cons
. This will then allow support of location info.
Previous: TODOs, Notes, Ideas, Up: NYACC User’s Guide [Contents]
Donnely, C., and Stallman, R., “Bison: The Yacc Compatible Parser Generator,” https://www.gnu.org/software/bison/manual.
Aho, A.V., Sethi, R., and Ullman, J. D., “Compilers: Principles, Techniques and Tools,” Addison-Wesley, 1985 (aka the Dragon Book)
DeRemer, F., and Pennello, T., “Efficient Computation of LALR(1) Look-Ahead Sets.” ACM Trans. Prog. Lang. and Systems, Vol. 4, No. 4., Oct. 1982, pp. 615-649.
R. P. Corbett, “Static Semantics and Compiler Error Recovery,” Ph.D. Thesis, UC Berkeley, 1985.