The simplest possible example of a query in the TXR Pattern Language
is a file containing UTF-8 text. Such a file specifies a match for itself:
well, almost.
Care has to be taken to escape any ocurrences of the meta-character @ which
introduces all special syntax. This is done by writing it twice: @@ stands for
a single literal @. Thus, a text file which contains no @ signs, or whose
@ signs are properly escaped by being doubled twice is a pattern match. So for
instance:
Four score and
seven years ago
our fathers brought forth,
is a TXR query which matches the text itself. Actually, it matches more than just itself. It matches any text which begins with those three lines. Thus it also matches this text
Four score and
seven years ago
our fathers brought forth,
upon this continent
furthermore, spaces actually have a special meaning in TXR. A single
space denotes a match for one or more spaces. So our query also matches
this text, which is a convenient behavior.
Four score and
seven years ago
our fathers brought forth,
upon this continent
We can tighten the query so that it matches exactly three lines, and
only single spaces in the first line.
Four@\ score@\ and seven years ago our fathers brought forth, @(eof)
Here the @ character comes into play. The syntax @\
space syntax encodes a literal
space which doesn't have the "match one or more spaces" meaning. The @(eof)
directive means "match the empty data set, consisting of no lines".
Variables are denoted as identifiers preceded by @, and match
pieces of text in mostly intuitive ways (and sometimes not so
intuitive). Suppose we change the above to this:
Four@\ score@\ and seven @units ago our @relatives brought forth, @(eof)
Now if this query is matched against the original file, the variable
units
will capture the character string "years"
and relatives will capture "fathers"
. Of course, it
matches texts which have words other than these, such as seven
months ago
, or our mothers brought forth
.
As you can see, the basic concept in simple patterns like this very much resembles a "here document": it's a template of text with variables. But of course, this "here document" runs backwards! Rather than generating text by substituting variables, it does the opposite: it matches text and extracts variables. The need for a "here document run backwards" was what prompted the initial development of TXR!
From this departure point, things get rapidly complicated. The pattern language has numerous directives expressing parallel matching and iteration. Many of the directives work in vertical (line oriented) and horizontal (character oriented) modes. Pattern functions can be defined (horizontal and vertical) and those can be recursive, allowing grammars to be parsed.
The following query reads a stream of comma-separated pairs and generates a HTML table. A complete version with sample data is given here.
@(collect) @char,@speech @(end) @(output :filter :to_html) <table> @ (repeat) <tr> <td>@char</td> <td>@speech</td> </tr> @ (end) </table> @(end)
Here is a TXR query which matches an arithmetic expression grammar,
consisting of numbers, identifiers, basic arithmetic operators (+
- * /)
and parentheses. The expression is supplied as a command
line argument (this is done by @(next :args)
which
redirects the pattern matching to the argument vector).
Note that most of this code is not literal text. All of the pieces
shown in color are special syntax. The @; os -> optional space
text is a comment.
@(next :args) @(define os)@/ */@(end)@; os -> optional space @(define mulop)@(os)@/[*\/]/@(os)@(end) @(define addop)@(os)@/[+\-]/@(os)@(end) @(define number)@(os)@/[0-9]+/@(os)@(end) @(define ident)@(os)@/[A-Za-z]+/@(os)@(end) @(define factor)@(cases)(@(expr))@(or)@(number)@(or)@(ident)@(end)@(end) @(define term)@(some)@(factor)@(or)@(factor)@(mulop)@(term)@(or)@(addop)@(factor)@(end)@(end) @(define expr)@(some)@(term)@(or)@(term)@(addop)@(expr)@(end)@(end) @(cases) @ (expr) @ (output) parses! @ (end) @(or) @ (expr)@bad @ (output) error starting at "@bad" @ (end) @(end)
The grammar productions above represented by horizontal pattern functions.
Horizontal pattern functions are denoted visually by a horizontal syntax:
their elements are written side by side on a single logical line.
Horizontal function definitions can be broken into multiple physical lines and
indented, with the help of the @\
continuation sequence, which
consumes all leading whitespace from the following line, like this:
@(define term)@\ @(some)@\ @(factor)@\ @(or)@\ @(factor)@(mulop)@(term)@\ @(or)@\ @(addop)@(factor)@\ @(end)@\ @(end)
Sample runs from Unix command line:
$ txr expr.txr 'a + (3 * b/(c + 4))'
parses!
$ txr expr.txr 'a + (3 * b/(c + 4)))'
error starting at ")"
$ txr expr.txr 'a + (3 * b/(c + 4)'
error starting at "+ (3 * b/(c + 4)"
As you can see, this program matches the longest prefix of the input
which is a well-formed expression. The expression is recognized using
the simple function call @(expr)
which could be placed
into the middle of a text template as easily as a variable. The @(cases)
directive is used to recognize two situations: either the argument
completely parses, or there is stray material that is not recognized,
which can be captured into a variable called bad
. The
grammar itself is straightforward.
Look at the grammar production for factor
. It contains
two literal characters: the parentheses around @(expr).
The syntax coloring reveals them to be what they are: they stand for
themselves.
The ability to parse grammars happened in TXR by accident. It's a consequence of combining pattern matching and functions. In creating TXR, I independently discovered a concept known as PEGs: Parsing Expression Grammars.
Note how the program easily deals with lexical analysis and higher
level parsing in one grammar: no need for a division of the task into
"tokenizing" and "parsing". Tokenizing is necessary with classic
parsers, like LALR(1) machines, because these parsers normally have
only one token of lookahead and avoid backtracking. So they are fed
characters instead of tokens, they cannot do very much due to running
into ambiguities arising from complex tokens. By itself, a classic
parser cannot decide whether "i" is the beginning of the C "int"
keyword, or just the start of an identifier like "input".It needs the
tokenizer to scan these (done with a regular language based on regular
expression) and do the classification, so the parser sees a KEYWORD
or IDENT
token.
Just like the TXR pattern matching primitves are embedded in plain
text, within the pattern matching language, there is an embedded Lisp
dialect. Here is one way to tabulate a frequency histogram of the
letters A-Z, using the pattern language to extract the letters from
the input, and TXR Lisp to tabulate:
@(do (defvar h (hash :equal-based))) @(collect :vars ()) @(coll :vars ())@\ @{letter /[A-Za-z]/}@(filter :upcase letter)@\ @(do (inc [h letter 0]))@\ @(end) @(end) @(do (dohash (key value h) (format t "~a: ~a\n" key value)))
For comparison, here a possible procedural approach using purely TXR Lisp.
Some aspects of the code are self-explanatory; a programmer can guess
that open-file
is a function to open a file and return a stream.
Lisp programmers will find the let
construct for binding
variables familiar. Then there is the strange gun
operator.
Its name stands for "generate until nil": it
returns a lazy list, possibly infinite, whose elements are formed by
repeated calls to the enclosed expression, in this case (get char s)
.
This lazy list of characters can then be conveniently processed using the
each
operator.
The square bracket expression (inc [h (chr-toupper ch) 0])
is a shorthand equivalent for
(inc (gethash h (chr-toupper ch) 0))
which means increment
the value in hash table h
corresponding to the key (chr-toupper ch)
(the character ch
converted to upper case). If the entry
does not exist, then it is created and initialized with 0 then incremented.
(let ((h (hash)) (s (open-file "/usr/share/dict/words" "r"))) (each ((ch (gun (get-char s)))) (if (chr-isalpha ch) (inc [h (chr-toupper ch) 0]))) (let ((sorted [sort (hash-pairs h) > second])) (each ((pair sorted)) (tree-bind (key value) pair (put-line `@key: @value`)))))