The PDT stuff described below is also applicable to Tensile (see sl(1)).
The following is just a brief introduction. If you're completely unfamiliar with the automata theory, you'd better learn something more from other sources.
Push-down transducers are a class of formal automata, abstract mathematical structures which may change their internal state and produce some output depending on the input. One of the most generic (and most well-known) examples of automata is the
Turing's machine.
Strictly speaking, the term automata is properly applicable only to those structures which has no output and can only decide whether a given input sequence belong to a certain class or not (the class which is usually called a
language is said to be recognized by the automaton. Those devices which can produce some output are properly called
transducers but the latter term is not very wide spread, so further we will use the both as mutual synomyms.
The most primitive automata class is finite-state automata. Their state and output are determined by a single character of the input and the current state, the sets of input characters and inner states both being
finite.
Push-down automata have in addition an internal stack of unlimited length. The topmost character in the stack determines the behaviour of an automaton alongside with its previous state and the input character. After changing the state and emitting the output, the topmost stack character is replaced by an arbitrary (poss. empty) string of characters. The set of tasks that can be modelled with push-down automata/transducers is much larger than that of finite-state automata so the latter should be clearly distinguished from the former. However, in practice, push-down automata are often called finite-state.
Automata are an ideal device to deal with complex string transformations. Well-known regular expressions are nothing but another notation for finite-state (outputless) automata. However, they may be quite hard to learn, so they are not very popular nowadays (I mean, as a tool for text processing). I personally know the only project which uses this conception – namely, Omega project by John Plaice anf Yannis Haralambous.
There are yet three special operators (`@', `:', and application) which will be described later. Parentheses may be used in a common way. Primitive expressions are integral constant
strtol(3) syntax), character constants (C syntax), register references, and macro names
Range references have the form [=range-name], where range-name is defined by a preceding .range directive
Constant ranges are a comma-separated list of segments, enclosed in brackets. Each segment has the following structure:
start-end/dither*multi where every part except for start may be omitted: if end is omitted, it is implied equal to start; if dither or multi is omitted, 1 is implied This means ``take multi times from start to end each dither element''
start, end, dither, multi should be integral or character constants; there should be no spaces around -, /, *
To check whether a value is in a range, @ operator is used. It has the same precedence as unary operator; its left operand is a range, and the right is some expression. It returns a zero-base index of the first occurrence of a value within the range; or -1 if the value is outside the range.
To select n-th item from a range, application operator is used; it has no symbolic representation; just write a range followed by an expression. The application has the same priority as
@.
The registers may be assigned to with a : operator; its left arg is an expression and its right args is a register number. It has the same precedence as
@
WHAT ARE THOSE PDTS?
EXPRESSIONS
Automata expressions are much like C expressions. However, in the current version they have less precedence levels that the latter. Expressions may contain the following operators (in descending precedence)
RANGES
(aka tables) Ranges may have two syntactic forms: constant ranges and range references.
REGISTERS
The memory of an automaton consists of a stack, a current state, an input characters and 8 temporary registers. They are accessed in the following way:
transition :: | state-expr `;' eat filter-expr1 `;' filter-expr2 `;' newstate-expr `;' `{' expr-list1 `}' `{' expr-list2 `}' |
eat :: | null
`^' |
expr-list :: | null
expr-or-string expr-list |
expr-or-string :: | expr `;'
quoted-string |
A state-expr is evaluated at parse time yielding the number of a state. It must non-negative.
If eat is `^', an input character of the automaton will not change after transition. If filter-exprs are always true, this is equivalent to empty-string transition.
Filter-exprs are evaluated (at run time) in a short-circuit manner as if connected by `&&' operator. If they're both true, a transition for a given state is executed.
NOTE: | Transitions are searched in the order of their definitions. After a matching transition is found, search is complete. |
When a transition is executed, expressions from expr-list1 are evaluated and their values are passed to output. If a list item is a string, it is considered equivalent to a list of character constants.
Then a topmost stack item is eliminated, and expr-list2 is evaluated.
NOTE: | Here string items are, too, equivalent to sequence of characters. As a consequence, a string will be pushed into the stack reversed. |
After all, the state is changed to the value of newstate-expr.
NOTE: | no evaluation occur at the point of definition |
NOTE: | macros are expanded as if their body is surrounded by parentheses. |
NOTE: | filename may be in fact surround by arbitrary characters, but they need be the same – .include <... is not allowed. |
BUG: | implementation is broken |
autoseq :: | null
branch branch `|' autoseq |
branch :: | item
item `&' branch |
item :: | ismaster automaton param-spec outfunc |
ismaster :: | null
`*' |
automaton :: | auto-name state-spec
`{' extension-name extension-param '}' |
param-spec :: | null
`[' params `]' |
params ::= | param
param `,' params |
param ::= | reg-spec `=' number-or-name |
reg-spec ::= | digit
name |
number-or-name :: | number
char-constant name |
auto-name :: | name
apos-string |
state-spec :: | null
`:' name |
extension-name :: | any characters except ] and |
extension-param :: | null
any characters except ] |
outfunc :: | null
`$$' `$1' `$0' `$#' |
Arbitrary spaces are allowed between tokens
A branch is evaluated left to right, the output of the branch being the first non-empty output. However, all the branches are always executed.
An output value of an automaton can be modified by an output function which now can substitute a current state, a topmost stack element or a stack depth for a real output value.
When a sequence is initialized or reset, all its automata states are set to the value of
state-spec or to a value defined by .start if the former is missing. Temporary registers and stack may be initialized by additional
params. A param designator may be either a digit or a macro name which expands to a register reference. A
param value may be either a number or a character constant, or a macro name which is get evaluated at parse time.
By default, when polling a state of a sequence, the last branch of the last sequence item (i. e. the rightmost one) is used. This can be changed by prefixing a given item with `*'.
Evaluation
A sequence is evaluated right to left, the input of an automata branch being the output of the next one (in evaluation order).
SEE ALSO
sl(1), strtol(3)
AUTHOR
Artem V. Andreev
COPYRIGHT
Copyright © 2001, 2002 Artem V. Andreev. See sl(1) for details