Below is the pattern matching version of Ackermann being compiled at several different optimization levels: 0, 2, 4 and 7. The instruction count drops from 36 instructions to 26, 24 and finally 14. TXR 283 was used for this demo.
In the 36 instruction unoptimized version, we see frames being allocated and
lots of v registers used, corresponding to all the temporary variables in in
the code generated by the pattern matching macrology. There is also a
block
instruction with a matching end
because named
function bodies are implicit named blocks.
In the 26 instruction version, variables not captured by closures have been
identified and turned into t registers (pure registers, freshly instantiated in
each function, and not lexically captured). Thus all frame
instructions are gone along with their matching end
.
The block
instruction is gone: the compiler has determined that
it's not used: nothing in the block directly references it, and no function
is called which possibly uses the block. Blocks are dynamically scoped in
TXR Lisp, and are the target of a dynamic control transfer, which makes their
setup expensive.
The most obvious change between the 26 and 24 is some control flow work:
some jmp instructions are now going directly down to 41 instead of chaining
through the program. It's not simple jump threading; it's informed by data: as
in, if we jump here, we will test a register we know to be already false, so we
might as well jump all the way here. The 24 instruction version contains
numerous wasteful register-register moves. It's still storing values into a
register t6
, which is never
referenced.1
Ten more instructions are shaved off (24 to 14) with pattern-matching peephole optimizations, which are informed by data flow analysis. The reductions are mostly local to a basic block, but sometimes involve multiple blocks. In this last version, the forward jumps to the end instruction are gone; the end instruction sequence is hoisted back into those sites. There are no more wasteful mov instructions; in fact, all mov instructions are gone.
The final version is almost identical2 code to what pops out of a non-pattern-matching ordinary Ackermann.
1< (each ((*opt-level* #(0 2 4 7)))
(eval '(defun-match ack
((0 @n) (+ n 1))
((@m 0) (ack (- m 1) 1))
((@m @n) (ack (- m 1) (ack m (- n 1))))))
(disassemble (compile 'ack)))
data:
0: ack
1: 0
2: 1
3: t
syms:
0: +
1: ack
2: -
code:
0: 8C00003A close t2 2 13 58 2 2 nil v00000 v00001
1: 00020002
2: 00020002
3: 0000000D
4: 08010800
5: 4C000039 block t2 d0 57
6: 00020400
7: 04030001 frame 3 1
8: 04040001 frame 4 1
9: 3C000011 ifq v00000 d1 17
10: 08000401
11: 2D000801 movsr v02000 v00001
12: 20020C00 gcall v01000 0 v02000 d2
13: 10000000
14: 00000402
15: 2C060403 movsr t6 d3
16: 34000012 jmp 18
17: 2C060000 movsr t6 nil
18: 10000006 end t6
19: 3C000036 ifq t6 nil 54
20: 00060000
21: 04040001 frame 4 1
22: 3C000021 ifq v00001 d1 33
23: 08010401
24: 2D000800 movsr v02000 v00000
25: 2002000B gcall t11 2 v02000 d2
26: 10000002
27: 00000402
28: 20020C00 gcall v01000 1 t11 d2
29: 000B0001
30: 00000402
31: 2C060403 movsr t6 d3
32: 34000022 jmp 34
33: 2C060000 movsr t6 nil
34: 10000006 end t6
35: 3C000036 ifq t6 nil 54
36: 00060000
37: 04040002 frame 4 2
38: 2D000801 movsr v02000 v00001
39: 2D010800 movsr v02001 v00000
40: 20020008 gcall t8 2 v02001 d2
41: 10010002
42: 00000402
43: 2002000A gcall t10 2 v02000 d2
44: 10000002
45: 00000402
46: 20020009 gcall t9 1 v02001 t10
47: 10010001
48: 0000000A
49: 20020C00 gcall v01000 1 t8 t9
50: 00080001
51: 00000009
52: 10000403 end d3
53: 2C060403 movsr t6 d3
54: 2C020C00 movsr t2 v01000
55: 10000002 end t2
56: 10000002 end t2
57: 10000002 end t2
58: 10000002 end t2
instruction count:
36
entry point:
4
data:
0: ack
1: 0
2: 1
3: t
syms:
0: sys:b+
1: ack
2: sys:b-
code:
0: 8C00002F close t2 0 19 47 2 2 nil t18 t17
1: 00000002
2: 00020002
3: 00000013
4: 00110012
5: 3C00000D ifq t18 d1 13
6: 00120401
7: 2C0C0011 movsr t12 t17
8: 20020010 gcall t16 0 t12 d2
9: 000C0000
10: 00000402
11: 2C060403 movsr t6 d3
12: 3400000E jmp 14
13: 2C060000 movsr t6 nil
14: 3C00002D ifq t6 nil 45
15: 00060000
16: 3C00001B ifq t17 d1 27
17: 00110401
18: 2C0D0012 movsr t13 t18
19: 2002000A gcall t10 2 t13 d2
20: 000D0002
21: 00000402
22: 20020010 gcall t16 1 t10 d2
23: 000A0001
24: 00000402
25: 2C060403 movsr t6 d3
26: 3400001C jmp 28
27: 2C060000 movsr t6 nil
28: 3C00002D ifq t6 nil 45
29: 00060000
30: 2C0F0011 movsr t15 t17
31: 2C0E0012 movsr t14 t18
32: 2002000C gcall t12 2 t14 d2
33: 000E0002
34: 00000402
35: 20020007 gcall t7 2 t15 d2
36: 000F0002
37: 00000402
38: 20020009 gcall t9 1 t14 t7
39: 000E0001
40: 00000007
41: 20020010 gcall t16 1 t12 t9
42: 000C0001
43: 00000009
44: 2C060403 movsr t6 d3
45: 2C020010 movsr t2 t16
46: 10000002 end t2
47: 10000002 end t2
instruction count:
26
entry point:
4
data:
0: ack
1: 0
2: 1
3: t
syms:
0: sys:b+
1: ack
2: sys:b-
code:
0: 8C00002B close t2 0 19 43 2 2 nil t18 t17
1: 00000002
2: 00020002
3: 00000013
4: 00110012
5: 3C00000D ifq t18 d1 13
6: 00120401
7: 2C0C0011 movsr t12 t17
8: 20020010 gcall t16 0 t12 d2
9: 000C0000
10: 00000402
11: 2C060403 movsr t6 d3
12: 34000029 jmp 41
13: 2C060000 movsr t6 nil
14: 3C000019 ifq t17 d1 25
15: 00110401
16: 2C0D0012 movsr t13 t18
17: 2002000A gcall t10 2 t13 d2
18: 000D0002
19: 00000402
20: 20020010 gcall t16 1 t10 d2
21: 000A0001
22: 00000402
23: 2C060403 movsr t6 d3
24: 34000029 jmp 41
25: 2C060000 movsr t6 nil
26: 2C0F0011 movsr t15 t17
27: 2C0E0012 movsr t14 t18
28: 2002000C gcall t12 2 t14 d2
29: 000E0002
30: 00000402
31: 20020007 gcall t7 2 t15 d2
32: 000F0002
33: 00000402
34: 20020009 gcall t9 1 t14 t7
35: 000E0001
36: 00000007
37: 20020010 gcall t16 1 t12 t9
38: 000C0001
39: 00000009
40: 2C060403 movsr t6 d3
41: 2C020010 movsr t2 t16
42: 10000002 end t2
43: 10000002 end t2
instruction count:
24
entry point:
4
data:
0: ack
1: 0
2: 1
3: t
syms:
0: sys:b+
1: ack
2: sys:b-
code:
0: 8C000021 close t2 0 9 33 2 2 nil t2 t3
1: 00000002
2: 00020002
3: 00000009
4: 00030002
5: 3C00000B ifq t2 d1 11
6: 00020401
7: 20020004 gcall t4 0 t3 d2
8: 00030000
9: 00000402
10: 10000004 end t4
11: 3C000014 ifq t3 d1 20
12: 00030401
13: 20020005 gcall t5 2 t2 d2
14: 00020002
15: 00000402
16: 20020004 gcall t4 1 t5 d2
17: 00050001
18: 00000402
19: 10000004 end t4
20: 20020006 gcall t6 2 t2 d2
21: 00020002
22: 00000402
23: 20020007 gcall t7 2 t3 d2
24: 00030002
25: 00000402
26: 20020008 gcall t8 1 t2 t7
27: 00020001
28: 00000007
29: 20020004 gcall t4 1 t6 t8
30: 00060001
31: 00000008
32: 10000004 end t4
33: 10000002 end t2
instruction count:
14
entry point:
4
nil
t6
register arises because the pattern matching
cases are compiled into a or
expression which calculates a
Boolean value: was there a match or not? See the expansion below. But in that
specific situation, the Boolean value is never used; the result of each case
is stored in a generated variable, and that variable's value is returned.
When the or
is compiled, its result is tracked in
t6
. At the higher optimization levels, this wasteful register use
is identified and eliminated.
1> (macroexpand '(defun-match ack
((0 @n) (+ n 1))
((@m 0) (ack (- m 1) 1))
((@m @n) (ack (- m 1) (ack m (- n 1))))))
(defun ack (#:arg-00012
#:arg-10013)
(let (#:result0014)
(or (sys:when-exprs-match ;; result of this (or ...) is t6
(0 @n) (#:arg-00012
#:arg-10013)
(set #:result0014
(progn (+ n 1)))
t) ;; "mov t6 d3"
(sys:when-exprs-match
(@m 0) (#:arg-00012
#:arg-10013)
(set #:result0014
(progn (ack (- m 1)
1)))
t)
(sys:when-exprs-match
(@m @n) (#:arg-00012
#:arg-10013)
(set #:result0014
(progn (ack (- m 1)
(ack m (- n 1)))))
t))
#:result0014))
ack
.
It compiles and optimizes to the same 14 instruction sequence, with
some minor differences. The pattern matching version specifies a larger
frame size of 9 for the closure: in the close
instruction
we see a 9 rather than 8. This is because it uses more registers: up
through t8
. The code block for the main recursive case uses
registers t6
, t7
and t8
, whereas
the same code block from the non-pattern-matching version uses
t5
, t6
and t7
. That use of
registers, though smaller, is far from minimal; several more registers
could be eliminated through reuse. The TXR Lisp compiler and optimizer
implement some measures to reduce the number of registers, but not
doggedly so. Because a large number of registers is available,
a simple register allocation algorithm is used. When the optimizer
eliminates registers, it renumbers the remaining ones so they are
consecutive, and shrinks the frame size.
1> (defun ack (m n)
(cond
((eql 0 m) (+ n 1))
((eql 0 n) (ack (- m 1) 1))
(t (ack (- m 1) (ack m (- n 1))))))
ack
2> (compile 'ack)
#<vm fun: 2 param>
3> (disassemble 'ack)
data:
0: ack
1: 0
2: 1
syms:
0: sys:b+
1: ack
2: sys:b-
code:
0: 8C000021 close t2 0 8 33 2 2 nil t2 t3
1: 00000002
2: 00020002
3: 00000008
4: 00030002
5: 4000000B ifql d1 t2 11
6: 04010002
7: 20020004 gcall t4 0 t3 d2
8: 00030000
9: 00000402
10: 10000004 end t4
11: 40000014 ifql d1 t3 20
12: 04010003
13: 20020005 gcall t5 2 t2 d2
14: 00020002
15: 00000402
16: 20020004 gcall t4 1 t5 d2
17: 00050001
18: 00000402
19: 10000004 end t4
20: 20020005 gcall t5 2 t2 d2
21: 00020002
22: 00000402
23: 20020006 gcall t6 2 t3 d2
24: 00030002
25: 00000402
26: 20020007 gcall t7 1 t2 t6
27: 00020001
28: 00000006
29: 20020004 gcall t4 1 t5 t7
30: 00050001
31: 00000007
32: 10000004 end t4
33: 10000002 end t2
instruction count:
14
entry point:
4
ack