TXR Bytecode Optimization Case Study

Below is the pattern matching version of Ackermann being compiled at several different optimization levels: 0, 2, 4 and 7. The instruction count drops from 36 instructions to 26, 24 and finally 14. TXR 283 was used for this demo.

In the 36 instruction unoptimized version, we see frames being allocated and lots of v registers used, corresponding to all the temporary variables in in the code generated by the pattern matching macrology. There is also a block instruction with a matching end because named function bodies are implicit named blocks.

In the 26 instruction version, variables not captured by closures have been identified and turned into t registers (pure registers, freshly instantiated in each function, and not lexically captured). Thus all frame instructions are gone along with their matching end. The block instruction is gone: the compiler has determined that it's not used: nothing in the block directly references it, and no function is called which possibly uses the block. Blocks are dynamically scoped in TXR Lisp, and are the target of a dynamic control transfer, which makes their setup expensive.

The most obvious change between the 26 and 24 is some control flow work: some jmp instructions are now going directly down to 41 instead of chaining through the program. It's not simple jump threading; it's informed by data: as in, if we jump here, we will test a register we know to be already false, so we might as well jump all the way here. The 24 instruction version contains numerous wasteful register-register moves. It's still storing values into a register t6, which is never referenced.1

Ten more instructions are shaved off (24 to 14) with pattern-matching peephole optimizations, which are informed by data flow analysis. The reductions are mostly local to a basic block, but sometimes involve multiple blocks. In this last version, the forward jumps to the end instruction are gone; the end instruction sequence is hoisted back into those sites. There are no more wasteful mov instructions; in fact, all mov instructions are gone.

The final version is almost identical2 code to what pops out of a non-pattern-matching ordinary Ackermann.

1< (each ((*opt-level* #(0 2 4 7)))
     (eval '(defun-match ack
              ((0 @n) (+ n 1))
              ((@m 0) (ack (- m 1) 1))
              ((@m @n) (ack (- m 1) (ack m (- n 1))))))
     (disassemble (compile 'ack)))
data:
    0: ack
    1: 0
    2: 1
    3: t
syms:
    0: +
    1: ack
    2: -
code:
    0: 8C00003A close t2 2 13 58 2 2 nil v00000 v00001
    1: 00020002
    2: 00020002
    3: 0000000D
    4: 08010800
    5: 4C000039 block t2 d0 57
    6: 00020400
    7: 04030001 frame 3 1
    8: 04040001 frame 4 1
    9: 3C000011 ifq v00000 d1 17
   10: 08000401
   11: 2D000801 movsr v02000 v00001
   12: 20020C00 gcall v01000 0 v02000 d2
   13: 10000000
   14: 00000402
   15: 2C060403 movsr t6 d3
   16: 34000012 jmp 18
   17: 2C060000 movsr t6 nil
   18: 10000006 end t6
   19: 3C000036 ifq t6 nil 54
   20: 00060000
   21: 04040001 frame 4 1
   22: 3C000021 ifq v00001 d1 33
   23: 08010401
   24: 2D000800 movsr v02000 v00000
   25: 2002000B gcall t11 2 v02000 d2
   26: 10000002
   27: 00000402
   28: 20020C00 gcall v01000 1 t11 d2
   29: 000B0001
   30: 00000402
   31: 2C060403 movsr t6 d3
   32: 34000022 jmp 34
   33: 2C060000 movsr t6 nil
   34: 10000006 end t6
   35: 3C000036 ifq t6 nil 54
   36: 00060000
   37: 04040002 frame 4 2
   38: 2D000801 movsr v02000 v00001
   39: 2D010800 movsr v02001 v00000
   40: 20020008 gcall t8 2 v02001 d2
   41: 10010002
   42: 00000402
   43: 2002000A gcall t10 2 v02000 d2
   44: 10000002
   45: 00000402
   46: 20020009 gcall t9 1 v02001 t10
   47: 10010001
   48: 0000000A
   49: 20020C00 gcall v01000 1 t8 t9
   50: 00080001
   51: 00000009
   52: 10000403 end d3
   53: 2C060403 movsr t6 d3
   54: 2C020C00 movsr t2 v01000
   55: 10000002 end t2
   56: 10000002 end t2
   57: 10000002 end t2
   58: 10000002 end t2
instruction count:
   36
entry point:
    4
data:
    0: ack
    1: 0
    2: 1
    3: t
syms:
    0: sys:b+
    1: ack
    2: sys:b-
code:
    0: 8C00002F close t2 0 19 47 2 2 nil t18 t17
    1: 00000002
    2: 00020002
    3: 00000013
    4: 00110012
    5: 3C00000D ifq t18 d1 13
    6: 00120401
    7: 2C0C0011 movsr t12 t17
    8: 20020010 gcall t16 0 t12 d2
    9: 000C0000
   10: 00000402
   11: 2C060403 movsr t6 d3
   12: 3400000E jmp 14
   13: 2C060000 movsr t6 nil
   14: 3C00002D ifq t6 nil 45
   15: 00060000
   16: 3C00001B ifq t17 d1 27
   17: 00110401
   18: 2C0D0012 movsr t13 t18
   19: 2002000A gcall t10 2 t13 d2
   20: 000D0002
   21: 00000402
   22: 20020010 gcall t16 1 t10 d2
   23: 000A0001
   24: 00000402
   25: 2C060403 movsr t6 d3
   26: 3400001C jmp 28
   27: 2C060000 movsr t6 nil
   28: 3C00002D ifq t6 nil 45
   29: 00060000
   30: 2C0F0011 movsr t15 t17
   31: 2C0E0012 movsr t14 t18
   32: 2002000C gcall t12 2 t14 d2
   33: 000E0002
   34: 00000402
   35: 20020007 gcall t7 2 t15 d2
   36: 000F0002
   37: 00000402
   38: 20020009 gcall t9 1 t14 t7
   39: 000E0001
   40: 00000007
   41: 20020010 gcall t16 1 t12 t9
   42: 000C0001
   43: 00000009
   44: 2C060403 movsr t6 d3
   45: 2C020010 movsr t2 t16
   46: 10000002 end t2
   47: 10000002 end t2
instruction count:
   26
entry point:
    4
data:
    0: ack
    1: 0
    2: 1
    3: t
syms:
    0: sys:b+
    1: ack
    2: sys:b-
code:
    0: 8C00002B close t2 0 19 43 2 2 nil t18 t17
    1: 00000002
    2: 00020002
    3: 00000013
    4: 00110012
    5: 3C00000D ifq t18 d1 13
    6: 00120401
    7: 2C0C0011 movsr t12 t17
    8: 20020010 gcall t16 0 t12 d2
    9: 000C0000
   10: 00000402
   11: 2C060403 movsr t6 d3
   12: 34000029 jmp 41
   13: 2C060000 movsr t6 nil
   14: 3C000019 ifq t17 d1 25
   15: 00110401
   16: 2C0D0012 movsr t13 t18
   17: 2002000A gcall t10 2 t13 d2
   18: 000D0002
   19: 00000402
   20: 20020010 gcall t16 1 t10 d2
   21: 000A0001
   22: 00000402
   23: 2C060403 movsr t6 d3
   24: 34000029 jmp 41
   25: 2C060000 movsr t6 nil
   26: 2C0F0011 movsr t15 t17
   27: 2C0E0012 movsr t14 t18
   28: 2002000C gcall t12 2 t14 d2
   29: 000E0002
   30: 00000402
   31: 20020007 gcall t7 2 t15 d2
   32: 000F0002
   33: 00000402
   34: 20020009 gcall t9 1 t14 t7
   35: 000E0001
   36: 00000007
   37: 20020010 gcall t16 1 t12 t9
   38: 000C0001
   39: 00000009
   40: 2C060403 movsr t6 d3
   41: 2C020010 movsr t2 t16
   42: 10000002 end t2
   43: 10000002 end t2
instruction count:
   24
entry point:
    4
data:
    0: ack
    1: 0
    2: 1
    3: t
syms:
    0: sys:b+
    1: ack
    2: sys:b-
code:
    0: 8C000021 close t2 0 9 33 2 2 nil t2 t3
    1: 00000002
    2: 00020002
    3: 00000009
    4: 00030002
    5: 3C00000B ifq t2 d1 11
    6: 00020401
    7: 20020004 gcall t4 0 t3 d2
    8: 00030000
    9: 00000402
   10: 10000004 end t4
   11: 3C000014 ifq t3 d1 20
   12: 00030401
   13: 20020005 gcall t5 2 t2 d2
   14: 00020002
   15: 00000402
   16: 20020004 gcall t4 1 t5 d2
   17: 00050001
   18: 00000402
   19: 10000004 end t4
   20: 20020006 gcall t6 2 t2 d2
   21: 00020002
   22: 00000402
   23: 20020007 gcall t7 2 t3 d2
   24: 00030002
   25: 00000402
   26: 20020008 gcall t8 1 t2 t7
   27: 00020001
   28: 00000007
   29: 20020004 gcall t4 1 t6 t8
   30: 00060001
   31: 00000008
   32: 10000004 end t4
   33: 10000002 end t2
instruction count:
   14
entry point:
    4
nil

Notes:

1
The wasteful t6 register arises because the pattern matching cases are compiled into a or expression which calculates a Boolean value: was there a match or not? See the expansion below. But in that specific situation, the Boolean value is never used; the result of each case is stored in a generated variable, and that variable's value is returned. When the or is compiled, its result is tracked in t6. At the higher optimization levels, this wasteful register use is identified and eliminated.
1> (macroexpand '(defun-match ack
                   ((0 @n) (+ n 1))
                   ((@m 0) (ack (- m 1) 1))
                   ((@m @n) (ack (- m 1) (ack m (- n 1))))))
(defun ack (#:arg-00012
            #:arg-10013)
  (let (#:result0014)
    (or (sys:when-exprs-match     ;; result of this (or ...) is t6
          (0 @n) (#:arg-00012
                  #:arg-10013)
          (set #:result0014
            (progn (+ n 1)))
          t)                      ;; "mov t6 d3"
      (sys:when-exprs-match
        (@m 0) (#:arg-00012
                #:arg-10013)
        (set #:result0014
          (progn (ack (- m 1)
                  1)))
        t)
      (sys:when-exprs-match
        (@m @n) (#:arg-00012
                 #:arg-10013)
        (set #:result0014
          (progn (ack (- m 1)
                  (ack m (- n 1)))))
        t))
    #:result0014))
2
For reference, here is the non-pattern matching version of ack. It compiles and optimizes to the same 14 instruction sequence, with some minor differences. The pattern matching version specifies a larger frame size of 9 for the closure: in the close instruction we see a 9 rather than 8. This is because it uses more registers: up through t8. The code block for the main recursive case uses registers t6, t7 and t8, whereas the same code block from the non-pattern-matching version uses t5, t6 and t7. That use of registers, though smaller, is far from minimal; several more registers could be eliminated through reuse. The TXR Lisp compiler and optimizer implement some measures to reduce the number of registers, but not doggedly so. Because a large number of registers is available, a simple register allocation algorithm is used. When the optimizer eliminates registers, it renumbers the remaining ones so they are consecutive, and shrinks the frame size.
 1> (defun ack (m n)
     (cond
       ((eql 0 m) (+ n 1))
       ((eql 0 n) (ack (- m 1) 1))
       (t (ack (- m 1) (ack m (- n 1))))))
ack
2> (compile 'ack)
#<vm fun: 2 param>
3> (disassemble 'ack)
data:
    0: ack
    1: 0
    2: 1
syms:
    0: sys:b+
    1: ack
    2: sys:b-
code:
    0: 8C000021 close t2 0 8 33 2 2 nil t2 t3
    1: 00000002
    2: 00020002
    3: 00000008
    4: 00030002
    5: 4000000B ifql d1 t2 11
    6: 04010002
    7: 20020004 gcall t4 0 t3 d2
    8: 00030000
    9: 00000402
   10: 10000004 end t4
   11: 40000014 ifql d1 t3 20
   12: 04010003
   13: 20020005 gcall t5 2 t2 d2
   14: 00020002
   15: 00000402
   16: 20020004 gcall t4 1 t5 d2
   17: 00050001
   18: 00000402
   19: 10000004 end t4
   20: 20020005 gcall t5 2 t2 d2
   21: 00020002
   22: 00000402
   23: 20020006 gcall t6 2 t3 d2
   24: 00030002
   25: 00000402
   26: 20020007 gcall t7 1 t2 t6
   27: 00020001
   28: 00000006
   29: 20020004 gcall t4 1 t5 t7
   30: 00050001
   31: 00000007
   32: 10000004 end t4
   33: 10000002 end t2
instruction count:
   14
entry point:
    4
ack