B.

More Optimizations in Perl 5.20-to-be

In a recent post on his blog Matthew Horsfall explained a few of the optimizations to Perl that he contributed recently. On this site, we reported on the fantastic work that Dave Mitchell is doing. The two of them inspired me to try my hand at this optimization game at a recent hackathon at work by improving the Perl optimizer.

But first, let's discuss a bit about how the Perl compiler works and its work products.

The output of the Perl compiler is two-fold. The first is an in-memory data structure, that resembles a traditional compiler's Abstract Syntax Tree. This is called the OP tree since it consists of struct OPs that generally represent some OPeration to be performed. The second is a linked list that connects (some of) these OPs in the order they are to be executed[1].

So for the code my $a = $b + 3, we have this tree

sassign-+-add-+-padsv[$b]
        |     `-const[3]
        `-padsv[$a]

where sassign indicates scalar assignment, and padsv is a lexical variable access. It helps to think of perl as a stack machine, so the linked list for execution would be this sequence (with annotations on what they actually do):

padsv[$b]   (push $b on stack)
const[3]    (push 3 on stack)
add         (pop 3 and $b off of stack, push their sum on stack)
padsv[$a]   (push $a on stack)
sassign     (pop $a and the sum off of stack, set $a)

Things get just a tiny bit more complicated if you consider operations that take a variable number of items as parameters such as taking a slice of an array. Consider @x[1,2]

aslice-+-pushmark
       |-list-+-pushmark
       |         |-const[1]
       |         `-const[2]
       `-padav[@x]

The array slice needs two pieces of information: a list of indexes to slice from the array, and the array itself. In Perl, the way to pass an arbitrary number of items to an operation is to remember the stack position (which is what pushmark does[2]) and then simply push the items onto the stack. The recipient simply looks at the most recently remembered stack position and pops its parameters off the stack until it reaches this position. aslice is such an operation that receives a variable number of parameters, so the first thing to do is to execute a pushmark. The execution order of operations here is:

pushmark (outer)
pushmark (inner)
const[1]
const[2]
list
padav
aslice

This is where it gets a bit weird. One would expect to simply execute the two consts and the padav now. That would correspond to this tree

aslice-+-pushmark
       |-const[1]
       |-const[2]
       `-padav[@x]

and this execution order.

pushmark
const[1]
const[2]
padav
aslice

The list operation has a very simple purpose. It looks up the most recently remembered stack position. Then it checks which context it was invoked in (context in the normal Perl sense of context). If it was invoked in list context, then it will simply do nothing and leave the items on the stack. If it was invoked in scalar context, it will only leave the last, topmost item on the stack. Let that sink in for a moment.

Indeed, in list context, list simply undoes the effect of one preceding pushmark. The two operations of the inner pushmark and the list cancel each other out entirely. This is one of the places where the OP tree's dual nature as an AST and as a data structure intended for the actual execution of the program shows.

And here is where my optimization comes into play: In Perl 5.20, the optimizer is just a tiny bit smarter now because it can detect list / pushmark pairs that will be a no-op. It modifies the execution chain to not execute them while leaving the AST intact for decompilers and similar tooling. Because the internet has informed me that optimizations without benchmarks to "prove" their effectiveness are meaningless, let's indulge in some micro-benchmarking while investing some effort into making the result somewhat meaningful. What we'll benchmark is simply array slicing @y = @x[3,7].

use 5.14.2;
use warnings;
use Dumbbench;
use Dumbbench::CPUFrequencyPinner;

my $opt = '/path/to/optimized/perl';
my $unopt = '/path/to/unoptimized/perl';

my $code = <<'HERE';
my @x = (1..8);
my @y;
@y = @x[3,7] for 1..1e6;
HERE

my $db = Dumbbench->new(
  target_rel_precision => 0.005,
  initial_runs         => 20,
);

$db->add_instances(
  Dumbbench::Instance::Cmd->new(command => [$unopt, '-e', $code],
                                name => "before"),
  Dumbbench::Instance::Cmd->new(command => [$opt, '-e', $code],
                                name => "after"),
);

SCOPE: {
  # Pin CPU frequency down to reduce variability
  my $p = Dumbbench::CPUFrequencyPinner->new;
  $SIG{INT} = sub {undef $p; exit;};
  $p->set_max_frequencies($p->min_frequencies->[0]);
  $db->run;
}

$db->report;
$db->box_plot->show();

Benchmark results for array slicing

This type of graph is generally referred to as a box plot and it shows the measurement results with additional information on the variability of the repeat measurements. "after" refers to a perl built from commit 7d3c8a6837b55fff whereas "before" is a perl built from the commit prior to that. This particular example benchmark shows a statistically significant speed-up of 8% in slicing arrays. You can see similar speed-ups in hash slices, certain map blocks, and a few other list-type operations. For a day's work on a very mature code base, that's not too shabby at all!

[1] Technically speaking, there are many such linked lists since a logical operation will create a branch. So it's really a tree. But many stretches of code will be contained in linear sections, so please excuse me for the slight oversimplification.

[2] The name pushmark stems from the fact that the way to remember multiple stack positions in nested operations is done with another stack, the MARK stack. So pushmark does very much what's on the tin.

comments powered by Disqus