B.

How We Spent Two Days Making Perl Faster

This is a story about a significant new optimization to the Perl interpreter. It is a story about battling code complexity. And it is a story about wanting to have our cake and eat it too.

A recent Booking.com hackathon provided us the opportunity to investigate speeding up integer allocation in the Perl interpreter. If successful, this could optimize nearly every program we run. We discovered that a naive implementation could work, but would make the code a lot more difficult to maintain. Our path lead us to attempt to leverage the C preprocessor to improve code clarity while opening doors to real gains in program execution speed.

First, some background

As described in perlguts and PerlGuts Illustrated, representations of variables in Perl are usually composed of two parts: a head struct and (optionally) a body struct. The head contains the essential “internal book-keeping” common to all variables (regardless of type) including a pointer to the (optional) body struct. Below is an image of the layout of fields in a head struct as depicted in PerlGuts Illustrated:

SV Head (PerlGuts Illustrated)

The body struct can be quite different, depending upon the type of the variable. The most simple type of variable is the SvNULL, which represents undef and does not need a body struct at all.

A string (called PV for “pointer value”) has a body type of XPV:

SvPV (PerlGuts Illustrated)

But the body struct of a PV is different to a body struct of a PVNV. A PVNV can hold a floating point number and a string representation of that same value:

SvPVNV (PerlGuts Illustrated)

One benefit of this type of design is that all references to the value point to the memory location of the head. As such, Perl is free to change which memory is being used to represent a value by changing the body struct, without having to update any pointer except the pointer contained within the head.

Changing types

Naturally, Perl has an internal function to convert between types of variable. This function is sv_upgrade (“scalar value upgrade”). In essence, whenever we have a variable of some type in Perl (for example, a simple integer) and we need to access it as a different type (for example, a string), the sv_upgrade will convert the type of the variable (for example, into a type that contains both integer and string representations of a value). This change may involve replacing the body struct with a larger struct.

To see how sv_upgrade is implemented, we can look at the Perl_sv_upgrade function in sv.c. We see that this function encapsulates a lot of complexity. There are many comments, which describe subtle corner cases. Since it can take a scalar value of essentially any type and convert it to something capable of representing any other type, perhaps it’s unsurprising that there’s a lot of code here.

Without going line-by-line, an overview of how this function is implemented may be useful. First, there is a switch based on the current type of variable, to determine what it needs for the new type. Shortly thereafter, there is a second switch based on the new type. Inside of the second switch block, there are numerous if blocks for doing different things depending on the old type. Finally, after the new body struct has been set up, and the head struct contains all the correct flags, the memory used by the old body is freed.

Still with me? Good.

A naive optimization

The sv_upgrade function is called from a number of places. Not only is it called in places such as printing integers as strings, but it’s also called when assigning an integer value to a previously cleared variable.

A previously cleared variable is always an undef with no body struct. The reason sv_upgrade is called is to allow the correct setup for the body to occur. This was a very sensible design decision, as it centralizes a lot of “internal book-keeping” behavior, rather than duplicating it. The cost of this centralization is performance: some generic (and in this case superfluous) code is executed. For example, when creating a new integer, sv_upgrade will run an unnecessary check for a conversion of a larger type to a smaller type.

Integer assignment to a cleared variable occurs so frequently that one would imagine it might be worth duplicating the required complexity to achieve the performance improvement. Thus we decided to evaluate that trade-off for integer allocation. After a review of the nearly 300 lines of sv_upgrade internals, we saw that we could remove the call entirely if we could “hoist out” just the essential two lines of code! However, there is a very good reason this had not been done before. Let's look at the two lines.

The first line (since we know the new type) is easy:

SvFLAGS(sv) |= new_type;

But the other line is quite complex:

SvANY(sv) = (XPVIV*)((char*)&(sv->sv_u.svu_iv) - STRUCT_OFFSET(XPVIV, xiv_iv));

If your head is swimming after looking at that last line of code, don’t worry: you’re in good company. Granted, this is described in Illustrated perlguts:

Since 5.10 for a raw IV (without PV) the IVX slot is in the HEAD, there is no xpviv struct ("body") allocated. The SvIVX macro abuses SvANY pointer arithmethic to point to a compile-time calculated negative offset from HEAD-1 to sv_u.svu_iv, so that PVIV and IV can use the same SvIVX macro.

But, even once I thought I maybe understood what that complex line of code supposedly did, I made Steffen patiently sit for more than 15 minutes as I convinced myself with paper and pencil that it actually did what he had described. After that, the drawing from Illustrated Perl Guts made more sense to me:

Bodyless SVIV (PerlGuts Illustrated)

What’s more, I finally understood that this complexity exists in order to avoid executing an if statement that would otherwise be called every time the value is retrieved!

Thus, the previous design trade-offs seemed like the right ones. Yes, we could make Perl faster, but at the cost of leaking a lot of complexity into another part of the code. That complexity would make any additional development much more difficult in any future work.

Having our cake and eating it too

We wanted to encapsulate that very complex bit of code, but without any added runtime performance cost. Since this is C, we looked at using the preprocessor to push the complexity behind a macro, in much the same way that with other languages we might move some hairy lines of code behind a well named function or method:

#define SET_SVANY_FOR_BODYLESS_IV(sv) \
    SvANY(sv) = (XPVIV*)((char*)&(sv->sv_u.svu_iv) - STRUCT_OFFSET(XPVIV, xiv_iv))

And, of course, the advantage of using a macro (rather than a function) is that the cost is paid entirely at compile time, and thus there is zero additional runtime performance lost.

By introducing the macro, the readability of the code would be improved in several places.

So, how did this change our situation? With the macro, the two lines we wanted to “hoist out” looked a lot less complex. All we would need would be a patch which replaces this one computationally heavy function call:

sv_upgrade(dstr, SVt_IV);

with just these two lines:

SET_SVANY_FOR_BODYLESS_IV(dstr);
SvFLAGS(dstr) |= SVt_IV;

By factoring the complexity differently, we opened the doors to a different set of performance/complexity cost trade-offs. We would leak only a small amount of complexity relative to the gain in performance. But is it worth it? With these changes, the cost would be low, but what would the actual benefit be?

Measuring the gain

The micro-benchmark we used is very heavy on exercising one particular code path, however it is a common code path.

  $ dumbbench -i50 --pin-frequency -- \
    ./perl -Ilib -e \
    'for my $x (1..1000){my @a = (1..2000);}'

Here are the results of the micro-benchmark before the optimization:

  Rounded run time per iteration: 2.4311e-01 +/- 1.4e-04

And here are the results after the optimization:

  Rounded run time per iteration: 1.99354e-01 +/- 5.5e-05

That’s 18% faster.

Results

Overall, this hackathon project was a success.

With these measurements, we demonstrated that the optimization was worthwhile. The complexity cost is pretty close to zero: in some places the Perl core is slightly more complex, but the internals of sv_upgrade are certainly more clear. Additionally, we found there were other similar potential optimizations that could leverage the same technique. In the end, five related patches were pushed to Perl:

When Perl 5.22 ships, in part because of this work, many real-world programs will be faster.

comments powered by Disqus