B.

Using multivariant tests to determine performance impact

When we introduce new features on our website, sometimes it’s not simply the behaviour of our users that changes. The behaviour of our own systems can change, too.

For example: A new feature might improve conversion (changing a user’s behaviour) while also slowing down our site rendering (changing the behaviour of our systems). This becomes interesting when you realize that the second effect might influence the first – a rendering slowdown might decrease conversion, for instance.

Sometimes, the opposing effects that turn up in our results make for interesting investigation and developments. At Booking.com, we’ve found a way to use multivariant testing to quantify these opposing effects.

Moving forward from traditional A/B testing

We don't change anything on our website without first validating it (traditionally through A/B split testing) and all but a select few of our experiments run on 100% of our eligible traffic. Small CSS changes, the introduction of new features, and even infrastructure changes like a new availability search engine, a new translation infrastructure, or a backend Perl version upgrade all must first go through the same testing.

It shouldn't therefore be any surprise that we run in excess of a thousand experiments in parallel. Our in-house experimentation platform is very well integrated into the foundations of our stack to make this possible. And many of the new features that get put through the experimentation process come with changes to a few layers.

Suppose we want to A/B test the effects of adding new useful data to our country landing page (the page where we hope you land after searching your favourite search engine for "Hotels in Italy"). We think that our visitors might be interested in having some Italian visa information available on that page. Such a feature needs a new data store for the visa information, extra effort to collect that data, UX design, and it won’t even be available for all our visitors as it probably won't be fully-formed in the beginning.

For the sake of argument, let's say that getting the Italian visa information is a kind-of-expensive operation. Let's also say that we don't have a whole lot of this content created yet – maybe we only have visa information written up for 10% of the countries our visitors want.

So, before investing effort into optimizing the data store or extending the data set, we'd like to run an A/B test to find out whether our visitors actually even like the feature. Here's the first way we could choose to implement this experiment (let's call it 'Experiment 1'):

Experiment 1: The simple A/B test

# data store query, introduces some performance overhead
have_data= get_visa_data()

# track_experiment() returns False if this user is in the control group, and
# True for the treatment group
if have_data and track_experiment("show_visa_data"):
    render_visa_data()

What's good about this setup is that the difference in metrics like the number of bookings made or customer service interactions between the unchanged original and the experiment variant can tell us whether the users liked this feature.

Unfortunately, it doesn't tell us anything about the business impact of having this feature: the performance impact of ‘get_visa_data()’ may be driving some users away from our website, and we’re not measuring that in the way we’ve decided to set up Experiment 1.

You might then choose to address this issue by implementing the experiment in a different way.

Experiment 2: The A/B test refined

if track_experiment("show_visa_data"):
    have_data= get_visa_data()
    if have_data:
        render_visa_data()

This experiment accounts for and measures business impact, which Experiment 1 couldn’t do. But there's a downside: if we have visa data for only 10% of our visitors, we would expose only 5% of the visitors in this experiment to the visible change that might entice them to convert better. It’s likely that this dilutes the effect so much as to be unmeasurable.

In other words, Experiment 2 is very likely to come out inconclusive. It might even be negative, due to the performance cost incurred for all users in the variant. If it is inconclusive (or negative), that might mean one of these two cases:

  1. Our users don't care about visa information at all;
  2. Our users love visa information, but that effect was diluted by the low availability of the data and/or negated by the negative impact of the performance cost.

It would be very valuable if we could tell (1) from (2) as it would influence our decision-making process and point at what we should do next. In case (1), we would abandon the idea and decide to better focus our effort on the next idea that comes along. In case (2) however, we can invest time in extending the data set and/or optimizing the data store.

We found a solution to that which is now quite commonplace in our organization. Here's what we do: instead of running an A/B test, we run a multivariant test (an "A/B/C" test):

Experiment 3: multivariant test

  1. No change;
  2. Get the visa data. Even if there's data, don't render it;
  3. Get the visa data. If there's data for this visitor, render it.

The comparison between (A) and (C) now gives the same data as Experiment 2 from the example above and tells you about the complete business impact (the one drawback of this three-way split is that it comes with a slight loss of statistical power compared to a 50/50 A/B split).

The comparison between (A) and (B) tells us about the impact of doing the data lookup. In other words, it tells us whether there may be something to gain by improving performance there.

The comparison between (B) and (C) tells us whether users for whom we had data liked the feature. As it is, we are quite unlikely to detect this difference as the visual change is only available for 10% of visitors in both groups. Fortunately, our platform makes it very easy to "zoom in" on this 10% subgroup and have a full experiment report that only includes them, giving us much better odds at detecting a difference – It actually gives us the same comparison as what Experiment 1 yields without any of the downsides.

This Experiment 3 multivariant setup adds another exciting possibility. While normal A/B tests can only give us a yes/no decision about enabling the feature for all our users, the new setup has an additional possible outcome akin to saying that a feature is promising but needs improvement.

As a company, we take pride in taking small steps towards optimizing our website, measuring along the way and learning from every result. This multivariant experimentation setup has proved to be a great resource in our toolbox with which to do just that.

comments powered by Disqus