A/B Testing - Concept != Execution

“We tested that, and it failed.”

This typical excuse is rampant in the world of A/B testing, but it can overlook the fact that a concept in and of itself is fundamentally different from the execution of a concept. Ideas often resurface over time. Ones that failed before tend to be labeled as failures, and they never make it out of a new gate.

Yeah, I thought of that years ago… Tried it though, and it didn’t work.

This quick shoot-down mentality can be harmful if not checked. Because Booking.com has been performing A/B tests for about a decade now, it sometimes seems that everything has already been tried. Though we've done a lot, failed many times and won a bit along the way too, there's still so much more we can improve for our customers and their experience on the site. That’s why my reaction to dismissive statements like this is typically, “OK, then… What exactly did you try and how long ago?"

Did their A/B test approach the overall idea in a similar way to the proposed new way? And if so, how long did it run? How did it affect user behavior? And are they 110% certain there were no bugs or usability issues introduced in their implementation?

There are far more ways to fail than there are to succeed.

I have a whole laundry list of questions I ask when I hear that a solid concept has failed. This list stems from my experience that there are far more ways to fail at something than there are to succeed.

That statement is rather pessimistic – and with good reason. I’ve done enough A/B tests (from generating the initial concept to implementing it from nuts-and-bolts technical perspective) to grasp the number of moving parts that could potentially lead to a good idea’s demise.

A seemingly “insignificant change” made off-the-cuff or a hard-to-identify design flaw could have an impact just negative enough to counteract whatever positive effect your change might be having.

Here are some concrete examples that
can make good ideas fail:

Increase in page-load time due to a less-than-ideal technical implementation
Did you add some big images, heavy CSS, or some poorly performing JS? Have you kept an eye on any new errors that might have cropped up?

Keep a very close eye on all vital site statistics because changes “unseen” to users are just as impactful as content and visual changes.

Slightly wrong choice of color, typeface, or font size on key elements
Is the most important information eye-catching and legible?

Even something as seemingly insignificant as a serif font used in the wrong place can have a negative impact.

Improper size or placement of a feature in relationship to other content on the page
Does the thing you’re adding or redesigning take attention away from another key element on the page? Did you remove something else to “make space” for the new feature?

You can’t add, remove, or change anything on a page without it affecting how people interact with everything else. Designers tend to have laser-focus on the new stuff they’re doing and forget it changes how stuff around it is used, too.

Poor timing
Have you implemented a tooltip that disappears after a certain length of time? Are you adding content that only makes sense during a particular time of the year?

Showing certain types of information too soon, too late, or not giving people enough time to absorb content can sometimes have the opposite of your intended effect.

Bugs in edge-case scenarios
Is there a rendering issue in a lesser-used browser or device? Is it just as usable on a tablet as it is on a desktop? If the website is multilingual, is everything translated properly and correctly localised?

If enough of these edge-cases combine together, their cumulative effect could be negative. This, however, is one of the benefits of comprehensive A/B testing framework! You can use analytics to see potential user agents, browser types, and countries where a test is failing. Each user deserves to have a good experience on your website, and resolving these issues you identify in the data can push the results over the edge. A/B testing can help ensure high-quality for everyone.

Making large, small and/or unnecessary changes not inherently linked to the raw concept, which may have unintended consequences
Did you change a line of copy, while also making the words bigger/smaller or adding new color?

If so, then you’ve tainted the concept you’re testing. I’ve seen a slight increase in font size while testing a color change cause a solid and repeatable concept to fail.

Noisy tracking
Are you testing a new flow or a different interaction? Is the content you’re adding or changing not immediately visible on the page?

Track users only when they are actually exposed to the visual change or interact with the element that contains the different behavior. If you’re tracking people who haven’t seen or used the concept you’re testing, then they become statistical noise and dilute the results. Enough noise in your tracking and you can’t hear what your users are trying to tell you.

Who’s the audience?
Who exactly sees the change? Is it exposed to all visitors? Did they come from an email that set a certain expectation of what they would see? Have they come from a paid ad or an organic search? Have they typed in the URL directly? Are they newcomers or returning visitors? Have they made a purchase before? 

A customer’s point of entry and their historical use of the site affects how they interact with content. The more you're able to target messaging to the most relevant users, the more likely you're able to create meaningful interactions that make metrics move.

Low traffic
Did your test include only a small group of users that were exposed to a very tiny change?

To pick up a significant result on a low traffic website, your changes need to be bigger and bolder to see an effect. In other words, your idea might actually be working, but you just can’t see it in the numbers. The size of the changes you make need to relate to the amount of traffic you have.

An idea ahead of its time?
When did you test this concept? Has a decent amount of time passed, but the problem still hasn’t been solved?

What doesn’t work today might work tomorrow, and what worked yesterday might be holding you back today.

The products we design must be just as dynamic as the people we design for.

People are dynamic, and their expectations change as they, and the world around them, evolves. That’s why the products we design have to be just as dynamic.

Designers can tend to be ahead of the curve because we keep our fingers on the pulse of what’s going on. Design trends, the newest HTML & CSS tricks, and fancy technology integration might seem cool to us, but most of the time what we expect is far different of what typical users feel comfortable with.

We are designing products for normal people to use TODAY.

I always try to remind myself that I’m designing products for normal people to use TODAY.

People who weren’t used to swiping gestures last year might expect to see them this year. A fancy line of code that used to crash browsers could solve that problem a few years later as hardware becomes more powerful.

These are just a few of the most common issues I’ve stumbled upon during my time designing with data.

That’s why gaining an incredible depth of understanding of the high-level concept you’re testing—as well as having a solid grasp on the complexity of your system coupled with a flawless implementation of the solution—is imperative to an idea’s final success (or failure).

Here are some things to keep in mind for when you set up your next A/B test:

  1. Remember the importance of carefully navigating through complexity to cleanly test your concept.

  2. The results, be they positive, negative, or neutral, can help form future iterations of the same concept or can offer you insights into new hypotheses to be tested.

  3. Hold yourself to a high standard of quality—even when in an easy-win situation.

  4. Every A/B test, in spite of its size or scope, should get the same amount of care.

  5. Understand that a negative or neutral result doesn’t necessarily mean “no.” These results can also possibly mean, “Not quite right” or “Not quite yet.” The more you test, the more you’ll be able to spot when “no” actually means “no.”

But wait! There’s more…

Sometimes, however, a concept is so strong that it can survive even the worst of executions. I’m sure you’ve experienced examples of features or functionalities on major websites that are incredibly useful but lack visual refinement and/or have some unfortunate usability issues. This often leads me to a moment of face-palming.

Sometimes a poor implementation keeps a good idea from succeeding. Conversely, a great idea can succeed in spite of a poor implementation.

So, it goes both ways. Sometimes a poor execution keeps a great idea from succeeding, but sometimes a great idea succeeds in spite of a haphazard implementation.

The difference between “average” and “exceptional” data-driven designers is that they realize that a “concept != an execution.”

comments powered by Disqus