the-paradox-at-the-heart-of-a/b-testing

A/B testing began with beer.

At the turn of the 20th century, William Sealy Gosset was exploring new ways of running experiments on his production line. Gosset was trying to improve the quality of Guinness’s signature stout but couldn’t afford to run large-scale experiments on all the variables in play. Fortunately, in addition to being an astute brewer, Gosset was a statistician; he had a hunch there was a way of studying his small snatches of data to uncover new insights.

Gosset took a year off from his work to study alongside another scientist, Karl Pearson. Together, they developed a new way of comparing small data sets to test hypotheses. They published their work in the leading statistics publication at the time, Biometrika. In “The Probable Error of a Mean,” the t-test, a cornerstone of modern statistical analysis, was born.

Gosset’s scientific approach was the foundation of a 38-year career with Guinness. He invented more ways of using statistics to make business decisions; founded the statistics department at Guinness; led brewing at Guinness’s newest plant in 1935; and finally, in 1937, became the head of all brewing at Guinness.

Since its early days as a tool of science (and beer), statistical decision-making has gone supernova. Today, it is used by every major tech company to make hundreds of thousands of decisions every year. Data-driven tests decide everything from the effectiveness of political ads to a link’s particular shade of blue. New methods like Fisher testing, multivariate testing, and multi-armed bandit testing are all descendants of Gosset’s early innovations. The most popular of these statistical tests is one of the oldest: A/B testing.

An A/B test is a measurement of what happens when you make a single, small, controlled change. In product design, this means changing something in the interface, like the color of a button or the placement of a headline. To run an A/B test, show an unchanged version of the interface (“A”) to a randomly-selected group of users. Show a changed version (“B”) to another randomly-selected group. Measure the difference in the behavior of the two groups using a t-test, and you can confidently predict how the changed version will perform when shown to all your users.

A/B tests are easy to understand, which explains their popularity in modern software development. But their simplicity is deceptive. The fundamental ideas of A/B tests contain a paradox that calls their value into question.

Fredkin’s paradox

In The Society Of Mind, Marvin Minsky explored a phenomenon that I experience every day as a designer: people often prefer one thing over another, even when they can’t explain their preference.

We often speak of this with mixtures of defensiveness and pride.

“Art for Art’s sake.”

“l find it aesthetically pleasing.”

“l just like it.”

“There’s no accounting for it.“

Why do we take refuge in such vague, defiant declarations? ”There’s no accounting for it” sounds like a guilty child who’s been told to keep accounts. And “I just like it” sounds like a person who is hiding reasons too unworthy to admit.

Minsky recognized that our capriciousness serves a few purposes: We tend to prefer familiar things over unfamiliar. We prefer consistency to avoid distraction. We prefer the convenience of order to the vulnerability of individualism. All of these explanations boil down to one observation, which Minsky attributes to Edward Fredkin:

Fredkin’s Paradox: The more equally attractive two alternatives seem, the harder it can be to choose between them—no matter that, to the same degree, the choice can only matter less.

Fredkin’s Paradox is about equally attractive options. Picking between a blue shirt and a black shirt is hard when they both look good on you. Choices can be hard when the options are extremely similar, too — see the previous link to Google’s infamous “50 shades of blue” experiment. The paradox is that you spend the most time deliberating when your choice makes no difference.

Parkinson’s law of triviality

In 1955, C. Northcote Parkinson wrote a book called Parkinson’s Law. It’s a satire of government bureaucracy, written when the British Colonial Office was expanding despite the British Empire itself shrinking. In one chapter, Parkinson describes a fictional 11-person meeting with two agenda items: the plans for a nuclear power plant, and the plans for an employee bike shed.

The power plant is estimated to cost $10,000,000. Due to the technical complexity involved, many experts have weighed in. Only two attendees have a full grasp of the budget’s accuracy. These two struggle to discuss the plans, since none of the other attendees can contribute. After a two-minute discussion, the budget is approved.

The group moves on to the bike shed, estimated to cost $2,350. Everyone in the meeting can understand how the bike shed is built. They debate the material the roof is made of — is aluminum too expensive? — and the need for a bike shed at all — what will the employees want next, a garage? — for forty-five minutes. This budget is also approved.

This allegory illustrates what’s called “Parkinson’s law of triviality”:

The time spent on any item of the agenda will be in inverse proportion to the sum [of money] involved.

We can generalize Parkinson’s law: The effort spent discussing a decision will be inversely proportional to the value of making that decision.

When faced with two similar alternatives, Fredkin’s paradox predicts you’ll have a hard time choosing. This is when A/B testing is most appealing: A/B tests settle debates with data instead of deliberation. But our generalization of Parkinson’s law of triviality says that this kind of A/B testing — testing as an alternative to difficult decisions — results in the least value.

Most of the time, A/B testing is worthless. The time spent designing, running, analyzing, and taking action on an A/B test will usually outweigh the value of picking the more desirable option.

Alternatives to A/B testing

Instead of A/B testing, I’ll offer two suggestions. Both are cheaper and more impactful.

Alternative 1: observe five users

Tom Landauer and Jakob Nielsen demonstrated in Why You Only Need to Test with 5 Users that insights about design happen logarithmically — that is, the first five users you study will reveal more than 75% of your usability issues. Doing a simple observation study with five users is an affordable way of understanding not just how to improve your design, but also why those improvements work. That knowledge will inform future decisions where a single A/B test can’t.

Alternative 2: A → B testing

The cheapest way to test a small change is to simply make that change and see what happens. Think of it like a really efficient A/B test: instead of showing a small percentage of visitors the variation and waiting patiently for the results to be statistically significant, you’re showing 100% of visitors the variation and getting the results immediately.

A → B testing does not have the statistical rigor that A/B testing claims. But when the changes are small, they can be easily reversed or iterated on. A → B testing embraces the uncertainty of design, and opens the door to faster learning.

When A/B testing is the right tool for the job

A/B testing is worthless most of the time, but there are a few situations where it can be the right tool to use.

  1. When you only have one shot. Sales, event-based websites and apps, or debuts are not the time to iterate. If you’re working against the clock, an A/B test can allow you to confidently make real-time decisions and resolve usability problems.
  2. When there’s a lot of money on the line. If Amazon A → B tested the placement of their checkout button, they could lose millions of dollars in a single minute. High-value user behaviors have slim margins of error. They benefit from the risk mitigation that A/B testing provides.

Conclusion

If there’s a lot on the line in the form of tight timelines or lots of revenue at stake, A/B testing can be useful. When settling a debate over which color of button is better for your email newsletter, leave A/B testing on the shelf. Don’t get caught by the one-two punch of Fredkin’s paradox and Parkinson’s law of triviality — avoid these counterintuitive tendencies by diversifying your testing toolkit.


measuring-ux-with-heart-??

One of the most important steps in developing any digital project is defining what success means is. Because we have the possibility to measure virtually everything, it’s easy to get caught in the trap of measuring all the wrong things. We often throw these numbers around and, of course, the bigger the number is, the more impressive it is. We tend to do this with almost everything nowadays! Page views, email subscribers, instagram followers, funding raised, revenue, and even numbers of employees. Big is better, right? Don’t get me wrong, in some cases big is better – like profit, but in some cases it’s just vanity metrics.

The only metrics that entrepreneurs should invest energy in collecting are those that help them make decisions. Unfortunately, the majority of data available in off-the-shelf analytics packages are what I call Vanity Metrics. They might make you feel good, but they don’t offer clear guidance for what to do.

When you hear companies doing PR about the billions of messages sent using their product, or the total GDP of their economy, think vanity metrics. But there are examples closer to home. Consider the most basic of all reports: the total number of “hits” to your website. Let’s say you have 10,000. Now what? Do you really know what actions you took in the past that drove those visitors to you, and do you really know which actions to take next? In most cases, I don’t think it’s very helpful.Eric Ries on Tim Ferriss

With all of the tools available showing us all the charts, how do we know which are useful and which are just vanity metrics? Hint: the answer is right there in the question — “u-s-e-f-u-l.” Metrics that help us make decisions are useful, metrics that can’t be used to make actionable steps are vanity metrics.

It all comes down to one thing: does the metric help you make decisions? When you see the metric, do you know what you need to do? If you don’t, you’re probably looking at a vanity metric. Vanity metrics are all those data points that make us feel good if they go up but don’t help us make decisions.Neil Patel – Metrics, Metrics On The Wall, Who’s The Vainest Of Them All?

Measuring UX

While other areas have it easier to measure their progress in numbers, UX is still tougher to tackle. Because UX is so difficult to measure, we still struggle to define our value to organizations who could benefit from our work.

Business metrics are often tied directly to dollars and cents, so it’s a metric almost everyone understands. Improved sales and reduced costs equal profit and everyone loves money. If it’s a business that relies on technology they already have tons of stuff they measure. Reduced server loads, faster software, and less latency are all things that while not everyone in the public truly understand them, most understand them as a good thing.

But UX, and design in general, is harder to tackle. One way to measure its effect is through the HEART framework, designed by Kerry Rodden, Hilary Hutchinson, and Xin Fu from Google’s research team.

To make this work in practice it’s important to use the right metrics. Basic traffic metrics (like overall page views or number of unique users) are easy to track and give a good baseline on how your site is doing, but they are often not very useful for evaluating the impact of UX changes. This is because they are very general, and usually don’t relate directly to either the quality of the user experience or the goals of your project — it’s hard to make them actionable.How to choose the right UX metrics for your product

The framework is a kind of UX metrics scorecard that’s broken down into 5 factors:

  • Happiness: How do users feel about your product? Happiness is typically measured by user satisfaction surveys, app ratings and reviews, and net promoter score.
  • Engagement: How often are people coming back to use the product? Level of user involvement, typically measured via behavioral proxies such as frequency, intensity, or depth of interaction over some time period. Examples might include the number of visits per user per week or the number of photos uploaded per user per day.
  • Adoption: How many people successfully complete the on-boarding process and become regular users? Adoption is measured by number of new users over a period of time or percentage of customers using a new feature.
  • Retention: The rate at which existing users are returning. You can measure how many of the active users from a given time period are still present in some later time period. As a product owner, you may be more interested in failure to retain, commonly known as “churn.”
  • Task success: Can your users achieve their goal or task quickly and easily? Task success is measured by factors like efficiency (how long it takes users to complete the task), effectiveness (percent of tasks completed), and error rate.

Using the HEART framework gives you a better understanding of the user and their relationship with the product. The nice perk of this is that it can be applied to a single feature in your app or to your whole product. It gives you the option to measure just what you need to at any moment.

Informed by data, driven by empathy

The other day, I published this quote from Data-Driven Design is Killing Our Instincts that’s stuck with me.

We’re told all design decisions must be validated by user feedback or business success metrics. Analytics are measuring the design effectiveness of every tweak and change we make. If it can’t be proven to work in a prototype, A/B test, or MVP, it’s not worth trying at all. 

In this cutthroat world of data-driven design, we’re starting to lose sight of something we once cherished: the designer’s instinct. “Trusting your gut” now means “lazy, entitled designer.” When we can ask users what they want directly, there’s no room for instinct and guesswork. 

Or is there?

As we have access to all of this data, and the general consensus is that data should lead the way, I’m confused by what the future role of a designer is. What I truly like about the HEART framework is that while the output of it is informed data, it’s all done through the eye and mind of a human.
Just like we need to be careful to not measure everything just because we can, we need to remember that in order to really understand human’s behaviors, we need to talk with them, not just look at charts. Charts are great for an overview, but human insight is the best way to understand the underlying reason for the chart’s direction.

“Data-driven” is all the rage at the moment, everyone wants a slice of the “big data” cake. Data scientists are the new rock stars, replacing the JavaScript and Front-end gurus and ninjas from a few years back. My problem with trends like these is that they cause the so called “tunnel-vision”. Thing x is a trend right now and we should do that too because,… you know… everyone’s doing it.
Some companies have already started to realise that data alone can’t answer all the questions they need answered. Booking.com put “Informed by data, driven by empathy” in their design guidelines.Measuring and Quantifying User Experience

How do you get from the HEART categories to metrics you can actually implement and track? Unfortunately, there’s no off-the-shelf “HEART dashboard” that will magically do this for you. The most useful metrics are likely to be specific to your particular product or project. Personally, I like to start bigger projects with a UX Strategy. This UX Strategy should define what the most important tasks and metrics for your product is, and consequentially, what failure looks like. How might success or failure in the goals actually manifest itself in user behavior or attitudes? For YouTube, an engagement signal might be the number of videos users watch, but an even better one could be the amount of time they spend watching those videos. A failure category for YouTube Search might be entering a query, but not clicking on any of the results.

Even fashion brands are using data to design products that will meet customers demands better:

“You never design by data, but the data provides a compass as you’re navigating a hunch.”Analytics are reshaping fashion’s old-school instincts

So while we as designers might not get the out-of-the-box dashboard with all the charts ready for us to act on, there are tools and techniques that we can use to get the information needed for us to act upon to get a confident start. Let me know if you need help getting started!