the-paradox-at-the-heart-of-a/b-testing

A/B testing began with beer.

At the turn of the 20th century, William Sealy Gosset was exploring new ways of running experiments on his production line. Gosset was trying to improve the quality of Guinness’s signature stout but couldn’t afford to run large-scale experiments on all the variables in play. Fortunately, in addition to being an astute brewer, Gosset was a statistician; he had a hunch there was a way of studying his small snatches of data to uncover new insights.

Gosset took a year off from his work to study alongside another scientist, Karl Pearson. Together, they developed a new way of comparing small data sets to test hypotheses. They published their work in the leading statistics publication at the time, Biometrika. In “The Probable Error of a Mean,” the t-test, a cornerstone of modern statistical analysis, was born.

Gosset’s scientific approach was the foundation of a 38-year career with Guinness. He invented more ways of using statistics to make business decisions; founded the statistics department at Guinness; led brewing at Guinness’s newest plant in 1935; and finally, in 1937, became the head of all brewing at Guinness.

Since its early days as a tool of science (and beer), statistical decision-making has gone supernova. Today, it is used by every major tech company to make hundreds of thousands of decisions every year. Data-driven tests decide everything from the effectiveness of political ads to a link’s particular shade of blue. New methods like Fisher testing, multivariate testing, and multi-armed bandit testing are all descendants of Gosset’s early innovations. The most popular of these statistical tests is one of the oldest: A/B testing.

An A/B test is a measurement of what happens when you make a single, small, controlled change. In product design, this means changing something in the interface, like the color of a button or the placement of a headline. To run an A/B test, show an unchanged version of the interface (“A”) to a randomly-selected group of users. Show a changed version (“B”) to another randomly-selected group. Measure the difference in the behavior of the two groups using a t-test, and you can confidently predict how the changed version will perform when shown to all your users.

A/B tests are easy to understand, which explains their popularity in modern software development. But their simplicity is deceptive. The fundamental ideas of A/B tests contain a paradox that calls their value into question.

Fredkin’s paradox

In The Society Of Mind, Marvin Minsky explored a phenomenon that I experience every day as a designer: people often prefer one thing over another, even when they can’t explain their preference.

We often speak of this with mixtures of defensiveness and pride.

“Art for Art’s sake.”

“l find it aesthetically pleasing.”

“l just like it.”

“There’s no accounting for it.“

Why do we take refuge in such vague, defiant declarations? ”There’s no accounting for it” sounds like a guilty child who’s been told to keep accounts. And “I just like it” sounds like a person who is hiding reasons too unworthy to admit.

Minsky recognized that our capriciousness serves a few purposes: We tend to prefer familiar things over unfamiliar. We prefer consistency to avoid distraction. We prefer the convenience of order to the vulnerability of individualism. All of these explanations boil down to one observation, which Minsky attributes to Edward Fredkin:

Fredkin’s Paradox: The more equally attractive two alternatives seem, the harder it can be to choose between them—no matter that, to the same degree, the choice can only matter less.

Fredkin’s Paradox is about equally attractive options. Picking between a blue shirt and a black shirt is hard when they both look good on you. Choices can be hard when the options are extremely similar, too — see the previous link to Google’s infamous “50 shades of blue” experiment. The paradox is that you spend the most time deliberating when your choice makes no difference.

Parkinson’s law of triviality

In 1955, C. Northcote Parkinson wrote a book called Parkinson’s Law. It’s a satire of government bureaucracy, written when the British Colonial Office was expanding despite the British Empire itself shrinking. In one chapter, Parkinson describes a fictional 11-person meeting with two agenda items: the plans for a nuclear power plant, and the plans for an employee bike shed.

The power plant is estimated to cost $10,000,000. Due to the technical complexity involved, many experts have weighed in. Only two attendees have a full grasp of the budget’s accuracy. These two struggle to discuss the plans, since none of the other attendees can contribute. After a two-minute discussion, the budget is approved.

The group moves on to the bike shed, estimated to cost $2,350. Everyone in the meeting can understand how the bike shed is built. They debate the material the roof is made of — is aluminum too expensive? — and the need for a bike shed at all — what will the employees want next, a garage? — for forty-five minutes. This budget is also approved.

This allegory illustrates what’s called “Parkinson’s law of triviality”:

The time spent on any item of the agenda will be in inverse proportion to the sum [of money] involved.

We can generalize Parkinson’s law: The effort spent discussing a decision will be inversely proportional to the value of making that decision.

When faced with two similar alternatives, Fredkin’s paradox predicts you’ll have a hard time choosing. This is when A/B testing is most appealing: A/B tests settle debates with data instead of deliberation. But our generalization of Parkinson’s law of triviality says that this kind of A/B testing — testing as an alternative to difficult decisions — results in the least value.

Most of the time, A/B testing is worthless. The time spent designing, running, analyzing, and taking action on an A/B test will usually outweigh the value of picking the more desirable option.

Alternatives to A/B testing

Instead of A/B testing, I’ll offer two suggestions. Both are cheaper and more impactful.

Alternative 1: observe five users

Tom Landauer and Jakob Nielsen demonstrated in Why You Only Need to Test with 5 Users that insights about design happen logarithmically — that is, the first five users you study will reveal more than 75% of your usability issues. Doing a simple observation study with five users is an affordable way of understanding not just how to improve your design, but also why those improvements work. That knowledge will inform future decisions where a single A/B test can’t.

Alternative 2: A → B testing

The cheapest way to test a small change is to simply make that change and see what happens. Think of it like a really efficient A/B test: instead of showing a small percentage of visitors the variation and waiting patiently for the results to be statistically significant, you’re showing 100% of visitors the variation and getting the results immediately.

A → B testing does not have the statistical rigor that A/B testing claims. But when the changes are small, they can be easily reversed or iterated on. A → B testing embraces the uncertainty of design, and opens the door to faster learning.

When A/B testing is the right tool for the job

A/B testing is worthless most of the time, but there are a few situations where it can be the right tool to use.

  1. When you only have one shot. Sales, event-based websites and apps, or debuts are not the time to iterate. If you’re working against the clock, an A/B test can allow you to confidently make real-time decisions and resolve usability problems.
  2. When there’s a lot of money on the line. If Amazon A → B tested the placement of their checkout button, they could lose millions of dollars in a single minute. High-value user behaviors have slim margins of error. They benefit from the risk mitigation that A/B testing provides.

Conclusion

If there’s a lot on the line in the form of tight timelines or lots of revenue at stake, A/B testing can be useful. When settling a debate over which color of button is better for your email newsletter, leave A/B testing on the shelf. Don’t get caught by the one-two punch of Fredkin’s paradox and Parkinson’s law of triviality — avoid these counterintuitive tendencies by diversifying your testing toolkit.


the-aesthetic-accessibility-paradox

Every interface has a subset of users that make up the majority and minority. The majority of users usually have normal vision, while the minority have some form of visual impairment.

There’s a big difference between what normal visioned users see versus what color blind and low vision users see. These users tend to experience blurry text and faint elements when text sizes and color contrasts are too low.

The goal of accessibility is to meet the needs of the minority because they’re often forgotten. But what happens when meeting the needs of the minority ends up failing the needs of the majority? This issue occurs when the interface is made too accessible and isn’t balanced with aesthetics.

Aesthetic Vs. Accessible

In general, the more accessible an interface is, the less aesthetic appeal it has. Highly accessible interfaces are easier on the eyes of the visually impaired, but harsher on the eyes of the normal visioned. On the flip side, highly aesthetic interfaces are easier on the eyes of the normal visioned, but harsher on the eyes of the visually impaired.

This aesthetic-accessibility paradox is what designers struggle with when they design interfaces. The challenge is to meet the needs of both the majority and the minority. However, if you veer too far into one extreme, you’ll alienate a subset of your users. Most people don’t want to alienate the minority. But alienating the majority of your users is just as bad as alienating the minority.

Below are two forms that illustrate this concept. One form is AAA compliant and accessible to all visually impaired users. The other is not accessible at all but appeals to normal visioned users.

highly-aesthetic-accessible

For the normal visioned, the aesthetic form is easy on the eyes, while the accessible form is harsh. However, for the visually impaired, the accessible form is easier on the eyes, while the aesthetic form is harsher. Which form should you use?

The correct answer is neither because neither form respects the aesthetic-accessibility paradox. They are designed toward opposite ends of the spectrum, which will either alienate the majority or minority.

A truly accessible and aesthetic interface falls somewhere in the middle. Below is the form that respects the aesthetic-accessibility paradox. The color hues, contrasts, font sizes, and weights are AA compliant and balanced to meet the needs of both user groups. The result is an interface that’s easy on the eyes for nearly everyone.

balanced-aesthetic-accessibility

The Majority of the Minority

Why isn’t an interface that’s balanced with aesthetics and accessibility easy on the eyes for everyone? Within the subset of the minority, there’s another majority and minority. The majority of the minority are users who don’t have extreme visual impairments and will be able to use a balanced design. However, the minority of the minority have extreme visual impairments that will still cause them issues.

majority-minority-accessibility

Designing for the smallest minority will make your design accessible to users with extreme visual impairments. However, your design will alienate normal visioned users who make up the majority of your base. For this reason, the best design is a balanced one that satisfies the largest minority.

What about the needs of the smallest minority? Most users with extreme visual impairments use screen readers that provide high contrast modes. These high contrast modes allow them to use interfaces that have low contrast. It’s not necessary to design for the minority of the minority, but rather the majority of the minority. Designing for the largest minority means making your interface AA compliant.

Local High Contrast Mode

Sometimes a highly aesthetic or highly accessible interface is required based on the nature of a project. There’s a way you can provide users with these presentations without alienating any of your audience.

If you want to maintain a highly aesthetic design, you should provide a local high contrast mode on your interface. A local high contrast mode is a toggle button on the page that allows users to enhance the contrast of text and elements. On the other hand, if you want to provide users with a highly accessible design, make your high contrast mode AAA compliant.

However, the challenge is getting users to notice and use it. Make sure it’s visually prominent, or they’ll overlook it. The example below shows a button for high contrast mode, but it’s in an obscure form and location. If you decide to implement a local high contrast mode, follow these guidelines.

local-high-contrast-mode

The Importance of Aesthetics

Accessibility extremists tend to discount aesthetics. They believe an interface should be as accessible as possible for the minority without considering how it affects the average user. These extremists need to understand and respect the aesthetic-accessibility paradox before demanding the highest degree of accessibility.

Aesthetics isn’t a subjective and trivial attribute used for ornamentation. It serves an important purpose in the user experience. It determines whether users trust your app, perceive it as valuable, or are satisfied using it. In other words, aesthetics affects user engagement and conversion rate. Discounting it is not only bad for users, but bad for business.

Striking a Balance

Balancing aesthetics and accessibility isn’t easy, but it’s necessary for a great user experience. The cross-section of the aesthetic-accessibility spectrum is the balance point for designing interfaces that satisfy the most users. Avoid designing at the extreme ends of the spectrum and respect the aesthetic-accessibility paradox.

aesthetic-accessibility-paradox

Being mindful of this paradox will help you make design choices that include the visually impaired, without excluding the normal visioned. When you’re designing for a wide range of people, extremism toward either an aesthetic or accessible direction is not the best approach. Finding the middle ground is the best way to reach and satisfy as many users as possible.

site-flows-kit

user-personas-kit