a/b testing


Author: Abhideep Jain

A/B tests, as they are called, are quite common in the pharma industry. Large clinical trials are fundamentally split tests, where one group receives a novelty treatment for a condition while the other group receives the standard prevalent treatment (or a placebo). Results are used to compare if the novel treatment is significantly better at treating the medical condition.

But the popularity of these tests soared when product and marketing managers started using them to understand questions like, “How do I ensure people click on my ad campaign? Or how do I increase the number of people that open my promotion email?”

A/B tests integral to software product development

From launching a new product, tweaking the existing product, identifying a new target customer segment, or solving existing customers’ needs better – product managers are always looking for data-driven tests to complement their decisions. In an A/B Testing environment, the performance of in-app messages, understanding product feature adoption, or designing a killer user experience – all such decisions are increasingly becoming more scientific than gut-based. Top content creators on YouTube today are not just creating content; they are also optimizing video titles, thumbnails, and descriptions based on multiple A/B test results and ensuring that their content never underperforms.

How A/B testing shapes data-driven software product decisions

Let’s look at a hypothetical scenario – The product managers of a popular homestay booking app are trying to figure out why their customers drop out on the payments page. During focused group discussions, they find that people have certain inhibitions to pay prior to their actual stay. As one of the controlled tests, they came up with two variations of the payment method – “Pay Now” – the control version, and “Reserve Now, Pay Later” – the test version.

Their initial hypothesis was, “The new deferred payment option does not affect the booking rate”.

control & test ab testing

Figure 1: Control and Test Versions of Payment Method

After running the A/B test for 30 days, here is the booking conversion rate for the two random groups.

conversion rates

Figure 2: Conversion rates for test and control sample for 30 days

Applying a basic two-tailed t-test on this dataset gives a p-value of 0.002 (which is far less than 0.05, the most commonly used significance level). So the initial hypothesis was disproven – meaning that there was a greater than 95% chance that the introduction of the new payment option had, in fact, resulted in an increased booking rate.

Factoring in the complexity of the real world

While this was an ideal example to explain the significance of such tests in real-world scenarios, multiple factors affect the performance or non-performance of a certain metric. That is when you need multivariate tests which can take into account how multiple variables interact with each other to influence the outcome metric finally. As in the YouTube example, imagine you want to understand how the view count is influenced by Title, Thumbnail, and Description – all in a single analysis.

Major pitfalls that Product Managers should be on the lookout for

While A/B tests are an excellent way to improve products incrementally, many times, it ends up being a complex experiment in the real world because significant engineering effort is needed to set it up, and it may end up providing no clear direction – unless done right.

At Tiger Analytics, we’ve worked with various clients who’re trying to develop exceptional software products, and based on our experience, here are a few things Product Managers can keep in mind while setting up these tests:

1. Inadvertent selection bias –

The most common difficulty of doing A/B tests is ensuring that training and test data are truly randomized – the entire purpose of the test becomes void if you let selection bias creep into your data. If you let the tests run for too long, other factors outside your control may skew your test and control datasets and affect the results. One way to achieve this is to be aware of the various subgroups in the data and ensure that each subgroup has adequate representation in the test and control datasets.

2. Mistaking novelty for higher performance –

Test Users tend to interact with a new feature as soon as it is introduced simply out of curiosity, but they may not show prolonged use of that feature. Decision scientists should be aware of such possibilities and should study how the novel feature adoption wears off with time.

3. Cannibalize other features for a novel feature –

In certain cases, the introduction of a new test feature may reduce engagement on other features, ultimately leading to no real long-term benefit. Results from usability tests may help here to understand the interplay amongst various features as users interact with them.

4. Running too many tests at the same time –

Quite often, due to faster time-to-market pressures, teams end up running too many inference tests on the same set of users simultaneously. There could be cases where two or more changes could interfere and produce very different user behavior than if the changes were introduced in isolation. When situations occur, one can either ensure that users don’t take part in multiple tests or wait for the test to be completed before starting another.

Set up teams for product success

A typical product development pod performs numerous tests to scientifically develop products that exceed customer expectations. Multiple such pods may execute hundreds of tests every few sprints. Automated tools for executing such tests come in handy for business users, but they have their limitations. In such cases, Decision Scientists can help Product Development Teams leverage seamless insights from multiple A/B tests so that there is an increased focus on building great products and interfaces. Figure 3 shows a typical engineering pod that leverages decision science expertise available as a shared service to the engineering organization.

cross functional engineering pods

Figure 3: Cross-functional engineering pods

At Tiger Analytics, we use a prebuilt Test and Control tool which provides 30-50% acceleration to help in running a series of controlled experiments, ensuring test and control data isolation, post-test analytics, insights, and seamless reporting, for clients. From the generation of multiple hypotheses to designing the experiment, identifying test and control groups, defining the key KPIs for analysis, giving recommendations based on key KPIs, and test measurement – all happens with zero to minimal coding.

For a large security software client, we evangelized creating a robust testing framework by eliminating repetitive statistical tests and by automating the creation of certain key metrics. This ultimately led to fast and informed data-driven decisions to accelerate product development and release.

When done right, under the proper conditions and with the right metrics in place, a well-executed A/B test not only provides the right guidance and direction for DevOps teams to scale up the right solution at the right time but will also help the leadership make data-driven decisions based on actual customer response.


– Netflix Technology Blog on Medium (https://netflixtechblog.medium.com/)
– Product School (https://www.youtube.com/watch?v=z5ksWcukD9Y&list=PLEXcbK4FvkxHdNU8NhSFcw_pr6iDFWbka)
– https://stats.oarc.ucla.edu/



Leave a reply

Your email address will not be published. Required fields are marked *


©2023 Tiger Analytics. All rights reserved.

Log in with your credentials

Forgot your details?