Software experimentation sins

This article is for folks who run software experiments and want to learn more about how to avoid common classes of mistakes!

📚 Estimated reading time: 20 minutes.

Experimentation is a powerful tool used in modern software development, but many experiments fall prey to common pitfalls that compromise statistical validity.

Once you’ve committed a few sins in an experiment, you start to lose statistical accuracy and may form the wrong conclusions, despite using what seemed like rigorous statistics.

These are learnings from conducting various experiments, reading experiment reviews, and talking to data science friends — at Meta and Notion.

I’ve personally made all of these experimental mistakes and seen them made many more times! 😅

Examples

Imagine you are a product manager at a hypothetical dating web app called Tinge.

Our goal is to build features to help users make romantic connections and increase app revenue.

We’ll be running experiments using an experimentation platform like Statsig or LaunchDarkly, that does all the statistics and testing management for us.

Process

Modern software experimentation closely follows the scientific method. The typical cycle looks like this:

1. Hypothesis	“We should add new profile prompts to hopefully grow prompt adoption & matching rate.”
2. Test	“Let’s give 50% of all users 10 new prompts and see how that affects our connection metrics.“
3. Analysis	“Over a two week period, our new prompts were used by 10% of users in the test group and there was a 2±1% increase in new matches. This helps validate the impact of having good prompts to matches.”
4. Decision	“Because we saw positive increases to prompt usage and top-line metrics, we will ship the new prompts to all users.”

In some experiments, we may not have a very specific hypothesis and just want to measure the overall effects of a change. In these cases, we might decide to ship the change anyway, provided the measured results aren't terrible.

Our goals with experimentation are:

to build intuition and understanding for how the world works, and make better future choices.

“How do better used prompts lead to increased matches?”

to demonstrate impact of our work and validate changes.

“My team’s work helped drive increased +X% revenue and +Y% matches!”

Experimentation sins

Experimentation errors typically fall into three categories:

Incorrect platform foundations.

Incorrect experiment setup.

Incorrect interpretation and analysis.

We might visualize this as layers of a pyramid.

The entire process depends on solid foundations, a correct setup, and accurate interpretation. Errors at each level can compound and erode statistical validity.

i. Platform foundations

Metric logging & computation errors

Probably the most common and simple sin is that we might inaccurately log and compute metrics. logging and computing metrics is incorrectly set up.

For every metric that needs to be produced, that had to come from an engineer building some logging mechanism or data engineering pipeline. Issues may occur simply because an engineer forgot to add a log or some metadata or made a logic mistake.

Examples:

We accidentally logged “user liked” as “user disliked” instead due to a flipped boolean, and nobody noticed till weeks later!

We didn’t realize we should include “messages sent” as part of the “daily engaged users” metric, but data scientists assumed this was included. This led to invalid data for experiments that goaled on this metric.

These errors can typically be avoided by auditing the end-to-end flow before the experiment begins and making sure metrics are logging as expected. If any metric looks fishy during experimental analysis, we should investigate them and ensure our understanding of them is correctly lined up with their implementation.

Incomplete metric coverage

We may be missing capturing some important effects that would otherwise change our insights and decisions.

This is especially true of younger companies with immature experiment cultures and data pipelines!

Example: Tinge introduced a new user flow to push users to subscribe!

We measured "new subscriptions" and shipped the change due to a +5% increase. However, we failed to measure "subscription refunds," which skyrocketed by +30%.

Without being able to measure and setting this as a guardrail metric, we missed some crucial information. Had we known about this dramatic increase in refunds, we wouldn't have shipped the change—it clearly indicates that users were subscribing unintentionally and having poor experiences.

Investing in long-term metric foundations and logging helps solve the problem of missing metric coverage. We should try to capture and analyze all relevant effects we care about.

Unmeasurable effects

Some effects are impossible to capture as metrics due to platform limitations.

We generally cannot capture off-platform and second-order effects of user experiences. User sentiment and brand reputation are some of the many qualities that are very difficult to measure but can be greatly affected by product changes!

Examples:

Tinge decided to experiment with some 🌶️ controversial politically charged prompts. We saw increased prompt usage overall, and more user logins, so we decided to ship the change!

Sadly, this later led to app uninstalls, negative news articles, and bad word-of-mouth, that we could not measure via our platform. We ended up unshipping the change after the public backlash started to get high.

A delightful feature could motivate users to recommend the app to their friends or make TikToks. A negative change could lead to bad reviews on reddit and the App Store, deterring many more users from joining and hurting the brand. But neither of these effects are measurable in any A/B experiment analysis.

It’s impossible for any experiment analysis to capture all the important impacts of a change. We’ll have to use tools other than experimental analysis to make judgement calls for decisions that are likely to have negative off-platform or second-order effects!

Experimentation platform issues

There are often notable constraints or intermittent issues with our experimentation services, like:

Getting an accurate user or device identifier may not always be possible (e.g. if the user is logged out, blocks cookies, or using incognito).

Ad-blockers may block logging or experimentation APIs for privacy reasons.

Example: Tinge analytics tracking was blocked by a popular adblocker on the web. This skews our results, because adblocking is likely correlated with age and technical proficiency, leading to unrepresentative user proportions.

Let’s dive deeper…

Let’s say that ~30% of our users use an adblock and therefore all of their analytics are blocked for experimentation.

This would not be an issue if “being an adblock user” was equally distributed across the distribution of all users (i.e. it is an independently and identically distributed variable).

However, adblock generally correlates with more technically proficient users and user age!

Imagine we had a user literacy and adblock distribution that looked like this, bucketing users into 4 quartiles of user literacy:

ㅤ	Ad block usage (A)	% of all daily users (U)	*Measured users out of 100 (M) = (1-A) U * 100**	% of measured users (M / 70)
Very tech literate	60%	25%	10	14%
Moderately tech literate	36%	25%	16	22%
Somewhat tech literate	16%	25%	21	30%
Not very tech literate	8%	25%	23	33%

With the distribution above, all of our experimentation would systematically overrepresent not tech literate users, and underrepresent very tech literate users.

Now let’s say we ran a re-design experiment. We changed the app in a way that really resonated with technically literate users (typically younger, use lots of apps), but not technically illiterate users (typically older, more resistant to change). And it led to these results:

ㅤ	“True” increase in daily usage (I)	*“True” impact on users (U I)**	*Measured impact (M (1 + I))**
Very tech literate	+15%	+3.75	+1.5
Moderately tech literate	+10%	+2.5	+1.6
Somewhat tech literate	-7%	-1.75	-1.47
Not very tech literate	-8%	-2	-1.84
ㅤ	ㅤ	Total: +2.5	Total: -0.21

Globally, this change was actually good for users from a overall measured impact perspective. But we’d actually get a negative overall result from our experiment due to the non-adblocked-user bias and then make the wrong conclusions from the experiment!

For illustrative purposes, we assumed all users are on web.

The platforms and pipelines we rely on may fail in other peculiar ways.

Example: Tinge is using popular experimentation platform Statsig, but there was an issue in our SDK implementation for a few weeks, throwing off all experiments that included users on old app versions.

ii. Experiment setup

Improperly isolating control & test groups

Scientific validity in experiments depends on properly isolated, randomized control groups to ensure that the treatment group does not influence the control group.

In many fields like biology & medicine, achieving this can be easier. Researchers might separate test and control groups into different physical spaces — ensuring that participants (or lab rats) cannot influence each other. This minimizes interference and helps maintain the integrity of the experimental results.

In social or collaborative software, it’s quite hard to achieve true isolation. This contamination can lead to odd experiment-breaking effects.

Example: We introduce a new feature to Tinge called “message reacts” that becomes TikTok-viral. We want to gauge the impact of how messaging reacts leads to improved connections and messaging.

This experiment setup led to some funny quirks…

Oddly, we see a spike in messaging from both test and control users, and it is quite hard to isolate the “true impact” of our feature.

Users may be sending message reactions to people who cannot see them, and then be confused why the other folks are not responding to their reacts.

Users may literally have messages sent to each other discussing the new feature. (how meta!)

Some users in both groups may be going into messaging for the sake of seeing whether or not they have the message reacts feature, affecting metrics.

Control users are responding more in general due to elevated messaging from test users.

Because of the contamination of impact from test to control, our metrics become quite dilluted.

To address network contamination, we may instead pick a different method of sampling such as using location (zipcode, city, or country) or some other logical user grouping. This is sometimes called cluster assignment or cluster randomized trials.

Example: We instead ship “message reacts” to 1/5 of the major countries we operate in as an experiment. Users cannot message react to somebody who doesn’t have the feature.

This method isn’t perfect, but helps avoid some contamination issues.

Previously, with a regular 50-50 randomized split, users in the test group were extremely likely to interact with control group and contaminate.

Now, test group users with the feature are now much less likely to interact with control group users with the feature. This isolated impact makes analysis more accurate.

However —

Different countries may have different impacts due to cultural differences, sizes, and software usage, so we want to make sure our country selection controls for this reasonably well.
Users might travel or message people in adjacent countries and then be confused.

We should consider how to best manage experimental isolation and consider alternative methods in experiments where there are network interactions.

Where it’s not possible to do so, we may have to accept some statistical accuracy issues and user confusion. If we feel confident about the feature, we may also just ship the feature directly to everybody and then evaluate the impact broadly without an experiment.

Setting up exposure & gating logic wrong

Exposure refers to the point when a user becomes marked as a test or control group for analysis and treatment. Typically, we want to pick exposure conditions that best match when that user will be impacted by the actual product being different.

Gating refers to the conditions that must be met for a user to be considered to be exposed into the experiment.

It’s easy and common to make subtle mistakes in these conditions, such as:

Incorrect exposure logic: The conditions intended for exposure were different than the implementation in code. Therefore users were exposed into test but didn’t actually get the test treatment.

Example: We re-designed the “Superlikes” feature UI/UX (exclusive to Pro plan users) and exposed all users even without pro plans into the experiment. Depending on what we’re trying to measure, these exposure conditions could be wrong and dilutive.

Learn more…


function doesUserHaveSuperlikesRedesign(user) {
  return getExperimentGroup(user, "superlikes_redesign") 
	  && userHasPlan(user, "pro")
} 
// Assume `getExperimentGroup` logs an experiment exposure.

In the above code, we’re exposing users into the experiment who don’t have a Pro plan, because we check the experiment group first.

This may makes the results harder to interpret, since only 10% of test users will actually be impacted by the treatment, so the delta in impact is diluted by 90%.

ㅤ	Test users	Control users
Was Pro plan	10%	10%
Wasn’t Pro plan	90%	90%

Flipping the condition to check the experiment group second changes our exposure conditions.


function doesUserHaveSuperlikesRedesign(user) {
  return userHasPlan(user, "pro") && 
	  getExperimentGroup(user, "superlikes_design")
}

ㅤ	Test users	Control users
Was Pro plan	100%	100%
Wasn’t Pro plan	0%	0%

The new setup better measures how the redesign specifically affects Pro user behavior and its impact on the test group.

The original setup works better if we want to measure impacts across all users, particularly when the treatment affects Free users too. For instance, if Free users received free Superlikes, the treatment would influence their behavior as well.

Versioning: We accidentally exposed users in an experiment on old versions, so test group users got a broken, buggy version of the product that did not have logging set up properly.

Internationalization: We exposed all users globally, but our feature was not actually localized yet, leading to negative impacts to test users who got a non-localized version.

Imbalanced exposures: Exposure or gating conditions were different for test or control users, leading to experiment imbalances.

Example: We added a [New] badge to the Profile tab for test group users to indicate that we added some new features.

However, we manually logged exposures for users only after they opened the Profile tab and not when they saw the badge, because we wanted to only expose people who actually saw the new Profile settings. This broke our experiment exposures entirely.

Let’s dig deeper…

Exposure imbalances happen when we provide some treatment to users without exposing them to the experiment.

In the above example, the “New” badge throws off the entire experiment, because it makes the test users more likely to be exposed to the experiment due to seeing it.

If the “New” badge makes users +X% more likely to click on the Profile tab on a given day, test users are exposed at far higher rates, leading to an experimental imbalance.

This change has two central issues:

We’ll have far more metrics from test users overall than control users.

We might see results like a +20% increase in total subscriptions simply due to there being significantly more users in the test group.

The users in test are biased by the “New” badge in general, and are likely to have metric movement just because of that and not because of the changes in the Profile menu.

Users may have +X% Profile editing rates just because of the New badge directing them to open Profile at higher rates.

To resolve this, we might consider not using the “New” badge until the feature is actually shipped to all users, or exposing all users into the experiment regardless if they open settings.

To avoid these common classes of issues, all experiments should have carefully planned and reviewed exposure and gating guidelines. Auditing the end-to-end exposure flow is also helpful for more complex experiments!

Setting weak hypotheses and evaluation criteria

We may make our hypotheses and evaluation criteria too attainable, without proper guardrails and broader analysis to mitigate negative outcomes.

Example:

We have a feature called Tinge experiences, where you can purchase local and virtual date experiences.

On the Tinge experiences team, we decide to experiment adding multiple entrypoints and upsells throughout the product to encourage users to purchase experiences.

We decide to use the experiment criteria:

goal A: Increase purchases of Tinge experiences.

guardrail B: Do not negatively affect overall monthly active users and subscriptions.

If both these objectives are met over a 2-week experiment, then we will ship the treatment.

Goal A will likely be met — because increasing entrypoints to a surface naturally increases visibility to that surface. But guardrail B is a very hard condition to trigger, as these metrics may be much harder to move.

Meanwhile, our changes might interrupt user flows, drawing attention away from other important parts of the app, and come across as overly promotional. It’s possible that this experiment ends up regressing metrics not captured in the goal or guardrail metrics, especially when measured over a longer timeframe.

But because of our weak guardrails and lack of long-term, holistic analysis, we may end up shipping a change that led to negative user outcomes.

With any experiment, we should make sure our hypotheses and guardrails are rigorous to ensure that the trade-offs of the change are worth it. Riskier product changes are best paired with product design discussions, growth principles, and customer feedback.

Bundling many treatments, breaking attribution

Combining multiple changes into a single experiment makes it difficult to identify which specific change drove the observed results.

Example:

We decide to test a new Tinge redesign! It overhauls the navigation structure, updates typography, and introduces "New" badges and modals throughout the app.

If this bundled redesign experiment shows a 20% increase in product usage, we can't confidently claim "our new navigation drove this improvement" or expect to see this increase sustained long-term.

It's likely that a specific change—such as the popup, New badges, or the novelty effect of all these changes combined—is responsible for most of the positive impact. Some elements of the bundle might have neutral or even negative effects, but we'd be unable to pinpoint which parts were actually beneficial.

It’s likely impractical or silly to isolate every individual change, and statistical power is weakened if we have to make too many experiment groups. But bundling too many treatments risks muddying the waters, making it hard to draw clear, actionable conclusions.

We should avoid bundling unrelated changes together in a single experiment and be open to the possibility that not all bundled changes are beneficial.

iii. Interpretation & analysis

Misinterpreting statistical significance

Statistical significance, often measured using a p-value threshold like 0.05 (or 95% confidence) is the standard for determining whether a metric result is meaningful enough to report on. While this threshold is a useful guideline, there are some caveats.

Example of a metric measurement with a confidence interval, from Statsig.

For smaller companies or experiments with limited sample sizes, achieving statistical significance is often challenging. Metrics often appear neutral with wide error bars, leaving inconclusive results. In some cases, metrics may fluctuate wildly day-to-day due to insufficient sample sizes or noise in the data, and we may not have enough statistical power to make sound conclusions.

Example: On Day 1, we saw that the Likes metric increased by +10%! But this changed on Day 2, where it was actually -5%. And then on Day 7 it was +20% again! What’s going on? Results like these prompt a deeper investigation. What’s the confidence interval on these metrics? How large was the analytics sampling rate and how many users are counted in this metric? Does this metric usually have daily or seasonal fluctuations?

Even accounting for statistical significance via p-value calculations has some flaws. For example, we may not have accounted for outliers, and need some methods like Winsorization.

Example: After inspecting some data from the experiment, we realized one particular botfarm was responsible for millions of likes. After removing outliers, we started to see a more consistent impact of +3±1% likes.

To avoid statistical significance pitfalls, we should always include confidence intervals in any effect statement (e.g. 5±3%) and investigate outliers and anomalies that may distort findings.

We should also remember that confidence intervals only capture the probable outcome of a metric, and they do not account for experimental errors described in this article!

Changing evaluation criteria after the experiment begins

It’s natural to see positive metric impacts and want to use them to justify a change, but changing metrics and hypotheses after an experiment begins undermines its statistical integrity.

Example:

In an experiment modifying the messages tab, our hypothesis was we would positively increase messages sent. However, we saw -4±3% messages sent instead after the experiment concluded.

While browsing the impacted metrics list, we found some positive outcomes in our core collection.

+1±1% subscriptions

+4±1% message reactions

We decide to include them as part of our report and say because we saw all these positive results, we should ship the change, despite our original hypothesis being incorrect.

This is an example of poor experiment methodology, because:

The statistical validity of an experiment relies on conducting the analysis as planned, with a fixed hypothesis and pre-registered set of metrics.

We are inherently biased towards metric selection that validate our hypothesis. Picking results after the fact often introduces significant bias from the experimenter.

Each additional metric or hypothesis shift introduces more uncertainty, increasing the likelihood that some of these confidence intervals will show significant effects purely by chance, not due to any real impact from our changes. It’s kinda like re-rolling the dice on statistical significance every time you swap or add new metrics.

That’s not to say these new metrics are necessarily insignificant or that these observed outcomes aren’t real. However, they should not be used as the sole basis to ship an experiment unless they are very significant results (e.g. +20±1%) and the overall impact is positive.

For many changes, it may be very reasonable to argue “we didn’t see the original effect we wanted, but feel this is the right change to ship based on other factors.” It is also reasonable to note these additional metrics in a report.

But for a statistically rigorous experiment, any hypothesis should be evaluated against their original criteria. Reselecting success metrics or changing hypotheses breaks the experimental integrity and reliability of the results.

Changing end date while the experiment is ongoing

Example:

We’re running an experiment and set it to be 2 weeks long.

On day 14, only a few metrics are significant, a few red, a few green.

We want more data since the error bars are quite large. So we decide to run till we see more significant metrics.

On day 20, metrics look significant enough! So we decide to end the experiment and write the report.

In How not to run an AB test, Evan Miller explains why this is problematic:

“Repeated significance testing always increases the rate of false positives, that is, you’ll think many insignificant results are significant (but not the other way around). The problem will be present if you ever find yourself “peeking” at the data and stopping an experiment that seems to be giving a significant result. The more you peek, the more your significance levels will be off. For example, if you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5% significance. If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. “Peeking” at the data is OK as long as you can restrain yourself from stopping an experiment before it has run its course. I know this goes against something in human nature, so perhaps the best advice is: no peeking!”

A note on statistical significance:

Let’s suppose that we broke statistical significance by committing a few sins above, resulting in outcomes that were only about 70% confident. If we were to accept these experiments purely on a statistical basis, we would be accepting a ~30% chance that our experiment result was due to random chance rather than a true effect. With 100 such experiments, about 30 of them would show "significant" results that are actually just variance.

We’d be unknowingly making the product worse with many of our changes, and making bad decisions motivated by inaccurate conclusions. For all the extra work we had to do to run these experiments, this is a pretty terrible outcome. This is why we set a high threshold for statistical significance, and care so much about statistical rigor, avoiding type I (false positive) & II (false negative) statistical errors.

Extrapolating results to the long-term

Tech companies usually settle on an experiment duration of 2 weeks. This seems to be a sweet spot to balance accuracy & velocity.

But depending on the treatment, too short experiment leads to:

an overemphasis of short-term and novelty effects.

an underemphasis of long-term effects.

Additionally, the future product (and the future user) is going to be quite different than the present conditions, especially at fast-moving tech companies.

Despite these limitations, we often incorrectly treat short-term experimental results as reliable predictors of long-term impact.

Example: We launched Tinge experiences and reported great usage metrics and +10% increased revenue from our 2 week experiment. All our experiment results look great so we shipped it! Our claim was that our new feature helps grow revenue by +10% because of the data and high user feature adoption rates. But two months later, revenue growth is reversed to -10% and all our Experiences metrics look concerning with low adoption rates.

What happened?

It’s hard to know from this data alone, but it’s possible that any number of these happened:

Much of the initial increased usage was a novelty effect. Users were drawn to the allure of the new surface, captivated by feature upsells, and used it at higher rates than normal.

Users took some time to realize later that the feature underdelivered on its promises, and then became less attached to the product overall. This wasn’t something they could realize immediately during the experiment, since they needed time to use and evaluate the experience.

The experience changed some time in the 2 months after the experiment — the feature got worse, buggier, or clunkier.

We shipped another feature that didn’t really make sense with this feature, and the two surfaces cannabalized each other‘s benefits.

There may be some seasonal effect overall, where overall app usage is naturally lower — maybe because people are dating less or less interested in the “experiences” we’re offering during particular months.

Short-term metric movements alone shouldn’t drive ship decisions. We should ask deeper questions like, “is this actually a good experience in the long-term product that we believe in?” and “how does this fit in to the future product vision?”

For more significant changes, we may also consider longer term experiments or holdout groups to capture long-term sustained impacts.

Interpreting metric definitions wrong

When communicating results, we often simplify metrics and experiment outcomes to make them easier to understand.

But the more we simplify, the less nuance is captured, and the more likely our findings are wrongly interpreted. Sometimes, we may have missed some key nuances of the metrics ourselves.

Examples:

“There are 1000 users* of this feature now!”

Is this daily? weekly? monthly? unique users?

What defines a “user” of this feature?

How does this user count compare to similar features?

“We increased subscriptions* by 5%.”

Is this total new subscriptions or net subscriptions including cancellations and refunds?

Is this number global subscriptions or just exposed users?

Does this account for normal growth during this period?

When these additional qualifications are changed, these metrics may change by multiple magnitudes in either +/- directions.

For example, it’s natural to isolate experiments to expose only users who were impacted by the treatment. Otherwise, measured effects may be too minuscule and diluted. But this may lead to some incorrect metric interpretations.

Example: We experimented with a new setting and our test showed +5% subscriptions — a massive win! We should our global revenue go up significantly… right?

We may not have realized that the test only exposed people who went into the settings menu (which might be only 2% of users over 2 weeks), who are more likely to subscribe anyways. In reality, we may have had a .1% global increase in subscriptions.

These nuances are especially relevant when trying to make causative claims or recommendations about future decisions.

Example:

“Improving prompt quality and quantity led to 20% increased matches, and therefore we should invest more deeply into prompts.“

→

“We saw a +20±11% increased match rate from profiles that adopted the batch of 10 new prompts, and +5±3% increased daily usage overall for all test group users. Based on the results of this experiment, we should ship the change. We generally suspect investing in having fresh and diverse prompt selection is beneficial for increasing user engagement.”

Striking the right balance between clarity and nuance ensures insights are communicated effectively without oversimplifying. Adding the right amount of context empowers teams to apply findings correctly and make better data-informed decisions!

iv. Fin

We've recently witnessed numerous American analysts predict the 2024 U.S. Election to end in an easy Kamala Harris sweep or an extremely close race. What we saw instead was a Trump landslide victory claiming all seven swing states.

These analysts, despite their methods being rooted in statistical thinking, suffer from the same human mistakes we make in software experimentation. Their platform has flaws, their experiment methodology has issues, or they make the wrong interpretations and extrapolations from the data. The ongoing replication crisis in science also serves as a stark reminder of these challenges.

Ultimately, statistical conclusions are inherently probabilistic; they offer insights but cannot guarantee future outcomes.

The allure of experimentation lies in its promise of precision: validated hypotheses, actionable insights, and improved decision-making. But errors in platforms, setup, or interpretation can obscure the truth and lead to flawed conclusions.

To properly leverage experimentation, we must acknowledge its limitations and consider the nuances of metrics, exposures, and second-order effects. Experiments should guide decisions, not dictate them.

By pairing experimentation with user research, feedback, product intuition, and design thinking, we can take a more holistic approach to building products.

I hope this article was helpful in expanding your understanding of experimental thinking! 🙂