Why session-based attribution is flawed in A/B tests
To yield more actionable results, marketers should move away from session based attribution and start measuring A/B tests based on revenue per user.
Summarize this articleHere’s what you need to know:
- Shift from session-based to per-user conversion attribution for A/B testing to get more actionable results.
- Session-based KPIs are easier to track but per-user KPIs are better for long-term revenue growth.
- Session-based KPIs treat each session as independent, even for the same user, making them unreliable.
- Per-user KPIs count each user only once, providing a more reliable picture, with various algorithms available for large-scale user counting.
- Some A/B testing and personalization platforms still use session-based KPIs due to the computational cost of distinct user counting.
Successful marketing requires the ability to measure the relative contribution of each activity and channel on conversion and purchasing over an extended period of time. However, many of the A/B testing and conversion optimization solutions push marketers into flawed single-session attribution and short-term optimization. As a result, marketers embrace misleading tactics that undermine the effectiveness of their marketing optimization campaigns.
In this article, I’ll highlight the importance of choosing the right KPI for your experiments, and how to overcome a key challenge associated with selecting actionable KPIs.
The difference between session-based and user-based attribution
As mentioned, “per-session” KPIs are more widely used in the industry than “per unique user” KPIs. This is mainly because they are easier to implement, monitor and optimize toward. Yet, the best KPIs for achieving true, long-term revenue uplifts, are “per user” based. Here is why:
Every statistical engine out there assumes that trials (namely “unique users” / “sessions”) are independent. By independent we mean that each one has the same probability to convert/give expected revenue. Clearly, when the same user starts several sessions, the probability of converting changes dramatically between sessions. There is a clear statistical dependence.
In practice, many users initiate several sessions. The result of this assumption breakage is that the reliability of the results coming out of the statistical engine are badly hurt in an unknown direction. So statistical significance, probability to be best, confidence intervals, winner declarations are all less reliable to an unknown extent. The “per user” KPI is much more reliable because each user is counted just once, and the assumption that different users are statistically independent remains pretty solid.
Distinct counts are hard
Given the inherent advantages in per-user KPIs, why is it uncommon to make decisions using KPIs normalized by unique (a.k.a distinct) users? Well, it turns out that counting distinct users at scale is a much harder task than it looks.
Enterprise-grade optimization often involves a stream of millions of events with repeating user IDs. Thus, maintaining a data structure that allows counting each observed user ID only once requires enormous amounts of memory, particularly when a user is running multiple tests and tracking multiple unique counters for each. Various algorithms have been offered which, given limited memory and a stream of recurring numbers, supply an approximate distinct count of those items. The crown jewel is the groundbreaking HyperLogLog algorithm, which is continually improved upon by multiple players in the analytics and big data fields. It is offered by Apache Spark, supported in Redis, and underlies many of the other hot technologies we’re hearing about today.
However, HyperLogLog has one glaring problem which prohibits its reliable use for A/B testing. The algorithm guarantees what the standard error will be given the amount of memory allocated to it. For example, the Redis implementation uses 12kb per HyperLogLog counter giving 0.81% standard error. That sounds great! But what does it actually mean?
Well, it means that if you simulate one million tests using HyperLogLog for counting and calculate the relative standard deviation of the approximate counts vs. reality, the standard deviation would be ever-so-close to 0.81%. However, HyperLogLog does not make any guarantees on the maximum approximation error for a single given counter – it might be +0.01%, it might be -1.5%, it might be +4%! .For analytics, this is generally fine. However, when making decisions on which A/B test variation is better based on a 2% difference in performance, you need to know that this is not due to HyperLogLog “misbehaving”.
For this reason, some experimentation and personalization platforms like Dynamic Yield use only exact distinct counts when calculating KPIs used for decision making.