Pitfalls to Avoid in Creative Testing
Creative A/B and multivariate testing is broken. Most tests produce wildly different results when run again because the complexity of the digital ad ecosystem makes it impossible to conduct a test that generates accurate and repeatable creative insights.
There are several pitfalls that heavily bias creative tests that, once known and understood, can be avoided in a well-planned creative test. Adacus is the only ad server that eliminates these sources of bias in creative testing.
Faint Signal Problem
The central challenge to conducting creative tests is the faint signal problem. There's two parts to the faint signal problem:
- The signal is weak, and
- Media influences the signal far more than does creative.
These are actually two separate problems, and we will address the first here and the second in the next section below - Not Holding Media Constant.
Weak Signal Conversion rates that are used to measure digital ad effectiveness - from click-through rates and video completion rates to view-through site engagement rates and order rates - are generally very small. By comparison, A/B testing is common on web sites where effectiveness is measured in terms of page click-through rates which are much larger.
Unfortunately, as conversion rates get smaller, the required sample size only gets larger. And the increase is not linear, it’s exponential, as can be seen in the chart below.
Pitfalls to Avoid in Creative Testing
- Faint Signal Problem
- Not Controlling for Media
- Test Plan Siloed from Media Plan
- Rotation-based Testing on DCM, Facebook
- Test Planning
- Test Setup
- Analyzing Test Results
- Test Planning
- Test Setup
- Analyzing Test Results
- Many digital marketers hoping to run an A/B test are stopped in their tracks once they learn how long it will take to get results. The result: most creatives go completely untested.
But it doesn't need to be this way. Adacus' platform employs 3 tactics to amplify the weak signal and deliver creative insights in days and weeks rather than months.
- Separating Media Attribution from A/B Testing
- Bayesian A/B Test Statistics
- Multivariate Test Design
- Offline Ad Effectiveness Tracking
Separating Media Attribution from A/B Testing: Ad servers such as Doubleclick apply attribution models by default to all conversion reporting. In other words, when a user converts after having seen ads from a test as well as ads from other placements, Doubleclick will not always attribute that conversion to the test group to which the user was assigned.
This makes A/B testing all but impossible, for two reasons:
- Doubleclick attribution removes most conversions from the results of an A/B test, thus increasing the amount of testing time required to achieve significance from weeks to months.
- Doubleclick attribution introduces noise into the A/B test results. Evaluation of an A/B test does not require multi-touch attribution as users are only presented with the A ad or the B ad throughout the duration of the test.
Bayesian A/B Test Statistics: But where do these sample size requirements come from? First generation A/B testing platforms reused the statistics that has traditionally been used for offline hypothesis testing, such as clinical trials. Traditional hypothesis testing statistics is based entirely on calculation of a p-value as the sole measure of significance, as explained below.
A new generation of A/B testing statistics is being developed by online optimization platforms such as VWO (website A/B testing), swrve (mobile app A/B testing) and Adacus (digital advertising A/B testing). This new generation of A/B testing statistics is based on Bayesian statistics, which is increasingly leveraged for much of today’s data science and machine learning.
What is “statistical significance”?
Marketers perpetually ask if test results are “statistically significant,” but what does that term even mean? Probably not what you think.
Traditional statistics relies on p-values as a measure of the “statistical significance” of test results. Most marketers are surprised to learn that p-values do not, in fact, measure the probability that one creative or treatment will outperform another. What p-values measure is far more abstract and removed from the decisions that marketers make based on A/B tests.
In every A/B test one variation will perform at least slightly better than the other. P-values measure the probability that a test result (say, creative variation B outperforms creative variation A by 10%) would have occurred if in fact there were no difference between the two creatives at all. That “95% confidence level” threshold you’ve probably heard bantered about simply means that there is a 5% chance that, were the two variations identical, you would have observed as large a difference in performance between them as you did. The p-value is an important measure in other fields of study to account for what is known as in traditional statistics as Type I error. In our experience, we have yet to hear a digital marketer ask us for this specific probability. And why would they.
Bayesian statistics makes use of two metrics that are critical to making decisions based upon A/B tests: Chance to Beat and Potential Loss.
- Chance to Beat: Most marketers assume that p-values reflect the chance that one creative variation will beat another. As described above, they only measure one source of sampling error. Chance to Beat Control, however, is based upon direct analysis of the probability distribution for each creative variation, and thus answers the question being asked by marketers directly.
- Potential Loss: You've selected a winning creative variation, and you know based on the Chance to Beat that there is a chance it's a false positive and the losing variation could beat it. But do you know by how much? You learn this from the Potential Loss of selecting a variation as the winner. This tells you how much is at risk if the losing variation turns out to be the winner. In other words, if you were to rerun the test an infinite number of times, and note every time the selected variation lost, this metric tells us the average loss in all those tests.
So, while p-values draw an arbitrary line in the sand, a line based on an abstract and non-intuitive measure of significance, Bayesian A/B statistics provide substantive, actionable measures of significance.
And not only do these metrics answer the actual questions that digital advertisers are asking, they are knowable with far smaller sample sizes than required by P-values.
As the chart demonstrates, the difference in sample size requirements between traditional and Bayesian A/B test statistics gets larger as conversion rates get smaller. For creative to be tested and optimized at the fast pace of digital advertising, Bayesian statistics are a critical tool in the digital marketers’ toolkit.
Multivariate Test Design: In combination with Bayesian statistics, multivariate tests can also be used to reduce the required sample size of a digital advertising A/B test, thus speeding up time-to-insights and
enabling mid-flight creative optimization.
Full factorial tests consist of two or more “factors” (elements to test) each of which has multiple options. This allows marketers to run multiple A/B tests simultaneously without increasing the required sample size.
While generally used for multivariate testing – to identify the effect of combinations of variables – the results of full factorial tests can also be analyzed for each separate factor in the test. This enables you to run multiple A/B tests concurrently, reusing the same sample of impressions.
In the below example, full factorial test design allows an automobile advertiser to test vehicle models (Group A v B) and messaging (Group C v D) simultaneously.
Offline Ad Effectiveness Tracking: When all of a digital marketer’s conversions are found in online activity – site engagement, eCommerce orders – then conversion tracking is as simple as placing a pixel on their web site, but most marketers don’t have it that easy.
Marketers whose businesses convert customers in call centers or in brick-and-mortar retail locations often unnecessarily limit their A/B tests to only the small subset of orders that are placed online. This significantly increases the required sample size of impressions, making A/B testing all but impossible for most digital advertisers.
The solution is offline conversion tracking. If your orders are placed by phone, you can leverage existing call intelligence technology to tie a call to an online device. If your orders are placed in physical stores, you can leverage sales measurement vendors to tie a purchase to an online device. To maximize your A/B Testing dollars, make sure your creative optimization vendor integrates with these offline tracking companies
Not Controlling for Media
There are two parts to the faint signal problem in testing creatives.
- The signal is weak. This was addressed above. And...
- Media influences the signal far more than does creative.
It's not that media is more important than creative, but that the variation in media quality is far greater than the variation in creative quality. Depending on DSP settings like CPM, frequency, and other bidding tactics, media buying may end up with a spot above the fold on a prominent news site or with completely unviewable or fraudulent media inventory.
As a result, it's critical, when testing creative, to hold the media constant across creatives being testing. This creates two critical requirements to any creative test in digital advertising:
1. Randomly assign users on the ad server to a creative being tested and serve the same creative to the user for the duration of the test.
Tests are all-too-often conducted by comparing the performance of two DSP line items being run simultaneously. The problem with this approach is that it does not hold the media constant. Different line items, even with identical settings, will inevitably access different inventory over the course of the test.
A/B tests on Facebook generate misleading results for this same reason. There is currently no way to do A/B testing of creatives on Facebook, though many try by creating a different Ad Set for each creative being tested. Facebook Ad Sets, however, are media bidders, just like DSP line items, and they automatically optimize to find the best performing inventory. When assigning different creatives to different Facebook Ad Sets, there is no way to know if the difference in performance is due to different bidding optimizations or to creative.
A proper A/B test in digital advertising must be run on a single line item with the ad server randomly assigning users into groups at each impression so that audience used for the test truly is identical across the test groups.
2. Do not change the allocation of traffic between creatives during a test. Otherwise changes in traffic allocation will be confounded with changes in media quality over time.
A common tactic to reduce testing time is to shift the weighting of traffic to the "winning" creatives based on early results. This is sometimes done by manually resetting traffic allocation or with significant multi-armed bandit algorithms.
Bandit algorithms make a lot of sense in domains such as automated selection of news article headlines that generate the most article expansions by the reader. In this common use of bandits, there is no reason to believe one headline would perform better from one day or one week to the next.
When testing digital display and video ads, on the other hand, the near constant changes in media quality from programmatically purchased inventory would artificially bias any test that shifted more traffic to one creative than another during one period of the test.
Test Plan Siloed from Media Plan
it is critical that creative A/B testing is coordinated with your media agency or inhouse programmatic team. Programmatic media buying that is not coordinated with creative optimization can undermine creative tests in the following ways:
1. Buying low CPM inventory that is less viewable limits the impact of any creative
As discussed in the introduction, creative is growing in importance for multiple reasons, one of which is that digital advertisers are increasingly getting the viewability and attention to their ads that they have been missing. Your creative testing will only be as informative as your ads are viewable. Ad placements that compete with 5-10 other placements on the page may be low in CPM, but the attention garnered makes creative less effective, thus making creative testing less effective. If you’re impressions are 40% unviewable, 40% competing with 5-10 other placements, and only 20% both viewable and prominent, the impact of creative will simply be minimal.
2. Targeting the same users in the A/B test with other programmatic campaigns
Sometimes when an agency sets up a creative A/B test, they generate a separate placement in the ad server for the test and traffic it to a line item or package with their trading desk. Such tests are less likely to measure the actual differences in performance between creative variations, because users in the test are being served creatives from other programmatic campaigns at the same time. When providing multiple placements or ad tags to a trading desk, ensure that the trading desk isolates users in a test from other programmatic campaigns.
Rotation-based Testing on Doubleclick, Facebook
Many ad servers, inlcuding Doubleclick and Facebook, offer a feature within creative rotations that optimizes the percentage weight of each creative within the rotation based on conversions as measured by clicks or by on-site events. Digital advertisers are strongly advised to avoid this feature and instead conduct actual A/B tests.
A/B testing via rotation optimization randomly assigns creative variations to impressions, not to users. As a result, users see multiple creatives during a test. Doubleclick, Facebook, and other ad servers that support rotation-based use last touch attribution to assign the full credit of a conversion to the last creative viewed by a user who clicks or converts. It requires ignoring the impact of creatives that are not the last creative viewed or clicked, which negates decades of research on how advertising shapes buying patterns.
Ironically, optimizing weighting of creatives within a rotation is premised on the assumption that rotations are bad, because the impact of creatives that are not the last viewed or clicked within a rotation is presumed to be zero.
This is not to say that creative rotations aren’t often better than serving a single creative to a user. In fact, Adacus supports a special form of testing that evaluates the impact of serving the same creative multiple times versus serving a rotation of creatives to a user and selects the optimal subset of creatives to include in a rotation - Creative Rotation Attribution Testing.