I’m a predictable person. Why? Because I tend to repeat myself.
If something is smart, necessary or important, I’m going to continue emphasizing it until I die or until people get it.
One such thing is split testing.
Split testing is the essence of conversion optimization. A runner can’t run without moving his legs. In the same way, a conversion optimizer can’t optimize without conducting split tests. It’s like a silver bullet for conversion optimization.
But for all its silver bulletness, split testing has its own set of pitfalls and perils. One does not merely split test and, voila!, out pops a perfectly optimized site.
Those who split test must be aware of what could go wrong in the process of split testing. There are dangers. If you’re going to do any split testing — and you should — then you also need to be aware of the split testing killers.
I’ve listed four major split testing killers. But, you ask, aren’t there more split testing killers than just these few? Of course. But I don’t want to bore you with the little mistakes. I want to frontload the important stuff, and help you remember a few good things than a whole laundry list of maybe-helpful-maybe-not things. Mkay?
Without further ado, here is the list of split testing killers.
1. Not split testing at all.
If you’re not testing then you’re not making mistakes, right? Wrong. You’re making the biggest mistake of all.
In spite of its warts, wrinkles, abuse and mistakes, split testing is still the only way to gain ground in the world of conversion optimization. Smart as you are, you can’t make all the right conversion optimization improvements 100% of the time. (Neither can split testing, but it can get a darn bit closer than you can.)
Yes, the biggest mistake is not making any mistakes. If you split test, you’re going to experience blunders. But that’s part of the value of testing. You screw it up, so you do it again. And in the process, you learn.
What are some of the things that would compel a conversion optimizer not to test?
- False Assurance — You read about other split tests, and they made the same changes that you did. So, instead of testing your own changes, you feel good about them, and don’t test them.
- Misguided Change — Applying the outcome of case studies you read about. So you read some tantalizing A/B test case study where some guy achieved a 89,090% increase in conversions by changing some random something. You think, “Ah, the man is a genius, and it shall work for me, too!” And so you make the change, totally not realizing that he was selling urinals, and you are selling perfume. If you split test that change, you may realize that what works for urinals doesn’t work for L’Air Du Temps.
Keep in mind that the A/B tests you read about might be guilty of testing mistakes, too. Don’t believe everything you read.
Instead, do your own testing.
2. Stopping a split test too soon.
When you end your test too early, you could be screwing up the entire thing.
Here’s what happens in a typical I-ended-the-test-too-early-because-I’m-stupid experience.
Jack is awesome, because he’s split testing. He has a split testing map a mile long. (Basically, he’s awesome.) Jack hypothesizes, starts an A/b test, watches the data roll in, and gets some amazing information! “Wow!” he thinks, shaking his head in amazement, "I’m so glad I tested!"
Jack’s testing results showed that the test was “statistically significant,” so he ended the test, analyzed the data, and took action.
End of story.
Now, for the sequel.
After Jack, the awesome split tester, ended his test, he was brimming with confidence. He had conducted a statistically significant test, and gained actionable insights that would change the world.
Based on the winning variation of his split test, he changed the button color from green to orange. Then, true to his analytical self, he started watching the conversion rates, expecting to see them rise with the same meteoric increase of the winning variable in his split test.
But nothing happened. Instead of causing conversions to increase, the rates actually declined by a few percentage points. What was going on?!
Let’s do a post-mortem, because this scenario is pervasively typical.
What went wrong?
Basically, Jack pulled the plug on his test too soon. But didn’t his testing software tell him that the test was statistically significant?
Yes, but statistical significance does not equal reliability. Statistical significance does not take into account the variables that affect conversion rates, including seasonal or weekly peaks and troughs, sale events, e-blasts, and other random weirdness.
You see, Jack ran his test for four days — Sunday through Wednesday. What he failed to take into account was that for his website, sales inevitably spike beginning on Thursday nightS. The conversion process that he was analyzing consisted of a few random and non-typical online buyers.
His testing should have continued for another few weeks, at least, in order to measure the rates of conversion over the weekend, and then another weekend peak, and even another weekend peak to account for the sales increases that take place during several weekends (there are actually several other variables here, but we will just stick with weekends for this exercise).
If Jack was a few neurons smarter, he would have used the “conversions per day of the week” report from Google Analytics.
That’s all the theoretical stuff. Now, let me give you the brass tacks on this all-too-common split test pitfall, so you can know precisely what you do to eliminate it.
- Run the test for at least three business cycles. Ideally, your start/stop times should start and stop at the same point in the business cycle. In other words, if you started the test on Sunday at 10 p.m., then the test should end on a Sunday at 10 p.m.
- Run the test until you reach a statistical significance level of 95%. The higher the statistical significance, the higher the statistical confidence. The higher the statistical confidence, the greater the likelihood that the test is going to produce results. However, make sure you watch out for false positives. Often 95% can be misleading if you are not paying attention to all of the other variables. Just be extra careful with this one.
- Measure at least 100 conversions per variation. The more conversions you have, the better your results. If you can get 200, great. 300, even better.
- Run the test with enough participants. Sample size and segment of a sample size are probably are the most important ones. Traffic levels play a key role in the duration of a test. It’s pointless to test for a certain time frame if your site is not receiving enough traffic during that time frame. Here is helpful tool for determining the number of participants required for a statistically reliable test: http://www.testsignificance.com/. This tool will show you exactly how many participants you need to achieve a given confidence level based on conversion rate differences and group size.
A few final warnings, and then I’ll stop belaboring this point.
The problem here is that people stop their tests too early. But the solution isn’t to simply run the test longer. The solution is to identify an ideal duration based on the other factors — statistical significance level, total number of participants, business cycle duration, total number of conversions, etc.
It’s not enough to merely lengthen the time of the test. Instead, it’s important to actually calculate how long it needs to be to produce reliable results.
3. Split Testing things that are too tiny to make any real difference.
Most conversion optimizers that I know are analytical people. They’re smart, organized, detail-oriented, and anal. (Did I just say that?)
The problem comes when such people want to test everything — like minutiae.
I changed the color of the 1px border around the left button — from CCFF66 to CCFF33! Get ready to experience a huge uptick in conversion rates!
This is the kind of A/B testing that is a total waste of time.
I love the example that Cennydd Bowles wrote about. Here it is in all its split testing glory:
Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads.
Do you get it? You test a little variation in two pieces of crap, and then conclude, based on the crap testing, that crap A is better than crap B. Therefore, you implement crap. And the net effect is CRAP!
The simple presence of confirming data does not mean that version A is better than version B. Remember, in every split test there are not two options but three.
What is the third option? The null hypothesis.
What is the null hypothesis? The null hypothesis states that there is no relationship between two measured phenomena. In other words, you test two variations, and neither one comes out as a clear winner.
You can quantify such a distinction via statistical confidence levels, or you can assess it logically by determining its impact upon a conversion situation. It is also known as a "flat test" or "no winner."
Before I proceed, let me just say this plainly: I know that small changes can make big differences. CRO is a pile of surprises. A teeny little variation can produce cascades of conversions.
But most of the time, that’s simply not the case. Most of the time, teeny variations produce zero change in conversions.
Let me give you an example.
I admire these guys for testing something, but what are they testing?
- In version A, the “Upload” button is boldfaced.
- In version A, there is a small arrow on the “Convert” button.
Could it be that neither version A nor version B gains the win? Could it be that the null hypothesis has reared its ugly head?
Test the stuff that matters, not the little stuff.
Let me share another way of looking at it. “Null hypothesis” sounds smartass and nerdy, but there’s a more practical way of looking at things.
A test may lack statistical significance, which is one way of ascertaining its validity. But that same test may lack practical significance. That is the difference we’re focusing on in this point.
Every test must somehow tie into your overall conversion and business goals. The “degree of effect” must be connected with the focus of the test as a whole.
In a now-famous study, Google tested 41 shades of blue to see which one performed better. One ex-Googler also explained a disagreement between 2 pixels on a border, in which colleagues demanded he prove his case for the different border width.
Jason Cohen, CEO of WP Engine, gets this. In fact, he gave a whole talk on the subject. You can watch his entire 50-minute session. It’s helpful, and it could be the best way you spend 50 minutes this week.
This is the kind of navel-gazing split tests that turn conversion optimizers into bean-counting, clock-watching, cubicle-dwelling, fluorescent-light-seeking, Keurig-swilling, Office-watching, Dilbert-reading time wasters.
What kind of things matter?
- Headline variations
- Image changes
- Image sizes
- Button color, size, shape, position
- Social proof.
Peep Laja doesn’t waste words when dealing with stupid split tests:
So you’re testing colors, huh? Stop.
Okay, I’ll say it too. Stop. Just. Stop.
4. You’re not segmenting your split tests.
Segmentation, I argue, is the secret to A/B testing smartness.
When I tweeted my article on segmentation, there was a small Twitter uprising. Therefore, I enter this territory with caution, well aware of the flame throwing opponents that may desire to roast me.
Here’s the mistake. Many optimizers run split tests on aggregate data. The problem, as Unbounce contributor Vincent Barr explains, is that the “traffic [and therefore, the conversion] is coming from unequal visitor segments.”
He uses the example of two customer types — a trainer and a gym-goer. Traffic from the two groups varies according to situation, producing a skewed aggregate result.
Relying on this aggregate CVR is going to make your test results look different than if you were segmenting your tests according to users.
When you run an indiscriminate test against the mass of visitors, you’re going to come up with a recommendation based on traffic and rates, not on customers and their unique type. Conversion optimization is for customers, not traffic rates.
My recommendation is that you conduct tests that are solidly based on user segments, not raw traffic.
Split testing is awesome.
And if you do split testing, you’re awesome.
And if you screw it up,
But there is hope. Watch out for these common killers, defend against them with all your might, and press on toward conversion optimization success.