the smart marketer’s guide to a/b testing › wp-content › uploads › 2015 › 12 › ... ·...

Thanks for downloading this PDF! You can find the whole post online at: http://conversionxl.com/abtestingguide

The Smart Marketer’s Guide to A/B Testing A/B testing. You’ve heard so much about it in the past few years. You may have even run a test or two (or a thousand). Yet, for all the content out there about split testing, the concept is still frequently misunderstood. In its simplest sense, A/B split testing is a new term for an old technique controlled experimentation. When researchers are testing the efficacy of new drugs, they use a ‘split test.’ In fact, most research experiments could be considered a ‘split test,’ complete with a hypothesis, a control and variation, and a statistically calculated result. The main difference, however, lies in the variability of internet traffic. In a lab, it’s easier to control for external variables. Online, you can mitigate them, but it’s truly difficult to operate a purely controlled test. In addition, testing new drugs requires an almost certain degree of accuracy. Lives are on the line. In technical terms, your period of ‘exploration’ can be much longer, as you want to be damned sure during your period of ‘exploitation’ that you didn’t reach a type I error (false positive). A/B split testing online is primarily a business decision. It’s a weighing of risk vs reward, exploration vs exploitation, science vs business. Therefore, we view results with a different lens and make decisions slightly differently than tests in a pure lab setting. Ready to dive in? Let’s start with the basics. What Is An A/B Test? An A/B test is a controlled online experiment in which 50% of your traffic is sent to the original page and 50% is sent to a variation. That’s it. In its simplest sense, an A/B test is a 50/50 traffic split that seeks to validate changes you’ve made to a page by testing similar traffic during the same time period.

http://conversionxl.com/ab-testing-guide

http://www.win-vector.com/blog/2015/06/i-do-not-believe-google-invented-the-term-ab-test/

You can, of course, create more than two variations. Broadly known as an A/B/n test, if you have the traffic that allows it, you can test as many variations as you’d like. Here’s an example of a A/B/C/D test, and how much traffic each variation is allocated:

Now, at this point, you may be wondering what the difference is between an A/B/C/D test and a multivariate test (MTV) is. MVT isn’t the only other type of experiment, either. There are also bandit algorithms that solve similar problems as A/B/n tests, just in a different way. A/B Testing, Multivariate, and Bandit Algorithms: What’s the Difference?</h3> Multivariate Tests A/B/n tests are controlled experiments run on 1 or more variations + the original page that directly compare conversion rate means based on the changes made between variations. While it sounds similar, multivariate tests are controlled experiments that test multiple versions of a page and attempt to isolate which attributes cause the largest impact. In other words, multivariate tests are like A/B/n tests in that they test an original against variations, but each variation contains different design elements. For example:

If you have enough traffic, you should use both types of tests to maximize the output of your optimization program. Each one has a different and specific impact and use case, and used together, can help you get the most out of your site. Here’s how:

Use A/B testing to determine best layouts Use MVT to polish the layouts to make sure all the elements interact with each other in

the best possible way. As I said before, you need to get a ton of traffic to the page you’re testing before even considering MVT. However, makes sure your priorities align with your testing program. Peep once said, “most top agencies that I’ve talked to about this run ~10 A/B tests for every 1 MVT.” Bandit Testing As for bandit algorithms, you can almost think of them as A/B/n tests that update in real time based on the performance of each variation. In essence, a bandit algorithm starts by sending traffic to two (or more) pages: the original and the variation(s). Then, in attempt to ‘pull the winning slot machine arm most often,’ the algorithm updates based on whether or not a variation is ‘winning.’ Eventually, the algorithm fully exploits the best option:

Image Source One of the big benefits of bandit testing is that bandits mitigate ‘regret,’ which is basically the lost conversion you experience while exploring a potentially worse variation in a test. This chart from Google explains that very well:

Image Source By the way, try not to think of bandits and A/B/n tests as a ‘this or that’ scenario; they’re tools that each have their purposes. In general, bandits are great for:

Headlines and ShortTerm Campaigns Automation for Scale Targeting Blending Optimization with Attribution

Read this article for more information on bandit algorithms. Two other minor issues with A/B testing:

Onetail or twotail test? Bayesian or Frequentist stats?

http://conductrics.com/balancing-earning-with-learning-bandits-and-adaptive-optimization/

https://support.google.com/analytics/answer/2844870?hl=en

http://conversionxl.com/bandit-tests/

OneTail vs TwoTail A/B Tests I promise you, this is a much smaller issue than some people think. Therefore, we’ll brush over it quickly. Onetailed tests allow for the possibility of an effect in just one direction where with twotailed tests, you are testing for the possibility of an effect in two directions – both positive and negative. No need to get very worked up about this. Matt Gershoff from Conductrics summed it up really well: “If your testing software only does one type or the other, don’t sweat it. It is super simple to convert one type to the other (but you need to do this BEFORE you run the test) since all of the math is exactly the same in both tests. All that is different is the significance threshold level. If your software uses a one tail test, just divide the pvalue associated with the confidence level you are looking to run the test by ‘2’. So if you want your two tail test to be at the 95% confidence level, then you would actually input a confidence level of 97.5%, or if at a 99%, then you need to input 99.5%. You can then just read the test as if it was twotailed.” Dive down the rabbit hole with our article on onetail vs two tail tests if you’d like. Bayesian or Frequentist Stats Bayesian or Frequentist A/B testing is another hot topic for debate. Especially with popular tools rebuilding their stats engines to feature a Bayesian methodology, Here’s the difference (very much simplified): Using a Frequentist method means making predictions on underlying truths of the experiment using only data from the current experiment. The difference is that in the Bayesian view, a probability is assigned to a hypothesis. In the Frequentist view, a hypothesis is tested without being assigned a probability. Dr. Rob Balon, who carries a Phd in statistics and market research, says the debate is mostly esoteric tail wagging done in the domain of the ivory tower. “In truth,” he says, “most analysts out of the ivory tower don’t care that much, if at all, about Bayesian vs. Frequentist.” Don’t get me wrong, there are practical business implications to each methodology. For the sake of discussion, though, if you’re at all new to A/B testing, there are much more important things to worry about.

https://twitter.com/mgershoff

http://conductrics.com/

http://conversionxl.com/one-tailed-vs-two-tailed-tests/

https://vwo.com/bayesian-ab-testing/

https://vwo.com/bayesian-ab-testing/

If you do want to dive down the rabbit hole, though, here’s an article we wrote on Bayesian vs Frequentist A/B Testing. Strategy First: Setting up a Successful A/B Testing Plan A structured approach to A/B testing could be your biggest area of impact. Don’t listen to any blog posts that tell you “99 Things You Can A/B Test Right Now.” That’s a waste of time and traffic. Being a bit more processminded will make you more money. In a survey done by Econsultancy and RedEye, 74% of the survey respondents who reported having a structured approach to conversion also stated they had improved their sales. Those that don’t have a structured approach, I’m almost sure stay in what Craig Sullivan calls the “Trough of Disillusionment” (unless their results are littered with false positives, which we’ll get into later). So structure. To simplify a winning process, it goes something like this:

1. Measurement 2. Prioritization 3. Experimentation 4. Repeat

Measurement: Getting DataDriven Insights Measurement is a crucial step in this process, because we want to know what is happening as well as why it’s happening. First thing’s first, we’ll start with the highlevel strategy and move down to the granular. So think in this order:

1. Define your business objectives 2. Define your website goals 3. Define your Key Performance Indicators 4. Define your target metrics

Once you know where you want to go, we can collect the data necessary to get there. To do this, we recommend the ResearchXL Framework. (we’ll go over this briefly, but to really understand it, read this post). Point is, we want to be datadriven, which means we want to collect and analyze relevant data for our goals. This means a structure approach to our conversion research. Here’s the executive

http://conversionxl.com/bayesian-frequentist-ab-testing/


https://econsultancy.com/reports/conversion-rate-optimization-report

http://www.slideshare.net/sullivac/emetrics-london-the-ab-testing-hype-cycle

http://conversionxl.com/how-to-build-a-strong-ab-testing-plan-that-gets-results/

http://conversionxl.com/how-to-come-up-with-more-winning-tests-using-data/

http://conversionxl.com/how-to-come-up-with-more-winning-tests-using-data/

summary of our process:

1. Heuristic Analysis 2. Technical Analysis 3. Web Analytics Analysis 4. Mouse Tracking Analysis 5. Qualitative Surveys 6. User Testing

Heuristic analysis is about as close as we get to ‘best practices.’ The difference here in our process is that, after years of experience, you still can’t tell what exactly will work, but you can more easily point out opportunity areas. As Craig Sullivan put it: “My experience in observing and fixing things — these patterns do make me a better diagnostician but they don’t function as truths — they guide and inform my work but they don’t provide guarantees.” So when it comes to heuristic analysis, humility and a framework are important. We assess each page based on:

Relevancy Clarity Value Friction Distraction

Read about WiderFunnel’s LIFT Model for a good heuristic framework. Technical analysis is an area often overlooked and highly underrated by optimizers. Bugs – if they’re around – are your main conversion killer. You think your site works perfectly – both in terms of user experience and functionality – with every browser and device? Probably not. This is a lowhanging fruit, one that you can make a lot of money on (think 12 month perspective). So start:

Conduct crossbrowser and crossdevice testing Do speed analysis

Web analytics analysis is next. First thing’s first make sure everything is working. You’d be surprised how many analytics setups are broken. Google Analytics (and other analytics setups) are a course in themselves, so I’ll leave you with some helpful links to read:

http://conversionxl.com/best-practice-and-optimisation-experts/

http://www.widerfunnel.com/lift-model/

http://conversionxl.com/11-low-hanging-fruits-for-increasing-website-speed-and-conversions/

Google Analytics 101: How To Configure Google Analytics To Get Actionable Data Google Analytics 102: How To Set Up Goals, Segments & Events in Google Analytics

Next is mouse tracking analysis, which includes heat maps, scroll maps, click maps, form analytics, and user session replays. One point of advice here is to not get carried away with pretty visualizations of click maps, etc. Make sure you’re informing your larger goals with the analytics in this step. Qualitative research is an important part of measurement as well, because it tells you the why that quantitative analysis misses. Many people think that qualitative analysis is “softer” or easier than quantitative, but it should be just as rigorous and can provide just as important of insights as your GA data. For qualitative research, use things like:

Onsite surveys Customer surveys Customer interviews and focus groups

Finally we have user testing. The premise is simple: observe actual people use and interact with your website while they’re commenting their thought process out loud. Pay attention to what they say and experience. After the heavyass conversion research, you’ll have lots of data and need to do some prioritization. Prioritizing A/B Tests There are many frameworks to prioritize your A/B tests, and you could even innovate with your own formula. But here’s how we do it. Once you go through all 6 steps, you will find identify issues – some of them severe, some minor. You’ll want to allocate every finding into one of these 5 buckets:

1. Test. (This bucket is where you place stuff for testing.) 2. Instrument. (This can involve fixing, adding or improving tag or event handling on the

analytics configuration.) 3. Hypothesize. (This is where we’ve found a page, widget or process that’s just not

working well but we don’t see a clear single solution.) 4. Just Do It – JFDI. (Here’s the bucket for nobrainers. Just do it) 5. Investigate. (If an item is in this bucket, you need to ask questions or do further

digging.) Then we rank them from 1 to 5 stars (1= minor issue, 5 = critically important). There are 2 criteria that are more important than others when giving a score:

http://conversionxl.com/successful-site-data-might-overlooking-google-analytics-setup/

http://conversionxl.com/google-analytics-102/

http://conversionxl.com/qualitative-research-guide/

http://conversionxl.com/on-site-surveys/

http://conversionxl.com/customer-surveys/

1. Ease of implementation (time/complexity/risk). Sometimes the data tells you to build a

feature, but it takes months to do it. So it’s not something you’d start with. 2. Opportunity score (subjective opinion on how big of a lift you might get).

Then create a spreadsheet with all of your data and you’ll have a prioritized testing roadmap, more rigorous than most of your competitors will have. You can also use a variety of other frameworks. A very popular one is the PIE framework. This breaks opportunity areas into three scores:

1. Potential 2. Importance 3. Ease

Read more on frameworks to prioritize A/B testing here. Setting Up Your A/B Tests Once you’ve got a prioritized list of test ideas, it’s time to form a hypothesis and run an experiment. Basically, a hypothesis will define why you believe a problem occurs. Furthermore, a good hypothesis:

1. Is testable It needs to be measurable, so that it can be used in testing. 2. Has a goal of solving conversion problems Split testing is done to solve specific

conversion problems 3. Gains market insights A wellarticulated hypothesis will let your split testing results give

you information about your customers, whether the test ‘wins’ or ‘loses’ or whatever.

http://www.widerfunnel.com/how-to-prioritize-conversion-rate-optimization-tests-using-pie/

http://conversionxl.com/3-frameworks-help-prioritize-conduct-conversion-testing/

Craig Sullivan has put together a hypothesis kit to simplify the process. Here’s his simple version:

1. Because we saw (data/feedback) 2. We expect that (change) will cause (impact) 3. We’ll measure this using (data metric)

And the advanced one:

1. Because we saw (qual & quant data) 2. We expect that (change) for (population) will cause (impact(s)) 3. We expect to see (data metric(s) change) over a period of (x business cycles)

Technical Stuff Here’s the fun part: you finally get to pick your tool.

https://medium.com/@optimiseordie/hypothesis-kit-2-eff0446e09fc#.uivj7m97p

http://conversionxl.com/conversion-optimization-tools/

While this is the first thing many people think about, it’s not actually the most important, by any means. The strategy and statistical knowledge aspects come first, and only then should you worry about picking a tool. That said, there are a few differences you should bear in mind. One major distinguishment in tools is whether they are server side or client side testing tools. Serverside tools render code on the serverlevel and send a randomized version of the page to the viewer with no modification on the visitor’s browser. Clientside tools send the same page but JavaScript on the client’s browser manipulate the appearance on both the original and the variation. Clientside testing tools are things like Optimizely, VWO, and Adobe Target. Conductrics has capabilities of both, and SiteSpect does a proxy serverside method. What does all this mean for you? If you’d like to save time up front, or if your team is small or lacks development resources, clientside tools can get you up and running faster. Serverside requires development resources but can often be more robust. So while setting up tests can be slightly different depending on which tool you use, often it will be as simple as signing up for your favorite tool and following some basic instructions, like putting a javascript snippet on your website. You’ll basically want to set up goals (something that lets you know a conversion has been made, like a ‘thank you for purchasing’ page), and your testing tool will track when each variation converts visitors into customers.

http://conversionxl.com/server-side-vs-client-side-ab-testing-tools-whats-the-difference/

Some skills that come in handy when setting up tests are HTML, CSS, and JavaScript/JQuery, as well as design and copywriting skills to draw up the variations. Sure, some tools allow use of a visual editor, but that limits your flexibility and control, so learning some technical skills is helpful. Or you could use something like Testing.Agency to set up your tests for you. How Long Should You Run A/B Tests? First rule: don’t stop a test just because it reaches statistical significance. This is probably the most common error committed by beginning optimizers with good intentions. If you’re calling your tests when you hit significance, you’ll find that most of your lifts don’t translate to increased revenue (that’s the goal, afterall). You’ll find that the lifts were in fact imaginary. Consider this: One thousand A/A tests (two identical pages tested against each other) were run.

771 experiments out of 1.000 reached 90% significance at some point 531 experiments out of 1.000 reached 95% significance at some point

http://testing.agency/

http://conversionxl.com/statistical-significance-does-not-equal-validity/

http://conversionxl.com/statistical-significance-does-not-equal-validity/

http://www.einarsen.no/is-your-ab-testing-effort-just-chasing-statistical-ghosts/

Stopping tests at significance breeds the risk of false positives and excludes possible external validity threats like seasonality. Instead, you’ll want to predetermine a sample size and run the test for full weeks, usually for at least two business cycles. How do you predetermine sample size? There are lots of great tools out there for that, including tools within your favorite testing tool. Here’s how you’d calculate your sample size with Evan Miller’s tool:

In this case we told the tool that we have a 3% conversion rate, and want to detect at least 10% uplift. The tool tells us that we need 51,486 visitors per variation before can look at the statistical significance levels and statistical power. Oh, and you’ll notice in addition to significance level, there’s something called ‘statistical power’ in the photo above as well. Statistical power is another important factor in running your A/B test, as it attempts to avoid Type II errors (false negatives). In other words, it makes sure that you detect an effect if there actually was one.

http://conversionxl.com/ab-test-validity-threats/


http://conversionxl.com/ab-testing-statistics/#statistical-power-detecting-an-effect-that-is-actually-there

For practical purposes, know that 80% power is the standard for testing tools. To reach such a level, you need either a large sample size, a large effect size, or a longer duration test. There Are No Magic Numbers You’ll read a lot of blog posts that have magic numbers like “100 conversions” or “1,000 visitors” as their stopping points. Math is not magic, math is math, and what we’re dealing with is slightly more complex than simplistic heuristics like that. Andrew Anderson from Malwarebytes put it well: “It is never about how many conversions, it is about having enough data to validate based on representative samples and representative behavior. 100 conversions is possible in only the most remote cases and with an incredibly high delta in behavior, but only if other requirements like behavior over time, consistency, and normal distribution take place. Even then it is has a really high chance of a type I error, false positive.” What we’re worried about is the representativeness of our sample. How can we do that in basic terms? Your test should run for 1, or better yet 2, business cycles, so it includes everything that going on:

every day of the week (and tested one week at a time as your daily traffic can vary a lot), various different traffic sources (unless you want to personalize the experience for a

dedicated source), your blog post and newsletter publishing schedule, people who visited your site, thought about it, and then came back 10 days later to buy

it, any external event that might affect purchasing (e.g. pay day)

Another (very important) note: be careful with low sample size. The internet is full of case studies steeped in shitty math, and most of it (if they even release full numbers), is because they judged a test on like 100 visitors per variation and 12 vs 22 conversions. If you’ve set everything up correctly so far, then you’ll just want to avoid peaking (or letting your boss peak) at test results multiple times before the test is finished. This can result in calling a result early due to ‘spotting a trend’ (impossible). What you’ll find is that many test results regress to the mean. Regression to the Mean Often, you’ll see results vary wildly in the first few days of the test. Sure enough, they tend to converge as the test continues for the next few weeks. Here’s an example Peep gave in an older blog post of an eCommerce client:

https://twitter.com/antfoodz

https://www.malwarebytes.org/

http://shittymath.com/

https://en.wikipedia.org/wiki/Regression_toward_the_mean

Here’s what we’re looking at:

First couple of days, blue (variation #3) is winning big – like $16 per visitor vs $12.5 for Control. Lots of people would end the test here. (Fail).

After 7 days: blue still winning – and the relative difference is big. After 14 days: orange (#4) is winning! After 21 days: orange still winning! End: no difference

So if you’d called the test at less than four weeks you would have made an erroneous conclusion. Something related, that the internet always gets confused on, is called the novelty effect. That’s when the novelty of your changes (bigger blue button) brings more attention to the variation. With time, the lift disappears because the change is no longer novel. All of this stuff is some of the more complex A/B testing information. We have a bunch of blog posts devoted to the various topics covered above. Dive in if you’d like to learn more: Stopping A/B Tests: How Many Conversions Do I Need? Can You Run Multiple A/B Tests Simultaneously? You want to speed up your testing program and run more tests. High tempo testing, yeah? So a common question is: can you run more than one A/B test at the same time on your site? Will this increase your growth potential, or will it pollute the data because each test interacts with the other?

http://conversionxl.com/stopping-ab-tests-how-many-conversions-do-i-need/

https://growthhackers.com/growth-studies/high-tempo-testing-revives-growthhackers-com-growth

Look, this is a complicated issue. Some experts say you shouldn’t do multiple tests simultaneously, and some say it’s fine. In most cases you will be fine running multiple simultaneous tests, and extreme interactions are unlikely. Unless you’re testing really important stuff (e.g. something that impacts your business model, future of the company), the benefits of testing volume will most likely outweigh the noise in your data and occasional false positives. If based on your assessment there’s a high risk of interaction between multiple tests, reduce the number of simultaneous tests and/or let the tests run longer for improved accuracy. If you want to read more on this, read these posts:

AB Testing: When Tests Collide Can You Run Multiple A/B Tests at the Same Time?

Analyzing Your A/B Testing Results Alright. You’ve done your research, set up your test, and the test is finally cooked. Now, on to analysis and it’s not always as simple as glimpsing at the graph your testing tool gives you.

Image Source One thing you should always do it to analyze your test results in Google Analytics.

http://conductrics.com/ab-testing-when-tests-collide-2/

http://conversionxl.com/can-you-run-multiple-ab-tests-at-the-same-time/

https://blog.optimizely.com/wp-content/uploads/2014/06/image00.png

http://conversionxl.com/analyze-ab-test-results-google-analytics/

It doesn’t just enhance your analysis capabilities, but it allows you to be more confident in your data and decision making. The point is, it’s possible that your testing tool could be recording the data incorrectly, and if you have no other source for your test data, you can never be sure whether to trust it or not. Create multiple sources of data (won’t go too far into detail, but read this post for how to set it all up) But what happens if, after analyzing the results in GA, there is no difference at all between variations? Don’t move on too quickly. First, realize these two things: 1. Your test hypothesis might have been right, but the implementation sucked. Let’s say your qualitative research says that concern about security is an issue. How many ways do we have to beef up the perception of security? Unlimited. The name of the game is iterative testing, so if you were onto something, then try a few iterations that attempt to solve the problem. 2. Just because there was no difference overall, the treatment might have beat control in a segment or two. If you got a lift in returning visitors and mobile visitors, but a drop for new visitors and desktop users – those segments might cancel each other out, and it seems like it’s a case of “no difference”. Analyze your test across key segments to see this. All About Data Segmentation The key to learning in A/B testing is segmenting. Even though B might lose to A in the overall results, B might beat A in certain segments (organic, Facebook, mobile, etc). There are a ton of segments you can analyze. Optimizely lists the following possibilities:

Browser type Source type Mobile vs. desktop, or by device Loggedin vs. loggedout visitors PPC/SEM campaign Geographical regions (City, State/Province, Country) New vs. returning visitors New vs. repeat purchasers

http://conversionxl.com/analyze-ab-test-results-google-analytics/

http://conversionxl.com/iterative-ab-testing-a-must-if-you-lack-a-crystal-ball/

http://www.jeremysaid.com/blog/segmentation-secret-ab-testing-strategy/

http://conversionxl.com/12-ab-split-testing-mistakes-i-see-businesses-make-all-the-time/#5-test-data-is-not-sent-to-google-analytics

https://help.optimizely.com/hc/en-us/articles/203684114-Segment-your-results

Power users vs. casual visitors Men vs. women Age range New vs. alreadysubmitted leads Plan types or loyalty program levels Current, prospective, and former subscribers Roles (if your site has, for instance, both a Buyer and Seller role)

But definitely look at your test results at least across these segments (making sure of adequate sample size):

Desktop vs Tablet/Mobile New vs Returning Traffic that lands directly on the page you’re testing vs came via internal link

For segments, the same stopping rules apply. Make sure that you have enough sample size within the segment itself as well (calculate it in advance, be wary if it’s less than 250350 conversions PER variation within that one segment you’re looking at). If your treatment performed well for a specific segment, it’s time to consider a personalized approach for that particular segment. All Your Stats Questions Answered, Stat There’s a certain level of statistical knowledge that comes in handy when analyzing A/B test results. Some of it we went over in the above section on setting up A/B tests, but there is still more to be covered when it comes to analysis. Why do you need to know all of this statistics stuff? We’re dealing with inference here means and probability and therefore cannot go without some basic understanding of stats. Or as Matt Gershoff put it (quoting his college math professor), “how can you make cheese if you don’t know where milk comes from?!” There are three terms you should know before we dive into the nitty gritty of A/B testing statistics:

1. Mean (we’re not measuring all conversion rates, just a sample, and finding an average of them that is representative of the whole)

2. Variance (what is the natural variability of a population? That will affect our results and how we take action with them)

3. Sampling (again, we can’t measure true conversion rate, so we select a sample that is hopefully representative of the whole)

What The Hell Is a PValue? There’s a large amount of bloggers writing about conversion optimization that are using the term “statistical significance” inaccurately. We talked a bit above about how statistical significance by itself is not a stopping rule, so what is it and why is it important? To start with, let’s go over PValues, which are also very misunderstood. As FiveThirtyEight recently pointed out, even scientists can’t easily explain what PValues are. PValue is basically measure of evidence against the null hypothesis (the control in A/B Testing parlance). Very important: Pvalue does not tell us the probability that B is better than A. Similarly, it doesn’t tell us the probability that we will make a mistake in selective B over A. These are both extraordinarily commons misconceptions, but they are false. The pvalue is just the probability of seeing a result or more extreme given that the null hypothesis is true. Or, “How surprising is that result?”

http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/

So to sum it up, statistical significance (or a statistically significant result) is attained when a pvalue is less than the significance level (which is usually set at .05). By the way, significance in regards to statistical hypothesis testing is where the whole onetail vs twotail issue comes up. Confidence Intervals and Margin of Error In A/B testing, we use confidence intervals to mitigate the risk of sampling errors. In that sense, we’re managing the risk associated with implementing a new variation. So if your tool says something like, “We are 95% confident that the conversion rate is X% +/ Y%,” then you need to account for the +/ Y% as the margin of error.

Image Source How confident you are in your results depends largely on how large the margin of error is. As a rule of thumb – if the 2 conversion ranges overlap, you’ll need to keep testing in order to get a valid result. Matt Gershoff gave a great illustration of how margin of error works: “Say your buddy is coming to visit you from Round Rock and is taking TX1 at 5pm. She wants to know how long it should take her. You say I have a 95% confidence that it will take you about 60 minutes plus or minus 20 minutes. So your margin of error is 20 minutes or 33%. If she is coming at 11am you might say “it will take you 40 min, plus or minus 10 min,” so the margin of error is 10 minutes, or 25%. So while both are at the 95% confidence level, the margin of error is different.” External Validity Threats There’s a challenge with running A/B tests: the data is nonstationary.

http://www.datasciencecentral.com/profiles/blogs/example-of-bad-data-science-test-of-hypothesis

https://www.prwd.co.uk/wp-content/uploads/2014/08/optimizely_360-1.png

In other words, a stationary time series is one whose statistical properties (mean, variance, autocorrelation, etc) are constant over time. For many reasons, website data is nonstationary, which means we can’t make the same assumptions as with stationary data. Here are a few reasons data might fluctuate:

Season Day of the week Holidays Press (positive or negative) Other Marketing Campaigns PPC/SEM SEO WordofMouth

So seasonality and the other factors above are one source of external validity threat. Other include sample pollution, the flicker effect, revenue tracking errors, selection bias, and more (read here). These are all things to keep in mind in planning and analyzing your A/B tests. Archiving Test Results For Future Learning A/B testing isn’t just about lifts, wins, losses, and testing random shit. As Matt Gershoff said, optimization is about “gathering information to inform decisions,” and the learnings from statistically valid A/B test results contribute to the greater goals of growth and optimization.

http://conversionxl.com/holiday-testing-guide/


Smart organizations archive their test results and plan their approach to testing systematically. There’s a reason having a structured approach to optimization have greater growth and are limited less often by local maxima.

Image Source So here’s the tough part: there’s no single best way to structure your knowledge management. We wrote an article on how effective organizations archive their results (read it), and as it turns out, many of them do it slightly differently. Some use sophisticated internallybuilt tools, some use 3rd party tools, and some use good ol’ Excel and Trello. If it helps, here are 4 tools built specifically for conversion optimization project management:

Iridion Effective Experiments Growth Hackers’ Canvas Experiment Engine

On a similar note, in larger organizations (or hell, in smaller as well), it’s important to be able to communicate across departments and to the executives above. Often, A/B test results aren’t super intuitive to the layperson (and most people haven’t read guides as long as this one). So what helps is visualization.

https://econsultancy.com/reports/conversion-rate-optimization-report/

http://conversionxl.com/archiving-test-results/

https://www.iridion.de/?lang=en

http://effectiveexperiments.com/

http://canvas.growthhackers.com/

https://www.experimentengine.com/

This is another area where, sadly, there is not real right way to do it. That said, Annemarie Klaassen and Ton Wesseling wrote an awesome post on our blog detailing their journey to great visualizations. Sneak peek, here’s what they ended up with:

A/B Testing Tools and Resources Littered throughout this guide are tons of links to external resources articles, tools, books, etc. To make it convenient for you, though, here are some of the best (divided by categories). A/B Test Tools

Optimizely VWO Adobe Target Maximyser Conductrics 53 Conversion Optimization Tools Reviewed By Experts

A/B Testing Calculators

A/B Split Test Significance Calculator by VWO A/B Split and Multivariate Test Duration Calculator Evan Miller’s Sample Size Calculator Evan Miller’s Whole Suite of A/B Testing Tools

A/B Testing Statistics Resources

Ignorant No More: Crash Course on A/B Testing Statistics Statistical Analysis and A/B Testing Understanding A/B testing statistics to get REAL Lift in Conversions OneTailed vs TwoTailed Tests (Does It Matter?)

http://conversionxl.com/visualize-ab-test-results/

http://conversionxl.com/conversion-optimization-tools/

https://vwo.com/ab-split-test-significance-calculator/

https://vwo.com/ab-split-test-duration/

http://www.evanmiller.org/ab-testing/sample-size.html

http://www.evanmiller.org/ab-testing/

http://conversionxl.com/ab-testing-statistics/

http://20bits.com/article/statistical-analysis-and-ab-testing

https://www.optimizesmart.com/understanding-ab-testing-statistics-to-get-real-lift-in-conversions/

http://conversionxl.com/one-tailed-vs-two-tailed-tests/

Bayesian vs Frequentist A/B Testing – What’s the Difference? Sample Pollution Science Isn’t Broken

A/B Testing/CRO Strategy Resources

eCommerce A/B Test Data for Improved Process: What Percentage of Tests Are Winners?

WiderFunnel’s LIFT Model 3 Frameworks To Help Prioritize & Conduct Your Conversion Testing What you have to know about conversion optimization Our Conversion Optimization Guide Small Business Big Money Online: A Proven System to Optimize eCommerce Websites

and Increase Internet Profits (book) Conclusion A/B testing is an invaluable resource to anyone making decisions in an online environment. With a little it of knowledge and a lot of diligence, you can mitigate many of the risks that most beginning optimizers face due to errors. This isn't a complete or an ultimate guide, but it is a damn good start. If you really dig into the information here, you'll be ahead of 90% of people running tests. If you believe in the power of A/B testing for continued revenue growth, then that's a fantastic place to be. Knowledge is a limiting factor that only experience and iterative learning can bust through, though. So get testing :)


http://conversionxl.com/sample-pollution/

http://fivethirtyeight.com/features/science-isnt-broken/

https://www.experimentengine.com/blog/2015/09/30/ecommerce-test-data-process-part1/

https://www.experimentengine.com/blog/2015/09/30/ecommerce-test-data-process-part1/

http://www.widerfunnel.com/lift-model/

http://conversionxl.com/3-frameworks-help-prioritize-conduct-conversion-testing/

http://conversionxl.com/what-you-have-to-know-about-conversion-optimization/

http://conversionxl.com/conversion-optimization-guide/

http://www.amazon.com/Small-Business-Big-Money-Online-ebook/dp/B00ZRV4WDS/ref=asap_bc?ie=UTF8

http://www.amazon.com/Small-Business-Big-Money-Online-ebook/dp/B00ZRV4WDS/ref=asap_bc?ie=UTF8

the smart marketer’s guide to a/b testing › wp-content › uploads › 2015 › 12 › ... ·...

Documents