20 top AB testing mistakes and how to avoid them

Download 20 top AB testing mistakes and how to avoid them

Post on 13-Jun-2015




0 download


My latest deck with all the top AB testing mistakes you can make - and how to avoid or resolve them!


  • 1. Oh Boy!These A/B tests appear tobe bullshit!

2. @OptimiseOrDie UX and Analytics (1999) User Centred Design (2001) Agile, Startups, No budget (2003) Funnel optimisation (2004) Multivariate & A/B (2005) Conversion Optimisation (2005) Persuasive Copywriting (2006) Joined Twitter (2007) Lean UX (2008) Holistic Optimisation (2009)Was : Consulting all over the placeNow : Optimiser of Everything,Spareroom.co.uk 3. @OptimiseOrDieHands on! 4. AB Test Hype CycleZen Plumbing@OptimiseOrDieTimelineTested stupid ideas,lotsMost AB or MVT tests arebullshitDiscovered ABtestingTriage,Triangulation,Prioritisation, Maths 5. Craigs Cynical QuadrantImprovesrevenueYes Client delightedNo YesImproves UXNo(and fires you for another agency)Client fuckingdelightedClient absolutelyfucking furiousClient fires you(then wins an award for your work) 6. #1 : Youre doing it in the wrongplace@OptimiseOrDie 7. #1 : Youre doing it in the wrong placeThere are 4 areas a CRO expert always looks at:1. Inbound attrition (medium, source, landing page,keyword, intent and many more)2. Key conversion points (product, basket, registration)3. Processes, lifecycles and steps (forms, logins,registration, checkout, onboarding, emails, push)4. Layers of engagement (search, category, product, add)1. Use visitor flow reports for attrition very useful.2. For key conversion points, look at loss rates &interactions3. Processes and steps look at funnels or make your own4. Layers and engagement make a ring model@OptimiseOrDie 8. Examples ConceptBounceEngageOutcome@OptimiseOrDie 9. Examples 16-25Railcard.co.ukBounceLogin toAccountContentEngageStartApplicationType andDetailsEligibilityPhotoComplete@OptimiseOrDie 10. Examples Guide DogsBounceContentEngageDonationPathwayDonationPageStartsprocessFunnelstepsComplete@OptimiseOrDie 11. Within a layerPage 1Page 2Page 3Page 4 Page 5ExitDeeperLayerEmailWishlistContact LikeMicroConversions@OptimiseOrDie 12. #1 : Make a Money Model Get to know the flow and loss (leaks) inbound, inside andthrough key processes or conversion points. Once you know the key steps youre losing people at and howmuch traffic you have make a money model. 20,000 see the basket page whats the basket page tocheckout page ratio? Estimate how much you think you can shift the key metric(e.g. basket adds, basket -> checkout) What downstream revenue or profit would that generate? Sort by the money column Congratulations youve now built the worlds first IT plan forgrowth with a return on investment estimate attached! Ill talk more about prioritising later but a good real worldanalogy for you to use:@OptimiseOrDie 13. Think like astore owner!If you cantrefurbish theentire store,which floors ordepartments willyou invest inoptimising?Wherever thereis: Footfall Low return@OptimiseOrDie 14. #2 : Your hypothesis iscrap!Insight - Inputs#FAILCompetitorcopyingGuessingDice rollingPanicCompetitorchangeAn articlethe CEOreadEgoOpinionCherishednotionsMarketingwhims Cosmic raysNot onbrandenoughITinflexibilityInternalcompanyneedsSomedumbassconsultantShinyfeatureblindnessKnee jerkreactons@OptimiseOrDie 15. #2 : These are the inputs youneedInsight - InputsInsightEye trackingSegmentationSurveysSales andCall CentreCustomercontactSocialanalyticsSessionReplayUsabilitytestingFormsanalyticsSearchanalytics Voice ofCustomerMarketresearchA/B andMVT testingBig &unstructureddataWebanalyticsCompetitorCustomer evalsservices@OptimiseOrDie 16. Insight - Inputs@OptimiseOrDie#2 : Brainstorming the test Check your inputs Assemble the widest possible team Share your data and research Design Emotive Writing guidelines 17. Insight - Inputs@OptimiseOrDie#2 : Emotive Writing - exampleCustomers do not know what to do and need support and advice Emphasize the fact that you understand that their situation is stressful Emphasize your expertise and leadership in vehicle glazing and will help them get the bestsolution for their situation Explain what they will need to do online and during the call-back so that they know what thenext steps will be Explain that they will be able ask any other questions they might have during the call-backCustomers do not feel confident in assessing the damage Emphasize the fact that you will help them assess the damage correctly onlineCustomers need to understand the benefits of booking online Emphasize that the online booking system is quick, easy and provides all the informationthey need in regards with their appointment and general cost informationCustomers mistrust insurers and find dealing with their insurance situation very frustrating Where possible communicate the fact that the job is most likely to be free for insuredcustomers, or good value for money for cash customers Show that you understand the hassle of dealing with insurance companies emphasise thatyou will help with their insurance paperwork for them, freeing them of this burdenSome customers cannot be bothered to take action to fix their car glass Emphasize the consequences of not doing anything,e.g. Its going to cost you more if the chip develops into a crack 18. Insight - Inputs@OptimiseOrDie#2 : THE DARK SIDEKeep your family safe and get back on theroad fast with Autoglass. 19. Insight - Inputs@OptimiseOrDie#2 : NOW YOU CAN BEGIN You should have inputs, research, data, guidelines Sit down with the team and prompt with 12 questions: Who is this page (or process) for? What problem does this solve for the user? How do we know they need it? What is the primary action we want people to take? What might prompt the user to take this action? How will we know if this is doing what we want it to do? How do people get to this page? How long are people here on this page? What can we remove from this page? How can we test this solution with people? How are we solving the users needs in different and better ways than otherplaces on our site? If this is a homepage, ask these too (bit.ly/1fX2RAa) 20. Insight - Inputs@OptimiseOrDie#2 : PROMPT YOURSELF Check your UX or Copywritingguidelines. Use Get Mental Notes What levers can we apply now? Create a hypothesis:WE BELIEVE THAT DOING [A]FOR PEOPLE [B] WILL MAKEOUTCOME [C] HAPPEN.WE'LL KNOW THIS WHEN WESEE DATA [D] AND FEEDBACK[E]www.GetMentalNotes.com 21. Insight - Inputs@OptimiseOrDie#2 : THE FUN BIT! Collaborative Sketching Brainwriting Refine and Test! 22. We believe that doing [A] forPeople [B] will makeoutcome [C] happen.Well know this when weobserve data [D] and obtainfeedback [E]. (reverse)@OptimiseOrDie 23. #2 : Solutions You need multiple tool inputs Tool decks are here : www.slideshare.net/sullivac Collaborative, Customer connected team If youre not doing this, youre hosed Session replay tools provide vital input Get vital additional customer evidence Simple page Analytics dont cut it Invest in your analytics, especially eventtracking Ego, Opinion, Cherished notions fill gaps Fill these vacuums with insights and data Champion the user Give them a chair at every meeting @OptimiseOrDie 24. #2 : HYPOTHESIS DESIGN SUMMARYInsight - Inputs@OptimiseOrDie Inputs get the right stuff Research, Guidelines, Data Framing the problem(s) Questions to get you going Use card prompts for Psychology Create a hypothesis Collaborative Sketching Brainwriting Refine and Check Hypothesis Instrument and Test 25. We believe that doing [A] forPeople [B] will makeoutcome [C] happen.Well know this when weobserve data [D] and obtainfeedback [E]. (reverse)@OptimiseOrDie 26. #3 : No analytics integration Investigating problems with tests Segmentation of results Tests that fail, flip or move around Tests that dont make sense Broken test setups What drives the averages you see?@OptimiseOrDie 27. 28A B B A 28. These Danishporn sites areso hardcore!Were stillwaiting for ourAB tests tofinish!#4 : The test will finish after you die Use a test length calculator like this one: visualwebsiteoptimizer.com/ab-split-test-duration/ 29. #5 : You dont test for long enough The minimum length 2 business cycles (so you can cross check) Usually a week, 2 weeks, Month Always test whole not partial cycles Be aware of multiple cycles Dont self stop! PURCHASE CYCLES KNOW THEM 30. Business & Purchase Cycles@OptimiseOrDieStart Test Finish Avg Cycle Customers change Your traffic mix changes Markets, competitors Be aware of all the waves Always test whole cycles Minimum 2 cycles (wk/mo) Dont exclude slower buyers 31. #5 : You dont test for long enough How long after that I aim for a minimum 250 outcomes, ideally 350+ for each creative If you test 4 recipes, thats 1400 outcomes needed You should have worked out how long each batch of 350 needsbefore you start! 95% confidence is the cherry not the cake - BUT BIG SECRET -> (pvalues are unreliable) If you segment, youll need more data It may need a bigger sample if the response rates are similar* Use a test length calculator but be aware of BARE MINIMUM TOEXPECT Important insider tip watch the error bars! The +/- stuff* Stats geeks know Im glossing over something here. That test time dependson how the two experiments separate in terms of relative performance aswell as how volatile the test response is. Ill talk about this when I record thisone! This is why testing similar stuff sux. 32 32. #5 : You put faith in the Confidencevalue95%, 99%, 99.99%Confidence or Chance to beat baseline whats that? Its a stats thing Seriously, look at this one LAST in your testing Purchase Cycle, Business Cycles, Sample Size, Error barseparation ALL come before this one. Got it? Why? Its to do with p-values. Read this article: http://bit.ly/1gq9dtd If you rely on confidence, you are relying uponsomething thats unreliable and moves around,particularly early in testing. Dont be fooled by your testing package watch theerror bars instead of confidence. 33. #5 : The tennis court Lets say we want to estimate, on average, what height Roger Federerand Nadal hit the ball over the net at. So, lets start the match:@OptimiseOrDie 34. First Set Federer 6-4 We start to collect values62cm+/- 2cm63.5cm+/- 2cm@OptimiseOrDie 35. Second Set Nadal 7-6 Nadal starts sending them low over the net62cm+/- 1cm62.5cm+/- 1cm@OptimiseOrDie 36. Final Set Nadal 7-6 We start to collect values61.8cm+/- .3cm62cm+/- .3cm 37. Lets look at this a different way62.5cm+/- 1cm@OptimiseOrDie9.1% 0.39.3% 0.3 38. 62.5cm+/- 1cm@OptimiseOrDie9.1% 0.59.3% 0.59.1% 0.29.3% 0.29.1% 0.19.3% 0.1 39. Graph is a range, not a line:9.1 1.9% 9.1 0.9% 9.1 0.3% 40. #5 : How long to test? The minimum length: 2 business cycles and > purchase cycle as a minimum, regardless ofoutcomes. Test for less and youre biasing the sample. ALWAYS ALWAYS TEST WHOLE CYCLES. 250 ABSOLUTE MINIMUM FOR ANY SAMPLE, 350+ nicer, 1000 sweet! Error bar separation (or minimal overlap) between creatives Ignore 95%+ confidence (its unreliable) Use a test calculator (VWO have a nice one). Work out your test units how long to get 350 outcomes for eachcreative in your test. This is a minimum you should expect but sample size (or overlap) maymean you need longer When to stop?@OptimiseOrDie 41. #5 : When to stop Self stopping is a huge problem: I stopped the test when it looked good It hit 20% on Thursday, so I figured time to cut and run We need test time for something else. Looks good to us Weve got a big sample now so why not finish it today? False Positives and Negatives If you cut part of a business cycle, you bias the segments you have inthe test. So if you ignore weekend shoppers by stopping your test on Friday, thatwill affect results The other problems is FALSE POSITIVES and FALSE NEGATIVES@OptimiseOrDie 42. #5 : When to stopScenario 1 Scenario 2 Scenario 3 Scenario 4@OptimiseOrDieAfter 200observationsInsignificant Insignificant Significant! Significant!After 500observationsInsignificant Significant! Insignificant Significant!End ofexperimentInsignificant Significant! Insignificant Significant!Scenario 1 Scenario 2 Scenario 3 Scenario 4After 200observationsInsignificant Insignificant Significant! Significant!After 500observationsInsignificant Significant! trial stopped trial stoppedEnd ofexperimentInsignificant Significant! Significant! Significant! 43. #5 : When to stop So what to do? Run a test calculator Set the test time to hit the highest of the minimums What minimums do you mean? Minimum sample (250, 350, higher) Business cycles (2+) Purchase cycles (1 or 2+) What your test calculator says The longest one is how long its gonna take. Set the test time Run the test Stop the test at the end, on a whole cycle Analyse Thats it!@OptimiseOrDie 44. #6 : The early stages of a test Ignore the graphs. Dont draw conclusions. Dont dance. Calm down. Get a feel for the test but dont do anything yet! Remember in A/B - 50% of returning visitors will see a new shiny website! Until your test has had at least 2 business cycles and 250+ outcomes, dont bothereven getting remotely excited! Watching regularly is good though. Youre looking for anything that looks reallyodd if everyone is looking (but not concluding) then oddities will get spotted. All tests move around or show big swings early in the testing cycle. Here is a veryhigh traffic site it still takes 10 days to start settling. Lower traffic sites willstretch this period further.45 45. #7 : No QAtesting for the ABtest? 46. #7 BIG SECRET! Over 40% of tests have had QA issues. Its very easy to break or bias the testingBrowser testing www.crossbrowsertesting.comwww.browserstack.comwww.spoon.netwww.cloudtesting.comwww.multibrowserviewer.comwww.saucelabs.comMobile devices www.deviceanywhere.comwww.perfectomobile.comwww.opendevicelab.com@OptimiseOrDie 47. #7 : What other QA testing should I do? Testing from several locations (office, home, elsewhere) Testing the IP filtering is set up Test tags are firing correctly (analytics and the test tool) Test as a repeat visitor and check session timeouts Cross check figures from 2+ sources Monitor closely from launch, recheck, watch WATCH FOR BIAS!@OptimiseOrDie 48. #8 : Tests are random and notprioritisedOnce you have a list ofpotential test areas, rankthem by opportunity vs.effort.The common rankingmetrics that I use include:Opportunity (revenue,impact)Dev resourceTime to marketRisk / ComplexityMake yourself a quadrant 49. #9 : Your cycles are too slow0 6 12 18MonthsConversion@OptimiseOrDie 50. #9 : Solutions Give Priority Boarding for opportunities The best seats reserved for metric shifters Release more often to close the gap More testing resource helps, analytics hawk eye Kaizen continuous improvement Others call it JFDI (just f***ing do it) Make changes AS WELL as tests, basically! These small things add up RUSH Hair booking Over 100 changes No functional changes at all 37% improvement Inbetween product lifecycles? The added lift for 10 days work, worth 360k@OptimiseOrDie 51. #9 : Make your own cycles@OptimiseOrDie 52. #10 : How do I know when its ready? The hallmarks of a cooked test are: Its done at least 1 or preferably 2+ business and at least one ifnot two purchase cycles You have at least 250-350 outcomes for each recipe Its not moving around hugely at creative or segment levelperformance The test results are clear even if the precise values are not The intervals are not overlapping (much) If a test is still moving around, you need to investigate FIND OUT WHAT MARKETING ARE DOING FIND OUT WHAT EVERYONE IS DOING Be careful about limited time period campaigns (e.g. TV, print,online) If you know when TV (or other big campaigns) are running, tryone week with TV and one without during tests veryinteresting. 53 53. 54#11 : Your testfails@OptimiseOrDie 54. #11: Your test fails Learn from the failure! If you cant learn from the failure, youvedesigned a crap test. Next time you design, imagine all your stuff failing. What wouldyou do? If you dont know or youre not sure, get it changed sothat a negative becomes insightful. So : failure itself at a creative or variable level should tell yousomething. On a failed test, always analyse the segmentation and analytics One or more segments will be over and under Check for varied performance Now add the failure info to your Knowledge Base: Look at it carefully what does the failure tell you? Whichelement do you think drove the failure? If you know what failed (e.g. making the price bigger) then youhave very useful information You turned the handle the wrong way Now brainstorm a new test@OptimiseOrDie 55. #12 : The test is about the same Analyse the segmentation Check the analytics and instrumentation One or more segments may be over and under They may be cancelling out the average is a lie The segment level performance will help you (beware ofsmall sample sizes) If you genuinely have a test which failed to move anysegments, its a crap test be bolder This usually happens when it isnt bold or brave enough inshifting away from the original design, particularly onlower traffic sites Get testing again!@OptimiseOrDie 56. #13 : The test keeps movingaround There are three reasons it is moving around Your sample size (outcomes) is still too small The external traffic mix, customers or reaction hassuddenly changed or Your inbound marketing driven traffic mix iscompletely volatile (very rare) Check the sample size Check all your marketing activity Check the instrumentation If no reason, check segmentation@OptimiseOrDie 57. #14 : The test has flipped on me Something like this can happen: Check your sample size. If its still small, then expect this until the testsettles. If the test does genuinely flip and quite severely then something haschanged with the traffic mix, the customer base or your advertising.Maybe the PPC budget ran out? Seriously! To analyse a flipped test, youll need to check your segmented data. Thisis why you have a split testing package AND an analytics system. The segmented data will help you to identify the source of the shift inresponse to your test. I rarely get a flipped one and its always something 58. No and this is why: Its a waste of time Its easier to test and monitor instead You are eating into test time Also applies to A/A/B/B testing A/B/A running at 25%/50%/25% is the best Read my post here :http://bit.ly/WcI9EZ59#15 : Should I run an A/A testfirst 59. #16 : Nobody feels thetest You promised a 25% rise in checkouts - you only see 2% Traffic, Advertising, Marketing may have changed Check theyre using the same precise metrics Run a calibration exercise I often leave a 5 or 10% stub running in a test This tracks old creative once new one goes live If conversion is also down for that one, BINGO! Remember the AB test is an estimate it doesntprecisely record future performance This is why infrequent testing is bad Always be trying a new test instead of basking in theglory of one you ran 6 months ago. Youre only as goodas your next test.@OptimiseOrDie 60. #17 : You forgot about Mobile &Tablet If youre AB testing a responsive site, pay attention Content will break differently on many screens Know thy users and their devices Use bango or google analytics to define a test list Make sure you test mobile devices & viewports What looks good on your desk may not be for the user Harder to design cross device tests Youll need to segment mobile, tablet & desktop responsein the analytics or AB testing package Your personal phone is not a device mix Ask me about making your device list Buy core devices, rent the rest from deviceanywhere.com@OptimiseOrDie 61. #18 : Oh shit no traffic If small volumes, contact customers reach out. If data volumes arent there, there are still customers! Drive design from levers you can apply game the system Pick clean and simple clusters of change (hypothesis driven) Use a goal at an earlier ring stage or funnel step Beware of using clickthroughs when attrition is high on theother side Try before and after testing on identical time periods(measure in analytics model) Be careful about small sample sizes (