WHITE PAPER
PRACTICAL GUIDE TO MACHINE LEARNING
MACHINE LEARNING: AN OVERVIEW
You may have heard how companies like Google and Facebook use machine learning to drive cars, recognize human speech and classify images. Very cool, you think. But how does that relate to my business?
Consider how these companies use machine learning today:
•Apaymentsprocessingcompanydetectsfraudhiddenamongmorethanabilliontransactions,anditdoessoinrealtime,reducinglossesby$1millionpermonth;
•Anautoinsurerlinkslossesfrominsuranceclaimstodetailedgeospatialdata,enablingittoaccuratelypredictthebusinessimpactofsevereweatherevents;
•Workingwithdataproducedbyvehicletelematics,amanufactureruncoverspatternsinoperationalmetricsandusesthemtodriveproactivemaintenance.
Two themes unify these success stories:
•Eachapplicationdependsonlargescaledatasets,inavarietyofformats,andathighvelocity;
•Ineachcase,machinelearninguncoversnewinsightsanddrivesvalue.
Thetechnicalfoundationsofmachinelearningaremorethanfiftyyearsold,butuntilrecentlyfewpeopleoutsideofacademiawereawareofitscapabilities.Machinelearningrequiresalotofcomputingpower;earlyadopterssimplylackedtheinfrastructuretomakeitcost-effective.
Severalconvergingtrendscontributetotherecentsurgeofinterestandactivity:
•Moore’sLawradicallyreducescomputingcosts,andmassivecomputingpoweriswidelyavailableatminimalcost;
•Newandinnovativealgorithmsprovidefasterresults;
•Withexperience,datascientistshaveaccumulatedtheoryandpracticalguidancetodrivevalue.
Aboveall,thetsunamiofdatacreatesanalyticproblemsthatsimplycannotbesolvedwithconventionalstatistics.Necessityisthemotherofinvention:oldmethodsofanalysisnolongerworkintoday’sbusiness environment.
Practical Guide to Machine Learning H2O White Paper | 2
MACHINE LEARNING TECHNIQUES
Therearehundredsofdifferentmachinelearningalgorithms;arecentpaperbenchmarkedmorethan150algorithmsforclassificationalone.Thisoverviewcoversthekeytechniquesthatdatascientistsuseto drive value today.
Datascientistsdistinguishbetweentechniquesforsupervisedandunsupervisedlearning.Supervised learningtechniquesrequirepriorknowledgeofanoutcome.Forexample,ifweworkwithhistoricaldatafromamarketingcampaign,wecanclassifyeachimpressionbywhetherornottheprospectresponded,orwecandeterminehowmuchtheyspent.Supervisedtechniquesprovidepowerfultoolsforpredictionandclassification.
Frequently,however,wedonotknowthe“ultimate”outcomeofanevent.Forexample,insomecasesoffraud,wemaynotknowthatatransactionisfraudulentuntillongaftertheevent;inthiscase,ratherthanattemptingtopredictwhichtransactionsarefrauds,wemightwanttousemachinelearningtoidentifytransactionsthatareunusual,andflagtheseforfurtherinvestigation.Weuseunsupervised learningwhenwedonothavepriorknowledgeaboutaspecificoutcome,butstillwanttoextractusefulinsights from the data.
Themostwidelyusedsupervisedlearningtechniquesinclude:
•GENERALIZED LINEAR MODELS (GLM): an advanced form of linear regression that supportsdifferentprobabilitydistributionsandlinkfunctions,enablingtheanalysttomodelthedatamoreeffectively.Enhancedwithagridsearch,GLMisahybridofclassicalstatisticsandthemost advanced machine learning.
•DECISION TREES: asupervisedlearningmethodthatlearnsasetofrulesthatsplitapopulationinto progressively smaller segments that are homogeneous with respect to the target variable.
•RANDOM FORESTS: a popular ensemble learning method that trains many decision trees, thenaveragesacrossthetreestodevelopaprediction.Thisaveragingprocessproducesamoregeneralizablesolution,andfiltersoutrandomnoiseinthedata.
•GRADIENT BOOSTING MACHINE (GBM):amethodthatproducesapredictionmodelbytrainingasequenceofdecisiontrees,wheresuccessivetreesadjustforpredictionerrorsinprevioustrees.
•DEEP LEARNING:anapproachthatmodelshigh-levelpatternsindataascomplexmulti-layerednetworks.Becauseitisthemostgeneralwaytomodelaproblem,DeepLearninghasthepotentialto solve the most challenging problems in machine learning.
Keytechniquesforunsupervisedlearninginclude:
•CLUSTERING:thistechniquegroupsobjectsintosegments,orclusters,thataresimilartooneanotheronmanymetrics.Customersegmentationisanexampleofclusteringinaction.Therearemanydifferentclusteringalgorithms;themostwidelyusedisk-means.
•ANOMALY DETECTION:infieldslikesecurityandfraud,itisnotpossibletoexhaustivelyinvestigateeverytransaction;weneedtosystematicallyflagthemostunusualtransactions.DeepLearning,atechniquediscussedpreviouslyundersupervisedlearning,canalsobeusedforanomalydetection.
Practical Guide to Machine Learning H2O White Paper | 3
•DIMENSION REDUCTION:asorganizationscapturemoredata,thenumberofpossiblepredictors(orfeatures)availableforpredictionexpandsrapidly.Simplyidentifyingwhatdataprovidesinformationvalueforaparticularproblemisasignificanttask.PrincipalComponentsAnalysis(PCA)evaluates a set of raw features and reduces them to indices that are independent of one another.
Whilesomemachinelearningtechniquestendtoconsistentlyoutperformothers,itisrarelypossibletosayinadvancewhichonewillworkbestforaparticularproblem.Hence,mostdatascientistsprefertotrymanytechniquesandchoosethebestmodel.Forthisreason,highperformanceisessential,becauseitenablesthedatascientisttotrymoreoptionsandbuildthebestpossiblemodel.
HOW TO GET STARTED
Ifyouareinterestedinmachinelearningandwonderinghowtoapplyitinyourorganization,therearesome concrete steps you can take.
IDENTIFY A BUSINESS PROBLEM.Identifyopportunitiesinyourbusinesswhereimprovedpredictionswillhaveacompellingimpact,intheformofincreasedrevenues,reducedcostsorsomeotherkeybusinessdriver.Possibleexamplesinclude(butarenotlimitedto):detectingandpreventingfraud;detectingsecurityrisksandthreats;measuringcreditanddefaultrisk;andotherhigh-impactproblems.Ifyoucan’tfindproblemslikethisinyourbusiness,you’renotlookinghardenough;everybusinesshasopportunitiestoimprove.
CONSULT WITH YOUR ANALYTICS TEAM. You may be surprised to learn that your analysts alreadyusemachinelearning;ifso,that’sgreat.Ifnot,ask:whynot?Mostanalystsareexcitedaboutmachinelearning,andactivelyseekoutbusinesscaseswherethetechniquescandrivevalue.Theremaybeshort-termbarriers,however,suchasashortageofpersonnel,lackofsoftwareorlackofsupportfromtheITorganization.Workwithyouranalyststodiagnoseandresolvethesebarriers.
Ifyoudonothaveananalyticsteam,engageaconsultantoranalyticservicesproviderwhocanhelpyoubuildthecapabilityandprovideinterimsupport.Ifyouranalyticsteamexpressesnointerestindrivingbusinessvalue,examinetheteam’sleadershipandincentives.
ENGAGE YOUR IT ORGANIZATION.YourITorganizationplaysacriticalrolegettingyourapplicationintoproduction,soit’simportanttoengagethemearly.ITorganizationsaresometimesreluctanttointroduceadvancedanalyticsintoproductionsystems,fearingthat“rocketscientists”willtieupthesystemorbringitdown.Toaddresstheseconcerns,makesurethatIThelpsdefinethetechnicalrequirementsforyourmachinelearningsoftware.YourITteamwillbeveryconcernedaboutsuchthingsasHadoopsupport,theabilitytoruninthecloudandotherthingsthatcanmakeorbreakyourapplication.
CHOOSE YOUR SOFTWARE OPTIONS.Youmaybetoldthatyourorganizationalreadyhasthesoftwareitneedstodeliveryourapplication.That’slikelynottrue;machinelearningisarapidlydevelopingfield,withsignificantadvancesinthepastyear.Thinkofitthisway:if your organization already has the software it needs to deliver your application, why isn’t your application built already?
Practical Guide to Machine Learning H2O White Paper | 4
InthesectionbelowheadedSoftwareConsiderations,weoutlinethemostimportantthingstolookforinmachinelearningsoftware.YouranalystsandyourITorganizationwillfillindetailsaboutsuchthingsasHadoopdistributionsandcloudplatforms.
DEFINE EVALUATION CRITERIA.Yourbusinessproblemdefinesyourmeasuresofsuccess.WorkwithyouranalystsandITrepresentativestodevelopspecificandmeasureablecriteria,whichshouldinclude:
•Measuresofpredictionsuccess
•Runtimeperformanceformodeltrainingandmodelscoring
•Scalingrequirements,measuredindatavolume(rowsandcolumns)
•Outputrequirements
Ifthereisanexistingpredictivemodelinproduction,yourevaluationcriteriashouldspecifytheperformancethresholdsanynewsoftwareshouldmeet.
PLAN A TRIAL. WorkwithyouranalystsandITteamtoplanatrial,orProofofConcept(POC).Ifyoulimitthescopetoopensourcesoftware,aswerecommend,yourout-of-pocketcostswillbeminimal.(Commercialsoftwarevendorsordinarilydonotchargeforevaluationsoftware;howeverlicensingcostsforcommercialmachinelearningsoftwarethatscalestoBigDatastartsatsevenfigures.)
MACHINE LEARNING SOFTWARE REQUIREMENTS
Softwareformachinelearningiswidelyavailable,andorganizationsseekingtodevelopacapabilityinthisareahavemanyoptions.Thefollowingrequirementsshouldbeconsideredwhenevaluatingmachinelearning:
•Speed
•TimetoValue
•ModelAccuracy
•EasyIntegration
•FlexibleDeployment
•Usability
•Visualization
Let’srevieweachoftheseinturn.
SPEED:Timeismoney,andfastsoftwaremakesyourhighlypaiddatascientistsmoreproductive.Practicaldatascienceisofteniterativeandexperimental;aprojectmayrequirehundredsoftests,sosmalldifferencesinspeedtranslatetodramaticimprovementsinefficiency.Giventoday’sdatavolumes,high-performancemachinelearningsoftwaremustrunonadistributedplatform,soyoucanspreadtheworkload over many servers.
Practical Guide to Machine Learning H2O White Paper | 5
TIME TO VALUE:Runtimeperformanceisjustonepartoftotaltimetovalue.Thekeymetricforyourbusinessistheamountoftimeneededtocompleteaprojectfromdataingestiontodeployment.Inpracticalterms,thismeansthatyourmachinelearningsoftwareshouldintegratewithpopularHadoopandcloudformats,andshouldexportpredictivemodelsascodeyoucandeployanywhereinyourorganization.
MODEL ACCURACY:Accuracymatters,especiallysowhenthestakesarehigh;forapplicationslikefrauddetection,smallimprovementsinaccuracycanproducemillionsofdollarsinannualsavings.Yourmachinelearningsoftwareshouldempoweryourdatascientiststouseall of your data, rather than forcing them to work with samples.
EASY INTEGRATION:Yourmachinelearningsoftwaremustco-existwithacomplexstackofBigDatasoftwareinproduction.Opensourcesoftwareiseasiertodeploy,modifyandintegrateintoyourproductionworkflows.Additionally,lookformachinelearningsoftwarethatrunsoncommodityhardware,anddoesnotrequirespecializedHPCmachinesorexotichardwarelikeGPUchips.
FLEXIBLE DEPLOYMENT:Yourmachinelearningsoftwareshouldsupportarangeofdeploymentoptions,includingco-locatedinHadooporinafreestandingcluster.Ifcloudispartofyourarchitecture,lookforsoftwarethatrunsinavarietyofcloudplatforms,suchasAmazonWebServices,MicrosoftAzureandGoogleCloud.
USABILITY:Yourdatascientistsusemanydifferentsoftwaretoolstoperformtheirwork,includinganalyticlanguageslikeR,PythonandScala;yourmachinelearningplatformshouldintegrateeasilywiththetoolsyourdatascientistsalreadyuse.Well-designedmachinelearningalgorithmsincludetime-saving features.
•Abilitytotreatmissingdata •Abilitytotransformcategoricaldata •Regularizationtechniquestomanagecomplexity •Gridsearchcapabilityforautomatedtestandlearn •Automaticcross-validation(toavoidoverlearning)
VISUALIZATION:Successfulpredictivemodelingrequirescollaborationbetweenthedatascientistandbusinessusers.Yourmachinelearningsoftwareshouldprovidebusinessuserswithtoolstovisuallyevaluatethequalityandcharacteristicsofthepredictivemodel.
Practical Guide to Machine Learning H2O White Paper | 6
ABOUT H2O.AI
H2Oisthe#1opensourcemachinelearningplatformforsmarterapplications.H2O.aiistheSiliconValleysoftwarecompanysupportinganddevelopingH2O.Leadinginsurance,healthcareandfinancialservicescompaniesareusingH2Otomakesmarterpredictionsaboutchurn,pricing,fraudandmore.H2O.aiisfosteringagrassrootsmovementofsystemsengineers,datascientists,datadevelopersandpredictiveanalyststomovemachinelearningforward.ArapidlygrowingcommunityofH2Ousersisnowactiveinmorethan5000organizationsworldwide.H2O.aiisaGartnerCoolVendorinDataSciencefor2015.
Practical Guide to Machine Learning H2O White Paper | 7
©2015 H2O.ai, Inc. All Rights Reserved. 2307 Leghorn St. Mountain View, CA 94043. Information is subject to change without notice.