Transcript
Page 1: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

WHITE PAPER

PRACTICAL GUIDE TO MACHINE LEARNING

Page 2: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

MACHINE LEARNING: AN OVERVIEW

You may have heard how companies like Google and Facebook use machine learning to drive cars, recognize human speech and classify images. Very cool, you think. But how does that relate to my business?

Consider how these companies use machine learning today:

•Apaymentsprocessingcompanydetectsfraudhiddenamongmorethanabilliontransactions,anditdoessoinrealtime,reducinglossesby$1millionpermonth;

•Anautoinsurerlinkslossesfrominsuranceclaimstodetailedgeospatialdata,enablingittoaccuratelypredictthebusinessimpactofsevereweatherevents;

•Workingwithdataproducedbyvehicletelematics,amanufactureruncoverspatternsinoperationalmetricsandusesthemtodriveproactivemaintenance.

Two themes unify these success stories:

•Eachapplicationdependsonlargescaledatasets,inavarietyofformats,andathighvelocity;

•Ineachcase,machinelearninguncoversnewinsightsanddrivesvalue.

Thetechnicalfoundationsofmachinelearningaremorethanfiftyyearsold,butuntilrecentlyfewpeopleoutsideofacademiawereawareofitscapabilities.Machinelearningrequiresalotofcomputingpower;earlyadopterssimplylackedtheinfrastructuretomakeitcost-effective.

Severalconvergingtrendscontributetotherecentsurgeofinterestandactivity:

•Moore’sLawradicallyreducescomputingcosts,andmassivecomputingpoweriswidelyavailableatminimalcost;

•Newandinnovativealgorithmsprovidefasterresults;

•Withexperience,datascientistshaveaccumulatedtheoryandpracticalguidancetodrivevalue.

Aboveall,thetsunamiofdatacreatesanalyticproblemsthatsimplycannotbesolvedwithconventionalstatistics.Necessityisthemotherofinvention:oldmethodsofanalysisnolongerworkintoday’sbusiness environment.

Practical Guide to Machine Learning H2O White Paper | 2

Page 3: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

MACHINE LEARNING TECHNIQUES

Therearehundredsofdifferentmachinelearningalgorithms;arecentpaperbenchmarkedmorethan150algorithmsforclassificationalone.Thisoverviewcoversthekeytechniquesthatdatascientistsuseto drive value today.

Datascientistsdistinguishbetweentechniquesforsupervisedandunsupervisedlearning.Supervised learningtechniquesrequirepriorknowledgeofanoutcome.Forexample,ifweworkwithhistoricaldatafromamarketingcampaign,wecanclassifyeachimpressionbywhetherornottheprospectresponded,orwecandeterminehowmuchtheyspent.Supervisedtechniquesprovidepowerfultoolsforpredictionandclassification.

Frequently,however,wedonotknowthe“ultimate”outcomeofanevent.Forexample,insomecasesoffraud,wemaynotknowthatatransactionisfraudulentuntillongaftertheevent;inthiscase,ratherthanattemptingtopredictwhichtransactionsarefrauds,wemightwanttousemachinelearningtoidentifytransactionsthatareunusual,andflagtheseforfurtherinvestigation.Weuseunsupervised learningwhenwedonothavepriorknowledgeaboutaspecificoutcome,butstillwanttoextractusefulinsights from the data.

Themostwidelyusedsupervisedlearningtechniquesinclude:

•GENERALIZED LINEAR MODELS (GLM): an advanced form of linear regression that supportsdifferentprobabilitydistributionsandlinkfunctions,enablingtheanalysttomodelthedatamoreeffectively.Enhancedwithagridsearch,GLMisahybridofclassicalstatisticsandthemost advanced machine learning.

•DECISION TREES: asupervisedlearningmethodthatlearnsasetofrulesthatsplitapopulationinto progressively smaller segments that are homogeneous with respect to the target variable.

•RANDOM FORESTS: a popular ensemble learning method that trains many decision trees, thenaveragesacrossthetreestodevelopaprediction.Thisaveragingprocessproducesamoregeneralizablesolution,andfiltersoutrandomnoiseinthedata.

•GRADIENT BOOSTING MACHINE (GBM):amethodthatproducesapredictionmodelbytrainingasequenceofdecisiontrees,wheresuccessivetreesadjustforpredictionerrorsinprevioustrees.

•DEEP LEARNING:anapproachthatmodelshigh-levelpatternsindataascomplexmulti-layerednetworks.Becauseitisthemostgeneralwaytomodelaproblem,DeepLearninghasthepotentialto solve the most challenging problems in machine learning.

Keytechniquesforunsupervisedlearninginclude:

•CLUSTERING:thistechniquegroupsobjectsintosegments,orclusters,thataresimilartooneanotheronmanymetrics.Customersegmentationisanexampleofclusteringinaction.Therearemanydifferentclusteringalgorithms;themostwidelyusedisk-means.

•ANOMALY DETECTION:infieldslikesecurityandfraud,itisnotpossibletoexhaustivelyinvestigateeverytransaction;weneedtosystematicallyflagthemostunusualtransactions.DeepLearning,atechniquediscussedpreviouslyundersupervisedlearning,canalsobeusedforanomalydetection.

Practical Guide to Machine Learning H2O White Paper | 3

Page 4: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

•DIMENSION REDUCTION:asorganizationscapturemoredata,thenumberofpossiblepredictors(orfeatures)availableforpredictionexpandsrapidly.Simplyidentifyingwhatdataprovidesinformationvalueforaparticularproblemisasignificanttask.PrincipalComponentsAnalysis(PCA)evaluates a set of raw features and reduces them to indices that are independent of one another.

Whilesomemachinelearningtechniquestendtoconsistentlyoutperformothers,itisrarelypossibletosayinadvancewhichonewillworkbestforaparticularproblem.Hence,mostdatascientistsprefertotrymanytechniquesandchoosethebestmodel.Forthisreason,highperformanceisessential,becauseitenablesthedatascientisttotrymoreoptionsandbuildthebestpossiblemodel.

HOW TO GET STARTED

Ifyouareinterestedinmachinelearningandwonderinghowtoapplyitinyourorganization,therearesome concrete steps you can take.

IDENTIFY A BUSINESS PROBLEM.Identifyopportunitiesinyourbusinesswhereimprovedpredictionswillhaveacompellingimpact,intheformofincreasedrevenues,reducedcostsorsomeotherkeybusinessdriver.Possibleexamplesinclude(butarenotlimitedto):detectingandpreventingfraud;detectingsecurityrisksandthreats;measuringcreditanddefaultrisk;andotherhigh-impactproblems.Ifyoucan’tfindproblemslikethisinyourbusiness,you’renotlookinghardenough;everybusinesshasopportunitiestoimprove.

CONSULT WITH YOUR ANALYTICS TEAM. You may be surprised to learn that your analysts alreadyusemachinelearning;ifso,that’sgreat.Ifnot,ask:whynot?Mostanalystsareexcitedaboutmachinelearning,andactivelyseekoutbusinesscaseswherethetechniquescandrivevalue.Theremaybeshort-termbarriers,however,suchasashortageofpersonnel,lackofsoftwareorlackofsupportfromtheITorganization.Workwithyouranalyststodiagnoseandresolvethesebarriers.

Ifyoudonothaveananalyticsteam,engageaconsultantoranalyticservicesproviderwhocanhelpyoubuildthecapabilityandprovideinterimsupport.Ifyouranalyticsteamexpressesnointerestindrivingbusinessvalue,examinetheteam’sleadershipandincentives.

ENGAGE YOUR IT ORGANIZATION.YourITorganizationplaysacriticalrolegettingyourapplicationintoproduction,soit’simportanttoengagethemearly.ITorganizationsaresometimesreluctanttointroduceadvancedanalyticsintoproductionsystems,fearingthat“rocketscientists”willtieupthesystemorbringitdown.Toaddresstheseconcerns,makesurethatIThelpsdefinethetechnicalrequirementsforyourmachinelearningsoftware.YourITteamwillbeveryconcernedaboutsuchthingsasHadoopsupport,theabilitytoruninthecloudandotherthingsthatcanmakeorbreakyourapplication.

CHOOSE YOUR SOFTWARE OPTIONS.Youmaybetoldthatyourorganizationalreadyhasthesoftwareitneedstodeliveryourapplication.That’slikelynottrue;machinelearningisarapidlydevelopingfield,withsignificantadvancesinthepastyear.Thinkofitthisway:if your organization already has the software it needs to deliver your application, why isn’t your application built already?

Practical Guide to Machine Learning H2O White Paper | 4

Page 5: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

InthesectionbelowheadedSoftwareConsiderations,weoutlinethemostimportantthingstolookforinmachinelearningsoftware.YouranalystsandyourITorganizationwillfillindetailsaboutsuchthingsasHadoopdistributionsandcloudplatforms.

DEFINE EVALUATION CRITERIA.Yourbusinessproblemdefinesyourmeasuresofsuccess.WorkwithyouranalystsandITrepresentativestodevelopspecificandmeasureablecriteria,whichshouldinclude:

•Measuresofpredictionsuccess

•Runtimeperformanceformodeltrainingandmodelscoring

•Scalingrequirements,measuredindatavolume(rowsandcolumns)

•Outputrequirements

Ifthereisanexistingpredictivemodelinproduction,yourevaluationcriteriashouldspecifytheperformancethresholdsanynewsoftwareshouldmeet.

PLAN A TRIAL. WorkwithyouranalystsandITteamtoplanatrial,orProofofConcept(POC).Ifyoulimitthescopetoopensourcesoftware,aswerecommend,yourout-of-pocketcostswillbeminimal.(Commercialsoftwarevendorsordinarilydonotchargeforevaluationsoftware;howeverlicensingcostsforcommercialmachinelearningsoftwarethatscalestoBigDatastartsatsevenfigures.)

MACHINE LEARNING SOFTWARE REQUIREMENTS

Softwareformachinelearningiswidelyavailable,andorganizationsseekingtodevelopacapabilityinthisareahavemanyoptions.Thefollowingrequirementsshouldbeconsideredwhenevaluatingmachinelearning:

•Speed

•TimetoValue

•ModelAccuracy

•EasyIntegration

•FlexibleDeployment

•Usability

•Visualization

Let’srevieweachoftheseinturn.

SPEED:Timeismoney,andfastsoftwaremakesyourhighlypaiddatascientistsmoreproductive.Practicaldatascienceisofteniterativeandexperimental;aprojectmayrequirehundredsoftests,sosmalldifferencesinspeedtranslatetodramaticimprovementsinefficiency.Giventoday’sdatavolumes,high-performancemachinelearningsoftwaremustrunonadistributedplatform,soyoucanspreadtheworkload over many servers.

Practical Guide to Machine Learning H2O White Paper | 5

Page 6: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

TIME TO VALUE:Runtimeperformanceisjustonepartoftotaltimetovalue.Thekeymetricforyourbusinessistheamountoftimeneededtocompleteaprojectfromdataingestiontodeployment.Inpracticalterms,thismeansthatyourmachinelearningsoftwareshouldintegratewithpopularHadoopandcloudformats,andshouldexportpredictivemodelsascodeyoucandeployanywhereinyourorganization.

MODEL ACCURACY:Accuracymatters,especiallysowhenthestakesarehigh;forapplicationslikefrauddetection,smallimprovementsinaccuracycanproducemillionsofdollarsinannualsavings.Yourmachinelearningsoftwareshouldempoweryourdatascientiststouseall of your data, rather than forcing them to work with samples.

EASY INTEGRATION:Yourmachinelearningsoftwaremustco-existwithacomplexstackofBigDatasoftwareinproduction.Opensourcesoftwareiseasiertodeploy,modifyandintegrateintoyourproductionworkflows.Additionally,lookformachinelearningsoftwarethatrunsoncommodityhardware,anddoesnotrequirespecializedHPCmachinesorexotichardwarelikeGPUchips.

FLEXIBLE DEPLOYMENT:Yourmachinelearningsoftwareshouldsupportarangeofdeploymentoptions,includingco-locatedinHadooporinafreestandingcluster.Ifcloudispartofyourarchitecture,lookforsoftwarethatrunsinavarietyofcloudplatforms,suchasAmazonWebServices,MicrosoftAzureandGoogleCloud.

USABILITY:Yourdatascientistsusemanydifferentsoftwaretoolstoperformtheirwork,includinganalyticlanguageslikeR,PythonandScala;yourmachinelearningplatformshouldintegrateeasilywiththetoolsyourdatascientistsalreadyuse.Well-designedmachinelearningalgorithmsincludetime-saving features.

•Abilitytotreatmissingdata •Abilitytotransformcategoricaldata •Regularizationtechniquestomanagecomplexity •Gridsearchcapabilityforautomatedtestandlearn •Automaticcross-validation(toavoidoverlearning)

VISUALIZATION:Successfulpredictivemodelingrequirescollaborationbetweenthedatascientistandbusinessusers.Yourmachinelearningsoftwareshouldprovidebusinessuserswithtoolstovisuallyevaluatethequalityandcharacteristicsofthepredictivemodel.

Practical Guide to Machine Learning H2O White Paper | 6

Page 7: PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-learning.pdf · MACHINE LEARNING TECHNIQUES There are hundreds of different machine learning

ABOUT H2O.AI

H2Oisthe#1opensourcemachinelearningplatformforsmarterapplications.H2O.aiistheSiliconValleysoftwarecompanysupportinganddevelopingH2O.Leadinginsurance,healthcareandfinancialservicescompaniesareusingH2Otomakesmarterpredictionsaboutchurn,pricing,fraudandmore.H2O.aiisfosteringagrassrootsmovementofsystemsengineers,datascientists,datadevelopersandpredictiveanalyststomovemachinelearningforward.ArapidlygrowingcommunityofH2Ousersisnowactiveinmorethan5000organizationsworldwide.H2O.aiisaGartnerCoolVendorinDataSciencefor2015.

Practical Guide to Machine Learning H2O White Paper | 7

©2015 H2O.ai, Inc. All Rights Reserved. 2307 Leghorn St. Mountain View, CA 94043. Information is subject to change without notice.


Top Related