practical guide to machine learning - h2ouniversity.h2o.ai/.../practical-guide-to-machine- learning...

Download PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-  LEARNING TECHNIQUES There are hundreds of different machine learning algorithms; a recent paper benchmarked more than 150 algorithms for

Post on 08-May-2018

214 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • WHITE PAPER

    PRACTICAL GUIDE TO MACHINE LEARNING

  • MACHINE LEARNING: AN OVERVIEW

    You may have heard how companies like Google and Facebook use machine learning to drive cars, recognize human speech and classify images. Very cool, you think. But how does that relate to my business?

    Consider how these companies use machine learning today:

    Apaymentsprocessingcompanydetectsfraudhiddenamongmorethanabilliontransactions,anditdoessoinrealtime,reducinglossesby$1millionpermonth;

    Anautoinsurerlinkslossesfrominsuranceclaimstodetailedgeospatialdata,enablingittoaccuratelypredictthebusinessimpactofsevereweatherevents;

    Workingwithdataproducedbyvehicletelematics,amanufactureruncoverspatternsinoperationalmetricsandusesthemtodriveproactivemaintenance.

    Two themes unify these success stories:

    Eachapplicationdependsonlargescaledatasets,inavarietyofformats,andathighvelocity;

    Ineachcase,machinelearninguncoversnewinsightsanddrivesvalue.

    Thetechnicalfoundationsofmachinelearningaremorethanfiftyyearsold,butuntilrecentlyfewpeopleoutsideofacademiawereawareofitscapabilities.Machinelearningrequiresalotofcomputingpower;earlyadopterssimplylackedtheinfrastructuretomakeitcost-effective.

    Severalconvergingtrendscontributetotherecentsurgeofinterestandactivity:

    MooresLawradicallyreducescomputingcosts,andmassivecomputingpoweriswidelyavailableatminimalcost;

    Newandinnovativealgorithmsprovidefasterresults;

    Withexperience,datascientistshaveaccumulatedtheoryandpracticalguidancetodrivevalue.

    Aboveall,thetsunamiofdatacreatesanalyticproblemsthatsimplycannotbesolvedwithconventionalstatistics.Necessityisthemotherofinvention:oldmethodsofanalysisnolongerworkintodaysbusiness environment.

    Practical Guide to Machine Learning H2O White Paper | 2

  • MACHINE LEARNING TECHNIQUES

    Therearehundredsofdifferentmachinelearningalgorithms;arecentpaperbenchmarkedmorethan150algorithmsforclassificationalone.Thisoverviewcoversthekeytechniquesthatdatascientistsuseto drive value today.

    Datascientistsdistinguishbetweentechniquesforsupervisedandunsupervisedlearning.Supervised learningtechniquesrequirepriorknowledgeofanoutcome.Forexample,ifweworkwithhistoricaldatafromamarketingcampaign,wecanclassifyeachimpressionbywhetherornottheprospectresponded,orwecandeterminehowmuchtheyspent.Supervisedtechniquesprovidepowerfultoolsforpredictionandclassification.

    Frequently,however,wedonotknowtheultimateoutcomeofanevent.Forexample,insomecasesoffraud,wemaynotknowthatatransactionisfraudulentuntillongaftertheevent;inthiscase,ratherthanattemptingtopredictwhichtransactionsarefrauds,wemightwanttousemachinelearningtoidentifytransactionsthatareunusual,andflagtheseforfurtherinvestigation.Weuseunsupervised learningwhenwedonothavepriorknowledgeaboutaspecificoutcome,butstillwanttoextractusefulinsights from the data.

    Themostwidelyusedsupervisedlearningtechniquesinclude:

    GENERALIZED LINEAR MODELS (GLM): an advanced form of linear regression that supportsdifferentprobabilitydistributionsandlinkfunctions,enablingtheanalysttomodelthedatamoreeffectively.Enhancedwithagridsearch,GLMisahybridofclassicalstatisticsandthemost advanced machine learning.

    DECISION TREES: asupervisedlearningmethodthatlearnsasetofrulesthatsplitapopulationinto progressively smaller segments that are homogeneous with respect to the target variable.

    RANDOM FORESTS: a popular ensemble learning method that trains many decision trees, thenaveragesacrossthetreestodevelopaprediction.Thisaveragingprocessproducesamoregeneralizablesolution,andfiltersoutrandomnoiseinthedata.

    GRADIENT BOOSTING MACHINE (GBM):amethodthatproducesapredictionmodelbytrainingasequenceofdecisiontrees,wheresuccessivetreesadjustforpredictionerrorsinprevioustrees.

    DEEP LEARNING:anapproachthatmodelshigh-levelpatternsindataascomplexmulti-layerednetworks.Becauseitisthemostgeneralwaytomodelaproblem,DeepLearninghasthepotentialto solve the most challenging problems in machine learning.

    Keytechniquesforunsupervisedlearninginclude:

    CLUSTERING:thistechniquegroupsobjectsintosegments,orclusters,thataresimilartooneanotheronmanymetrics.Customersegmentationisanexampleofclusteringinaction.Therearemanydifferentclusteringalgorithms;themostwidelyusedisk-means.

    ANOMALY DETECTION:infieldslikesecurityandfraud,itisnotpossibletoexhaustivelyinvestigateeverytransaction;weneedtosystematicallyflagthemostunusualtransactions.DeepLearning,atechniquediscussedpreviouslyundersupervisedlearning,canalsobeusedforanomalydetection.

    Practical Guide to Machine Learning H2O White Paper | 3

  • DIMENSION REDUCTION:asorganizationscapturemoredata,thenumberofpossiblepredictors(orfeatures)availableforpredictionexpandsrapidly.Simplyidentifyingwhatdataprovidesinformationvalueforaparticularproblemisasignificanttask.PrincipalComponentsAnalysis(PCA)evaluates a set of raw features and reduces them to indices that are independent of one another.

    Whilesomemachinelearningtechniquestendtoconsistentlyoutperformothers,itisrarelypossibletosayinadvancewhichonewillworkbestforaparticularproblem.Hence,mostdatascientistsprefertotrymanytechniquesandchoosethebestmodel.Forthisreason,highperformanceisessential,becauseitenablesthedatascientisttotrymoreoptionsandbuildthebestpossiblemodel.

    HOW TO GET STARTED

    Ifyouareinterestedinmachinelearningandwonderinghowtoapplyitinyourorganization,therearesome concrete steps you can take.

    IDENTIFY A BUSINESS PROBLEM.Identifyopportunitiesinyourbusinesswhereimprovedpredictionswillhaveacompellingimpact,intheformofincreasedrevenues,reducedcostsorsomeotherkeybusinessdriver.Possibleexamplesinclude(butarenotlimitedto):detectingandpreventingfraud;detectingsecurityrisksandthreats;measuringcreditanddefaultrisk;andotherhigh-impactproblems.Ifyoucantfindproblemslikethisinyourbusiness,yourenotlookinghardenough;everybusinesshasopportunitiestoimprove.

    CONSULT WITH YOUR ANALYTICS TEAM. You may be surprised to learn that your analysts alreadyusemachinelearning;ifso,thatsgreat.Ifnot,ask:whynot?Mostanalystsareexcitedaboutmachinelearning,andactivelyseekoutbusinesscaseswherethetechniquescandrivevalue.Theremaybeshort-termbarriers,however,suchasashortageofpersonnel,lackofsoftwareorlackofsupportfromtheITorganization.Workwithyouranalyststodiagnoseandresolvethesebarriers.

    Ifyoudonothaveananalyticsteam,engageaconsultantoranalyticservicesproviderwhocanhelpyoubuildthecapabilityandprovideinterimsupport.Ifyouranalyticsteamexpressesnointerestindrivingbusinessvalue,examinetheteamsleadershipandincentives.

    ENGAGE YOUR IT ORGANIZATION.YourITorganizationplaysacriticalrolegettingyourapplicationintoproduction,soitsimportanttoengagethemearly.ITorganizationsaresometimesreluctanttointroduceadvancedanalyticsintoproductionsystems,fearingthatrocketscientistswilltieupthesystemorbringitdown.Toaddresstheseconcerns,makesurethatIThelpsdefinethetechnicalrequirementsforyourmachinelearningsoftware.YourITteamwillbeveryconcernedaboutsuchthingsasHadoopsupport,theabilitytoruninthecloudandotherthingsthatcanmakeorbreakyourapplication.

    CHOOSE YOUR SOFTWARE OPTIONS.Youmaybetoldthatyourorganizationalreadyhasthesoftwareitneedstodeliveryourapplication.Thatslikelynottrue;machinelearningisarapidlydevelopingfield,withsignificantadvancesinthepastyear.Thinkofitthisway:if your organization already has the software it needs to deliver your application, why isnt your application built already?

    Practical Guide to Machine Learning H2O White Paper | 4

  • InthesectionbelowheadedSoftwareConsiderations,weoutlinethemostimportantthingstolookforinmachinelearningsoftware.YouranalystsandyourITorganizationwillfillindetailsaboutsuchthingsasHadoopdistributionsandcloudplatforms.

    DEFINE EVALUATION CRITERIA.Yourbusinessproblemdefinesyourmeasuresofsuccess.WorkwithyouranalystsandITrepresentativestodevelopspecificandmeasureablecriteria,whichshouldinclude:

    Measuresofpredictionsuccess

    Runtimeperformanceformodeltrainingandmodelscoring

    Scalingrequirements,measuredindatavolume(rowsandcolumns)

    Outputrequirements

    Ifthereisanexistingpredictivemodelinproduction,yourevaluationcriteriashouldspecifytheperformancethresholdsanynewsoftwareshouldmeet.

    PLAN A TRIAL. WorkwithyouranalystsandITteamtoplanatrial,orProofofConcept(POC).Ifyoulimitthescopetoopensourcesoftware,aswerecommend,yourout-of-pocketcostswillbeminimal.(Commercialsoftwarevendorsordinarilydonotchargeforevaluationsoftware;howeverlicensingcostsforcommercialmachinelearningsoftwarethatscalestoBigDatastartsatsevenfigures.)

    MACHINE LEARNING SOFTWARE REQUIREMENTS

    Softwareformachinelearningiswidelyavailable,andorganizationsseekingtodevelopacapabilityinthisareahavemanyoptions.Thefollowingrequirementsshouldbeconsideredwhenevaluatingmachinelearning:

    Speed

    TimetoValue

    ModelAccuracy

    EasyIntegration

    FlexibleDeployment

    Usability

    Visualization

    Letsrevieweachoftheseinturn.

    SPEED:Timeismoney,andfastsoftwaremakesyourhighlypaiddatascientistsmoreproductive.Practicaldatascienceisofteniterativeandexperimental;aprojectmayrequirehundredsoftests,sosmalldifferencesinspeedtranslatetodramaticimprovementsinefficiency.Giventodaysdatavolumes,high-performancemachinelearningsoftwaremustrunonadistributedplatform,soyoucanspreadtheworkload over many servers.

    Practical Guide to Machine Learning H2O White Paper | 5

  • TIME TO VALUE:Runtimeperformanceisjustonepartoftotaltimetovalue.Thekeymetricforyourbusinessistheamountoftimeneededtocompleteaprojectfromdataingestiontodeployment.Inpracticalterms,thismeansthatyourmachinelearningsoftwareshouldintegratewithpopularHadoopandcloudformats,andshouldexportpredictivemodelsascodeyoucandeployanywhereinyourorganization.

    MODEL ACCURACY:Accuracymatters,especiallysowhenthestakesarehigh;forapplicationslikefrauddetec

Recommended

View more >