PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine- LEARNING TECHNIQUES There are hundreds of different machine learning algorithms; a recent paper benchmarked more than 150 algorithms for

Download PRACTICAL GUIDE TO MACHINE LEARNING - H2Ouniversity.h2o.ai/.../practical-guide-to-machine-  LEARNING TECHNIQUES There are hundreds of different machine learning algorithms; a recent paper benchmarked more than 150 algorithms for

Post on 08-May-2018

214 views

Category:

Documents

2 download

TRANSCRIPT

WHITE PAPERPRACTICAL GUIDE TO MACHINE LEARNINGMACHINE LEARNING: AN OVERVIEWYou may have heard how companies like Google and Facebook use machine learning to drive cars, recognize human speech and classify images. Very cool, you think. But how does that relate to my business?Consider how these companies use machine learning today: Apaymentsprocessingcompanydetectsfraudhiddenamongmorethanabilliontransactions,anditdoessoinrealtime,reducinglossesby$1millionpermonth; Anautoinsurerlinkslossesfrominsuranceclaimstodetailedgeospatialdata,enablingittoaccuratelypredictthebusinessimpactofsevereweatherevents; Workingwithdataproducedbyvehicletelematics,amanufactureruncoverspatternsinoperationalmetricsandusesthemtodriveproactivemaintenance.Two themes unify these success stories: Eachapplicationdependsonlargescaledatasets,inavarietyofformats,andathighvelocity; Ineachcase,machinelearninguncoversnewinsightsanddrivesvalue.Thetechnicalfoundationsofmachinelearningaremorethanfiftyyearsold,butuntilrecentlyfewpeopleoutsideofacademiawereawareofitscapabilities.Machinelearningrequiresalotofcomputingpower;earlyadopterssimplylackedtheinfrastructuretomakeitcost-effective.Severalconvergingtrendscontributetotherecentsurgeofinterestandactivity: MooresLawradicallyreducescomputingcosts,andmassivecomputingpoweriswidelyavailableatminimalcost; Newandinnovativealgorithmsprovidefasterresults; Withexperience,datascientistshaveaccumulatedtheoryandpracticalguidancetodrivevalue.Aboveall,thetsunamiofdatacreatesanalyticproblemsthatsimplycannotbesolvedwithconventionalstatistics.Necessityisthemotherofinvention:oldmethodsofanalysisnolongerworkintodaysbusiness environment.Practical Guide to Machine Learning H2O White Paper | 2MACHINE LEARNING TECHNIQUESTherearehundredsofdifferentmachinelearningalgorithms;arecentpaperbenchmarkedmorethan150algorithmsforclassificationalone.Thisoverviewcoversthekeytechniquesthatdatascientistsuseto drive value today. Datascientistsdistinguishbetweentechniquesforsupervisedandunsupervisedlearning.Supervised learningtechniquesrequirepriorknowledgeofanoutcome.Forexample,ifweworkwithhistoricaldatafromamarketingcampaign,wecanclassifyeachimpressionbywhetherornottheprospectresponded,orwecandeterminehowmuchtheyspent.Supervisedtechniquesprovidepowerfultoolsforpredictionandclassification.Frequently,however,wedonotknowtheultimateoutcomeofanevent.Forexample,insomecasesoffraud,wemaynotknowthatatransactionisfraudulentuntillongaftertheevent;inthiscase,ratherthanattemptingtopredictwhichtransactionsarefrauds,wemightwanttousemachinelearningtoidentifytransactionsthatareunusual,andflagtheseforfurtherinvestigation.Weuseunsupervised learningwhenwedonothavepriorknowledgeaboutaspecificoutcome,butstillwanttoextractusefulinsights from the data. Themostwidelyusedsupervisedlearningtechniquesinclude: GENERALIZED LINEAR MODELS (GLM): an advanced form of linear regression that supportsdifferentprobabilitydistributionsandlinkfunctions,enablingtheanalysttomodelthedatamoreeffectively.Enhancedwithagridsearch,GLMisahybridofclassicalstatisticsandthemost advanced machine learning. DECISION TREES: asupervisedlearningmethodthatlearnsasetofrulesthatsplitapopulationinto progressively smaller segments that are homogeneous with respect to the target variable. RANDOM FORESTS: a popular ensemble learning method that trains many decision trees, thenaveragesacrossthetreestodevelopaprediction.Thisaveragingprocessproducesamoregeneralizablesolution,andfiltersoutrandomnoiseinthedata. GRADIENT BOOSTING MACHINE (GBM):amethodthatproducesapredictionmodelbytrainingasequenceofdecisiontrees,wheresuccessivetreesadjustforpredictionerrorsinprevioustrees. DEEP LEARNING:anapproachthatmodelshigh-levelpatternsindataascomplexmulti-layerednetworks.Becauseitisthemostgeneralwaytomodelaproblem,DeepLearninghasthepotentialto solve the most challenging problems in machine learning.Keytechniquesforunsupervisedlearninginclude: CLUSTERING:thistechniquegroupsobjectsintosegments,orclusters,thataresimilartooneanotheronmanymetrics.Customersegmentationisanexampleofclusteringinaction.Therearemanydifferentclusteringalgorithms;themostwidelyusedisk-means. ANOMALY DETECTION:infieldslikesecurityandfraud,itisnotpossibletoexhaustivelyinvestigateeverytransaction;weneedtosystematicallyflagthemostunusualtransactions.DeepLearning,atechniquediscussedpreviouslyundersupervisedlearning,canalsobeusedforanomalydetection.Practical Guide to Machine Learning H2O White Paper | 3 DIMENSION REDUCTION:asorganizationscapturemoredata,thenumberofpossiblepredictors(orfeatures)availableforpredictionexpandsrapidly.Simplyidentifyingwhatdataprovidesinformationvalueforaparticularproblemisasignificanttask.PrincipalComponentsAnalysis(PCA)evaluates a set of raw features and reduces them to indices that are independent of one another.Whilesomemachinelearningtechniquestendtoconsistentlyoutperformothers,itisrarelypossibletosayinadvancewhichonewillworkbestforaparticularproblem.Hence,mostdatascientistsprefertotrymanytechniquesandchoosethebestmodel.Forthisreason,highperformanceisessential,becauseitenablesthedatascientisttotrymoreoptionsandbuildthebestpossiblemodel.HOW TO GET STARTEDIfyouareinterestedinmachinelearningandwonderinghowtoapplyitinyourorganization,therearesome concrete steps you can take.IDENTIFY A BUSINESS PROBLEM.Identifyopportunitiesinyourbusinesswhereimprovedpredictionswillhaveacompellingimpact,intheformofincreasedrevenues,reducedcostsorsomeotherkeybusinessdriver.Possibleexamplesinclude(butarenotlimitedto):detectingandpreventingfraud;detectingsecurityrisksandthreats;measuringcreditanddefaultrisk;andotherhigh-impactproblems.Ifyoucantfindproblemslikethisinyourbusiness,yourenotlookinghardenough;everybusinesshasopportunitiestoimprove.CONSULT WITH YOUR ANALYTICS TEAM. You may be surprised to learn that your analysts alreadyusemachinelearning;ifso,thatsgreat.Ifnot,ask:whynot?Mostanalystsareexcitedaboutmachinelearning,andactivelyseekoutbusinesscaseswherethetechniquescandrivevalue.Theremaybeshort-termbarriers,however,suchasashortageofpersonnel,lackofsoftwareorlackofsupportfromtheITorganization.Workwithyouranalyststodiagnoseandresolvethesebarriers.Ifyoudonothaveananalyticsteam,engageaconsultantoranalyticservicesproviderwhocanhelpyoubuildthecapabilityandprovideinterimsupport.Ifyouranalyticsteamexpressesnointerestindrivingbusinessvalue,examinetheteamsleadershipandincentives.ENGAGE YOUR IT ORGANIZATION.YourITorganizationplaysacriticalrolegettingyourapplicationintoproduction,soitsimportanttoengagethemearly.ITorganizationsaresometimesreluctanttointroduceadvancedanalyticsintoproductionsystems,fearingthatrocketscientistswilltieupthesystemorbringitdown.Toaddresstheseconcerns,makesurethatIThelpsdefinethetechnicalrequirementsforyourmachinelearningsoftware.YourITteamwillbeveryconcernedaboutsuchthingsasHadoopsupport,theabilitytoruninthecloudandotherthingsthatcanmakeorbreakyourapplication.CHOOSE YOUR SOFTWARE OPTIONS.Youmaybetoldthatyourorganizationalreadyhasthesoftwareitneedstodeliveryourapplication.Thatslikelynottrue;machinelearningisarapidlydevelopingfield,withsignificantadvancesinthepastyear.Thinkofitthisway:if your organization already has the software it needs to deliver your application, why isnt your application built already? Practical Guide to Machine Learning H2O White Paper | 4InthesectionbelowheadedSoftwareConsiderations,weoutlinethemostimportantthingstolookforinmachinelearningsoftware.YouranalystsandyourITorganizationwillfillindetailsaboutsuchthingsasHadoopdistributionsandcloudplatforms.DEFINE EVALUATION CRITERIA.Yourbusinessproblemdefinesyourmeasuresofsuccess.WorkwithyouranalystsandITrepresentativestodevelopspecificandmeasureablecriteria,whichshouldinclude: Measuresofpredictionsuccess Runtimeperformanceformodeltrainingandmodelscoring Scalingrequirements,measuredindatavolume(rowsandcolumns) OutputrequirementsIfthereisanexistingpredictivemodelinproduction,yourevaluationcriteriashouldspecifytheperformancethresholdsanynewsoftwareshouldmeet.PLAN A TRIAL. WorkwithyouranalystsandITteamtoplanatrial,orProofofConcept(POC).Ifyoulimitthescopetoopensourcesoftware,aswerecommend,yourout-of-pocketcostswillbeminimal.(Commercialsoftwarevendorsordinarilydonotchargeforevaluationsoftware;howeverlicensingcostsforcommercialmachinelearningsoftwarethatscalestoBigDatastartsatsevenfigures.)MACHINE LEARNING SOFTWARE REQUIREMENTSSoftwareformachinelearningiswidelyavailable,andorganizationsseekingtodevelopacapabilityinthisareahavemanyoptions.Thefollowingrequirementsshouldbeconsideredwhenevaluatingmachinelearning: Speed TimetoValue ModelAccuracy EasyIntegration FlexibleDeployment Usability VisualizationLetsrevieweachoftheseinturn.SPEED:Timeismoney,andfastsoftwaremakesyourhighlypaiddatascientistsmoreproductive.Practicaldatascienceisofteniterativeandexperimental;aprojectmayrequirehundredsoftests,sosmalldifferencesinspeedtranslatetodramaticimprovementsinefficiency.Giventodaysdatavolumes,high-performancemachinelearningsoftwaremustrunonadistributedplatform,soyoucanspreadtheworkload over many servers.Practical Guide to Machine Learning H2O White Paper | 5TIME TO VALUE:Runtimeperformanceisjustonepartoftotaltimetovalue.Thekeymetricforyourbusinessistheamountoftimeneededtocompleteaprojectfromdataingestiontodeployment.Inpracticalterms,thismeansthatyourmachinelearningsoftwareshouldintegratewithpopularHadoopandcloudformats,andshouldexportpredictivemodelsascodeyoucandeployanywhereinyourorganization.MODEL ACCURACY:Accuracymatters,especiallysowhenthestakesarehigh;forapplicationslikefrauddetection,smallimprovementsinaccuracycanproducemillionsofdollarsinannualsavings.Yourmachinelearningsoftwareshouldempoweryourdatascientiststouseall of your data, rather than forcing them to work with samples. EASY INTEGRATION:Yourmachinelearningsoftwaremustco-existwithacomplexstackofBigDatasoftwareinproduction.Opensourcesoftwareiseasiertodeploy,modifyandintegrateintoyourproductionworkflows.Additionally,lookformachinelearningsoftwarethatrunsoncommodityhardware,anddoesnotrequirespecializedHPCmachinesorexotichardwarelikeGPUchips.FLEXIBLE DEPLOYMENT:Yourmachinelearningsoftwareshouldsupportarangeofdeploymentoptions,includingco-locatedinHadooporinafreestandingcluster.Ifcloudispartofyourarchitecture,lookforsoftwarethatrunsinavarietyofcloudplatforms,suchasAmazonWebServices,MicrosoftAzureandGoogleCloud.USABILITY:Yourdatascientistsusemanydifferentsoftwaretoolstoperformtheirwork,includinganalyticlanguageslikeR,PythonandScala;yourmachinelearningplatformshouldintegrateeasilywiththetoolsyourdatascientistsalreadyuse.Well-designedmachinelearningalgorithmsincludetime-saving features. Abilitytotreatmissingdata Abilitytotransformcategoricaldata Regularizationtechniquestomanagecomplexity Gridsearchcapabilityforautomatedtestandlearn Automaticcross-validation(toavoidoverlearning)VISUALIZATION:Successfulpredictivemodelingrequirescollaborationbetweenthedatascientistandbusinessusers.Yourmachinelearningsoftwareshouldprovidebusinessuserswithtoolstovisuallyevaluatethequalityandcharacteristicsofthepredictivemodel.Practical Guide to Machine Learning H2O White Paper | 6ABOUT H2O.AIH2Oisthe#1opensourcemachinelearningplatformforsmarterapplications.H2O.aiistheSiliconValleysoftwarecompanysupportinganddevelopingH2O.Leadinginsurance,healthcareandfinancialservicescompaniesareusingH2Otomakesmarterpredictionsaboutchurn,pricing,fraudandmore.H2O.aiisfosteringagrassrootsmovementofsystemsengineers,datascientists,datadevelopersandpredictiveanalyststomovemachinelearningforward.ArapidlygrowingcommunityofH2Ousersisnowactiveinmorethan5000organizationsworldwide.H2O.aiisaGartnerCoolVendorinDataSciencefor2015.Practical Guide to Machine Learning H2O White Paper | 72015 H2O.ai, Inc. All Rights Reserved. 2307 Leghorn St. Mountain View, CA 94043. Information is subject to change without notice.

Recommended

View more >