making big data projects successful - data science pop-up seattle
TRANSCRIPT
#datapopupseattle
AARON CORDOVACTO and Co-Founder, Koverse
aaroncordova
Making Big Data Projects Successful
koverse
#datapopupseattle
UNSTRUCTUREDData Science POP-UP in Seattle
www.dominodatalab.com
D
Produced by Domino Data Lab
Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish
models into production faster.
Keystomakingsuccessfulbigdataprojectsrepeatable
©Koverse|CompanyConfiden<al 2
Intro
AaronCordovaCTO,co-founderatKoverseInc.BuiltsuccessfulbigdatasystemsforDOD,Intelligence,Finance
©Koverse|CompanyConfiden<al 3
BigDataProjects
Howittendstobe
Howitshouldbe
©Koverse|CompanyConfiden<al 4
BigDataProjects
Interes<ngpart
©Koverse|CompanyConfiden<al 5
BigDataProjects
Interes<ngpart
©Koverse|CompanyConfiden<al 6
BigDataProjects
Interes<ngpart
MorepropellantSupportInfrastructure
Propellant
LaunchplaSorm
U<li<es
©Koverse|CompanyConfiden<al 7
Step1:Import
Bringthedatatothedatascien<stFromwhere?
©Koverse|CompanyConfiden<al 8
Step1:Security
Sensi<vedatarequiresaccesscontrolsUsingmorethan1datasetrequirefine-grainedaccesscontrols
©Koverse|CompanyConfiden<al 9
Step1:Security
©Koverse|CompanyConfiden<al 10
Step2:DataAssump<ons
Needtofindout1. Structureofthedata(fieldnames,types)
2. Dataseman<cs(isCustomerIDindatasetAequaltoCIDfromdatasetB?)
Ini<alassump<onsarealmostcertainlywrong.Needtoseeactualdatasamples.Goback,getmoredatasets;normalize,cleanupdata
©Koverse|CompanyConfiden<al 11
Step2:DataAssump<ons
Ifprimaryanaly<calsystemcan’thandlediscovery,needanothersystemforsampling,viewing,cleaningup,normalizingdata
©Koverse|CompanyConfiden<al 12
Step3:Interes<ngPart!
Runanaly<cs!Needsomesortofsystemforrunninganaly<cs:
RPythonSparkMLLibMapReduceSAS
©Koverse|CompanyConfiden<al 13
Step4:DeliveringResults
Reportsarerela<velyeasytodeliver–runonceaday..smalloutputSomeresultsarelarge,needtostayinthesystemIndexingmakesresultssearchableforalargenumberofconsumersResultscanbeembeddedininterac<vedecision-makingappswithanAPI
©Koverse|CompanyConfiden<al 14
Step4:DeliveringResults
Findsomesystemforindexinganaly<calresults–possiblycopyingdata,addressconsistencyissuesApplysomesolu<onformakingresultsavailableviaanAPIsotheycanbeembeddedinapplica<ons…Thenbuildapplica<ons
©Koverse|CompanyConfiden<al 15
Scalability
Eveniforiginaldatasetsaresmall,mul<pledatasetsneedtobeco-locatedOriginaldataistransformedintoderiva<vesIndexeddatarequiresmorespaceScalabilitybecomesaproblemeventually
©Koverse|CompanyConfiden<al 16
Scalability
Migrateoriginalsolu<ontoascalablesystem.Rewriteanaly<cs,dataflowforthescalablesystem.
©Koverse|CompanyConfiden<al 17
Repeatability
Systemworks!Nowwhat?Asnewdataarrives,thewholeprocessneedstobere-run,orrunonalltheavailabledataIfanyassump<onsorstructureofthedatachange,needtobeabletore-processdataLiveupdatesneedtobescheduled,resourcedemandsneedtobebalancedOhyeah,andgobackandaddresssecurity…ifpossible
©Koverse|CompanyConfiden<al 18
Workingbackwards
©Koverse|CompanyConfiden<al 19
Workingbackwards
Wanttoprovidevaluefromdatabutfirsthaveto:
Addressdatadiscovery,security,scalability,repeatability…
©Koverse|CompanyConfiden<al 20
YakShavingAvoid
©Koverse|CompanyConfiden<al 21
Recommendedapproach
1. Startwithscalabletechnologies2. Buildinsecurityfromthestart3. Admitthatdataismessy,makeitpossibletoaddressdataqualityissues
withinthesystem4. Integratewithwhateveranaly<caltoolsdatascien<stswanttouse5. Integrateindexingandsearchintothesystem,avoidcopyingdata6. Allowforprototypingnewdataflows,analy<cs,appsinproduc<onsystem.
Goingliveamamerofconfigura<on..notarewrite
©Koverse|CompanyConfiden<al 22
Recommendedapproach
Gofrom2-3successfulprojectsperyearto20-30