big data for managers: from hadoop to streaming and beyond
TRANSCRIPT
www.scispike.comCopyright©SciSpike2016
Dr.VladimirBacvanski
§ Founder of SciSpike, a development, consulting, and training firm
§ Passionate about software and data § PhD in computer science RWTH Aachen,
Germany § Architect, consultant, mentor
§ Custom development: Scalable Web and IoT systems
§ Training and mentoring in Big Data, Scala, node.js, software architecture
@OnSoftware
https://www.linkedin.com/in/vladimirbacvanski
www.scispike.comCopyright©SciSpike2016
ProblemswithRela9onalStores
§ DatathatdoesnotnaturallyfitintotablesàImpedancemismatch
§ DevelopmentEmeo5entolong
§ Dealingwithunstructureddata§ Performanceproblems
§ Difficulttorunonclusters
§ Cost
3
www.scispike.comCopyright©SciSpike2016
StructuredandUnstructuredDataSources
StructuredDataSources
• ExisEngdatabases• ERP/CRM/BIsystems• Inventory• Supplychain
UnstructuredDataSources
• Serverlogs• Searchenginelogs• Browsinglogs• E-Commercerecords• Socialmedia• Voice• Video• Sensordata
4
www.scispike.comCopyright©SciSpike2016
NoSQLImpact
5
DisksProcessors
x1000 x1000 x1000
Cost/Perform
ance
1M 1B 1T 1Q …HUGE!!!x1000
Rela9onalDatabase
BigData+NoSQL
Tomorrow-Volumeisoutofreach
Today-Doable,butexpensiveandslow
StabilizeCost&IncreasePerformance
EnableUnlimitedVolumeGrowth
www.scispike.comCopyright©SciSpike2016
ScaleUpvs.ScaleOut
6
Capability
CostScaleUp
Capability
Cost ScaleOut
www.scispike.comCopyright©SciSpike2016
ACommonPaNernforProcessingLargeData
Loadalargesetofrecordsontoasetofmachines
ExtractsomethinginteresEngfromeachrecord
Shuffleandsortintermediateresults
Aggregateintermediateresults
Storeendresult
7
"Map"
"Reduce"
Key/Valuepairs
www.scispike.comCopyright©SciSpike2016
TwoKeyAspectsofHadoop
§ MapReduceframework– HowHadoopunderstandsandassignsworktothenodes(machines)
§ HadoopDistributedFileSystem=HDFS– WhereHadoopstoresdata– AfilesystemthatspansallthenodesinaHadoopcluster– Itlinkstogetherthefilesystemsonmanylocalnodestomakethemintoonebigfilesystem
8
www.scispike.comCopyright©SciSpike2016
MapReduceExample:WordCount
§ WordCountisthe"HelloWorld"ofBigData– YouwillseevarioustechnologiesimplemenEngit– AgoodfirststeptocomparetheexpressivenessofBigDatatools
9
dog cat bird
dog cat bird
dog dog cat
dog, 1 cat, 1 bird, 1
dog, 1 cat, 1 bird, 1
dog, 1 dog, 1 cat, 1
Map
dog, 1 dog, 1 dog, 1 dog, 1
cat, 1 cat, 1 cat, 1
bird, 1 bird, 1
Shuffle
dog, 4
cat, 3
bird, 2
Reduce
dog cat bird dog cat bird dog dog cat
pets.txt
dog, 4 cat, 3 bird, 2
pet_freq.txt
www.scispike.comCopyright©SciSpike201610
TheMapReduceProgrammingModel
§ "Map"step:– Inputsplitintopieces– Workernodesprocessindividualpiecesinparallel(underglobalcontroloftheJobTrackernode)
– Eachworkernodestoresitsresultinitslocalfilesystemwhereareducerisabletoaccessit
§ "Reduce"step:– Dataisaggregated(‘reduced”fromthemapsteps)byworkernodes(undercontroloftheJobTracker)
– MulEplereducetaskscanparallelizetheaggregaEon
10
www.scispike.comCopyright©SciSpike2016
Separa9onofWork
Programmers
• Map• Reduce
Framework
• Dealswithfaulttolerance
• Assignworkerstomapandreducetasks
• Movesprocessestodata
• Shufflesandsortsintermediatedata
• Dealswitherrors
11
www.scispike.comCopyright©SciSpike2016
HowToCreateMapReduceJobs
§ JavaAPI– Lowlevel,veryflexible– Timeconsumingdevelopment
§ StreamingAPI– Asimple,producEvemodelforPythonandRuby
§ Hive– Opensourcelanguage/Apachesub-project– ProvidesaSQL-likeinterfacetoHadoop
§ Pig– Dataflowlanguage/Apachesub-project
15
www.scispike.comCopyright©SciSpike2016
TheBigPicture:NoSQL+HadoopinApplica9ons
16
Columnar
Priceupdates
Logs
Document
Productinfo
Graph
CustomerAgent
relaFon-ships
RDB
XAdata
Hadoop
Oper.analyFcs
PriceanalyFcs
Key/Value
Sessiondata
ApplicaFons
www.scispike.comCopyright©SciSpike2016
Streaming:ANewParadigm
§ ConvenEonalprocessing:sta9cdata
Data Queries Results
§ Real-time processing: streaming data
Queries Data Results
17
www.scispike.comCopyright©SciSpike2016
CommonStreamingApplica9ons
§ PersonalizaEon§ Search§ RevenueopEmizaEon
§ Userevents§ Contentfeeds§ Logprocessing§ Monitoring
§ RecommendaEons
§ Ads
§ Notableusers:– Twiper– Yahoo– SpoEfy– Cisco– Flickr– WeatherChannel
18
www.scispike.comCopyright©SciSpike2016
BeyondHadoop:Spark&Flink
19
MapReduce Tez
Spark
Flink
www.scispike.comCopyright©SciSpike2016
ApacheSpark
§ ImportantFeatures– InMemoryData– ResilientDistributedDatasets(RDDs)• Datasetscanrebuildthemselvesiffailureoccurs
– Richsetofoperators§ Efficient:
– 10x(onDisk)-100x(InMemory)fasterthanHadoopMR– 2to5Emeslesscode(RichAPIsinScala/Java/Python)
20
www.scispike.comCopyright©SciSpike2016
SparkArchitecture
§ Apowerfulsetoftools§ BeyondtradiEonalHadoop
Source:hpp://spark.apache.org
www.scispike.comCopyright©SciSpike2016
DataSharinginApacheSpark
HDFS
IteraFon1
Result1HeldInClusterMemory
IteraFon2
Result2HeldInClusterMemory
Query1
Query2
www.scispike.comCopyright©SciSpike2016
ApacheFlink
§ ExecuEon:– ProgramscompiledintoanexecuEonplan– PlanisopEmized– Executed
§ Designgoals:– Highperformance– HybridbatchandstreamingrunEme– Simplicityforthedeveloper– Richlibraries– IntegraEonwithmanysystems
23
www.scispike.comCopyright©SciSpike2016
ApacheFlinkComponents
§ IntegraEonwithHadoopYARN,MapReduce,HBase,Cassandra,Kara,…
§ ExecuEonengineforApacheBeam(GoogleDataflow)24
www.scispike.comCopyright©SciSpike2016
FlinkOp9miza9onandExecu9on
§ OpEmizerselectsanexecuEonplan
§ SimilartowhatwehaveinrelaEonaldatabases
§ OpEmalplandependsonthesizeoftheinputfiles
§ RunasstandaloneorontopofHadoop§ IntegraEonwithmanyHadooptechnologies
25
www.scispike.comCopyright©SciSpike2016
Flink&Spark:TheAdvantagesandOutlook
§ LessIOoverheadthanconvenEonalHadoop§ Caching§ IteraEvealgorithms
§ UnifyingbatchandstreamcompuEng
§ Scalaasanatural,expressivelanguageforBigData– Otherlanguages:Python,Java,R
§ Bewareoflessmaturecomponents
26
www.scispike.comCopyright©SciSpike2016
TypicalNoSQLSystems
§ Non-relaKonal§ Distributed§ Horizontallyscalable§ Noneedforafixedschema
§ Severalestablishedplayers
§ Systemsarespecialized
27
www.scispike.comCopyright©SciSpike2016
NoSQLStoresandTheirCategories
§ ChooseastorethatisabestmatchforyourapplicaEon
§ Itisfinetohaveseveraldifferentstoresused– "Polyglotpersistence"
28
k v
Key-ValueColumn-Family
Document-Oriented
GraphDB
www.scispike.comCopyright©SciSpike2016
NoSQLStores:Scalevs.ComplexityofData
29
k v
Key-Value
Column-Family
Document-Oriented
complexity
scalability
GraphDB
needsofmostapplicaFons
www.scispike.comCopyright©SciSpike2016
Key-ValueStores
§ KeyàValuemapping
§ Large,persistentMap("hashtable")– Valuescouldbelistsandhashes
§ Easytouse§ Scaleverywell§ DatamodelmaybetoosimpleformostapplicaEons
§ Systems:– Redis,Riak,Memcached,AmazonDynamoDB,Aerospike,FoundaEonDB
§ UsewhendatamodelisverysimpleandscalabilityessenEal
30
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Thedatamodelisverysimple!– ActualdatacanbeJSON
§ Sessiondata§ Userpreferencesandprofiles§ Shoppingcart
§ IfotherNoSQLstoreisgoodenough,youmaywanttoskipthisandletColumnorDocumentstorehandleit
31
www.scispike.comCopyright©SciSpike2016
Column-Family
§ "Column-family":similartoatable– Tableissparse
§ Keyà(Column:Value)*
§ Columnshavenames
§ Canbeindexed§ Canstorecomplexdata
– Denormalize!§ Systems:
– GoogleBigTable,HBase,Cassandra,AmazonSimpleDB,Hypertable
§ UsewhenscalabilityisessenEal32
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Highinsertvolume:logging
§ Real-Emeupdates
§ Contentmanagement
§ Expiringcontent§ Cross-datacenterreplicaEon§ MapReduceanalyEcsoverstoreddata
§ Youdon’tneedconvenEonal(ACID)transacEons
33
www.scispike.comCopyright©SciSpike2016
DocumentStores
§ JSON,BSON,XML
§ Noschema
§ Indexesimproveperformance
§ EasytransiEonfromRDBMS
§ Systems– MongoDB,CouchDB,CouchBase
§ Usewhendataisinsemi-structuredform
§ O5enseeninnewWebapplicaEons
34
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Logging– Especiallywithvariablecontent
§ ProductinformaEon
§ CustomerinformaEon
§ Contentmanagement
§ DatatobestoredhasformatthatvariesoverEme– Flexibleschema
§ WebanalyEcs
35
www.scispike.comCopyright©SciSpike2016
GraphDatabases
§ NodeswithproperEes§ NodesconnectedthroughrelaEonships§ Canmodelverycomplexgraphdata
– Socialnetworks§ Systems:
– Neo4J,InfiniteGraph,TitanDB,OrientDB§ Usewhendataisa(complex)graph
36
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Highlyinterconnecteddata§ Socialgraphs§ PartyrelaEonshipsinanenterprise§ LocaEonbasedservices§ PurchasinganalyEcsandrecommendaEons
§ O5encombinedwithothersystemstostorethebulkofdata– GraphdatabasecanfocusonrelaEonships
37
www.scispike.comCopyright©SciSpike2016
Integra9ngRela9onal,Streams,andHadoop
Streams
Data+BigData
TradiEonalWarehouse
In-MoEonAnalyEcs
DataanalyEcs Results
Database&Warehouse
At-restdataanalyEcs
Results
UltraLowLatencyResults
TradiEonal/RelaEonal
DataSources
Non-TradiEonal/Non-RelaEonalDataSources
Varieddataformats
Semi-structured,unstructured...
EventSystem
NoSQL
38
www.scispike.comCopyright©SciSpike2016
MergeResults
LambdaArchitecture
39
Event(Speed)Layer
RealTimeData
BatchLayer ServingLayer
MasterDataset
BatchView
IncomingData
RealTimeUpdate
BatchUpdate
Queries
RollingValues
www.scispike.comCopyright©SciSpike2016
MasterDataManagementandGovernance
§ BigDataandNoSQLstorescaneasilybecomeabiggermessthanrelaEonalstores
§ IntroduceapracEcalplan– Avoidlengthyandcumbersomegovernance– Actualuseshouldbethedrivingforce– Startslow
§ Bereadyforchange– Thetechnologieschangerapidly
§ Focusonbusinessoutcomes
40
www.scispike.comCopyright©SciSpike2016
SucceedingwithBigDataandNoSQL
1. AcEvelylookforsoluEonswheretherightstorecaneasethepain
2. Makesureyoudelivertangiblevaluetoclients
3. A5eryougetyourfirstappstowork:createaBigDataintroducEonandgovernanceplan
4. PrioriEze:dothemostusefulthingforthebusinessfirst
5. IntegratewithexisEngIT6. MakesureyouhireorgrowyourBigDatachampions
7. Fieldisimmature:lookoutfornewtoolsandtechniques
41
www.scispike.comCopyright©SciSpike2016
Conclusions
– HadoopandNoSQLaddresstheweakpointsofrelaEonalsystems:• Scale• Performance• Unstructuredandsemistructureddata
– Streamingaddressestheprocessingofdatainreal-Eme– IntegratewithconvenEonaltechnologies!– SparkandFlink:thenextgeneraEonBigDatasystems
42
QuesKons?