big data for managers: from hadoop to streaming and beyond

40
Big Data for Managers: From Hadoop to Streaming and Beyond Dr. Vladimir Bacvanski [email protected] @OnSo5ware

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

701 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data for Managers: From hadoop to streaming and beyond

BigDataforManagers:FromHadooptoStreamingandBeyond

[email protected]

@OnSo5ware

Page 2: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Dr.VladimirBacvanski

§  Founder of SciSpike, a development, consulting, and training firm

§  Passionate about software and data §  PhD in computer science RWTH Aachen,

Germany §  Architect, consultant, mentor

§  Custom development: Scalable Web and IoT systems

§  Training and mentoring in Big Data, Scala, node.js, software architecture

@OnSoftware

https://www.linkedin.com/in/vladimirbacvanski

Page 3: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

ProblemswithRela9onalStores

§  DatathatdoesnotnaturallyfitintotablesàImpedancemismatch

§  DevelopmentEmeo5entolong

§  Dealingwithunstructureddata§  Performanceproblems

§  Difficulttorunonclusters

§  Cost

3

Page 4: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

StructuredandUnstructuredDataSources

StructuredDataSources

• ExisEngdatabases• ERP/CRM/BIsystems• Inventory• Supplychain

UnstructuredDataSources

• Serverlogs• Searchenginelogs• Browsinglogs• E-Commercerecords• Socialmedia• Voice• Video• Sensordata

4

Page 5: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

NoSQLImpact

5

DisksProcessors

x1000 x1000 x1000

Cost/Perform

ance

1M 1B 1T 1Q …HUGE!!!x1000

Rela9onalDatabase

BigData+NoSQL

Tomorrow-Volumeisoutofreach

Today-Doable,butexpensiveandslow

StabilizeCost&IncreasePerformance

EnableUnlimitedVolumeGrowth

Page 6: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

ScaleUpvs.ScaleOut

6

Capability

CostScaleUp

Capability

Cost ScaleOut

Page 7: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

ACommonPaNernforProcessingLargeData

Loadalargesetofrecordsontoasetofmachines

ExtractsomethinginteresEngfromeachrecord

Shuffleandsortintermediateresults

Aggregateintermediateresults

Storeendresult

7

"Map"

"Reduce"

Key/Valuepairs

Page 8: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TwoKeyAspectsofHadoop

§  MapReduceframework– HowHadoopunderstandsandassignsworktothenodes(machines)

§  HadoopDistributedFileSystem=HDFS– WhereHadoopstoresdata– AfilesystemthatspansallthenodesinaHadoopcluster–  Itlinkstogetherthefilesystemsonmanylocalnodestomakethemintoonebigfilesystem

8

Page 9: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

MapReduceExample:WordCount

§  WordCountisthe"HelloWorld"ofBigData– YouwillseevarioustechnologiesimplemenEngit– AgoodfirststeptocomparetheexpressivenessofBigDatatools

9

dog cat bird

dog cat bird

dog dog cat

dog, 1 cat, 1 bird, 1

dog, 1 cat, 1 bird, 1

dog, 1 dog, 1 cat, 1

Map

dog, 1 dog, 1 dog, 1 dog, 1

cat, 1 cat, 1 cat, 1

bird, 1 bird, 1

Shuffle

dog, 4

cat, 3

bird, 2

Reduce

dog cat bird dog cat bird dog dog cat

pets.txt

dog, 4 cat, 3 bird, 2

pet_freq.txt

Page 10: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike201610

TheMapReduceProgrammingModel

§  "Map"step:–  Inputsplitintopieces–  Workernodesprocessindividualpiecesinparallel(underglobalcontroloftheJobTrackernode)

–  Eachworkernodestoresitsresultinitslocalfilesystemwhereareducerisabletoaccessit

§  "Reduce"step:–  Dataisaggregated(‘reduced”fromthemapsteps)byworkernodes(undercontroloftheJobTracker)

–  MulEplereducetaskscanparallelizetheaggregaEon

10

Page 11: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Separa9onofWork

Programmers

• Map• Reduce

Framework

• Dealswithfaulttolerance

• Assignworkerstomapandreducetasks

• Movesprocessestodata

• Shufflesandsortsintermediatedata

• Dealswitherrors

11

Page 12: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

HowToCreateMapReduceJobs

§  JavaAPI– Lowlevel,veryflexible– Timeconsumingdevelopment

§  StreamingAPI– Asimple,producEvemodelforPythonandRuby

§  Hive– Opensourcelanguage/Apachesub-project– ProvidesaSQL-likeinterfacetoHadoop

§  Pig– Dataflowlanguage/Apachesub-project

15

Page 13: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TheBigPicture:NoSQL+HadoopinApplica9ons

16

Columnar

Priceupdates

Logs

Document

Productinfo

Graph

CustomerAgent

relaFon-ships

RDB

XAdata

Hadoop

Oper.analyFcs

PriceanalyFcs

Key/Value

Sessiondata

ApplicaFons

Page 14: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Streaming:ANewParadigm

§  ConvenEonalprocessing:sta9cdata

Data Queries Results

§ Real-time processing: streaming data

Queries Data Results

17

Page 15: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

CommonStreamingApplica9ons

§  PersonalizaEon§  Search§  RevenueopEmizaEon

§  Userevents§  Contentfeeds§  Logprocessing§  Monitoring

§  RecommendaEons

§  Ads

§  Notableusers:–  Twiper–  Yahoo–  SpoEfy–  Cisco–  Flickr–  WeatherChannel

18

Page 16: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

BeyondHadoop:Spark&Flink

19

MapReduce Tez

Spark

Flink

Page 17: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

ApacheSpark

§  ImportantFeatures–  InMemoryData– ResilientDistributedDatasets(RDDs)• Datasetscanrebuildthemselvesiffailureoccurs

– Richsetofoperators§  Efficient:

– 10x(onDisk)-100x(InMemory)fasterthanHadoopMR– 2to5Emeslesscode(RichAPIsinScala/Java/Python)

20

Page 18: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

SparkArchitecture

§  Apowerfulsetoftools§  BeyondtradiEonalHadoop

Source:hpp://spark.apache.org

Page 19: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

DataSharinginApacheSpark

HDFS

IteraFon1

Result1HeldInClusterMemory

IteraFon2

Result2HeldInClusterMemory

Query1

Query2

Page 20: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

ApacheFlink

§  ExecuEon:–  ProgramscompiledintoanexecuEonplan–  PlanisopEmized–  Executed

§  Designgoals:– Highperformance– HybridbatchandstreamingrunEme–  Simplicityforthedeveloper–  Richlibraries–  IntegraEonwithmanysystems

23

Page 21: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

ApacheFlinkComponents

§  IntegraEonwithHadoopYARN,MapReduce,HBase,Cassandra,Kara,…

§  ExecuEonengineforApacheBeam(GoogleDataflow)24

Page 22: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

FlinkOp9miza9onandExecu9on

§  OpEmizerselectsanexecuEonplan

§  SimilartowhatwehaveinrelaEonaldatabases

§  OpEmalplandependsonthesizeoftheinputfiles

§  RunasstandaloneorontopofHadoop§  IntegraEonwithmanyHadooptechnologies

25

Page 23: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Flink&Spark:TheAdvantagesandOutlook

§  LessIOoverheadthanconvenEonalHadoop§  Caching§  IteraEvealgorithms

§  UnifyingbatchandstreamcompuEng

§  Scalaasanatural,expressivelanguageforBigData– Otherlanguages:Python,Java,R

§  Bewareoflessmaturecomponents

26

Page 24: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TypicalNoSQLSystems

§  Non-relaKonal§  Distributed§  Horizontallyscalable§  Noneedforafixedschema

§  Severalestablishedplayers

§  Systemsarespecialized

27

Page 25: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

NoSQLStoresandTheirCategories

§  ChooseastorethatisabestmatchforyourapplicaEon

§  Itisfinetohaveseveraldifferentstoresused– "Polyglotpersistence"

28

k v

Key-ValueColumn-Family

Document-Oriented

GraphDB

Page 26: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

NoSQLStores:Scalevs.ComplexityofData

29

k v

Key-Value

Column-Family

Document-Oriented

complexity

scalability

GraphDB

needsofmostapplicaFons

Page 27: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Key-ValueStores

§  KeyàValuemapping

§  Large,persistentMap("hashtable")– Valuescouldbelistsandhashes

§  Easytouse§  Scaleverywell§  DatamodelmaybetoosimpleformostapplicaEons

§  Systems:– Redis,Riak,Memcached,AmazonDynamoDB,Aerospike,FoundaEonDB

§  UsewhendatamodelisverysimpleandscalabilityessenEal

30

Page 28: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Thedatamodelisverysimple!– ActualdatacanbeJSON

§  Sessiondata§  Userpreferencesandprofiles§  Shoppingcart

§  IfotherNoSQLstoreisgoodenough,youmaywanttoskipthisandletColumnorDocumentstorehandleit

31

Page 29: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Column-Family

§  "Column-family":similartoatable– Tableissparse

§  Keyà(Column:Value)*

§  Columnshavenames

§  Canbeindexed§  Canstorecomplexdata

– Denormalize!§  Systems:

– GoogleBigTable,HBase,Cassandra,AmazonSimpleDB,Hypertable

§  UsewhenscalabilityisessenEal32

Page 30: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Highinsertvolume:logging

§  Real-Emeupdates

§  Contentmanagement

§  Expiringcontent§  Cross-datacenterreplicaEon§  MapReduceanalyEcsoverstoreddata

§  Youdon’tneedconvenEonal(ACID)transacEons

33

Page 31: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

DocumentStores

§  JSON,BSON,XML

§  Noschema

§  Indexesimproveperformance

§  EasytransiEonfromRDBMS

§  Systems– MongoDB,CouchDB,CouchBase

§  Usewhendataisinsemi-structuredform

§  O5enseeninnewWebapplicaEons

34

Page 32: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Logging– Especiallywithvariablecontent

§  ProductinformaEon

§  CustomerinformaEon

§  Contentmanagement

§  DatatobestoredhasformatthatvariesoverEme– Flexibleschema

§  WebanalyEcs

35

Page 33: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

GraphDatabases

§  NodeswithproperEes§  NodesconnectedthroughrelaEonships§  Canmodelverycomplexgraphdata

– Socialnetworks§  Systems:

– Neo4J,InfiniteGraph,TitanDB,OrientDB§  Usewhendataisa(complex)graph

36

Page 34: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Highlyinterconnecteddata§  Socialgraphs§  PartyrelaEonshipsinanenterprise§  LocaEonbasedservices§  PurchasinganalyEcsandrecommendaEons

§  O5encombinedwithothersystemstostorethebulkofdata– GraphdatabasecanfocusonrelaEonships

37

Page 35: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Integra9ngRela9onal,Streams,andHadoop

Streams

Data+BigData

TradiEonalWarehouse

In-MoEonAnalyEcs

DataanalyEcs Results

Database&Warehouse

At-restdataanalyEcs

Results

UltraLowLatencyResults

TradiEonal/RelaEonal

DataSources

Non-TradiEonal/Non-RelaEonalDataSources

Varieddataformats

Semi-structured,unstructured...

EventSystem

NoSQL

38

Page 36: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

MergeResults

LambdaArchitecture

39

Event(Speed)Layer

RealTimeData

BatchLayer ServingLayer

MasterDataset

BatchView

IncomingData

RealTimeUpdate

BatchUpdate

Queries

RollingValues

Page 37: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

MasterDataManagementandGovernance

§  BigDataandNoSQLstorescaneasilybecomeabiggermessthanrelaEonalstores

§  IntroduceapracEcalplan– Avoidlengthyandcumbersomegovernance– Actualuseshouldbethedrivingforce– Startslow

§  Bereadyforchange– Thetechnologieschangerapidly

§  Focusonbusinessoutcomes

40

Page 38: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

SucceedingwithBigDataandNoSQL

1.  AcEvelylookforsoluEonswheretherightstorecaneasethepain

2.  Makesureyoudelivertangiblevaluetoclients

3.  A5eryougetyourfirstappstowork:createaBigDataintroducEonandgovernanceplan

4.  PrioriEze:dothemostusefulthingforthebusinessfirst

5.  IntegratewithexisEngIT6.  MakesureyouhireorgrowyourBigDatachampions

7.  Fieldisimmature:lookoutfornewtoolsandtechniques

41

Page 39: Big Data for Managers: From hadoop to streaming and beyond

www.scispike.comCopyright©SciSpike2016

Conclusions

– HadoopandNoSQLaddresstheweakpointsofrelaEonalsystems:•  Scale•  Performance•  Unstructuredandsemistructureddata

– Streamingaddressestheprocessingofdatainreal-Eme–  IntegratewithconvenEonaltechnologies!– SparkandFlink:thenextgeneraEonBigDatasystems

42

Page 40: Big Data for Managers: From hadoop to streaming and beyond

QuesKons?