big data for managers: from hadoop to streaming and beyond

BigDataforManagers:FromHadooptoStreamingandBeyond

[email protected]

@OnSo5ware

www.scispike.comCopyright©SciSpike2016

Dr.VladimirBacvanski

§  Founder of SciSpike, a development, consulting, and training firm

§  Passionate about software and data §  PhD in computer science RWTH Aachen,

Germany §  Architect, consultant, mentor

§  Custom development: Scalable Web and IoT systems

§  Training and mentoring in Big Data, Scala, node.js, software architecture

@OnSoftware

https://www.linkedin.com/in/vladimirbacvanski


ProblemswithRela9onalStores

§  DatathatdoesnotnaturallyfitintotablesàImpedancemismatch

§  DevelopmentEmeo5entolong

§  Dealingwithunstructureddata§  Performanceproblems

§  Difficulttorunonclusters

§  Cost

3


StructuredandUnstructuredDataSources

StructuredDataSources

• ExisEngdatabases• ERP/CRM/BIsystems• Inventory• Supplychain

UnstructuredDataSources

• Serverlogs• Searchenginelogs• Browsinglogs• E-Commercerecords• Socialmedia• Voice• Video• Sensordata

4


NoSQLImpact

5

DisksProcessors

x1000 x1000 x1000

Cost/Perform

ance

1M 1B 1T 1Q …HUGE!!!x1000

Rela9onalDatabase

BigData+NoSQL

Tomorrow-Volumeisoutofreach

Today-Doable,butexpensiveandslow

StabilizeCost&IncreasePerformance

EnableUnlimitedVolumeGrowth


ScaleUpvs.ScaleOut

6

Capability

CostScaleUp

Capability

Cost ScaleOut


ACommonPaNernforProcessingLargeData

Loadalargesetofrecordsontoasetofmachines

ExtractsomethinginteresEngfromeachrecord

Shuffleandsortintermediateresults

Aggregateintermediateresults

Storeendresult

7

"Map"

"Reduce"

Key/Valuepairs


TwoKeyAspectsofHadoop

§  MapReduceframework– HowHadoopunderstandsandassignsworktothenodes(machines)

§  HadoopDistributedFileSystem=HDFS– WhereHadoopstoresdata– AfilesystemthatspansallthenodesinaHadoopcluster–  Itlinkstogetherthefilesystemsonmanylocalnodestomakethemintoonebigfilesystem

8


MapReduceExample:WordCount

§  WordCountisthe"HelloWorld"ofBigData– YouwillseevarioustechnologiesimplemenEngit– AgoodfirststeptocomparetheexpressivenessofBigDatatools

9

dog cat bird

dog cat bird

dog dog cat

dog, 1 cat, 1 bird, 1


dog, 1 dog, 1 cat, 1

Map

dog, 1 dog, 1 dog, 1 dog, 1

cat, 1 cat, 1 cat, 1

bird, 1 bird, 1

Shuffle

dog, 4

cat, 3

bird, 2

Reduce

dog cat bird dog cat bird dog dog cat

pets.txt


pet_freq.txt


TheMapReduceProgrammingModel

§  "Map"step:–  Inputsplitintopieces–  Workernodesprocessindividualpiecesinparallel(underglobalcontroloftheJobTrackernode)

–  Eachworkernodestoresitsresultinitslocalfilesystemwhereareducerisabletoaccessit

§  "Reduce"step:–  Dataisaggregated(‘reduced”fromthemapsteps)byworkernodes(undercontroloftheJobTracker)

–  MulEplereducetaskscanparallelizetheaggregaEon

10


Separa9onofWork

Programmers

• Map• Reduce

Framework

• Dealswithfaulttolerance

• Assignworkerstomapandreducetasks

• Movesprocessestodata

• Shufflesandsortsintermediatedata

• Dealswitherrors

11


HowToCreateMapReduceJobs

§  JavaAPI– Lowlevel,veryflexible– Timeconsumingdevelopment

§  StreamingAPI– Asimple,producEvemodelforPythonandRuby

§  Hive– Opensourcelanguage/Apachesub-project– ProvidesaSQL-likeinterfacetoHadoop

§  Pig– Dataflowlanguage/Apachesub-project

15


TheBigPicture:NoSQL+HadoopinApplica9ons

16

Columnar

Priceupdates

Logs

Document

Productinfo

Graph

CustomerAgent

relaFon-ships

RDB

XAdata

Hadoop

Oper.analyFcs

PriceanalyFcs

Key/Value

Sessiondata

ApplicaFons


Streaming:ANewParadigm

§  ConvenEonalprocessing:sta9cdata

Data Queries Results

§ Real-time processing: streaming data

Queries Data Results

17


CommonStreamingApplica9ons

§  PersonalizaEon§  Search§  RevenueopEmizaEon

§  Userevents§  Contentfeeds§  Logprocessing§  Monitoring

§  RecommendaEons

§  Ads

§  Notableusers:–  Twiper–  Yahoo–  SpoEfy–  Cisco–  Flickr–  WeatherChannel

18


BeyondHadoop:Spark&Flink

19

MapReduce Tez

Spark

Flink


ApacheSpark

§  ImportantFeatures–  InMemoryData– ResilientDistributedDatasets(RDDs)• Datasetscanrebuildthemselvesiffailureoccurs

– Richsetofoperators§  Efficient:

– 10x(onDisk)-100x(InMemory)fasterthanHadoopMR– 2to5Emeslesscode(RichAPIsinScala/Java/Python)

20


SparkArchitecture

§  Apowerfulsetoftools§  BeyondtradiEonalHadoop

Source:hpp://spark.apache.org


DataSharinginApacheSpark

HDFS

IteraFon1

Result1HeldInClusterMemory

IteraFon2

Result2HeldInClusterMemory

Query1

Query2


ApacheFlink

§  ExecuEon:–  ProgramscompiledintoanexecuEonplan–  PlanisopEmized–  Executed

§  Designgoals:– Highperformance– HybridbatchandstreamingrunEme–  Simplicityforthedeveloper–  Richlibraries–  IntegraEonwithmanysystems

23


ApacheFlinkComponents

§  IntegraEonwithHadoopYARN,MapReduce,HBase,Cassandra,Kara,…

§  ExecuEonengineforApacheBeam(GoogleDataflow)24


FlinkOp9miza9onandExecu9on

§  OpEmizerselectsanexecuEonplan

§  SimilartowhatwehaveinrelaEonaldatabases

§  OpEmalplandependsonthesizeoftheinputfiles

§  RunasstandaloneorontopofHadoop§  IntegraEonwithmanyHadooptechnologies

25


Flink&Spark:TheAdvantagesandOutlook

§  LessIOoverheadthanconvenEonalHadoop§  Caching§  IteraEvealgorithms

§  UnifyingbatchandstreamcompuEng

§  Scalaasanatural,expressivelanguageforBigData– Otherlanguages:Python,Java,R

§  Bewareoflessmaturecomponents

26


TypicalNoSQLSystems

§  Non-relaKonal§  Distributed§  Horizontallyscalable§  Noneedforafixedschema

§  Severalestablishedplayers

§  Systemsarespecialized

27


NoSQLStoresandTheirCategories

§  ChooseastorethatisabestmatchforyourapplicaEon

§  Itisfinetohaveseveraldifferentstoresused– "Polyglotpersistence"

28

k v

Key-ValueColumn-Family

Document-Oriented

GraphDB


NoSQLStores:Scalevs.ComplexityofData

29

k v

Key-Value

Column-Family

Document-Oriented

complexity

scalability

GraphDB

needsofmostapplicaFons


Key-ValueStores

§  KeyàValuemapping

§  Large,persistentMap("hashtable")– Valuescouldbelistsandhashes

§  Easytouse§  Scaleverywell§  DatamodelmaybetoosimpleformostapplicaEons

§  Systems:– Redis,Riak,Memcached,AmazonDynamoDB,Aerospike,FoundaEonDB

§  UsewhendatamodelisverysimpleandscalabilityessenEal

30


TypicalUseCases

§  Thedatamodelisverysimple!– ActualdatacanbeJSON

§  Sessiondata§  Userpreferencesandprofiles§  Shoppingcart

§  IfotherNoSQLstoreisgoodenough,youmaywanttoskipthisandletColumnorDocumentstorehandleit

31


Column-Family

§  "Column-family":similartoatable– Tableissparse

§  Keyà(Column:Value)*

§  Columnshavenames

§  Canbeindexed§  Canstorecomplexdata

– Denormalize!§  Systems:

– GoogleBigTable,HBase,Cassandra,AmazonSimpleDB,Hypertable

§  UsewhenscalabilityisessenEal32


TypicalUseCases

§  Highinsertvolume:logging

§  Real-Emeupdates

§  Contentmanagement

§  Expiringcontent§  Cross-datacenterreplicaEon§  MapReduceanalyEcsoverstoreddata

§  Youdon’tneedconvenEonal(ACID)transacEons

33


DocumentStores

§  JSON,BSON,XML

§  Noschema

§  Indexesimproveperformance

§  EasytransiEonfromRDBMS

§  Systems– MongoDB,CouchDB,CouchBase

§  Usewhendataisinsemi-structuredform

§  O5enseeninnewWebapplicaEons

34


TypicalUseCases

§  Logging– Especiallywithvariablecontent

§  ProductinformaEon

§  CustomerinformaEon

§  Contentmanagement

§  DatatobestoredhasformatthatvariesoverEme– Flexibleschema

§  WebanalyEcs

35


GraphDatabases

§  NodeswithproperEes§  NodesconnectedthroughrelaEonships§  Canmodelverycomplexgraphdata

– Socialnetworks§  Systems:

– Neo4J,InfiniteGraph,TitanDB,OrientDB§  Usewhendataisa(complex)graph

36


TypicalUseCases

§  Highlyinterconnecteddata§  Socialgraphs§  PartyrelaEonshipsinanenterprise§  LocaEonbasedservices§  PurchasinganalyEcsandrecommendaEons

§  O5encombinedwithothersystemstostorethebulkofdata– GraphdatabasecanfocusonrelaEonships

37


Integra9ngRela9onal,Streams,andHadoop

Streams

Data+BigData

TradiEonalWarehouse

In-MoEonAnalyEcs

DataanalyEcs Results

Database&Warehouse

At-restdataanalyEcs

Results

UltraLowLatencyResults

TradiEonal/RelaEonal

DataSources

Non-TradiEonal/Non-RelaEonalDataSources

Varieddataformats

Semi-structured,unstructured...

EventSystem

NoSQL

38


MergeResults

LambdaArchitecture

39

Event(Speed)Layer

RealTimeData

BatchLayer ServingLayer

MasterDataset

BatchView

IncomingData

RealTimeUpdate

BatchUpdate

Queries

RollingValues


MasterDataManagementandGovernance

§  BigDataandNoSQLstorescaneasilybecomeabiggermessthanrelaEonalstores

§  IntroduceapracEcalplan– Avoidlengthyandcumbersomegovernance– Actualuseshouldbethedrivingforce– Startslow

§  Bereadyforchange– Thetechnologieschangerapidly

§  Focusonbusinessoutcomes

40


SucceedingwithBigDataandNoSQL

1.  AcEvelylookforsoluEonswheretherightstorecaneasethepain

2.  Makesureyoudelivertangiblevaluetoclients

3.  A5eryougetyourfirstappstowork:createaBigDataintroducEonandgovernanceplan

4.  PrioriEze:dothemostusefulthingforthebusinessfirst

5.  IntegratewithexisEngIT6.  MakesureyouhireorgrowyourBigDatachampions

7.  Fieldisimmature:lookoutfornewtoolsandtechniques

41


Conclusions

– HadoopandNoSQLaddresstheweakpointsofrelaEonalsystems:•  Scale•  Performance•  Unstructuredandsemistructureddata

– Streamingaddressestheprocessingofdatainreal-Eme–  IntegratewithconvenEonaltechnologies!– SparkandFlink:thenextgeneraEonBigDatasystems

42

QuesKons?

big data for managers: from hadoop to streaming and beyond

Technology