nosql and data lake architecture - dama nydama-ny.com/images/meeting/051817/nosql_databases/... ·...

28
NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York, NY By Tom Haughey InfoModel, LLC [email protected] 201 755 3350

Upload: dinhthien

Post on 20-Sep-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

NOSQLandDataLakeArchitecture

IBM590MadisonAvenue(at57thSt.)

NewYork,NY

ByTomHaugheyInfoModel,LLC

[email protected]

Page 2: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

Agenda•  DescribeNOSQL•  ProposerelaConshipofNOSQLtothedatalake•  AnalyzethecapabiliCesofNOSQLtosupportBI•  Proposeanewoverallreferencedataarchitecture

©InfoModel,LLC.2017 2

Page 3: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

NOSQL

Ontheoutside

Ontheinside–

hundredstothousandsofservers

Server

©InfoModel,LLC.2017 3

Page 4: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

Other

Author

•  Blog entry could also be absorbed into Blog as an array. •  All arrows are links. •  In many cases, the actual user name would be included rather than just the ID. •  This is an example of a NOSQL data model, not necessarily of notation •  [ ] indicates embedding.

User

User ID User Name Password

Blog

Blog ID Blog Name Blog Description Blog Timestamp User ID Tag [ ] Tag Keyword

Blog Entry

Blog Entry TimeUUID Blog Entry Text Blog ID Comment [ ]

Comment TimeUUID Comment Text User ID Comment Timestamp

Follows

User ID [ ] Followed User ID

Followed By

User ID [ ] Follower User ID

Subscribes To

User ID [ ] Subscribed to Blog ID Subscription Date

Subscribed By

Blog ID [ ] Subscriber User ID Subscription Date

Time Ordered Blog Entries

User ID Blog Entry TimeUUID Posted By User ID Blog Entry Text

BlogNOSQLSampleDesign

©InfoModel,LLC.2017 4

Page 5: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

UnderstandingNOSQL•  ThefollowingessenLaltoproperunderstandingofNOSQL:

–  DatadistribuLon:Dataisspreadhorizontallyovermanyservers–  Semi-structured:Not“schema-less”;schemaisembeddedinthedata

ortheapplicaCon–  Composite.Astructurecanconsistofotherstructures–  Hierarchical:1:Mrankingfromparenttochild(exceptgraphDBMSs)–  Key-valuestructure:MostNOSQL’sconsistofakeyandavalue

•  Valuecanbeablob,string,orcontainerofotherkey:valuepairs.•  TheexcepGontothisisgraphDBMS

–  Materializedqueries:Basicallyastructureforeachquery–  ApplicaLonorientaLon:Notenterprise-oriented,butqueryoriented.–  RelaLonships:UnidirecConallinks,notjoins.–  Datamodel:ANOSQLdatamodelisaphysical,notlogical,datamodel

•  Fourmaintypes:key:value,widecolumn,document,graph©InfoModel,LLC.2017 5

Page 6: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

1-KeyValue

Key Value

Whateveryouwantittocontain,suchasaCSV,ablob,orotherkey:values,e.g.,streamingstockmarketdata:

record={Ccker:…,rate_date:…,price:…,price_change:…,percent_change:…

UseCasesFilestorageLogrecordsSessionstorageContentmanagementStreamingdata(songs,albums,etc.)ProductdatamanagementHighvolumedatafeedsUserdatamanagementHadoop

@InfoModel,LLC.2017 6

Page 7: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

2-WideColumn(ColumnFamily)

Column Family: Personal Column Family: Business Row Key

Name Home Phone Office Phone Address

00001 Luke 212 555 5689 201 891 2536 21 McCoy Ave 00021 Bryce 212 232 6785 201 766 2091 26 McCoy Ave 00033 Matthew 201 234 5768 201 766 7381 26 McCoy Ave 00046 James 908 435 6242 908 657 5438 5 Hatfield St 00057 Jack 347 361 5429 973 376 8394 6 Wallace St

Thetableissortedbasedonrowkey

EachcellcanhavemulCpleversionsindividuallyCmestamped

347 361 9876

Timestamp1 Timestamp2

Cells

•  SomeCmescalledcolumn-orientedbutactuallyrow-oriented•  SeveralvariaConsbutthisistheHbaseversion

©InfoModel,LLC.20177

Page 8: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

2–WideColumn(Cassandra)

CharacterisLcs:•  Sparsetablestructure•  Foreachkey,therecanbe:

–  Variableahributes–  Newcolumnswithout

indicaCngtheyarenew–  Omissionofcolumns

Key1 A=6 C=3 D=53 E=10Key2 B=42 D=62 E=84Key3 A=1 B=5 F=20 H=9

UseCases•  ProductCatalog/Playlist•  RecommendaCon/

PersonalizaConEngine•  SensorData/InternetofThings•  Messaging•  FraudDetecCon•  UsedbyeBay,Twissandra…

©InfoModel,LLC.2017 8

Table’ssortedbasedonrowkey

Columnssortedbasedoncolumnkey

Page 9: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

3-Document•  A key : value store that understands the data Posts =

{_id: “A12345” author: “Mickey”, date: 22/6/2012, text: “The Adventures of Mickey Mouse”, keywords: [“drama”, “comic”, “adventure”] } comments : [ { author: “Nick Machiavelli”, date: 11/12/2012, rating: “Less filling, tastes great”, votes: 7000 } , … ] }

©InfoModel,LLC.2017 9

Page 10: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

•  Contentmanagement•  Highvolumedatafeeds

–  Stockmarketdata–  Streamingmusic

•  OperaConalintelligence–  Atdetailedlevel–  Ataggregatelevel

•  Productdatamanagement•  Userdatamanagement(users,profiles,etc.)•  Hadoop

UseCasesforDocumentDataStores

©InfoModel,LLC.2017 10

Page 11: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

4 - Graph Databases •  Every element contains a direct pointer to its adjacent elements

hhp://www.infoq.com/arCcles/graph-nosql-neo4j©InfoModel,LLC.2017 11

Page 12: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

GraphDatabases•  Based on graph theory •  Addresses data complexity •  Direct path operations are easy •  Can be transactional •  FOAF (Friend Of A Friend) •  Usually paired with an index for search

©InfoModel,LLC.2017 12

Page 13: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

Joins•  NOSQLstructuresarematerializedquerieswithallor

mostofthequerydataembeddedintotheonestructure•  Joinsarenotjust“unnecessary”inNOSQL,theyarenot

possible(throughtheDBMSs)–  Rememberdataishighlydistributedacrossmany,manyservers–  Joinswouldbefar-reachingandinefficient–  NOSQLembeddingimpliesredundancy

•  Joiningacrossserverswouldrequire(somethinglike):–  UseofaparCConingkeythatcanhash(notnecessarilythePK)–  AhashingalgorithmthatconsistentlyresolvesaparCConingkeytothesamenode

–  An~evendistribuConofdata–  Ideally,collocaConofdatatobejoined

©InfoModel,LLC.2017 13

Page 14: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

MetadataintheDataLake•  Somemetadata,suchasdatatype,length,domain,

granularity,business/technicaldefiniConandothers,musteventuallybeassignedtodatalakefor:–  Data–  RelaConshipsandmore

•  SayMonthlySalesRevenueisingestedintothedatalakefromdifferentorgs/countries(inwhichcasethesetotalsarerawdata)–  Arethegrainsthesame?Dotheyagree?–  TheyarebothSalesRevenuebutsupposeoneis“salesasoflastdayofthemonth”andtheotheris“salesasofthelastFridayofthemonth”

–  Metadataisrequiredtounderstandthisdata

©InfoModel,LLC.2017 14

Page 15: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

DataLakeonNOSQL?•  AdatalakecanresideonHadoop,NoSQL,AmazonSimpleStorageService,

arelaConaldatabase,ordifferentcombinaConsofthem•  Fedbydatastreams•  Datalakehasmanytypesofdataelements,datastructuresandmetadata

inHDFSwithoutregardtoimportance,IDs,orsummariesandaggregates•  Importanttounderstandthevariegatednatureofthedatalakedatain

relaContothemetamodeloftheperspecCveNOSQLdatabase–  Semi-structured–  Key:value(mostly)withitshierarchicalstructure–  ThekeyandcolumnnamebeingessenCalpartsofmostNOSQL

•  MoreotendatalakeiskeptonHadoopandfedtoorfromNOSQL–  SomeaddiConallyusedagraphdatabaseontoptokeeptrackofthe

relaConships–  OnceloadedontoNOSQL,thefullpowerofNOSQLcanbeused

•  Note:NOSQLisanoperaConal,notananalyCcal,datastore©InfoModel,LLC.2017 15

Page 16: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

BIonNOSQL•  NOSQLdoesnotyethavecommodityBItools•  Inall,thereareseveralapproaches*toBIonNOSQL,

whichdivideintotwomajorgroups:–  ThosethatuseNoSQLonlyandstrengthentheapplicaConwithbeherUIs,beheradhocreporCngandothercustomfeaturesontotheNoSQLproductstheyarecurrentlyusing

–  ThosethatuseNoSQLtoruntheirapplicaCons,butthentakethatdataoutoftheNoSQLsystemandputitintoaRDBMSortradiConaldatawarehouseformore“aterthefact”analysis.

•  EachapproachhasmanysuccessstoriesandthebestapproachforaparCcularcompanyisbasedontheirspecificneeds,budgetsandskills

*BasedonpresentaConbyNicholasGoodman,andarCclesbyCharlesRoe“BI/AnalyCcsonNOSQL”

©InfoModel,LLC.2017 16

Page 17: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

1-ApplicaLononTopofNOSQL•  ReportsonNOSQL

–  HaveadeveloperbuildanapplicaConforreporCngontopofNOSQL

–  HasthefullrichnessofNoSQLbutisexpensiveduetotheneedforadeveloper

NOSQL ApplicaCon

©InfoModel,LLC.2017 17

Page 18: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

2-EnhancedNOSQL-Only•  AddsadynamicquerybuilderintothereporCngapp•  Needsadevelopertobuilditisevenmoreexpensive

NOSQLQueryBuilderApplicaCon

IndexesAggregates

©InfoModel,LLC.2017 18

Page 19: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

3-NOSQLExtracttoRelaLonal•  SimilartotradiConalETLbutHadooporNOSQLarethesource–  ExtractfromNOSQL/HadoopandinsertintoRDBMS–  AllowstheuseofrichBItools

•  Addstofirstapproach,creaCngadynamicquerybuilderintothereporCngsystem–  Guidedadhoc–  Datafreshnessissueduetoday-olddata

RDBMSETLNOSQL/Hadoop

BI

©InfoModel,LLC.2017 19

Page 20: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

Othersources

4-NOSQLasETLSource•  NOSQLsarejustpartoftheDWsourcing(ETL)•  DataextractedfromNOSQL/HadoopandinsertedintoDWandintegratedwithotherDWdata

•  ProsandCons–  CanusestandardBItools,whicharecostly–  Richflexibility/scalabilityonNOSQLgone–  ETLdevelopmentcost–  Note:Hadoopisabatchenvironment,notreal-Cme

RDBMSETLNOSQL/Hadoop

BI

©InfoModel,LLC.2017 20

Page 21: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

5-NoSQLAddedtoBITools•  Developerintensive

–  AddsaservicetoastandardcommodityBItool–  FlahenstheNoSQLdataandoutputsitintoareport–  NoneedforSQLoradhocweb-basedaccesstools–  NoneedforloadsofexpensivereportswriheninNOSQL

•  TheprogramiswrihenoneCme–  ItisanapplicaConwithsubstanCaldevelopercosts–  CanuseM:Rforup-to-dateaccessto100%ofthedataset–  AggregaConsmightbeslower

NOSQL

NOSQLAddedto

BI

©InfoModel,LLC.2017 21

Page 22: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

6–InterfacingNOSQLandBI•  ThirdpartyEnterpriseInformaConIntegraCon(EII)between

thecommodityBIandtheNoSQLordatalake–  TheEIItoolcanspeaktobothNOSQLandBI–  IntegraConwithotherdata,givesliveup-to-dateaccess–  ETLissimplewithINSERT/MERGEsdonenightlyandhasadhocaccess

tolive,cacheddata.–  “Bestofboth”:NoSQLontheback,BIonthefront–  ThecostandcomplicaConsofintroducingathirdsystem–  ThereissCllsomelossoftherichnessofNoSQLlanguage

EII BIToolsNOSQL

©InfoModel,LLC.2017 22

Page 23: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

TransformingData•  Tousedata,itmustbeputintoauseablestate•  Youcannotputrawdataintoanyrepository(including

thedatalake)andpretendthatipsofactoiteliminatestheneedfortransformaCon(asmanyclaim!)

•  It’sacaseof“Paymenoworpaymelater”•  DataiseithertransformedbyETL/ELTbeforeitisstored

oritistransformedaKerbythequery–  Schemaonwrite-abatchprocessofETLorELTinarelaConalenvironment

–  Schemaonreadbythequerymanager–whetherinRDBMS,NOSQL,orMap:Reduce

•  OnemainreasonforstoringaggregatesintheDWistoprovideconsistentnumbers–  Thedatalaketooshouldensureconsistentnumbers

©InfoModel,LLC.2017 23

Page 24: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

SampleStrategicQuery•  Strategicqueriesusingmanydimensionsand

aggregaConscouldposeaproblemtoNOSQL–  “GivemeabreakdownoftotalcommissionspaidandtransacConcountbyaccount,summarizedbyproducttypeandproductclass,andorderedbytheorganizaConthatownstheproductandtheorganizaConthatsoldtheproduct.”

•  SuchaquerywouldbedifficulttodoinNOSQLwithoutmaterializingthequeryandusingMap/ReducefuncCons

•  ADWonarobustpla|ormcouldefficientlysupportthiswithouthavingtomaterializetheaggregateandsCllsupportquerieswithothermixesofdimensions

©InfoModel,LLC.2017 24

Page 25: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

MiningDataLakeData•  MiningistheuseofmathemaCcalalgorithmstofind

hiddenrelaConshipsinthedata•  Justasthepopularityofnewtoolsisexploding,soare

thecapabiliCesindatamining•  NOSQLiswellsuitedfordatamining•  Data-miningtechniquesfallintofourmajorcategories:

–  ClassificaCon–suchastargetedmarkeCng–  AssociaCon–suchasmarketbasketanalysis–  Sequencing–thosewhoboughtthisboughtthat–  Clustering–developingconclusionsusingspaceanddistance

•  NOTE:InHadoop,queryingandminingcanbedonethroughHive,MahoutandPig

©InfoModel,LLC.2017 25

Page 26: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

DataLakeReferenceArchitecture

Appliance HADOOP NOSQL EDW Mart

RYO data

Consumers

Big Data Data Warehouse

RDBMSs

Real-time Analytics

Files Cloud Web Logs OLAP Tables Docs Sensors Events XML/JSON Streams

Data Streams

Trusted Data

Data Lakes

Raw Data

Data Streams

©InfoModel,LLC.2017 26

Page 27: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

DW,DataMartandNOSQL•  ADWisageneralizedenvironmentinwhichdataisgatheredfrom

manysources,transformed,storedandthendeliveredwithbusinessmeaning–  Feedsanytypeofquery,includingadhocqueries–  ApplicaConneutral

•  DatamartsareaspecializedcollecConofrelateddataforaparCcularusercommunityandforalimitedCme–  Areuser-andapplicaCon-specific–  WillanswerverywellthosequesConswithindatamartsscope

•  FourtypesofNOSQL–  Key:value,widecolumn,document,graph–  MostNOSQLdatabasesarehierarchicalandoperaConal–  Thismeansmaterializingalmosteveryqueryrequiringanydatastores

•  DataLake–  AstoragerepositorythatholdsavastamountofrawdatainitsnaCve

formatunClitisneeded

©InfoModel,LLC.2017 27

Page 28: NOSQL and Data Lake Architecture - DAMA NYdama-ny.com/images/meeting/051817/NoSQL_Databases/... · NOSQL and Data Lake Architecture IBM 590 Madison Avenue (at 57th St.) New York,

QuesLons?

ThetermNOSQLwasfirstusedbyCarloStrozziin1998todescribeafile-based

databasehewasbuilding