nosql and data lake architecture - dama nydama-ny.com/images/meeting/051817/nosql_databases/... ·...

Post on 20-Sep-2018

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NOSQLandDataLakeArchitecture

IBM590MadisonAvenue(at57thSt.)

NewYork,NY

ByTomHaugheyInfoModel,LLC

Tom.Haughey@InfoModelUSA.com2017553350

Agenda•  DescribeNOSQL•  ProposerelaConshipofNOSQLtothedatalake•  AnalyzethecapabiliCesofNOSQLtosupportBI•  Proposeanewoverallreferencedataarchitecture

©InfoModel,LLC.2017 2

NOSQL

Ontheoutside

Ontheinside–

hundredstothousandsofservers

Server

©InfoModel,LLC.2017 3

Other

Author

•  Blog entry could also be absorbed into Blog as an array. •  All arrows are links. •  In many cases, the actual user name would be included rather than just the ID. •  This is an example of a NOSQL data model, not necessarily of notation •  [ ] indicates embedding.

User

User ID User Name Password

Blog

Blog ID Blog Name Blog Description Blog Timestamp User ID Tag [ ] Tag Keyword

Blog Entry

Blog Entry TimeUUID Blog Entry Text Blog ID Comment [ ]

Comment TimeUUID Comment Text User ID Comment Timestamp

Follows

User ID [ ] Followed User ID

Followed By

User ID [ ] Follower User ID

Subscribes To

User ID [ ] Subscribed to Blog ID Subscription Date

Subscribed By

Blog ID [ ] Subscriber User ID Subscription Date

Time Ordered Blog Entries

User ID Blog Entry TimeUUID Posted By User ID Blog Entry Text

BlogNOSQLSampleDesign

©InfoModel,LLC.2017 4

UnderstandingNOSQL•  ThefollowingessenLaltoproperunderstandingofNOSQL:

–  DatadistribuLon:Dataisspreadhorizontallyovermanyservers–  Semi-structured:Not“schema-less”;schemaisembeddedinthedata

ortheapplicaCon–  Composite.Astructurecanconsistofotherstructures–  Hierarchical:1:Mrankingfromparenttochild(exceptgraphDBMSs)–  Key-valuestructure:MostNOSQL’sconsistofakeyandavalue

•  Valuecanbeablob,string,orcontainerofotherkey:valuepairs.•  TheexcepGontothisisgraphDBMS

–  Materializedqueries:Basicallyastructureforeachquery–  ApplicaLonorientaLon:Notenterprise-oriented,butqueryoriented.–  RelaLonships:UnidirecConallinks,notjoins.–  Datamodel:ANOSQLdatamodelisaphysical,notlogical,datamodel

•  Fourmaintypes:key:value,widecolumn,document,graph©InfoModel,LLC.2017 5

1-KeyValue

Key Value

Whateveryouwantittocontain,suchasaCSV,ablob,orotherkey:values,e.g.,streamingstockmarketdata:

record={Ccker:…,rate_date:…,price:…,price_change:…,percent_change:…

UseCasesFilestorageLogrecordsSessionstorageContentmanagementStreamingdata(songs,albums,etc.)ProductdatamanagementHighvolumedatafeedsUserdatamanagementHadoop

@InfoModel,LLC.2017 6

2-WideColumn(ColumnFamily)

Column Family: Personal Column Family: Business Row Key

Name Home Phone Office Phone Address

00001 Luke 212 555 5689 201 891 2536 21 McCoy Ave 00021 Bryce 212 232 6785 201 766 2091 26 McCoy Ave 00033 Matthew 201 234 5768 201 766 7381 26 McCoy Ave 00046 James 908 435 6242 908 657 5438 5 Hatfield St 00057 Jack 347 361 5429 973 376 8394 6 Wallace St

Thetableissortedbasedonrowkey

EachcellcanhavemulCpleversionsindividuallyCmestamped

347 361 9876

Timestamp1 Timestamp2

Cells

•  SomeCmescalledcolumn-orientedbutactuallyrow-oriented•  SeveralvariaConsbutthisistheHbaseversion

©InfoModel,LLC.20177

2–WideColumn(Cassandra)

CharacterisLcs:•  Sparsetablestructure•  Foreachkey,therecanbe:

–  Variableahributes–  Newcolumnswithout

indicaCngtheyarenew–  Omissionofcolumns

Key1 A=6 C=3 D=53 E=10Key2 B=42 D=62 E=84Key3 A=1 B=5 F=20 H=9

UseCases•  ProductCatalog/Playlist•  RecommendaCon/

PersonalizaConEngine•  SensorData/InternetofThings•  Messaging•  FraudDetecCon•  UsedbyeBay,Twissandra…

©InfoModel,LLC.2017 8

Table’ssortedbasedonrowkey

Columnssortedbasedoncolumnkey

3-Document•  A key : value store that understands the data Posts =

{_id: “A12345” author: “Mickey”, date: 22/6/2012, text: “The Adventures of Mickey Mouse”, keywords: [“drama”, “comic”, “adventure”] } comments : [ { author: “Nick Machiavelli”, date: 11/12/2012, rating: “Less filling, tastes great”, votes: 7000 } , … ] }

©InfoModel,LLC.2017 9

•  Contentmanagement•  Highvolumedatafeeds

–  Stockmarketdata–  Streamingmusic

•  OperaConalintelligence–  Atdetailedlevel–  Ataggregatelevel

•  Productdatamanagement•  Userdatamanagement(users,profiles,etc.)•  Hadoop

UseCasesforDocumentDataStores

©InfoModel,LLC.2017 10

4 - Graph Databases •  Every element contains a direct pointer to its adjacent elements

hhp://www.infoq.com/arCcles/graph-nosql-neo4j©InfoModel,LLC.2017 11

GraphDatabases•  Based on graph theory •  Addresses data complexity •  Direct path operations are easy •  Can be transactional •  FOAF (Friend Of A Friend) •  Usually paired with an index for search

©InfoModel,LLC.2017 12

Joins•  NOSQLstructuresarematerializedquerieswithallor

mostofthequerydataembeddedintotheonestructure•  Joinsarenotjust“unnecessary”inNOSQL,theyarenot

possible(throughtheDBMSs)–  Rememberdataishighlydistributedacrossmany,manyservers–  Joinswouldbefar-reachingandinefficient–  NOSQLembeddingimpliesredundancy

•  Joiningacrossserverswouldrequire(somethinglike):–  UseofaparCConingkeythatcanhash(notnecessarilythePK)–  AhashingalgorithmthatconsistentlyresolvesaparCConingkeytothesamenode

–  An~evendistribuConofdata–  Ideally,collocaConofdatatobejoined

©InfoModel,LLC.2017 13

MetadataintheDataLake•  Somemetadata,suchasdatatype,length,domain,

granularity,business/technicaldefiniConandothers,musteventuallybeassignedtodatalakefor:–  Data–  RelaConshipsandmore

•  SayMonthlySalesRevenueisingestedintothedatalakefromdifferentorgs/countries(inwhichcasethesetotalsarerawdata)–  Arethegrainsthesame?Dotheyagree?–  TheyarebothSalesRevenuebutsupposeoneis“salesasoflastdayofthemonth”andtheotheris“salesasofthelastFridayofthemonth”

–  Metadataisrequiredtounderstandthisdata

©InfoModel,LLC.2017 14

DataLakeonNOSQL?•  AdatalakecanresideonHadoop,NoSQL,AmazonSimpleStorageService,

arelaConaldatabase,ordifferentcombinaConsofthem•  Fedbydatastreams•  Datalakehasmanytypesofdataelements,datastructuresandmetadata

inHDFSwithoutregardtoimportance,IDs,orsummariesandaggregates•  Importanttounderstandthevariegatednatureofthedatalakedatain

relaContothemetamodeloftheperspecCveNOSQLdatabase–  Semi-structured–  Key:value(mostly)withitshierarchicalstructure–  ThekeyandcolumnnamebeingessenCalpartsofmostNOSQL

•  MoreotendatalakeiskeptonHadoopandfedtoorfromNOSQL–  SomeaddiConallyusedagraphdatabaseontoptokeeptrackofthe

relaConships–  OnceloadedontoNOSQL,thefullpowerofNOSQLcanbeused

•  Note:NOSQLisanoperaConal,notananalyCcal,datastore©InfoModel,LLC.2017 15

BIonNOSQL•  NOSQLdoesnotyethavecommodityBItools•  Inall,thereareseveralapproaches*toBIonNOSQL,

whichdivideintotwomajorgroups:–  ThosethatuseNoSQLonlyandstrengthentheapplicaConwithbeherUIs,beheradhocreporCngandothercustomfeaturesontotheNoSQLproductstheyarecurrentlyusing

–  ThosethatuseNoSQLtoruntheirapplicaCons,butthentakethatdataoutoftheNoSQLsystemandputitintoaRDBMSortradiConaldatawarehouseformore“aterthefact”analysis.

•  EachapproachhasmanysuccessstoriesandthebestapproachforaparCcularcompanyisbasedontheirspecificneeds,budgetsandskills

*BasedonpresentaConbyNicholasGoodman,andarCclesbyCharlesRoe“BI/AnalyCcsonNOSQL”

©InfoModel,LLC.2017 16

1-ApplicaLononTopofNOSQL•  ReportsonNOSQL

–  HaveadeveloperbuildanapplicaConforreporCngontopofNOSQL

–  HasthefullrichnessofNoSQLbutisexpensiveduetotheneedforadeveloper

NOSQL ApplicaCon

©InfoModel,LLC.2017 17

2-EnhancedNOSQL-Only•  AddsadynamicquerybuilderintothereporCngapp•  Needsadevelopertobuilditisevenmoreexpensive

NOSQLQueryBuilderApplicaCon

IndexesAggregates

©InfoModel,LLC.2017 18

3-NOSQLExtracttoRelaLonal•  SimilartotradiConalETLbutHadooporNOSQLarethesource–  ExtractfromNOSQL/HadoopandinsertintoRDBMS–  AllowstheuseofrichBItools

•  Addstofirstapproach,creaCngadynamicquerybuilderintothereporCngsystem–  Guidedadhoc–  Datafreshnessissueduetoday-olddata

RDBMSETLNOSQL/Hadoop

BI

©InfoModel,LLC.2017 19

Othersources

4-NOSQLasETLSource•  NOSQLsarejustpartoftheDWsourcing(ETL)•  DataextractedfromNOSQL/HadoopandinsertedintoDWandintegratedwithotherDWdata

•  ProsandCons–  CanusestandardBItools,whicharecostly–  Richflexibility/scalabilityonNOSQLgone–  ETLdevelopmentcost–  Note:Hadoopisabatchenvironment,notreal-Cme

RDBMSETLNOSQL/Hadoop

BI

©InfoModel,LLC.2017 20

5-NoSQLAddedtoBITools•  Developerintensive

–  AddsaservicetoastandardcommodityBItool–  FlahenstheNoSQLdataandoutputsitintoareport–  NoneedforSQLoradhocweb-basedaccesstools–  NoneedforloadsofexpensivereportswriheninNOSQL

•  TheprogramiswrihenoneCme–  ItisanapplicaConwithsubstanCaldevelopercosts–  CanuseM:Rforup-to-dateaccessto100%ofthedataset–  AggregaConsmightbeslower

NOSQL

NOSQLAddedto

BI

©InfoModel,LLC.2017 21

6–InterfacingNOSQLandBI•  ThirdpartyEnterpriseInformaConIntegraCon(EII)between

thecommodityBIandtheNoSQLordatalake–  TheEIItoolcanspeaktobothNOSQLandBI–  IntegraConwithotherdata,givesliveup-to-dateaccess–  ETLissimplewithINSERT/MERGEsdonenightlyandhasadhocaccess

tolive,cacheddata.–  “Bestofboth”:NoSQLontheback,BIonthefront–  ThecostandcomplicaConsofintroducingathirdsystem–  ThereissCllsomelossoftherichnessofNoSQLlanguage

EII BIToolsNOSQL

©InfoModel,LLC.2017 22

TransformingData•  Tousedata,itmustbeputintoauseablestate•  Youcannotputrawdataintoanyrepository(including

thedatalake)andpretendthatipsofactoiteliminatestheneedfortransformaCon(asmanyclaim!)

•  It’sacaseof“Paymenoworpaymelater”•  DataiseithertransformedbyETL/ELTbeforeitisstored

oritistransformedaKerbythequery–  Schemaonwrite-abatchprocessofETLorELTinarelaConalenvironment

–  Schemaonreadbythequerymanager–whetherinRDBMS,NOSQL,orMap:Reduce

•  OnemainreasonforstoringaggregatesintheDWistoprovideconsistentnumbers–  Thedatalaketooshouldensureconsistentnumbers

©InfoModel,LLC.2017 23

SampleStrategicQuery•  Strategicqueriesusingmanydimensionsand

aggregaConscouldposeaproblemtoNOSQL–  “GivemeabreakdownoftotalcommissionspaidandtransacConcountbyaccount,summarizedbyproducttypeandproductclass,andorderedbytheorganizaConthatownstheproductandtheorganizaConthatsoldtheproduct.”

•  SuchaquerywouldbedifficulttodoinNOSQLwithoutmaterializingthequeryandusingMap/ReducefuncCons

•  ADWonarobustpla|ormcouldefficientlysupportthiswithouthavingtomaterializetheaggregateandsCllsupportquerieswithothermixesofdimensions

©InfoModel,LLC.2017 24

MiningDataLakeData•  MiningistheuseofmathemaCcalalgorithmstofind

hiddenrelaConshipsinthedata•  Justasthepopularityofnewtoolsisexploding,soare

thecapabiliCesindatamining•  NOSQLiswellsuitedfordatamining•  Data-miningtechniquesfallintofourmajorcategories:

–  ClassificaCon–suchastargetedmarkeCng–  AssociaCon–suchasmarketbasketanalysis–  Sequencing–thosewhoboughtthisboughtthat–  Clustering–developingconclusionsusingspaceanddistance

•  NOTE:InHadoop,queryingandminingcanbedonethroughHive,MahoutandPig

©InfoModel,LLC.2017 25

DataLakeReferenceArchitecture

Appliance HADOOP NOSQL EDW Mart

RYO data

Consumers

Big Data Data Warehouse

RDBMSs

Real-time Analytics

Files Cloud Web Logs OLAP Tables Docs Sensors Events XML/JSON Streams

Data Streams

Trusted Data

Data Lakes

Raw Data

Data Streams

©InfoModel,LLC.2017 26

DW,DataMartandNOSQL•  ADWisageneralizedenvironmentinwhichdataisgatheredfrom

manysources,transformed,storedandthendeliveredwithbusinessmeaning–  Feedsanytypeofquery,includingadhocqueries–  ApplicaConneutral

•  DatamartsareaspecializedcollecConofrelateddataforaparCcularusercommunityandforalimitedCme–  Areuser-andapplicaCon-specific–  WillanswerverywellthosequesConswithindatamartsscope

•  FourtypesofNOSQL–  Key:value,widecolumn,document,graph–  MostNOSQLdatabasesarehierarchicalandoperaConal–  Thismeansmaterializingalmosteveryqueryrequiringanydatastores

•  DataLake–  AstoragerepositorythatholdsavastamountofrawdatainitsnaCve

formatunClitisneeded

©InfoModel,LLC.2017 27

QuesLons?

ThetermNOSQLwasfirstusedbyCarloStrozziin1998todescribeafile-based

databasehewasbuilding

top related