nosql and data lake architecture - dama nydama-ny.com/images/meeting/051817/nosql_databases/... ·...
TRANSCRIPT
NOSQLandDataLakeArchitecture
IBM590MadisonAvenue(at57thSt.)
NewYork,NY
ByTomHaugheyInfoModel,LLC
Agenda• DescribeNOSQL• ProposerelaConshipofNOSQLtothedatalake• AnalyzethecapabiliCesofNOSQLtosupportBI• Proposeanewoverallreferencedataarchitecture
©InfoModel,LLC.2017 2
NOSQL
Ontheoutside
Ontheinside–
hundredstothousandsofservers
Server
©InfoModel,LLC.2017 3
Other
Author
• Blog entry could also be absorbed into Blog as an array. • All arrows are links. • In many cases, the actual user name would be included rather than just the ID. • This is an example of a NOSQL data model, not necessarily of notation • [ ] indicates embedding.
User
User ID User Name Password
Blog
Blog ID Blog Name Blog Description Blog Timestamp User ID Tag [ ] Tag Keyword
Blog Entry
Blog Entry TimeUUID Blog Entry Text Blog ID Comment [ ]
Comment TimeUUID Comment Text User ID Comment Timestamp
Follows
User ID [ ] Followed User ID
Followed By
User ID [ ] Follower User ID
Subscribes To
User ID [ ] Subscribed to Blog ID Subscription Date
Subscribed By
Blog ID [ ] Subscriber User ID Subscription Date
Time Ordered Blog Entries
User ID Blog Entry TimeUUID Posted By User ID Blog Entry Text
BlogNOSQLSampleDesign
©InfoModel,LLC.2017 4
UnderstandingNOSQL• ThefollowingessenLaltoproperunderstandingofNOSQL:
– DatadistribuLon:Dataisspreadhorizontallyovermanyservers– Semi-structured:Not“schema-less”;schemaisembeddedinthedata
ortheapplicaCon– Composite.Astructurecanconsistofotherstructures– Hierarchical:1:Mrankingfromparenttochild(exceptgraphDBMSs)– Key-valuestructure:MostNOSQL’sconsistofakeyandavalue
• Valuecanbeablob,string,orcontainerofotherkey:valuepairs.• TheexcepGontothisisgraphDBMS
– Materializedqueries:Basicallyastructureforeachquery– ApplicaLonorientaLon:Notenterprise-oriented,butqueryoriented.– RelaLonships:UnidirecConallinks,notjoins.– Datamodel:ANOSQLdatamodelisaphysical,notlogical,datamodel
• Fourmaintypes:key:value,widecolumn,document,graph©InfoModel,LLC.2017 5
1-KeyValue
Key Value
Whateveryouwantittocontain,suchasaCSV,ablob,orotherkey:values,e.g.,streamingstockmarketdata:
record={Ccker:…,rate_date:…,price:…,price_change:…,percent_change:…
UseCasesFilestorageLogrecordsSessionstorageContentmanagementStreamingdata(songs,albums,etc.)ProductdatamanagementHighvolumedatafeedsUserdatamanagementHadoop
@InfoModel,LLC.2017 6
2-WideColumn(ColumnFamily)
Column Family: Personal Column Family: Business Row Key
Name Home Phone Office Phone Address
00001 Luke 212 555 5689 201 891 2536 21 McCoy Ave 00021 Bryce 212 232 6785 201 766 2091 26 McCoy Ave 00033 Matthew 201 234 5768 201 766 7381 26 McCoy Ave 00046 James 908 435 6242 908 657 5438 5 Hatfield St 00057 Jack 347 361 5429 973 376 8394 6 Wallace St
Thetableissortedbasedonrowkey
EachcellcanhavemulCpleversionsindividuallyCmestamped
347 361 9876
Timestamp1 Timestamp2
Cells
• SomeCmescalledcolumn-orientedbutactuallyrow-oriented• SeveralvariaConsbutthisistheHbaseversion
©InfoModel,LLC.20177
2–WideColumn(Cassandra)
CharacterisLcs:• Sparsetablestructure• Foreachkey,therecanbe:
– Variableahributes– Newcolumnswithout
indicaCngtheyarenew– Omissionofcolumns
Key1 A=6 C=3 D=53 E=10Key2 B=42 D=62 E=84Key3 A=1 B=5 F=20 H=9
UseCases• ProductCatalog/Playlist• RecommendaCon/
PersonalizaConEngine• SensorData/InternetofThings• Messaging• FraudDetecCon• UsedbyeBay,Twissandra…
©InfoModel,LLC.2017 8
Table’ssortedbasedonrowkey
Columnssortedbasedoncolumnkey
3-Document• A key : value store that understands the data Posts =
{_id: “A12345” author: “Mickey”, date: 22/6/2012, text: “The Adventures of Mickey Mouse”, keywords: [“drama”, “comic”, “adventure”] } comments : [ { author: “Nick Machiavelli”, date: 11/12/2012, rating: “Less filling, tastes great”, votes: 7000 } , … ] }
©InfoModel,LLC.2017 9
• Contentmanagement• Highvolumedatafeeds
– Stockmarketdata– Streamingmusic
• OperaConalintelligence– Atdetailedlevel– Ataggregatelevel
• Productdatamanagement• Userdatamanagement(users,profiles,etc.)• Hadoop
UseCasesforDocumentDataStores
©InfoModel,LLC.2017 10
4 - Graph Databases • Every element contains a direct pointer to its adjacent elements
hhp://www.infoq.com/arCcles/graph-nosql-neo4j©InfoModel,LLC.2017 11
GraphDatabases• Based on graph theory • Addresses data complexity • Direct path operations are easy • Can be transactional • FOAF (Friend Of A Friend) • Usually paired with an index for search
©InfoModel,LLC.2017 12
Joins• NOSQLstructuresarematerializedquerieswithallor
mostofthequerydataembeddedintotheonestructure• Joinsarenotjust“unnecessary”inNOSQL,theyarenot
possible(throughtheDBMSs)– Rememberdataishighlydistributedacrossmany,manyservers– Joinswouldbefar-reachingandinefficient– NOSQLembeddingimpliesredundancy
• Joiningacrossserverswouldrequire(somethinglike):– UseofaparCConingkeythatcanhash(notnecessarilythePK)– AhashingalgorithmthatconsistentlyresolvesaparCConingkeytothesamenode
– An~evendistribuConofdata– Ideally,collocaConofdatatobejoined
©InfoModel,LLC.2017 13
MetadataintheDataLake• Somemetadata,suchasdatatype,length,domain,
granularity,business/technicaldefiniConandothers,musteventuallybeassignedtodatalakefor:– Data– RelaConshipsandmore
• SayMonthlySalesRevenueisingestedintothedatalakefromdifferentorgs/countries(inwhichcasethesetotalsarerawdata)– Arethegrainsthesame?Dotheyagree?– TheyarebothSalesRevenuebutsupposeoneis“salesasoflastdayofthemonth”andtheotheris“salesasofthelastFridayofthemonth”
– Metadataisrequiredtounderstandthisdata
©InfoModel,LLC.2017 14
DataLakeonNOSQL?• AdatalakecanresideonHadoop,NoSQL,AmazonSimpleStorageService,
arelaConaldatabase,ordifferentcombinaConsofthem• Fedbydatastreams• Datalakehasmanytypesofdataelements,datastructuresandmetadata
inHDFSwithoutregardtoimportance,IDs,orsummariesandaggregates• Importanttounderstandthevariegatednatureofthedatalakedatain
relaContothemetamodeloftheperspecCveNOSQLdatabase– Semi-structured– Key:value(mostly)withitshierarchicalstructure– ThekeyandcolumnnamebeingessenCalpartsofmostNOSQL
• MoreotendatalakeiskeptonHadoopandfedtoorfromNOSQL– SomeaddiConallyusedagraphdatabaseontoptokeeptrackofthe
relaConships– OnceloadedontoNOSQL,thefullpowerofNOSQLcanbeused
• Note:NOSQLisanoperaConal,notananalyCcal,datastore©InfoModel,LLC.2017 15
BIonNOSQL• NOSQLdoesnotyethavecommodityBItools• Inall,thereareseveralapproaches*toBIonNOSQL,
whichdivideintotwomajorgroups:– ThosethatuseNoSQLonlyandstrengthentheapplicaConwithbeherUIs,beheradhocreporCngandothercustomfeaturesontotheNoSQLproductstheyarecurrentlyusing
– ThosethatuseNoSQLtoruntheirapplicaCons,butthentakethatdataoutoftheNoSQLsystemandputitintoaRDBMSortradiConaldatawarehouseformore“aterthefact”analysis.
• EachapproachhasmanysuccessstoriesandthebestapproachforaparCcularcompanyisbasedontheirspecificneeds,budgetsandskills
*BasedonpresentaConbyNicholasGoodman,andarCclesbyCharlesRoe“BI/AnalyCcsonNOSQL”
©InfoModel,LLC.2017 16
1-ApplicaLononTopofNOSQL• ReportsonNOSQL
– HaveadeveloperbuildanapplicaConforreporCngontopofNOSQL
– HasthefullrichnessofNoSQLbutisexpensiveduetotheneedforadeveloper
NOSQL ApplicaCon
©InfoModel,LLC.2017 17
2-EnhancedNOSQL-Only• AddsadynamicquerybuilderintothereporCngapp• Needsadevelopertobuilditisevenmoreexpensive
NOSQLQueryBuilderApplicaCon
IndexesAggregates
©InfoModel,LLC.2017 18
3-NOSQLExtracttoRelaLonal• SimilartotradiConalETLbutHadooporNOSQLarethesource– ExtractfromNOSQL/HadoopandinsertintoRDBMS– AllowstheuseofrichBItools
• Addstofirstapproach,creaCngadynamicquerybuilderintothereporCngsystem– Guidedadhoc– Datafreshnessissueduetoday-olddata
RDBMSETLNOSQL/Hadoop
BI
©InfoModel,LLC.2017 19
Othersources
4-NOSQLasETLSource• NOSQLsarejustpartoftheDWsourcing(ETL)• DataextractedfromNOSQL/HadoopandinsertedintoDWandintegratedwithotherDWdata
• ProsandCons– CanusestandardBItools,whicharecostly– Richflexibility/scalabilityonNOSQLgone– ETLdevelopmentcost– Note:Hadoopisabatchenvironment,notreal-Cme
RDBMSETLNOSQL/Hadoop
BI
©InfoModel,LLC.2017 20
5-NoSQLAddedtoBITools• Developerintensive
– AddsaservicetoastandardcommodityBItool– FlahenstheNoSQLdataandoutputsitintoareport– NoneedforSQLoradhocweb-basedaccesstools– NoneedforloadsofexpensivereportswriheninNOSQL
• TheprogramiswrihenoneCme– ItisanapplicaConwithsubstanCaldevelopercosts– CanuseM:Rforup-to-dateaccessto100%ofthedataset– AggregaConsmightbeslower
NOSQL
NOSQLAddedto
BI
©InfoModel,LLC.2017 21
6–InterfacingNOSQLandBI• ThirdpartyEnterpriseInformaConIntegraCon(EII)between
thecommodityBIandtheNoSQLordatalake– TheEIItoolcanspeaktobothNOSQLandBI– IntegraConwithotherdata,givesliveup-to-dateaccess– ETLissimplewithINSERT/MERGEsdonenightlyandhasadhocaccess
tolive,cacheddata.– “Bestofboth”:NoSQLontheback,BIonthefront– ThecostandcomplicaConsofintroducingathirdsystem– ThereissCllsomelossoftherichnessofNoSQLlanguage
EII BIToolsNOSQL
©InfoModel,LLC.2017 22
TransformingData• Tousedata,itmustbeputintoauseablestate• Youcannotputrawdataintoanyrepository(including
thedatalake)andpretendthatipsofactoiteliminatestheneedfortransformaCon(asmanyclaim!)
• It’sacaseof“Paymenoworpaymelater”• DataiseithertransformedbyETL/ELTbeforeitisstored
oritistransformedaKerbythequery– Schemaonwrite-abatchprocessofETLorELTinarelaConalenvironment
– Schemaonreadbythequerymanager–whetherinRDBMS,NOSQL,orMap:Reduce
• OnemainreasonforstoringaggregatesintheDWistoprovideconsistentnumbers– Thedatalaketooshouldensureconsistentnumbers
©InfoModel,LLC.2017 23
SampleStrategicQuery• Strategicqueriesusingmanydimensionsand
aggregaConscouldposeaproblemtoNOSQL– “GivemeabreakdownoftotalcommissionspaidandtransacConcountbyaccount,summarizedbyproducttypeandproductclass,andorderedbytheorganizaConthatownstheproductandtheorganizaConthatsoldtheproduct.”
• SuchaquerywouldbedifficulttodoinNOSQLwithoutmaterializingthequeryandusingMap/ReducefuncCons
• ADWonarobustpla|ormcouldefficientlysupportthiswithouthavingtomaterializetheaggregateandsCllsupportquerieswithothermixesofdimensions
©InfoModel,LLC.2017 24
MiningDataLakeData• MiningistheuseofmathemaCcalalgorithmstofind
hiddenrelaConshipsinthedata• Justasthepopularityofnewtoolsisexploding,soare
thecapabiliCesindatamining• NOSQLiswellsuitedfordatamining• Data-miningtechniquesfallintofourmajorcategories:
– ClassificaCon–suchastargetedmarkeCng– AssociaCon–suchasmarketbasketanalysis– Sequencing–thosewhoboughtthisboughtthat– Clustering–developingconclusionsusingspaceanddistance
• NOTE:InHadoop,queryingandminingcanbedonethroughHive,MahoutandPig
©InfoModel,LLC.2017 25
DataLakeReferenceArchitecture
Appliance HADOOP NOSQL EDW Mart
RYO data
Consumers
Big Data Data Warehouse
RDBMSs
Real-time Analytics
Files Cloud Web Logs OLAP Tables Docs Sensors Events XML/JSON Streams
Data Streams
Trusted Data
Data Lakes
Raw Data
Data Streams
©InfoModel,LLC.2017 26
DW,DataMartandNOSQL• ADWisageneralizedenvironmentinwhichdataisgatheredfrom
manysources,transformed,storedandthendeliveredwithbusinessmeaning– Feedsanytypeofquery,includingadhocqueries– ApplicaConneutral
• DatamartsareaspecializedcollecConofrelateddataforaparCcularusercommunityandforalimitedCme– Areuser-andapplicaCon-specific– WillanswerverywellthosequesConswithindatamartsscope
• FourtypesofNOSQL– Key:value,widecolumn,document,graph– MostNOSQLdatabasesarehierarchicalandoperaConal– Thismeansmaterializingalmosteveryqueryrequiringanydatastores
• DataLake– AstoragerepositorythatholdsavastamountofrawdatainitsnaCve
formatunClitisneeded
©InfoModel,LLC.2017 27
QuesLons?
ThetermNOSQLwasfirstusedbyCarloStrozziin1998todescribeafile-based
databasehewasbuilding