an introduction to big data - dis.uniroma1.itrosati/dmds-1819/introduction-to-big-data.pdf · data...
TRANSCRIPT
![Page 1: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/1.jpg)
Data Management for Data Science
Master of Science in Data Science
Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma
AA 2018/2019
Domenico Lembo Dipartimento di Ingegneria Informatica,
Automatica e Gestionale A. Ruberti
An Introduction to Big Data
![Page 2: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/2.jpg)
AvailabilityofMassiveData
• Digitaldataarenowadayscollectedatanunprecedentscaleandinverymanyformatsinavarietyofdomains(e-commerce,socialnetworks,sensornetworks,astronomy,genomics,medicalrecords,etc.)
• Thisishasbeenmadepossiblebytheincrediblegrowthinrecentyearsofthecapacityofdatastoragetoolsandofthecomputingpowerofelectronicdevices,aswellasbytheadventofmobileandpervasivecomputing,cloudcomputing,andcloudstorage.
![Page 3: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/3.jpg)
ExploitabilityofMassiveData
• Howtotransformavailabledataintoinformation,andhowtomakeorganizations’businesstotakeadvantagesofsuchinformationarelong-standingproblemsinIT,andinparticularininformationmanagementandanalysis.
• Theseissueshavebecomemoreandmorechallengingandcomplexinthe“BigData”era
• Atthesametime,facingthechallengecanbeevenmoreworthythaninthepast,sincethemassiveamountofdatathatisnowavailablemayallowforanalyticalresultsneverachievedbefore
![Page 4: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/4.jpg)
Becareful!• “Bigdataisavaguetermforamassive
phenomenonthathasrapidlybecomeanobsessionwithentrepreneurs,scientists,governmentsandthemedia”(TimHarford,journalistandeconomist,March,2014)*
*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu
Moore'sLawfor#BigData:Theamountofnonsensepackedintotheterm"BigData"doublesapproximatelyeverytwoyears(MikePluta,DataArchitect,onTwitterAugust2014).https://twitter.com/mikepluta/status/502878691740090369
![Page 5: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/5.jpg)
TheGoogleFluTrends*
• 2008:Googlepeoplepublish“Detectinginfluenzaepidemicsusingsearchenginequerydata”onnature(https://www.nature.com/articles/nature07634).
• TheywereabletotrackthespreadofinfluenzaacrosstheUSmorequicklythantheUSCentersforDiseaseControlandPrevention(CDC).
• Thetrackingwasessentiallybasedoncorrelationbetweenwhatpeoplesearchedforonlineandwhethertheyhadflusymptoms.
• Fouryearslater,withasimilarexperimentsGooglepeopleoverstimatedthespreadofinfluenzabyalmostafactoroftwo!
• “theory-freeanalysisofmerecorrelationsisinevitablyfragileifyouhavenoideawhatisbehindacorrelation”.
*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu
![Page 6: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/6.jpg)
CorrelationvsCausality!
![Page 7: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/7.jpg)
UnderstandingBigDataisinfactdifficult!
“Therearealotofsmalldataproblemsthatoccurinbigdata.Theydon’tdisappearbecauseyou’vegotlotsofthestuff.Theygetworse!”(DavidSpiegelhalter,CambridgeUniversity)
![Page 8: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/8.jpg)
ThinkingBigData*
"BigData"hasleaptrapidlyintooneofthemosthypedtermsinourindustry,yetthehypeshouldnotblindpeopletothefactthatthisisagenuinelyimportantshiftabouttheroleofdataintheworld.Theamount,speed,andvalueofdatasourcesisrapidlyincreasing.Datamanagementhastochangeinfivebroadareas:extractionofdatafromawiderrangeofsources,changestothelogisticsofdatamanagementwithnewdatabaseandintegrationapproaches,theuseofagileprinciplesinrunninganalyticsprojects,anemphasisontechniquesfordatainterpretationtoseparatesignalfromnoise,andtheimportanceofwell-designedvisualizationtomakethatsignalmorecomprehensible.Summingupthismeanswedon'tneedbiganalyticsprojects,insteadwewantthenewdatathinkingtopermeateourregularwork.”
MartinFowler
*http://martinfowler.com/articles/bigData/
![Page 9: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/9.jpg)
ThinkingBigData
• Thus,roughly,BigDataisdatathatexceedstheprocessingcapacityofconventionaldatabasesystems
• ButalsoBigDataisunderstoodasacapabilitythatallowscompaniestoextractvaluefromlargevolumesofdata
• but,notice,thisdoesnotmeanonlyextremelylarge,massivedatabases
• Besidesdatadimension,whatcharacterizesBigDataarealsotheheterogeneityinthewayinwhichinformationisstructured,thedynamicitywithwhichdatachanges,andtheabilityofquicklyprocessingit
• Thiscallsfornewcomputingparadigmsorframeworks,notonlyadvanceddatastoragemechanisms
![Page 10: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/10.jpg)
TheThreeVs
TocharacterizeBigData,threeVsareused,whicharetheVsof
– Volume
– Velocity– Variety
![Page 11: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/11.jpg)
Volume• Bigdataapplicationsarecharacterizedofcoursebybigamountsofdata,
wherebigmeansextremelylarge,e.g.,morethanaterabyte(TB)orpetabyte(PB),ormore.
• Therearevariouscontextsinwhichthesedimensionscanbeeasilyreached:chattersfromsocialnetworks,webserverlogs,trafficflowsensors,satelliteimagery,broadcastaudiostreams,bankingtransactions,GPStrails,financialmarketdata,biologicaldata,etc.
• Somemoreconcreteexamples:– DespitesomeYoutubestatisticsareavailable1thetotalstoragecapacity
ofYoutubeit’snotknown,butrealisticallyitshouldbenolessthan1EB(2016)
– NSAdatacenter:estimatedstoragecapacityofatleast2,000PBs(2013)2
– Facebook:300PBdatawarehouse(2014)31http://web.archive.org/web/20150217015601/http://www.youtube.com/yt/press/statistics.html2http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/#1b66cc7c3cd23https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
![Page 12: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/12.jpg)
Volume• Howmanydataintheworld?
AccordingtoIDC(InternationalDataCorporation):– 800Terabytes,2000– 160Exabytes,2006(1EB=1018B)– 500Exabytes,2009– 1.8Zettabytes,2011(1ZB=1021B)1– 2.8Zettabytes,20121– 4.4Zettabytes,2013– 175Zettabytesby20252(estimate)
1http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html2https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf3https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
Around90%ofworld’sdatageneratedinthelast4years.
Thedigitaluniverseisdoublinginsize
everytwoyears3
Multipleofbitsorbytes
Symbol Name Decimalvalue
BinaryValue
k kilo 1000 1024
M mega 10002 10242
G giga 10003 10243
T tera 10004 10244
P peta 10005 10245
E exa 10006 10246
Z zetta 10007 10247
Y yotta 10008 10248
![Page 13: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/13.jpg)
Volume
• Thesheervolumeofdataisenoughtodefeatmanylong-followedapproachestodatamanagement
• Traditionalcentralizeddatabasesystemscannothandlemanyofthedatavolumes,forcingtheuseofclusters
• Datahavetobenecessarilydistributed,andthenumberofsourcesprovidinginformationcanbehuge,muchhigherthanthenumberconsideredintraditionaldataintegrationandvirtualizationsystems
![Page 14: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/14.jpg)
Velocity
• Data’svelocity(i.e.,therateatwhichdataiscollectedandmadeavailableintoanorganization)hasfollowedasimilarpatterntothatofvolume
• Manydatasourcesaccessedbyorganizationsfortheirbusinessareextremelydynamic
• Mobiledevicesincreasetherateofdatainflow:data“everywhere”,collectedandconsumedcontinuously
![Page 15: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/15.jpg)
Velocity• Someexamples:
– Walmart:1milliontransactionperhour(2010)1– eBay:datathroughputreaches100PBsperday(2013)2– Googleprocesses100PBsperday(2013-14)3– Facebook:600TBaddedtothewarehouseeveryday(2014)4– 6000-8000tweetspersecondeveryday(in2019)5
• In2013,ithasbeenestimatedthateveryminuteofeverydaywecreated6:- Morethan204millionemailmessages- 571newWebsitesand347blogpostscreated- 72hoursofnewYouTubevideos- 1.8millionsoflikeonFacebook- 216.000newphotosoninstagram- $83.000spentonAmazon
1http://martinfowler.com/articles/bigData/2http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings3http://www.slideshare.net/kmstechnology/big-data-overview-2013-20144https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/5http://www.internetlivestats.com/twitter-statistics/6http://www.dailymail.co.uk/sciencetech/article-2381188/Revealed-happens-just-ONE-minute-internet-216-000-photos-posted-278-000-Tweets-1-8m-Facebook-likes.html
![Page 16: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/16.jpg)
Velocity• Processinginformationassoonasitisavailable,thusspeedingthe
“feedbackloop”,canprovidecompetitiveadvantages• SomeexamplesofFastDataProcessing:
– CustomerExperience/Retail:onlineretailsthatareabletosuggestadditionalproductstoacustomerateverynewinformationinsertedduringanon-linepurchase(Click-streamanalysis)
– FinancialServicesIndustry:Algorithmictradingusingeventprocesstechnologybutalsoreal-timedataintegrationandanalytics
– Telecommunication:understandallocationofnetworkresourcesbasedontrafficandapplicationrequirements,networkusagepatterns
– Energy:real-timeprocessofhighvolumeofeventstomakeimportantdecisionsinordertoeffectivelyandefficientlymanagepossiblefaultsonthedistributionnetwork
– Manufacturing:analyzereal-timemetricstotakecorrectiveactionbeforeafailureoccurs
![Page 17: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/17.jpg)
Velocity
• Streamprocessingisanewchallengingcomputingparadigm,whereinformationisnotstoredforlaterbatchprocessing,butisconsumedonthefly
• Thisisparticularlyusefulwhendataaretoofasttostorethementirely(forexamplebecausetheyneedsomeprocessingtobestoredproperly),asinscientificapplications,orwhentheapplicationrequiresanimmediateanswer
![Page 18: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/18.jpg)
Variety
• Dataisextremelyheterogeneous:e.g.,intheformatinwhicharerepresented,butalsoandinthewaytheyrepresentinformation,bothattheintensionalandextensionallevel
• E.g.,textfromsocialnetworks,sensordata,logsfromwebapplications,databases,XMLdocuments,RDFdata,etc.
• Dataformatrangesthereforefromstructured(e.g,relationaldatabases)tosemistructured(e.g.,XMLdocuments),tounstructured(e.g.,textdocuments)
![Page 19: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/19.jpg)
Variety
• Asforunstructureddata,forexample,thechallengeistoextractmeaningforconsumptionbothbyhumansormachines
• Entityresolution,whichistheprocessthatresolves,i.e.,identifies,entitiesanddetectsrelationships,thenplaysanimportantrole
• Infact,thesearewell-knownissuesstudiedsinceseveralyearsinthefieldsofdataintegration,dataexchange,anddataquality.IntheBigDatascenario,however,theybecomeevenmorechallenging
![Page 20: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/20.jpg)
AfourthV:Veracity*
• Dataareofwidelydifferentquality
• Traditionallydataisthoughtofascomingfromwellorganizeddatabaseswithcontrolledschemas
• Instead,in“BigData”thereisoftenlittleornoschematocontroltheirstructure
• Theresultisthatthereareseriousproblemswiththequalityofthedata
*TheliteratureoftenmentionsonlythreeVsanddoesnotincludeveracity.
HoweversomeauthorstendtoincludeveracityasacorecharacteristcofBigData(alternatively,veracityisconsideredanaspectofvariety)
![Page 21: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/21.jpg)
BigData:V3+Value
BigDatacangeneratehugecompetitiveadvantages!
![Page 22: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/22.jpg)
ThevalueofDatafororganizations
• Althoughit'sdifficulttogethardfiguresonthevalueofmakingfulluseofyourdata,muchofthesuccessofcompaniessuchasAmazonandGoogleiscreditedtotheireffectiveuseofdata1
• Thuscompaniesspendlargeamountsofmoneytoreachthiseffectiveuse:AccordingtoIDC,in2017bigdataandanalyticssoftwaremarketreached$54.1billionwordlwide,anditisexpectedtogrowatafive-yearCAGR(compoundannualgrowthrate)of11.2%.(analysis2018-2022)2
• ThusvariousBigDatasolutionsarenowpromotedbyallmajorvendorsindatamanagementsystems
1http://martinfowler.com/articles/bigData/2https://www.idc.com/getdoc.jsp?containerId=US44243318
![Page 23: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/23.jpg)
Potentialvalue
![Page 24: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/24.jpg)
Demandfornewdatamanagementsolutions*
• Inthescenarioswedepicteditisnotsurprisingthatnewdatamangementsolutionsaredemanded
• Indeed,despitethepopularityandwellunderstoodnatureofrelationaldatabases,itisnotthecasethattheyshouldalwaysbethedestinationfordata
• Dependingonthecharacteristicofdata,certainclassesofdatabasesaremoresuitedthanothersfortheirmanagement
• XMLdocumentsaremoreversatilewhenstoredindedicatedXMLstoragesystems(e.g.,MarkLogic)
• SocialnetworkrelationsaregraphbynatureandgraphdatabasessuchasNeo4Jcanmakeoperationsonthemsimplerandmoreefficient
*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam
![Page 25: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/25.jpg)
Demandfornewdatamanagementsolutions*
• Adisadvantageoftherelationaldatabaseisthestaticnatureofitsschema
• Inanagileenvironment,theresultsofcomputationwillevolvewiththedetectionandextractionofnewinformation
• Semi-structuredNoSQLdatabasesmeetthisneedforflexibility:theyprovidesomestructuretoorganizedata(enoughforcertainapplications),butdonotrequiretheexactschemaofthedatabeforestoringit
*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam
![Page 26: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/26.jpg)
NoSQLdatabases*
Orbetter…notonlySQL• Theterm"NoSQL"isveryill-defined.It'sgenerallyappliedtoa
numberofnon-relationaldatabasessuchasCassandra,Mongo,Dynamo,Neo4J,Riak,andmanyothers
• Theyembraceschemalessdata,runonclusters,andhavetheabilitytotradeofftraditionalconsistencyforotherusefulproperties
• AdvocatesofNoSQLdatabasesclaimthattheycanbuildsystemsthataremoreperformant,scalemuchbetter,andareeasiertoprogramwith
*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)
![Page 27: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/27.jpg)
Graphdatabases
![Page 28: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/28.jpg)
Key-valuesdatabases
![Page 29: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/29.jpg)
Documentdatabases
![Page 30: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/30.jpg)
ColumnFamilyDatabases
![Page 31: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/31.jpg)
NoSQLdatabases*
• Isthisthefirstrattleofthedeathknellforrelationaldatabases,oryetanotherpretendertothethrone?Ouranswertothatis"neither"
• Relationaldatabasesareapowerfultoolthatweexpecttobeusingformanymoredecades,butwedoseeaprofoundchangeinthatrelationaldatabaseswon'tbetheonlydatabasesinuse
• OurviewisthatweareenteringaworldofPolyglotPersistencewhereenterprises,andevenindividualapplications,usemultipletechnologiesfordatamanagement
*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)
![Page 32: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/32.jpg)
Multipletechnologiesfordatamanagement
Asanexercise,letusaskgooglewhichisthedatabaseengineusedbyFacebook.Wegetthefollowingtools1:• MySQLascoredatabaseengine(infactacustomizedversion
ofMySQL,highlyoptimizedanddistributed)2• Cassandra(anApacheopensourcefaulttolerantdistributed
NoSQLDBMS,originallydevelopedatFacebookitself)asdatabasefortheInobxmailsearch
• Memcached,amemorycachingsystemtospeedupdynamicdatabasedrivenwebsites
• HayStack,forstorageandmanagementofphotos• Hive,anopensource,peta-bytescaledatawarehousing
frameworkbasedonHadoop,foranalytics,andalsoPresto,anexabytescaledatawarehouse3
1https://www.techworm.net/2013/05/what-database-actually-facebook-uses.html
2http://www.datacenterknowledge.com/data-center-faqs/facebook-data-center-faq-page-23http://prestodb.io/
![Page 33: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/33.jpg)
DataWarehouse• Adatawarehouseisadatabaseusedforreportinganddata
analysis.Itisacentralrepositoryofdatawhichiscreatedbyintegratingdatafromoneormoredisparatesources
• AccordingtoInmon*,adatawarehouseis:– Subject-oriented:Thedatainthedatawarehouseisorganizedsothat
allthedataelementsrelatingtothesamereal-worldeventorobjectarelinkedtogether
– Non-volatile:Datainthedatawarehouseareneverover-writtenordeletedoncecommitted,thedataarestatic,read-only,andretainedforfuturereporting
– Integrated:Thedatawarehousecontainsdatafrommostorallofanorganization'soperationalsystemsandthesedataaremadeconsistent
– Time-variant:Foranoperationalsystem,thestoreddatacontainsthecurrentvalue.Thedatawarehouse,however,containsthehistoryofdatavalues
*Inmon,Bill(1992).BuildingtheDataWarehouse.Wiley
![Page 34: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/34.jpg)
DataWarehousevs.BigData• AreDataWarehouses(DWs)underthehatofBigData?
• Thenotionofdatawarehousingdatesbacktotheendof80s,andverymanydatawarehouseandbusinessintelligencesolutionshavebeenproposedsincethen
• BTW,BigDataandDWshavemanypointsincommon,atleastw.r.t.– Volume:datawarehousesstorelargeamountsofdata,– Variety:atleastinprinciple,datawarehousesintegrate
heterogeneousinformation– Veracity:datawarehosesusuallyareequippedwithdatacleaning
solutions,appliedintheso-calledextract-transformation-load(ETL)phase
![Page 35: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/35.jpg)
DataWarehousevs.BigData
• Existingenterprisedatawarehousesandrelationaldatabasesexcelatprocessingstructureddata,andcanstoremassiveamountsofdata,thoughatcost
• However,thisrequirementforstructureimposesaninertiathatmakesdatawarehousesunsuitedforagileexplorationofmassiveheterogenousdata
• Theamountofeffortrequiredtowarehousedataoftenmeansthatvaluabledatasourcesinorganizationsarenevermined
• Therefore,newcomputingmodelsandframeworksareneededtomakenewDWsolutionscompliantwiththeBigDataecosystem.
![Page 36: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/36.jpg)
MapReduce
• MapReduceisaprogrammingframeworkforparallelizingcomputation
• OriginallydefinedatGoogle
• Next,therehavebeenvariousimplementations
• Awell-knownopensourcedistributionisApacheHadoop
![Page 37: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,](https://reader030.vdocuments.mx/reader030/viewer/2022041100/5ed72af1c30795314c175292/html5/thumbnails/37.jpg)
MapReduce
AMapReduceprogramisconstitutedbytwocomponents• Map()procedure(themapper)thatperformsfilteringand
sorting(itdecomposestheproblemintoparallelizablesubproblems)
• Reduce()procedure(thereducer)devotedtosolvesubproblems
TheMapReduceFrameworkmanagesdistributedservers,whichexecutethevarioussubtasksinparallel,andcontrolscommunicationanddatatransfersbetweenthevariousservers,aswellasguaranteesfaulttoleranceanddisasterrecovery.