data informatics• nosql databases do not store or enforce relationships among entities. the...

47
Data Informatics Seon Ho Kim, Ph.D. [email protected]

Upload: others

Post on 21-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

DataInformatics

SeonHoKim,[email protected]

NoSQL andBigDataProcessing

Database• RelationalDatabases– mainstayofbusiness• Web-basedapplicationscausedspikes

– Especiallytrueforpublic-facinge-Commercesites• Issueswithscalingupwhenthedatasetisjusttoobig• RDBMSwerenotdesignedtobedistributed• Begantolookatmulti-nodedatabasesolutions• Knownas‘scalingout’or‘horizontalscaling’

ShortReviewofSQL

Standardlanguageforqueryingandmanipulatingdata

StructuredQueryLanguage

• DataDefinitionLanguage(DDL)– Create/alter/deletetablesandtheirattributes

• DataManipulationLanguage(DML)– Queryoneormoretables– discussednext!– Insert/delete/modifytuplesintables

TablesinSQL

PName Price Category Manufacturer

Gizmo $19.99 Gadgets GizmoWorks

Powergizmo $29.99 Gadgets GizmoWorks

SingleTouch $149.99 Photography Canon

MultiTouch $203.99 Household Hitachi

Product

AttributenamesTablename

Tuplesorrows

TablesExplained

• Theschema ofatableisthetablenameanditsattributes:Product(PName, Price,Category,Manfacturer)

• Akey isanattributewhosevaluesareunique;weunderlineakey

Product(PName, Price,Category,Manfacturer)

SQLQuery

Basicform:(plusmanymanymorebellsandwhistles)

SELECT <attributes>FROM <oneormorerelations>WHERE <conditions>

SimpleSQLQuery

PName Price Category Manufacturer

Gizmo $19.99 Gadgets GizmoWorks

Powergizmo $29.99 Gadgets GizmoWorks

SingleTouch $149.99 Photography Canon

MultiTouch $203.99 Household Hitachi

SELECT *FROM ProductWHERE category=‘Gadgets’

Product

PName Price Category Manufacturer

Gizmo $19.99 Gadgets GizmoWorks

Powergizmo $29.99 Gadgets GizmoWorks“selection”

SimpleSQLQuery

PName Price Category Manufacturer

Gizmo $19.99 Gadgets GizmoWorks

Powergizmo $29.99 Gadgets GizmoWorks

SingleTouch $149.99 Photography Canon

MultiTouch $203.99 Household Hitachi

SELECT PName,Price,ManufacturerFROM ProductWHERE Price>100

Product

PName Price Manufacturer

SingleTouch $149.99 Canon

MultiTouch $203.99 Hitachi“selection” and“projection”

Joins

PName Price Category Manufacturer

Gizmo $19.99 Gadgets GizmoWorks

Powergizmo $29.99 Gadgets GizmoWorks

SingleTouch $149.99 Photography Canon

MultiTouch $203.99 Household Hitachi

Product Company

Cname StockPrice Country

GizmoWorks 25 USA

Canon 65 Japan

Hitachi 15 Japan

PName Price

SingleTouch $149.99

SELECT PName,PriceFROM Product,CompanyWHEREManufacturer=CNameANDCountry=‘Japan’

ANDPrice<=200

SQLPhysicalLayerAbstraction

• Applicationsspecifywhat,nothow• Queryoptimizationengine• Physicallayercanchangewithoutmodifyingapplications– Createindexestosupportqueries– InMemorydatabases

Transactions– ACIDProperties

• Atomic – Alloftheworkinatransactioncompletes(commit)ornoneofitcompletes

• Consistent – Atransactiontransformsthedatabasefromoneconsistentstatetoanotherconsistentstate.Consistencyisdefinedintermsofconstraints.

• Isolated – Theresultsofanychangesmadeduringatransactionarenotvisibleuntilthetransactionhascommitted.

• Durable – Theresultsofacommittedtransactionsurvivefailures

Trends

OLTPvs.OLAP• WecandivideITsystemsintotransactional(OLTP)andanalytical(OLAP).IngeneralwecanassumethatOLTPsystemsprovidesourcedatatodatawarehouses,whereasOLAPsystemshelptoanalyzeit.

.

ChallengesofScaleDiffer

MotivesBehindNoSQL

• Bigdata.• Scalability.• Dataformat.• Manageability.

BigData

• Collect.• Store.• Organize.• Analyze.• Share.

• Datagrowthoutrunstheabilitytomanageitsoweneedscalablesolutions.

Scalability

• Scaleup,Verticalscalability.– Increasingservercapacity.– AddingmoreCPU,RAM.– Managingishard.– Possibledowntimes

Scalability

• Scaleout,Horizontalscalability.– Addingserverstoexistingsystemwithlittleeffort,akaElasticallyscalable.

• Bugs,hardwareerrors,thingsfailallthetime.• Itshouldbecomecheaper.Costefficiency.

– Sharednothing.– Useofcommodity/cheaphardware.– Heterogeneoussystems.– ServiceOrientedArchitecture.Localstates.

• Decentralizedtoreducebottlenecks.• Avoidsinglepointoffailures.

WhatisWrongWithRDBMS?• Nothing.Onesizefitsall?Notreally.• Rigidschemadesignà Efficiency issueinbigscale.• Hardertoscale.• Replicationà Expensive.• Joinsacrossmultiplenodes?Hard.• HowdoesRDMShandledatagrowth?Hard.• NeedforaDBA.• Manyprogrammersalreadyfamiliarwithità Good.• Transactions/ACIDmakedevelopmenteasyà Good.• Lotsoftoolstouseà Good.

ACIDSemantics

• Atomicity,Consistency,Isolation,Durability• AnydatastorecanachieveAtomicity,IsolationandDurabilitybutdoyoualwaysneedconsistency?No.

• BygivingupACIDproperties,onecanachievehigherperformanceandscalability.

ThePerfectStorm

• Largedatasets,acceptanceofalternatives,anddynamically-typeddatahascometogetherinaperfectstorm

• Notabacklash/rebellionagainstRDBMS• SQLisarichquerylanguagethatcannotberivaledbythecurrentlistofNoSQLofferings

CAPTheorem

• Threepropertiesofasystem:– consistency,availabilityandpartitions

• Youcanhaveatmosttwoofthesethreepropertiesforanyshared-datasystem

• Toscaleout,youhavetopartition.Thatleaveseitherconsistencyoravailability tochoosefrom– Inalmostallcases,youwouldchooseavailabilityoverconsistency

CAPTheorem(cont.)

Consistency

Partitiontolerance

Availability

C:Onceawriterhaswritten,allreaderswillseethatwrite

A:Systemisavailableduringsoftwareandhardwareupgradesandnodefailures.

P:Asystemcancontinuetooperateinthepresenceofanetworkpartitions.

Theorem:Youcanhaveatmosttwo ofthesepropertiesforanyshared-datasystem

BASE,anACIDAlternative

• AlmosttheoppositeofACID.• BasicallyAvailable:Nodesintheadistributedenvironmentcangodown,butthewholesystemshouldn’tbeaffected.

• SoftState(scalable):Thestateofthesystemanddatachangesovertime.

• EventualConsistency:Givenenoughtime,datawillbeconsistentacrossthedistributedsystem.

AClashofculturesACID:

• Strongconsistency.• Lessavailability.• Pessimisticconcurrency.• Complex.

BASE:• Availability isthemostimportantthing.Willingtosacrificeforthis(CAP).

• Weakerconsistency(Eventual).• Besteffort.• Simpleandfast.• Optimistic.

NoSQLDefinition

• Fromwww.nosql-database.org:– NextGenerationDatabasesmostlyaddressingsomeofthepoints:beingnon-relational,distributed, open-source and horizontalscalable.Theoriginalintentionhasbeenmodernweb-scaledatabases.Themovementbeganearly2009andisgrowingrapidly.

– Oftenmorecharacteristicsapplyas:schema-free,easyreplicationsupport,simpleAPI,eventuallyconsistent/BASE(notACID),ahugedataamount,andmore.

WhatisNoSQL?

• StandsforNotOnlySQL• Classofnon-relationaldatastoragesystems• Usuallydonotrequireafixedtableschemanordotheyusetheconceptofjoins

• AllNoSQL offeringsrelaxoneormoreoftheACIDproperties(followingCAPtheorem)

WhyNoSQL?

• Fordatastorage,anRDBMScannotbethebe-all/end-all

• Justastherearedifferentprogramminglanguages,needtohaveotherdatastoragetoolsinthetoolbox

• ANoSQL solutionismoreacceptabletoaclientnowthanevenayearago

NoSQLDistinguishingCharacteristics

• Largedatavolumes– Google’s“bigdata”

• Scalablereplicationanddistribution– Potentiallythousandsof

machines– Potentiallydistributed

aroundtheworld• Queriesneedtoreturn

answersquickly• Mostlyquery,few

updates

• AsynchronousInserts&Updates

• Schema-less• ACIDtransaction

propertiesarenotneeded– BASE

• CAPTheorem• Opensource

development

WhatkindsofNoSQL• NoSQL solutionsfallintotwomajorareas:

– Key/Valueor‘thebighashtable’.• AmazonS3(Dynamo)• Voldemort• Scalaris• Memcached (in-memorykey/valuestore)• Redis

– Schema-lesswhichcomesinmultipleflavors,column-based,document-basedorgraph-based.

• Cassandra(column-based)• CouchDB (document-based)• MongoDB(document-based)• Neo4J(graph-based)• HBase (column-based)

• Thekey-valuedatamodelisbasedonastructurecomposedoftwodataelements:akeyandavalue,inwhicheverykeyhasacorrespondingvalueorsetofvalues.

• Thekeyvaluemodelisalsoreferredtoastheattribute-valueorassociativedatamodel.

• Considertheexampleshownonthenextslidewhichillustratesasmalltruck-drivingcompanycalledTrucks-R-Us.Eachofitsthreedrivershasoneormorecertificationsandothergeneralinformation.

• Basedonthisexamplewecandefinethefollowingimportantpoints:

NoSQL Databases

NoSQL Databases

DID CERT1 CERT2 DOB LICTYPECERT32732 80 90 1/24/1962 P

2946 92 90 4/11/19703650 86 6/27/1968 C

DID KEY VALUE2732 CERT1 80

2732 CERT3 90

2732 DOB 1/24/1962

2732 LICTYPE P2946 CERT2 92

2946 CERT3 90

2946 DOB 4/11/1970

3650 CERT1 863650 DOB 6/27/1968

36500 LICTYPE C

Datastoredusingtraditionalrelationalmodel Datastoredusingkey-valuemodel

Driver2732

• In the relational model:• Each row represents one entity instance• Each column represents one attribute of the

entity• The values in a column are of the same data

type• In the key-value model:

• Each row represents one attribute/value of one entity instance

• The “key” column could represent any entity’s attribute

• The values in the “value” column could be of any data type and therefore it is generally assigned a long string data type

• Inthekey-valuemodel,eachrowrepresentsoneattributeofoneentityinstance.The“key” columnpointstoanattributeandthe“value” columncontainstheactualvaluefortheattribute.

• Thedatatypeofthe“value” columnisgenerallyalongstringtoaccommodatethevarietyofactualdatatypesofthevaluesplacedinthecolumn.

• Toaddanewentityattributeintherelationalmodel,youneedtomodifythetabledefinition(schemamodification).Toaddanewattributeinthekey-valuemodel,youaddarowtothekey-valuestore,whichiswhyitissaidtobe“schema-less”.

NoSQL Databases

• NoSQL databasesdonotstoreorenforcerelationshipsamongentities.Theprogrammerisrequiredtomanagetherelationshipsinthecode.Also,alldataandintegrityvalidationsmustbedoneinthecode.

• NoSQL databasesusetheirownnativeapplicationprogramminginterface(API)withsimpledataaccesscommands,suchasput,read,anddelete.BecausethereisnodeclarativeSQL-likesyntaxtoretrievedata,theprogramcodemusttakecareofretrievingrelateddatainthecorrectway.

• Indexingandsearchescanbedifficult.Becausethe“value”columninthekey-valuemodelcouldcontainmanydifferentdatatypes,itisoftendifficulttocreateindicesonthedata.Atthesametime,searchescanbecomeverycomplex.

NoSQL Databases

• One of the big advantages of NoSQL databases is that theygenerally use a distributedarchitecture.

• NoSQL databases can handle very large volumes of data. Inparticular, they are suited for sparse data.

• Sparse data occurs when the number of attributes is verylarge, but the number of actual data instances is low. Usingthe truck driving company as an example, drivers can take anycertification exam, but they are not required to take them all.In this case there were three drivers and three possiblecertificates for each driver, so there will be nine possible datapoints (we illustrated only 4). Extrapolate this to a hospitalwith 50,000 patients and more than 2500 possible medicaltests that could be performed. We would not expect to see50000 x 2500 data points, in reality we would see far less.

NoSQL Databases

• Most NoSQL databases are geared toward performancerather than transaction consistency.

• Enforcing data consistency on a distributed database is a verydifficult problem. Distributed databases make copies of dataelements at multiple nodes to ensure high availability andfault tolerance. Updating values and requiring consistencyamongst all copies is a very tough problem to solve. NoSQLdatabase sacrifice this consistency to attain high levels ofperformance.

• Some NoSQL databases provide a feature called eventualconsistency, which means that updates to data will propagatethrough the system and eventually all copies will beconsistent. With eventual consistency, data are notguaranteed to be consistent across all copies immediatelyafter an update.

NoSQL Databases

Key/ValueDataModel

• Pros:– veryfast– veryscalable– simplemodel– abletodistributehorizontally

• Cons:– manydatastructures(objects)can'tbeeasilymodeledaskeyvaluepairs

Schema-LessDataModel

• Pros:– Schema-lessdatamodelisricherthankey/valuepairs

– eventualconsistency– manyaredistributed– stillprovideexcellentperformanceandscalability

• Cons:– typicallynoACIDtransactionsorjoins

CommonAdvantages

• Cheap,easytoimplement(opensource)• Dataarereplicatedtomultiplenodes(thereforeidenticalandfault-tolerant)andcanbepartitioned– Downnodeseasilyreplaced– Nosinglepointoffailure

• Easytodistribute• Don'trequireaschema• Canscaleupanddown• Relaxthedataconsistencyrequirement(CAP)

WhatamIgivingup?

• Features– joins– groupby– orderby– ACIDtransactions

• SQLissometimesfrustratingbutstillapowerfulquerylanguage

• EasyintegrationwithotherapplicationsthatsupportSQL

StoringandModifyingData

• Syntaxvaries– HTML– JavaScript– Etc.

• Asynchronous– Insertsandupdatesdonotwaitforconfirmation

• Versioned• OptimisticConcurrency

RetrievingData

• SyntaxVaries– Noset-basedquerylanguage– ProceduralprogramlanguagessuchasJava,C,etc.

• Applicationspecifiesretrievalpath• Noqueryoptimizer• Quickanswerisimportant• Maynotbeasingle“right” answer

OpenSource

• Smallupfrontsoftwarecosts• Suitableforlargescaledistributiononcommodityhardware

NoSQLSummary

• NoSQL databasesreject:– OverheadofACIDtransactions– “Complexity” ofSQL– Burdenofup-frontschemadesign– Declarativequeryexpression– Yesterday’stechnology

• Programmerresponsiblefor– Step-by-stepprocedurallanguage– Navigatingaccesspath

Summary

• SQLDatabases– PredefinedSchema– Standarddefinitionandinterfacelanguage– Tightconsistency– Welldefinedsemantics

• NoSQLDatabase– NopredefinedSchema– Per-productdefinitionandinterfacelanguage– Gettingananswerquicklyismoreimportantthangettingacorrectanswer

WebReferences• “NoSQL-- YourUltimateGuidetotheNon- RelationalUniverse!”http://nosql-database.org/links.html

• “NoSQL(RDBMS)”http://en.wikipedia.org/wiki/NoSQL

• “ScalableSQL”,ACMQueue,MichaelRys,April19,2011http://queue.acm.org/detail.cfm?id=1971597

• “apracticalguidetonoSQL”,PostedbyDeniseMiuraonMarch17,2011athttp://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/