data informatics• nosql databases do not store or enforce relationships among entities. the...
TRANSCRIPT
Database• RelationalDatabases– mainstayofbusiness• Web-basedapplicationscausedspikes
– Especiallytrueforpublic-facinge-Commercesites• Issueswithscalingupwhenthedatasetisjusttoobig• RDBMSwerenotdesignedtobedistributed• Begantolookatmulti-nodedatabasesolutions• Knownas‘scalingout’or‘horizontalscaling’
ShortReviewofSQL
Standardlanguageforqueryingandmanipulatingdata
StructuredQueryLanguage
• DataDefinitionLanguage(DDL)– Create/alter/deletetablesandtheirattributes
• DataManipulationLanguage(DML)– Queryoneormoretables– discussednext!– Insert/delete/modifytuplesintables
TablesinSQL
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
Product
AttributenamesTablename
Tuplesorrows
TablesExplained
• Theschema ofatableisthetablenameanditsattributes:Product(PName, Price,Category,Manfacturer)
• Akey isanattributewhosevaluesareunique;weunderlineakey
Product(PName, Price,Category,Manfacturer)
SQLQuery
Basicform:(plusmanymanymorebellsandwhistles)
SELECT <attributes>FROM <oneormorerelations>WHERE <conditions>
SimpleSQLQuery
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
SELECT *FROM ProductWHERE category=‘Gadgets’
Product
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks“selection”
SimpleSQLQuery
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
SELECT PName,Price,ManufacturerFROM ProductWHERE Price>100
Product
PName Price Manufacturer
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi“selection” and“projection”
Joins
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
Product Company
Cname StockPrice Country
GizmoWorks 25 USA
Canon 65 Japan
Hitachi 15 Japan
PName Price
SingleTouch $149.99
SELECT PName,PriceFROM Product,CompanyWHEREManufacturer=CNameANDCountry=‘Japan’
ANDPrice<=200
SQLPhysicalLayerAbstraction
• Applicationsspecifywhat,nothow• Queryoptimizationengine• Physicallayercanchangewithoutmodifyingapplications– Createindexestosupportqueries– InMemorydatabases
Transactions– ACIDProperties
• Atomic – Alloftheworkinatransactioncompletes(commit)ornoneofitcompletes
• Consistent – Atransactiontransformsthedatabasefromoneconsistentstatetoanotherconsistentstate.Consistencyisdefinedintermsofconstraints.
• Isolated – Theresultsofanychangesmadeduringatransactionarenotvisibleuntilthetransactionhascommitted.
• Durable – Theresultsofacommittedtransactionsurvivefailures
OLTPvs.OLAP• WecandivideITsystemsintotransactional(OLTP)andanalytical(OLAP).IngeneralwecanassumethatOLTPsystemsprovidesourcedatatodatawarehouses,whereasOLAPsystemshelptoanalyzeit.
.
BigData
• Collect.• Store.• Organize.• Analyze.• Share.
• Datagrowthoutrunstheabilitytomanageitsoweneedscalablesolutions.
Scalability
• Scaleup,Verticalscalability.– Increasingservercapacity.– AddingmoreCPU,RAM.– Managingishard.– Possibledowntimes
Scalability
• Scaleout,Horizontalscalability.– Addingserverstoexistingsystemwithlittleeffort,akaElasticallyscalable.
• Bugs,hardwareerrors,thingsfailallthetime.• Itshouldbecomecheaper.Costefficiency.
– Sharednothing.– Useofcommodity/cheaphardware.– Heterogeneoussystems.– ServiceOrientedArchitecture.Localstates.
• Decentralizedtoreducebottlenecks.• Avoidsinglepointoffailures.
WhatisWrongWithRDBMS?• Nothing.Onesizefitsall?Notreally.• Rigidschemadesignà Efficiency issueinbigscale.• Hardertoscale.• Replicationà Expensive.• Joinsacrossmultiplenodes?Hard.• HowdoesRDMShandledatagrowth?Hard.• NeedforaDBA.• Manyprogrammersalreadyfamiliarwithità Good.• Transactions/ACIDmakedevelopmenteasyà Good.• Lotsoftoolstouseà Good.
ACIDSemantics
• Atomicity,Consistency,Isolation,Durability• AnydatastorecanachieveAtomicity,IsolationandDurabilitybutdoyoualwaysneedconsistency?No.
• BygivingupACIDproperties,onecanachievehigherperformanceandscalability.
ThePerfectStorm
• Largedatasets,acceptanceofalternatives,anddynamically-typeddatahascometogetherinaperfectstorm
• Notabacklash/rebellionagainstRDBMS• SQLisarichquerylanguagethatcannotberivaledbythecurrentlistofNoSQLofferings
CAPTheorem
• Threepropertiesofasystem:– consistency,availabilityandpartitions
• Youcanhaveatmosttwoofthesethreepropertiesforanyshared-datasystem
• Toscaleout,youhavetopartition.Thatleaveseitherconsistencyoravailability tochoosefrom– Inalmostallcases,youwouldchooseavailabilityoverconsistency
CAPTheorem(cont.)
Consistency
Partitiontolerance
Availability
C:Onceawriterhaswritten,allreaderswillseethatwrite
A:Systemisavailableduringsoftwareandhardwareupgradesandnodefailures.
P:Asystemcancontinuetooperateinthepresenceofanetworkpartitions.
Theorem:Youcanhaveatmosttwo ofthesepropertiesforanyshared-datasystem
BASE,anACIDAlternative
• AlmosttheoppositeofACID.• BasicallyAvailable:Nodesintheadistributedenvironmentcangodown,butthewholesystemshouldn’tbeaffected.
• SoftState(scalable):Thestateofthesystemanddatachangesovertime.
• EventualConsistency:Givenenoughtime,datawillbeconsistentacrossthedistributedsystem.
AClashofculturesACID:
• Strongconsistency.• Lessavailability.• Pessimisticconcurrency.• Complex.
BASE:• Availability isthemostimportantthing.Willingtosacrificeforthis(CAP).
• Weakerconsistency(Eventual).• Besteffort.• Simpleandfast.• Optimistic.
NoSQLDefinition
• Fromwww.nosql-database.org:– NextGenerationDatabasesmostlyaddressingsomeofthepoints:beingnon-relational,distributed, open-source and horizontalscalable.Theoriginalintentionhasbeenmodernweb-scaledatabases.Themovementbeganearly2009andisgrowingrapidly.
– Oftenmorecharacteristicsapplyas:schema-free,easyreplicationsupport,simpleAPI,eventuallyconsistent/BASE(notACID),ahugedataamount,andmore.
WhatisNoSQL?
• StandsforNotOnlySQL• Classofnon-relationaldatastoragesystems• Usuallydonotrequireafixedtableschemanordotheyusetheconceptofjoins
• AllNoSQL offeringsrelaxoneormoreoftheACIDproperties(followingCAPtheorem)
WhyNoSQL?
• Fordatastorage,anRDBMScannotbethebe-all/end-all
• Justastherearedifferentprogramminglanguages,needtohaveotherdatastoragetoolsinthetoolbox
• ANoSQL solutionismoreacceptabletoaclientnowthanevenayearago
NoSQLDistinguishingCharacteristics
• Largedatavolumes– Google’s“bigdata”
• Scalablereplicationanddistribution– Potentiallythousandsof
machines– Potentiallydistributed
aroundtheworld• Queriesneedtoreturn
answersquickly• Mostlyquery,few
updates
• AsynchronousInserts&Updates
• Schema-less• ACIDtransaction
propertiesarenotneeded– BASE
• CAPTheorem• Opensource
development
WhatkindsofNoSQL• NoSQL solutionsfallintotwomajorareas:
– Key/Valueor‘thebighashtable’.• AmazonS3(Dynamo)• Voldemort• Scalaris• Memcached (in-memorykey/valuestore)• Redis
– Schema-lesswhichcomesinmultipleflavors,column-based,document-basedorgraph-based.
• Cassandra(column-based)• CouchDB (document-based)• MongoDB(document-based)• Neo4J(graph-based)• HBase (column-based)
• Thekey-valuedatamodelisbasedonastructurecomposedoftwodataelements:akeyandavalue,inwhicheverykeyhasacorrespondingvalueorsetofvalues.
• Thekeyvaluemodelisalsoreferredtoastheattribute-valueorassociativedatamodel.
• Considertheexampleshownonthenextslidewhichillustratesasmalltruck-drivingcompanycalledTrucks-R-Us.Eachofitsthreedrivershasoneormorecertificationsandothergeneralinformation.
• Basedonthisexamplewecandefinethefollowingimportantpoints:
NoSQL Databases
NoSQL Databases
DID CERT1 CERT2 DOB LICTYPECERT32732 80 90 1/24/1962 P
2946 92 90 4/11/19703650 86 6/27/1968 C
DID KEY VALUE2732 CERT1 80
2732 CERT3 90
2732 DOB 1/24/1962
2732 LICTYPE P2946 CERT2 92
2946 CERT3 90
2946 DOB 4/11/1970
3650 CERT1 863650 DOB 6/27/1968
36500 LICTYPE C
Datastoredusingtraditionalrelationalmodel Datastoredusingkey-valuemodel
Driver2732
• In the relational model:• Each row represents one entity instance• Each column represents one attribute of the
entity• The values in a column are of the same data
type• In the key-value model:
• Each row represents one attribute/value of one entity instance
• The “key” column could represent any entity’s attribute
• The values in the “value” column could be of any data type and therefore it is generally assigned a long string data type
• Inthekey-valuemodel,eachrowrepresentsoneattributeofoneentityinstance.The“key” columnpointstoanattributeandthe“value” columncontainstheactualvaluefortheattribute.
• Thedatatypeofthe“value” columnisgenerallyalongstringtoaccommodatethevarietyofactualdatatypesofthevaluesplacedinthecolumn.
• Toaddanewentityattributeintherelationalmodel,youneedtomodifythetabledefinition(schemamodification).Toaddanewattributeinthekey-valuemodel,youaddarowtothekey-valuestore,whichiswhyitissaidtobe“schema-less”.
NoSQL Databases
• NoSQL databasesdonotstoreorenforcerelationshipsamongentities.Theprogrammerisrequiredtomanagetherelationshipsinthecode.Also,alldataandintegrityvalidationsmustbedoneinthecode.
• NoSQL databasesusetheirownnativeapplicationprogramminginterface(API)withsimpledataaccesscommands,suchasput,read,anddelete.BecausethereisnodeclarativeSQL-likesyntaxtoretrievedata,theprogramcodemusttakecareofretrievingrelateddatainthecorrectway.
• Indexingandsearchescanbedifficult.Becausethe“value”columninthekey-valuemodelcouldcontainmanydifferentdatatypes,itisoftendifficulttocreateindicesonthedata.Atthesametime,searchescanbecomeverycomplex.
NoSQL Databases
• One of the big advantages of NoSQL databases is that theygenerally use a distributedarchitecture.
• NoSQL databases can handle very large volumes of data. Inparticular, they are suited for sparse data.
• Sparse data occurs when the number of attributes is verylarge, but the number of actual data instances is low. Usingthe truck driving company as an example, drivers can take anycertification exam, but they are not required to take them all.In this case there were three drivers and three possiblecertificates for each driver, so there will be nine possible datapoints (we illustrated only 4). Extrapolate this to a hospitalwith 50,000 patients and more than 2500 possible medicaltests that could be performed. We would not expect to see50000 x 2500 data points, in reality we would see far less.
NoSQL Databases
• Most NoSQL databases are geared toward performancerather than transaction consistency.
• Enforcing data consistency on a distributed database is a verydifficult problem. Distributed databases make copies of dataelements at multiple nodes to ensure high availability andfault tolerance. Updating values and requiring consistencyamongst all copies is a very tough problem to solve. NoSQLdatabase sacrifice this consistency to attain high levels ofperformance.
• Some NoSQL databases provide a feature called eventualconsistency, which means that updates to data will propagatethrough the system and eventually all copies will beconsistent. With eventual consistency, data are notguaranteed to be consistent across all copies immediatelyafter an update.
NoSQL Databases
Key/ValueDataModel
• Pros:– veryfast– veryscalable– simplemodel– abletodistributehorizontally
• Cons:– manydatastructures(objects)can'tbeeasilymodeledaskeyvaluepairs
Schema-LessDataModel
• Pros:– Schema-lessdatamodelisricherthankey/valuepairs
– eventualconsistency– manyaredistributed– stillprovideexcellentperformanceandscalability
• Cons:– typicallynoACIDtransactionsorjoins
CommonAdvantages
• Cheap,easytoimplement(opensource)• Dataarereplicatedtomultiplenodes(thereforeidenticalandfault-tolerant)andcanbepartitioned– Downnodeseasilyreplaced– Nosinglepointoffailure
• Easytodistribute• Don'trequireaschema• Canscaleupanddown• Relaxthedataconsistencyrequirement(CAP)
WhatamIgivingup?
• Features– joins– groupby– orderby– ACIDtransactions
• SQLissometimesfrustratingbutstillapowerfulquerylanguage
• EasyintegrationwithotherapplicationsthatsupportSQL
StoringandModifyingData
• Syntaxvaries– HTML– JavaScript– Etc.
• Asynchronous– Insertsandupdatesdonotwaitforconfirmation
• Versioned• OptimisticConcurrency
RetrievingData
• SyntaxVaries– Noset-basedquerylanguage– ProceduralprogramlanguagessuchasJava,C,etc.
• Applicationspecifiesretrievalpath• Noqueryoptimizer• Quickanswerisimportant• Maynotbeasingle“right” answer
NoSQLSummary
• NoSQL databasesreject:– OverheadofACIDtransactions– “Complexity” ofSQL– Burdenofup-frontschemadesign– Declarativequeryexpression– Yesterday’stechnology
• Programmerresponsiblefor– Step-by-stepprocedurallanguage– Navigatingaccesspath
Summary
• SQLDatabases– PredefinedSchema– Standarddefinitionandinterfacelanguage– Tightconsistency– Welldefinedsemantics
• NoSQLDatabase– NopredefinedSchema– Per-productdefinitionandinterfacelanguage– Gettingananswerquicklyismoreimportantthangettingacorrectanswer
WebReferences• “NoSQL-- YourUltimateGuidetotheNon- RelationalUniverse!”http://nosql-database.org/links.html
• “NoSQL(RDBMS)”http://en.wikipedia.org/wiki/NoSQL
• “ScalableSQL”,ACMQueue,MichaelRys,April19,2011http://queue.acm.org/detail.cfm?id=1971597
• “apracticalguidetonoSQL”,PostedbyDeniseMiuraonMarch17,2011athttp://blogs.marklogic.com/2011/03/17/a-practical-guide-to-nosql/