analytics: sql or nosql? richard taylor chair business intelligence sig
TRANSCRIPT
![Page 1: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/1.jpg)
Analytics: SQL or NoSQL?
Richard TaylorChair Business Intelligence SIG
![Page 2: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/2.jpg)
The NoSQL Movement
Meetup June 11 2009 in San FranciscoNoSQL name proposed by Eric Evans
2004 BigTable (Google)
2007 Dynamo (Amazon)
2008 Cassandra (Facebook)
Hadoop/HBase (Yahoo)
Project Voldemort (LinkedIn)
NoSQL Conferences
![Page 3: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/3.jpg)
Relational Database/SQL
![Page 4: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/4.jpg)
1980
1981 Bernstein and GoodmanMulti-version ConcurrencyControl
Database Timeline
19701970 1990 2000 2010
1969 CODASYL- Network database- Schema- DDL/DML
1970 CoddRelational Model
1980 GrayTransaction
1995 Bernstein et alCritique of ANSI SQLIsolation Levels
1989 SQL-89
1992 SQL-92
1999 SQL:1999Object Relational
2003 SQL:2003Analytics extensions
1979 Oracle
1974 SEQUEL
![Page 5: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/5.jpg)
RowColumn
Relational Model
Normalized data “Atomic” Multi-column Key
Operations on tables: select, project, join
Relationship on key Primary Key Foreign Key
Table – n-tuple
Key
![Page 6: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/6.jpg)
SQL Designed for Transaction Processing Good
Easily handles simple cases Everyone has a Query Language
Bad Data access language (not Turing complete) Declarative Language (4GL)
Impedance mismatch with procedural languages Complicated cases get repetitive
![Page 7: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/7.jpg)
Normalization
Refine design of structured data “Atomic” No repeating groups Data item depends on key (and nothing else)
Avoid modification anomalies Ensure every data item is stored only once
Avoid bias to any particular pattern of querying Allow data to be accessed from every angle
Denormalization
![Page 8: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/8.jpg)
Star Schema Example
FactTable
Product
Store
Promotion
Date
Date_key
Store_key
Promotion_key
Product_key
Receipt_number
Quantity
Revenue
Unit_price
Date_key
Day_in_week
Day_in_month
Day_in_year
Day_name
Week_in_month
Week_in_year
Month_nbr
Month_name
Quarter
Year
Holiday
Holiday_desc
…
![Page 9: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/9.jpg)
Database Summary• Costs
– Fixed schema– Normalization– Transform data on load– Cost of scaling– Problems with large objects– Complicated software
• Benefits– Mature technology– Precise querying– Star Schema – historic data
![Page 10: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/10.jpg)
Tuple Store/NoSQL
![Page 11: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/11.jpg)
Tuple Storage Systems
• Google Database System– Chubby – Lock/metadata manager– Google File System – Distributed file system– Bigtable – Tuple storage on GFS– Map Reduce – Data processing on tuples
• Other tuple stores– Voldemort – Amazon Dynamo– Cassandra– HBase– Hypertable
![Page 12: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/12.jpg)
Tuple Store Model
One Table Operate on Map
Set of (Key, Value) Structured Key Unstructured Value Operations:
select, project Map Reduce
Tuple Store
Key Value
Key Column Timestamp
![Page 13: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/13.jpg)
Map Reduce
• Define two functions– Map
• Input: tuple
• Output: list of tuples
– Reduce• Input: key, list of values
• Output: list or tuple
• Specify a cluster• Specify input and output tuple stores• Framework does the rest
{ Map(k1, v1) } -> { list(k2, v2) }
{ list(k2, v2) } -> { (k2, list(v2)) }
{ Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) }
![Page 14: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/14.jpg)
Map Reduce Example
For each web page count the number of pages that reference that page
Input tuple store is WWW
Map Function:for each anchor on web page, emit (anchorURL, 1)
Reduce Function:emit (anchorURL, sum(list))
{ Map(k1, v1) } -> { list(k2, v2) }
{ list(k2, v2) } -> { (k2, list(v2)) }
{ Reduce(k2, list(v2)) } -> { (k2, v3) }
URL Web PageURL Web PageURL Web PageURL Web Page
…
Output tuple store is{ (URL, count) }
![Page 15: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/15.jpg)
Example in SQL
CREATE TABLE links ( URL page NOT NULL,
URL ref_page NOT NULL,PRIMARY KEY page, ref_page
)
SELECT ref_page, count(DISTINCT page)FROM linksGROUP BY ref_page
For each web page count the number of pages that reference that page
![Page 16: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/16.jpg)
Tuple Store Summary
• Semi-structured data– No need to normalize data
• Simple implementations– Cheap, fast, scalable
• Map Reduce Processing– Simple programming (for geeks)
• Issues– No guidance from schema– No model for historic data
Hadoop winsSort Benchmark
![Page 17: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/17.jpg)
Synthesis
![Page 18: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/18.jpg)
Summary
• SQL– Structured data
– Precise
– Historic data
– Needs transformation
– Scalability issues
• NoSQL– Cheap
– Scalable
– Handles large data
![Page 19: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/19.jpg)
Enterprise Model
Money Content Analytics
?NoSQLRelational
DB
Metadata?
Issues:- Data volume- Query requirements
![Page 20: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/20.jpg)
Analytics Architecture
Map ReduceProcessing TB+/day
RDBData Warehouse
GB
++/day
ReportsTupleStore
CubesReports
etc.
![Page 21: Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG](https://reader031.vdocuments.mx/reader031/viewer/2022012916/56649ece5503460f94bdb479/html5/thumbnails/21.jpg)
Summary
It is all about structured dataHow much do we want?
How much can we afford?