Download - Exalead managing terrabytes
![Page 1: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/1.jpg)
Content
• Introduc*on • Databases
– ACID – Data structures, algorithms
– Scalability issues – Scaling pa=erns
• Search engines – Data structures, algorithms
– Pros & cons • NoSQL Movement
– Why and What 1
![Page 2: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/2.jpg)
Content
• NoSQL Families – Key value stores – Column stores
– Document stores – Graph DB
• Principles: CAP, Scaling pa=erns, High availability pa=erns, Elas*city
• How to choose ? • Conclusion
2
![Page 3: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/3.jpg)
Introduc,on
• Who we are: – Clément STENAC (Indexing and search techs)
– Jérémie BORDIER (360 team (a bit of everything))
• Exalead: – Indexing technologies provider since 1998 – Online search engine: h=p://www.exalead.com – Daily challenge: Tackle informa*on access problems for large companies.
3
![Page 4: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/4.jpg)
Introduc,on
• Universal answer to data storage: RELATIONAL DATABASES
• Well known data representa*on: Objects and rela*onships
• Powerful query language: SQL • Open source implementa*ons:
– MySQL – PostgreSQL – …
4
![Page 5: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/5.jpg)
Introduc,on
• Database scalability problems ? • Used to be a Telco and bank problem…
• Un*l the internet has come !
5 Twitter whale, 2008
![Page 6: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/6.jpg)
Introduc,on
• Thanks to the internet… • …millions of rows is frequent…
• … real *me websites.
How to deal with massive amount of structured data ? Are there alterna*ves ?
What’s this NoSQL buzz ?
6
![Page 7: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/7.jpg)
RELATIONAL DATABASES Knowing your enemy:
7
![Page 8: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/8.jpg)
Databases: ACID
• Atomicity • Transac*ons succeed or fail atomically
• Consistency • Transac*ons leave the database in a consistent state
• Isola,on • Transac*ons do not see the effects of concurrent transac*ons
• Durability • Once a transac*on is commi=ed, it can’t be lost
ACID constraints
![Page 9: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/9.jpg)
Database structures Primary storage
Id 4 bytes
CREATE TABLE author ( id INTEGER PRIMARY KEY, nick VARCHAR(16), age INTEGER, firstname VARCHAR(128), biography TEXT);
CREATE TABLE post ( id INTEGER PRIMARY KEY, author_id FOREIGN KEY REFERENCES author(id); timestamp TIMESTAMP, title VARCHAR(256), text TEXT);
age 4 bytes
nick 16 bytes
firstname pointer
biography pointer
len data
Id 4 bytes
age 4 bytes
nick 16 bytes
firstname pointer
biography pointer
Row 1
Row 2
Table strings len data len data len data
Each value or pointer can be retrieved at a
known offset in the row
Fixed size
Variable size
Heuris*cs change it to variable-‐size
![Page 10: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/10.jpg)
Searching in a database SELECT * FROM author WHERE age=24;
• Enumerate all records in the table • For each record, fetch the condi*on value • Inline value: direct access at row_address + offset(column) • Outside value : fetch pointer and fetch data
• Perform comparison
The raw way: full scan
• Need to analyse the full table • Very CPU intensive • If the table does not fit in memory ? – I/O on the whole table
Analysis
![Page 11: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/11.jpg)
Database structures Indexes
• Primary storage: forward mapping row_id –> row data
• Index : reverse mapping row data –> row_id(s)
• Updated together with the primary storage
What is an index ?
• Retrieve the row ids using the index • Fetch the row data from primary storage
Searching with an index
![Page 12: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/12.jpg)
Database structures Indexes – Hash index
• Stores hashes of column values in as hash-‐table • Retrieve through the hash table
How it works
• Very easy and fast to update • Fast lookup – single hashtable lookup
Pros
• Only provides equality matching • Unable to answer inequality queries
Cons
![Page 13: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/13.jpg)
Database structures Indexes – BTree index
• Provides range and inequality queries easily • Quite fast (logarithmic) opera*ons
Pros
• More complex and expensive to update • B-‐Tree rebalancing
Cons
Binary search tree B-Tree
![Page 14: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/14.jpg)
Choosing how to search
• SELECT * from author where age < 300;
Is indexed search always be=er ?
• Fetch of whole table • Index: random lookups • Full scan : sequen*al fetch
Analysis
• Iden*fy the expensive queries • Use the EXPLAIN statement • Only add indexes where they are required • Indexes are expensive to update
Choosing wisely
![Page 15: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/15.jpg)
Joining
• Put together data from several tables • For some values in table A, find matching values in table B
Goal
• SELECT * FROM post INNER JOIN author ON author.id = post.author_id WHERE author.age = 42;
Example
![Page 16: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/16.jpg)
Join algorithms
• Foreach (author WHERE age=42) { Foreach(post) { if (post.author_id == author.id) { append post to the result set; } } }
• Very naive algorithm : runs in PxA *me • Provides all predicates
Nested loops
• Algorithm • Make a hashtable of author ids matching the « age = 42 » condi*on • Scan once the post table • For each post, lookup in the hashtable to check if it matches a valid author
• Faster than nested loops (2 scans instead of A) • Requires memory to store the hashtable • Only provides equality predicate
Hash join
![Page 17: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/17.jpg)
Join algorithms
• Need to have both tables sorted by join key • Post sorted by author_id • Author sorted by id
• Perform a single parallel scan of the two tables and iden*fy matches • Fastest algorithm, but needs sorted data • Disk-‐based sort for large data sets
Merge join
• Performed automa*cally by the query op*mizer (EXPLAIN) • Main parameters: • Rela*ons cardinali*es • Data order (presence of an ORDER BY clause ?) • Available indexes
• JOIN are always expensive -‐> schema denormaliza,on
Choice of join algorithm
![Page 18: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/18.jpg)
Database scaling Typical workloads
• Example: Wikipedia • First solu*on: high-‐level (frontend *er) caching • Database scaling : 1 master – N slaves • Replica,on of changes from master to slaves
• Does not solve the write bo=leneck problem
Mostly read workloads
• Examples: credit cards, Twi=er (>1000 tweets/second, 1000s of deliveries)
• Performance limited by write I/O throughput • Because of the « D » constraint • Hard to have more than 1000-‐2000 writes/second
High write workloads
![Page 19: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/19.jpg)
Database scaling Scaling writes
• All masters have the same data and share the updates • « share-‐all » cluster architecture
• Extremely complex synchroniza*on • Bi-‐direc*onal replica*on • Conflict detec*on
• Bad performance • Complex resilience • Down*me of a master: need a resync
• Complex, heavy and expensive architectures
Mul*ple master setups
Master 1
Master 2
Bi-directional replication flow Client 1 Client 2
![Page 20: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/20.jpg)
Database scaling Scaling writes
• Split the data between the masters based on a criterion • Date • User id • hash(url), …
• Clients query the correct master for each data • No shared data between masters (« share-‐nothing »)
Sharding
Master 1
Master 2
Client 1
Client 2
![Page 21: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/21.jpg)
Database scaling Problems with SQL sharding
• Not integrated in SQL • Need to perform the sharding in applica*ve code
Complexity
• Several machines but no resilience • Loss of one master = loss of data (compare to RAID-‐0)
Resilience
• You can’t do cross-‐shard joins
Loss of features
• How do you keep scaling ? • To add another machine, you need to change the distribu*on func*on
Complex evolu*ons
![Page 22: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/22.jpg)
Database scaling Other SQL shortcomings
• It is good, it provides strong typing • But, migra*on hell ! • Web applica*ons changes quickly • Not « Agile »
Strict schema
![Page 23: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/23.jpg)
SEARCH ENGINES On the other side:
23
![Page 24: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/24.jpg)
A quick look at search engines
• Not designed for OLTP • Update by batches • No transac*ons, updates are available to readers « later »
• Heavily read-‐op*mized
Differences from a tradi*onal database
• It’s more complex than LIKE ’%myword%’; • Need specific data structures
Full text search
![Page 25: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/25.jpg)
Search engines Inverted lists
Exalead S.A. © 2010 CONFIDENTIAL
Document 1
The quick fox
Document 2
The lazy dog
Document 3
The dog quick dog
• the = 1 • quick = 2 • fox = 3 • lazy = 4 • dog = 5
List for word 1 (the) • doc 1 (at posi*on 0) • doc 2 (at posi*on 0) • doc 3 (at posi*on 0)
List for word 2 (quick) • doc 1 (at posi*on 1) • doc 3 (at posi*on 2)
List for word 4 (lazy) • doc 2 (at posi*on 1)
List for word 3 (fox) • doc 1 (at posi*on 2)
List for word 5 (dog) • doc 2 (at posi*on 2) • doc 3 (at posi*ons 1, 3)
• A data structure mapping a « word iden*fier » to a list of « document iden*fier »
• For each word of each document, store the posi*ons
What is is
![Page 26: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/26.jpg)
Search engines Searching with inverted lists
Exalead S.A. © 2010 CONFIDENTIAL
• Resolve the word to its id using the dic*onary (wid 5) • Fetch the inverted list for this id • Simply read the inverted list for its id • We have the hits: document 2 and document 3
Single word query : dog
• Resolve words, fetch inverted lists • The: 1,2,3 Dog: 2,3 • Perform intersec*on: hits = 2,3
Boolean query: the AND dog
• Resolve/fetch • Perform union: hits = 1, 2, 3
Boolean query : the OR dog
![Page 27: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/27.jpg)
Search engines Searching with inverted lists
Exalead S.A. © 2010 CONFIDENTIAL
• Fetch the inverted lists and also read the posi*ons • The : 1(0), 2(0), 3(0) Dog : 2(2), 3(1,3)
• Iden*fy “simple boolean” matches: docs 2 and 3 • For each possible match, check if posi*ons form a sequence
• Only document 3 matches on sequence (0,1)
• Posi*onal queries are more expensive and storing word posi*ons is expensive (disk space, decoding CPU, I/O)
Posi*onal query: the NEXT dog
![Page 28: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/28.jpg)
THE NOSQL MOVEMENT The revolu*on:
28
![Page 29: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/29.jpg)
NoSQL Movement
• « NoSQL » © Eric VANS (Rackspace, 2009)
29
The name was an a=empt to describe the emergence of a growing number of non-‐
rela*onal, distributed data stores that ozen did not a=empt to provide ACID guarantees.
Wikipedia
![Page 30: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/30.jpg)
NoSQL Movement: Issue
• RDBMS fails with huge amount of data – Facebook’s 70TB of inbox – Digg’s 3TB – eBay’s 2PB…
• High scale SQL systems are either: – Very expensive to buy and quite to maintain
– Very expensive to maintain
30
![Page 31: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/31.jpg)
NoSQL Movement
• We need new systems that: – Scales horizontally (both read/write) – Have no single point of failure – Are fault tolerant – Are elas*cs (adding nodes is easy) – Have flexible data schemas – Are more web applica*ons friendly
31
![Page 32: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/32.jpg)
NoSQL: Families
• Different types of data stores: – Key-‐Value stores (Dynamo, Redis, Voldemort…)
– Column stores (BigTable, Cassandra, HBase…) – Document stores (CouchDB, MongoDB…) – Graph stores (Neo4J, Swarm…)
32
![Page 33: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/33.jpg)
NoSQL: Key-‐Value stores
• Distributed hashtables – Btrees – Fixed sized tables
• Benefits: – Very simple API (get/put/delete/range)
– Easily shardable – Fast reads
• Drawbacks: – No data schema (no joins, data fla=ening…)
– No query language • Implems: Redis, Amazon Dynamo, Voldemort
33
![Page 34: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/34.jpg)
NoSQL: Column Stores
• Row based storage: – 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;
• Column based storage: – 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;
34
Id Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
![Page 35: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/35.jpg)
NoSQL: Column Stores
• Benefits: – Reading all the values of a given column is faster (ex: aggregates)
– Batch writes are faster • Joins are faster
– Comparing two columns is sequen*al – Much more L1 CPU cache hits – L1 cache reference: 0.5ns – L2 cache reference: 7ns
35
![Page 36: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/36.jpg)
NoSQL: Column Stores
• Drawbacks: – Reading a single object is slower (mul* ios)
– Wri*ng a single object is slower (mul* ios) – Doesn’t fit to most applica*ons
• Finally: – Well suited for heavy write / read applica*ons
• (eg: Facebook inbox indexes)
36
![Page 37: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/37.jpg)
SQL Schema:
NoSQL: Document Stores
• Can be seen as schema free, hierarchical database (usually represented as JSON)
37
Person: -‐ id - name -‐ address - phone
Animal: -‐ id - person_id - name -‐ address - phone
1
N
Document store: Person: -‐ id - name -‐ address - phone - animals =
-‐ id -‐ person_id -‐ name -‐ address -‐ phone
![Page 38: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/38.jpg)
NoSQL: Document Stores
• Benefits: – Data spa*ality ! Everything in one place – Efficient write and updates (in place) – Efficient read – Highly flexible data schema
– Usually provides indexes over each object key to have powerful query language
• Drawbacks – Doesn’t encourage well designed data schema
38
![Page 39: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/39.jpg)
NoSQL: Graph Stores
• An entry is a node • Nodes have proper*es • Edges are links between nodes
39
![Page 40: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/40.jpg)
NoSQL: Graph Stores
• Benefits: – Faster to fetch an entry and its related entries (links are already resolved, no need to join)
– Flexible data schema
• Drawbacks: – Complex APIs – Slow for batch opera*ons – Open source implems are not that good…
40
![Page 41: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/41.jpg)
SCALABILITY IN PRACTICE The real issues…
41
![Page 42: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/42.jpg)
CAP Theorem
• CAP: – Consistency: Opera*ng fully or not at all. – Availability: The service must be reachable at any *me.
– Par,,on Tolerance: No set of failures less than total network failure is allowed to cause the system to respond incorrectly.
42
Any shared-‐data system can only achieve two of these three.
CAP Theorem, Dr. Eric Brewer, Berkeley (2000)
![Page 43: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/43.jpg)
Consistent Hashing
• Ensuring data availability: replica*on ! • Reaching the right nodes ? Hashing • Consistent hashing: Hash ring
– Objects are mapped into a range – Nodes are mapped into that range
– We write the object into the nearest node, clockwise
43
![Page 44: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/44.jpg)
Data consistency • Ensuring data eventual consistency: Quorum writes
– W = number of writes to ensure before returning OK – R = number of reads to ensure
– N = replica*on factor
• W < N == High write availability – Data may be lost or outdated if read from another node
• R < N == High read availability – Data may be outdated
• W + R > N == Full consistency ! – But slower writes / reads
44
![Page 45: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/45.jpg)
Conflicts resolu,on
• What happens when R > 1 and two different versions are found ?
• Conflict resolu*on ! • Common algorithm:
Vector clocks
45
![Page 46: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/46.jpg)
Vector clocks
46
• Assign to each node a unique ID • A node increments its own vector and keep track of the old entries
![Page 47: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/47.jpg)
Elas,city: Gossip Membership
• When a node joins…
47
![Page 48: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/48.jpg)
Elas,city: Gossip Membership
• When a node crashes !
48
![Page 49: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/49.jpg)
WHAT’S THE BEST SYSTEM ?
I’m star*ng the next big startup…
![Page 50: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/50.jpg)
Choosing your storage system
• “Don’t op,mize too early” • MySQL is robust and works VERY well
– You’ll know where bugs come from (you)
• Key-‐Value stores are hype, and o`en badly implemented
• Anyway, most mature “NoSQL” systems: – MongoDB
– Cassandra
50
![Page 51: Exalead managing terrabytes](https://reader034.vdocuments.mx/reader034/viewer/2022051608/5446e97aafaf9f59178b480b/html5/thumbnails/51.jpg)
Ques,ons
?