data management in large-scale distributed systems - nosql ... · sql born in the 70’s { still...

49
Data Management in Large-Scale Distributed Systems NoSQL Databases Fundamentals Thomas Ropars [email protected] http://tropars.github.io/ 2021 1

Upload: others

Post on 21-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Data Management in Large-Scale DistributedSystems

NoSQL Databases Fundamentals

Thomas Ropars

[email protected]

http://tropars.github.io/

2021

1

Page 2: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

References

• The lecture notes of V. Leroy

• The lecture notes of F. Zanon Boito

• Designing Data-Intensive Applications by Martin KleppmannI Chapters 2 and 7

2

Page 3: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

In this lecture

• Motivations for NoSQL databases

• ACID properties and CAP Theorem

• A landscape of NoSQL databases

3

Page 4: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Agenda

Introduction

Why NoSQL?

Transactions, ACID properties and CAP theorem

Data models

4

Page 5: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Common patterns of data accesses

Large-scale data processing

• Batch processing: Hadoop, Spark, etc.

• Perform some computation/transformation over a full dataset

• Process all data

Selective query

• Access a specific part of the dataset• Manipulate only the needed data

I 1 record among millions

• Main purpose of a database system

5

Page 6: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Loaddata

Writeresults

Writeresults

Serve

queries

Processing/DatabaseLink

3

Database

BatchJob(Hadoop,Spark)

StreamJob(Spark,Storm)

ApplicaCon1 ApplicaCon2 ApplicaCon3

e.g.senCmentanalysis

e.g.TwiSertrendspage

Insert

records

Page 7: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Differenttypesofdatabases

•  SofarweusedHDFS– Afilesystemcanbeseenasaverybasicdatabase– Directories/filestoorganizedata– Verysimplequeries(filesystempath)– Verygoodscalability,faulttolerance…

•  Otherendofthespectrum:RelaConalDatabases– SQLquerylanguage,veryexpressive– Limitedscalability(generally1server)

4

Page 8: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Size/Complexity

5Size

Complexity

GraphDB

RelaConalDB Document

DBColumnDB

Key/ValueDB

Filesystem

Page 9: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

TheNoSQLJungle

6

Page 10: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Agenda

Introduction

Why NoSQL?

Transactions, ACID properties and CAP theorem

Data models

6

Page 11: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Relational databases

SQL

• Born in the 70’s – Still heavily used

• Data is organized into relations (in SQL: tables)

• Each relation is an unordered collection of tuples (rows)

• Foreign keys are used to define the relationships among thetables

7

Page 12: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

About SQL

Advantages

• Separate the data from the codeI High-level languageI Space for optimization strategies

• Powerful query languageI Clean semanticsI Operations on sets

• Support for transactions

8

Page 13: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Motivations for alternative modelssee https://blog.couchbase.com/nosql-adoption-survey-surprises/

Some limitations of relational databases• Performance and scalability

I Difficult to partition the data (in general run on a single server)I Need to scale up to improve performance

• Lack of flexibilityI Will to easily change the schemaI Need to express different relationsI Not all data are well structured

• Few open source solutions

• Mismatch between the relational model and object-orientedprogramming model

9

Page 14: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Illustration of the object-relational mismatchFigure by M. Kleppmann

Figure: A CV in a relation database10

Page 15: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Illustration of the object-relational mismatchFigure by M. Kleppmann

{”user id”:251,”first name”: ”Bill”,”last name”: ”Gates”,”summary”: ”Co−chair of the Bill & Melinda Gates; Active blogger.”,”region id”: ”us:91”,”industry id”: 131,”photo url”: ”/p/7/000/253/05b/308dd6e.jpg”,”positions”: [{”job title”: ”Co−chair”, ”organization”: ”Bill & Melinda Gates

Foundation”},{”job title”: ”Co−founder, Chairman”, ”organization”: ”Microsoft”}

],”education”: [{”school name”: ”Harvard University”, ”start”: 1973, ”end”: 1975},{”school name”: ”Lakeside School, Seattle”, ”start”: null, ”end”: null}

],”contact info”: {

”blog”: ”http://thegatesnotes.com”,”twitter”: ”http://twitter.com/BillGates”

}}

Figure: A CV in a JSON document 11

Page 16: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

About NoSQL

What is NoSQL?• A hashtag

I NoSQL approaches were existing before the name becamefamous

• No SQL

• New SQL• Not only SQL

I Relational databases will continue to exist alongsidenon-relational datastores

12

Page 17: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

About NoSQL

A variety of NoSQL solutions

• Key-Value (KV) stores

• Wide column stores (Column family stores)

• Document databases

• Graph databases

Difference with relational databasesThere are several ways in which they differ from relationaldatabases:

• Properties

• Data models

• Underlying architecture

13

Page 18: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Agenda

Introduction

Why NoSQL?

Transactions, ACID properties and CAP theorem

Data models

14

Page 19: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

About transactions

The concept of transaction

• Groups several read and write operations into a logical unit• A group of reads and writes are executed as one operation:

I The entire transaction succeeds (commit)I or the entire transaction fails (abort, rollback)

• If a transaction fails, the application can safely retry

Why do we need transactions?

• Crashes may occur at any timeI On the database sideI On the application sideI The network might not be reliable

• Several clients may write to the database at the same time

15

Page 20: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

About transactions

The concept of transaction

• Groups several read and write operations into a logical unit• A group of reads and writes are executed as one operation:

I The entire transaction succeeds (commit)I or the entire transaction fails (abort, rollback)

• If a transaction fails, the application can safely retry

Why do we need transactions?

• Crashes may occur at any timeI On the database sideI On the application sideI The network might not be reliable

• Several clients may write to the database at the same time

15

Page 21: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

ACID

ACID describes the set of safety guarantees provided bytransactions

• Atomicity

• Consistency

• Isolation

• Durability

Having such properties make the life of developers easy, but:• ACID properties are not the same in all databases

I It is not even the same in all SQL databases

• NoSQL solutions tend to provide weaker safety guaranteesI To have better performance, scalability, etc.

16

Page 22: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

ACID: Atomicity

Description

• A transactions succeeds completely or fails completelyI If a single operation in a transaction fails, the whole

transaction should failI If a transaction fails, the database is left unchanged

• It should be able to deal with any faults in the middle of atransaction

• If a transaction fails, a client can safely retry

In the NoSQL context:

• Atomicity is still ensured

17

Page 23: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

ACID: Consistency

Description

• Ensures that the transaction brings the database from a validstate to another valid stateI Example: Credits and debits over all accounts must always be

balanced

• It is a property of the application, not of the databaseI The database cannot enforce application-specific invariant

• It is the responsibility of the programmer to issue transactionsthat make sense.

I The database can check some specific invariant• A foreign key must be valid

In the NoSQL context:

• Consistency is (often) not discussed

18

Page 24: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

ACID: Durability

Description

• Ensures that once a transaction has committed successfully,data will not be lostI Even if a server crashes (flush to a storage device, replication)

In the NoSQL context:

• Durability is also ensured

19

Page 25: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

ACID: Isolation

Description

• Concurrently executed transactions are isolated from eachotherI We need to deal with concurrent transactions that access the

same data

• SerializabilityI High level of isolation where each transaction executes as if it

was the only transaction applied on the database• As if the transactions are applied serially, one after the other

I Many SQL solutions provide a lower level of isolation• Example: Read committed (see next slides)

In the NoSQL context:

• What about the CAP theorem?

20

Page 26: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

More about isolationAlternative (weak) isolation levels

Read Committed• No dirty reads: A read only sees data that has been

committed

• No dirty writes: A write only overwrites data that has beencommitted.

Many SQL databases implement this level of isolation.

21

Page 27: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Example of allowed execution with Read Committed

Transaction to be executed (Increment a counter)

Begin transaction

Read count

count = count + 1

Write count

End transaction

ClientA

DB

ClientB

count=0

read count

read count write 1

write 1

count=1

commit

commit

No dirty writes does not prevent all inconsistencies.

22

Page 28: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Example of allowed execution with Read Committed

Transaction to be executed (Increment a counter)

Begin transaction

Read count

count = count + 1

Write count

End transaction

ClientA

DB

ClientB

count=0

read count

read count write 1

write 1

count=1

commit

commit

No dirty writes does not prevent all inconsistencies.

22

Page 29: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Example of allowed execution with Read Committed

2 Transactions executed concurrently

On the accounts of Alice:

• initial state: $500 of the two accounts

• Transaction 1: Transfer $100 from account B to A

• Transaction 2: Check balance.

Alice

DB

Bank

balance A? Balance B?

commit

B - 100 A + 100

commit

A+B= $900 ??

Some inconsistencies can also be observed with no dirty reads.

23

Page 30: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Example of allowed execution with Read Committed

2 Transactions executed concurrently

On the accounts of Alice:

• initial state: $500 of the two accounts

• Transaction 1: Transfer $100 from account B to A

• Transaction 2: Check balance.

Alice

DB

Bank

balance A? Balance B?

commit

B - 100 A + 100

commit

A+B= $900 ??

Some inconsistencies can also be observed with no dirty reads.

23

Page 31: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Example of allowed execution with Read Committed

2 Transactions executed concurrently

On the accounts of Alice:

• initial state: $500 of the two accounts

• Transaction 1: Transfer $100 from account B to A

• Transaction 2: Check balance.

Alice

DB

Bank

balance A? Balance B?

commit

B - 100 A + 100

commit

A+B= $900 ??

Some inconsistencies can also be observed with no dirty reads.23

Page 32: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

The CAP theorem

3 properties of databases

• ConsistencyI What guarantees do we have on the value returned by a read

operation?I It strongly relates to Isolation in ACID (and not to consistency)

• AvailabilityI The system should always accept updates

• Partition toleranceI The system should be able to deal with a partitioning of the

network

Comments on CAP theorem• Was introduced by E. Brewer in its lectures (beginning of

years 2000)

• Goal: discussing trade-offs in database design

24

Page 33: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

What does the CAP theorem says?

The theoremIt is impossible to have a system that provides Consistency,Availability, and partition tolerance.

How it should be understood:• Partitions are unavoidable

I It is a fault, we have no control on it

• We need to choose between availability and consistencyI In the CAP theorem:

• Consistency is meant as linearizability (the strongestconsistency guarantee)

• Availability is meant as total availability

I In practice, different trade-offs can be provided

25

Page 34: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

The intuition behind CAP

• Let inconsistencies occur? (No C)

• Stop executing transactions? (No A)

Note that in a centralized system (non-partitioned relationaldatabase), no need for Partition tolerance

• We can have Consistency and Availability

26

Page 35: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

The intuition behind CAP

• Let inconsistencies occur? (No C)

• Stop executing transactions? (No A)

Note that in a centralized system (non-partitioned relationaldatabase), no need for Partition tolerance

• We can have Consistency and Availability

26

Page 36: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

The impact of CAP on ACID for NoSQLsource: E. Brewer

The main consequence

• No NoSQL database with strong Isolation

Discussion about other ACID properties

• AtomicityI Each side should ensure atomicity

• DurabilityI Should never be compromised

27

Page 37: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

A vision of the NoSQL landscapeSource: https://blog.nahurst.com/visual-guide-to-nosql-systems

To be read with care:

• Solutions oftenprovide atrade-off betweenCP and AP

• A single solutionmay offer adifferent trade-offdepending on howis is configured.

• We don’t picktwo !

28

Page 38: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Agenda

Introduction

Why NoSQL?

Transactions, ACID properties and CAP theorem

Data models

29

Page 39: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Key-Value store

• Data are stored as key-value pairsI The value can be a data structure (eg, a list)

• In general, only support single-object transactionsI In this case, key-value pairs

• Examples:I RedisI Voldemort

• Use case:I Scalable (in-memory) cache for dataI Some solutions ensure durability by writing data to disk

30

Page 40: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Key-value storeImage by J. Stolfi

31

Page 41: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Column family stores

• Data are organized in rows and columns (Tabular data store)I The data are arranged based on the rowsI Column families are defined by users to improve performance

• Group related columns together

• Only support single-object transactionsI In this case, a row

• Examples:I BigTable/HBaseI Cassandra

• Use case:I Data with some structure with the goal of achieving scalability

and high throughputI Provide more complex lookup operations than KV stores

32

Page 42: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Column family stores

Note that a row does not need to have an entry for all columns

33

Page 43: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Document databases

• Data are organized in Key-Document pairsI A document is a nested structure with embedded metadataI No definition of a global schema

• The schema is implicit and not not enforced by the database• Schema-on-read

I Popular formats: XML, JSON

• Only support single-object transactionsI In this case, a document or a field inside a document

• Examples:I MongoDBI CouchDB

• Use case:I An alternative to relational databases for structured dataI Offer a richer set of operations compared to KV stores:

Update, Find, etc.

34

Page 44: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

DocumentDB

30

Page 45: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

GraphDB

•  Representdataasgraphs– Nodes/relaConshipswithproperCesasK/Vpairs

31

Page 46: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

GraphDB:Neo4j

•  Richdataformat– QuerylanguageaspaSernmatching– Limitedscalability

•  ReplicaContoscalereads,writesneedtobedonetoeveryreplica

32

Page 47: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

On the Many-to-one relationship

Many-to-one relationship

• Many items may refer to the same item

• Example: Many people went to the same university

Relational vs Document DB

• Relational databases use a foreign keyI Consistency and low memory footprint (normalization)I Easy updates and support for joinsI Difficult to scale

• Document databases duplicate dataI Efficient read operationsI Easy to scaleI Higher memory footprint and updates are more difficult (risk

of consistency issues)• Transactions on multiple objects could be very useful in this

case

I Some document databases support document references(equivalent to foreign keys)

35

Page 48: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

On the Many-to-one relationship

Many-to-one relationship

• Many items may refer to the same item

• Example: Many people went to the same university

Relational vs Document DB• Relational databases use a foreign key

I Consistency and low memory footprint (normalization)I Easy updates and support for joinsI Difficult to scale

• Document databases duplicate dataI Efficient read operationsI Easy to scaleI Higher memory footprint and updates are more difficult (risk

of consistency issues)• Transactions on multiple objects could be very useful in this

case

I Some document databases support document references(equivalent to foreign keys)

35

Page 49: Data Management in Large-Scale Distributed Systems - NoSQL ... · SQL Born in the 70’s { Still heavily used Data is organized into relations (in SQL: tables) ... A single solution

Additional references

Suggested reading

• http://martin.kleppmann.com/2015/05/11/

please-stop-calling-databases-cp-or-ap.html, M.Kleppmann, 2015.

• https://jvns.ca/blog/2016/11/19/

a-critique-of-the-cap-theorem/, J. Evans, 2016.

36