sql or nosql: that is the question

41
1 That is the question SQL, OR NOSQL:

Upload: alikonweb

Post on 12-Aug-2015

142 views

Category:

Data & Analytics


0 download

TRANSCRIPT

1

That is the question

SQL, OR NOSQL:

2

About me {

"_id": "555ae00a475a9b259281b21a",

"name": "Nicola Galgano",

"alias": "alikon",

"gender": "male",

"work": "DB consultant on banking systems",

"company": "looking for a new one",

"email": "[email protected]",

"twitter": "@alikon",

"address": "Roma, Italy, EU“,

“current_hobby”:”run away from dentist”}

3

The question is not “What is the answer?”

But“What is the question?”

Henri Poincaré

Ipse dixit

4

Why?

5

What is Big Data ?Big data is an all-encompassing term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data processing applications.

From wikipedia

6

How much is Big data ?

DVD 4.7 GB

Human brain 2.5 PB

LHC 1 PB/s

Net traffic 1 ZB/year

7

Where big data come from ?

Internet

of

Everything

IPv6 = 2^128

3,4e+38

IPv6 can address every quark in the world

8

When ?

9

Do you know your data ?Structured / Unstructured

Volume

10

Are you ready for the 4 V ?

Volume Velocity Variety Veracity

11

…and for the 5x9 ?Availability Downtime/year Downtime/month Downtime/week

90 % (1 nine) 36.5 days 72 hours 16.8 hours

99 % (2 nines) 3.65 days 7.20 hours 1.68 hours

99,9 % (3 nines) 8.76 hours 43.8 minutes 10.1 minutes

99,99 % (4 nines) 52.56 minutes 4.38 minutes 1.01 minutes

99,999% (5 nines) 5.26 minutes 25.9 seconds 6.05 seconds

12

The Database map

13

NoSql (no-SQL or Not Only SQL)

Next Generation Databases mostly addressing some of the points:

non-relational distributed  horizontal scalable open-source

From www.nosql-database.org

14

Taxonomies of NoSQL Key / value

Column

Document

Graph

15

Non Relational ? What ?!?!A data model is a rapresentation that we use to perceive and manipulate data

•Logic model• Normalization• 1NF,2NF,3NF,..• E-R • Schema (rigid)• Algebra of sets

•Impedance mismatch

16

NoSQL Data models

Schemaless(dynamic/implicit)

DenormalizationAggregate

Aggregates are the basic element of data storage

17

Key / ValueSimple data model

Blob/Opaque

Only 3 API function• Get(key)• Set(key, value)• Delete(key)

Key and value can be complex

18

Document More trasparent

JSON (JavaScript Object Notation)

A lightweight data interchange format

Easy for humans and machines to read and write

19

ColumnSparse semi structured,

sorted map.

Flexible number of columns

Column key can be grouped to family

How is stored

20

Graph Graph theory model G = ( V, E ) Store, map and query relationships

•Node connected by edges

•Complex relationships

•Recommend products

•ACID

Queries = graph traversal

21

Map reduce

The map job takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs)

The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples

refers to 2 separate and distinct tasks

Tasks runs in parallel

22

Divide et imperat

There is no “Silver Bullet” There are multiple ways to model data How the data is going to be accessed Read intensive or Write intensive Complex queries

23

Schemaless NormalizedModel

24

How do you scale ?Vertical (up)Add more power (ram/cpu/disk)

Horizontal (out) Add more commodity systems

25

The 8 fallacies of distributed computing

1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure.  5. Topology doesn't change.  6. There is one administrator.  7. Transport cost is zero. 8. The network is homogeneous. 

26

Sharding Split up data into multiple chunks Store each chunk in a separate data node

Partitioning strategy “The shard key“ Multishard ops (Join/aggregate) Load balancing

27

Replication Master / Slave Multi / Master

Synchonous Asynchonous

Provide redundancy Increase availability Failover (automatic)

28

A common problemMaria NickData

Get(X)T0

Get(X)T1

T2

Put(X)

Put(X)T3

Write Conflict

29

RDBMS are ACID with transaction

Transaction A sequence of operations that form a single unit of work

Transaction have 4 propertiesAtomicConsistentIsolatedDurable

30

ACID - AtomicityTransfer 100€ from A to B

1. Read(a)

2. If a > 100

3. A=A-100

4. Write(A)

5. Read(b)

6. B=B+100

7. Write(B)

31

ACID - Consistency

Transfer 100€ from A to B

1. Read(a)

2. If A > 100

3. A=A-100

4. Write(A)

5. Read(B)

6. B=B+100

7. Write(B)

32

ACID - IsolationTransfer 100€ from A to B

1. Read(A)

2. If A > 100

3. A=A-100

4. Write(A)

5. Read(B)

6. B=B+100

7. Write(B)

33

ACID - DurabilityTransfer 100€ from A to B

1. Read(A)

2. If A > 100

3. A=A-100

4. Write(A)

5. Read(b)

6. B=B+100

7. Write(B)

34

NoSQL are BASEBasically Available:  There will be a response to any request.  Fast response even if some replicas are slow or crashed

Soft State:  The state of the system could change over time It’s user application task to guarantee consistency

Eventual consistent:  The system will eventually become consistent once it stops

receiving input. The data will propagate to everywhere

35

Eventual Consistency (example) Nick finds a cool photo and shares with Maria by posting

on her Facebook wall Nick asks Maria to check it out Maria logs in her account, checks her Facebook wall but:

- Nothing is there! (x apart) Nick tells Maria to wait a bit and check out later Maria waits for a minute or so and checks back:

- She finds the photo Nick shared with her!

36

CAP theorem It’s impossible for a distributed computer system to

simultaneously provide all this three guarantees:

Consistency – all node see the same data at same time Availability –  all can always read and write Partition tollerance – the system will work on failure*

A distributed system can satisfay only 2 at the same time

37

Airline reservation system - OverbookingNick Maria

Who will take the next flight ?

EU US

38

The ATM example ATM will allow you to withdraw money even if the

machine is partitioned from the network

Higher availability means higher revenue

However, it puts a limit on the amount of withdraw The bank might also charge you a fee when a

overdraft happens

39

From CAP to PACELC

In the absence of partitions

how does the system trade off

latency (L) and consistency (C)?

40

Consistency vs Availability

41

SummaryACID RDBMS BASE NOSQL

Strong consistency Isolation Transaction Mature technology SQL Available & consistent Scale up (limited) Shared something (disk/ram/proc)

Weak consistenct (stale data) Last write wins Program managed New technology No standard Available & partition tolerant Scale out (unlimited*) Shared nothing (parallelizable)