vfabric sqlfire introduction

45
Confidential SQLFire Scalable SQL instead of NoSQL Jags Ramnarayan Jags Ramnarayan Chief Architect, GemFire Products

Upload: jags-ramnarayan

Post on 02-Nov-2014

4.637 views

Category:

Technology


0 download

DESCRIPTION

VMWare vFabric SQLFire - scalable SQL instead of NoSQL There is quite a bit of buzz thesedays on "NoSQL" databases. The lack of transactions and good support for querying (SQL) has been a problem for many to adopt these solutions. This talk presents, VMWare SQLFire, a distributed SQL data management solution that melds Apache Derby (borrowing SQL drivers, parsing and some aspects of the engine) and an object data grid (GemFire) to offer a horizontally scalable, memory oriented data management system where developers can continue to use SQL. We focus on new primitives that extend the well known SQL Data definition syntax for data partitioning and replication strategies but leaving the "select" and data manipulation part of SQL intact so it only minimally impacts your application. I gave this presentation at What's next, Paris 2011(http://www.whatsnextparis.com/abouttheseminar.html).

TRANSCRIPT

Page 1: vFabric SQLFire Introduction

Confidential

SQLFire Scalable SQL instead of NoSQL

Jags Ramnarayan Jags RamnarayanChief Architect, GemFire Products

Page 2: vFabric SQLFire Introduction

2 Confidential

Agenda Various NoSQL attributes and why SQL

SQLFire features + Demo

Scalability patterns• Hash partitioning

• Entity groups and collocation

• Scaling behavior using “data aware stored procedures”

Consistency model • How we do distributed transactions

Shared nothing persistence

Page 3: vFabric SQLFire Introduction

3 Confidential

3

Confidential

We Challenge the traditional RDBMS design NOT SQL

Too much I/O

Design roots don’t necessarily apply today

• Too much focus on ACID

• Disk synchronization bottlenecks

First write to LOG

Second write to Data files

Buffers primarily tuned

for IO

Page 4: vFabric SQLFire Introduction

4 Confidential

4

Confidential

Common themes in next-gen DB architectures

“Shared nothing” commodity clusters

focus shifts to memory, distributing data and clustering

Scale by partitioning the data and move behavior to data nodes

HA within cluster and across data centers

Add capacity to scale dynamically

NoSQL, Data Grids, Data Fabrics, NewSQL

Page 5: vFabric SQLFire Introduction

5 Confidential

What is different ?

Several data models Key-value

Column family (inspired by Google BigTable)

Document

Graph

Most focus on making model less rigid than SQL

Consistency model is not ACID

5

Low scale High scale Very high scale

STRICT – Full ACID (RDB)

Tunable Consistency

Eventual

Page 6: vFabric SQLFire Introduction

6 Confidential

What is our take with SQLFire?

Eventual consistency is too difficult for the average developer

Write(A,1) Read(A) may return 2 or (1,2)

SQL : Flexible, easily understood, strong type system

essential for integrity as well as query engine efficiency

Page 7: vFabric SQLFire Introduction

7 Confidential

SQLFire

Replicated, partitioned tables in memory. Redundancy through memory copies.

Data resides on disk when you explicitly say so

Powerful SQL engine: standard SQL for select, DML

DDL has SQLF extensions

Leverages GemFire data grid engine.

Page 8: vFabric SQLFire Introduction

8 Confidential

SQLFire

Applications access the distributed DB using JDBC, ADO.NET

Consistency model is FIFO, Tunable

Distributed transactions without global locks

Page 9: vFabric SQLFire Introduction

9 Confidential

SQLFire

Asynchronous replication over WAN

Synchronous replication within cluster

Clients failover, failback

Easily integrate with existing DBs - caching framework to read through, write through or write behind

Page 10: vFabric SQLFire Introduction

10 Confidential

SQLFire

"Data aware procedures“ - standard Java stored procedures with "data aware" and parallelism extensions

When nodes are added, data and behavior is rebalanced without blocking current clients

Page 11: vFabric SQLFire Introduction

11 Confidential

11

Confidential

Flexible Deployment Topologies

Java Application cluster can host an embedded clustered database by just changing the URLjdbc:sqlfire:;mcast-port=33666;host-data=true

Page 12: vFabric SQLFire Introduction

12 Confidential

12

Confidential

Flexible Deployment Topologies

Page 13: vFabric SQLFire Introduction

13 Confidential

Partitioning & Replication

Page 14: vFabric SQLFire Introduction

14 Confidential

Explore features through example

FLIGHTS---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME,…..

PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTAVAILABILITY---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER ,…..

PRIMARY KEY ( FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE))

FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTHISTORY---------------------------------------------

FLIGHT_ID CHAR(6), SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, DEST_AIRPORT CHAR(3),…..

1 – M

1 – 1

SEVERAL CODE/DIMENSION TABLES---------------------------------------------

AIRLINES: AIRLINE INFORMATION (VERY STATIC)COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTSCITIES: MAPS: PHOTOS OF REGIONS SERVED

Assume, thousands of flight rows, millions of flightavailability records

Page 15: vFabric SQLFire Introduction

16 Confidential

CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, …) PARTITION BY COLUMN (FLIGHT_ID) REDUNDANCY 1;

CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, …)

PARTITION BY COLUMN (FLIGHT_ID);

CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS)

CREATE TABLE Airlines AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. )

CREATE TABLE Airlines AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. )

REPLICATE;

Partitioned TablePartitioned TableRedundant PartitionRedundant PartitionPartitioned TablePartitioned TableRedundant PartitionRedundant PartitionPartitioned TablePartitioned TableRedundant PartitionRedundant PartitionReplicated TableReplicated TableReplicated TableReplicated TableReplicated TableReplicated TableTableTable

SQLFSQLF SQLF

SQLF Creating Tables

Colocated PartitionColocated PartitionColocated PartitionColocated Partition Colocated PartitionColocated Partition

Page 16: vFabric SQLFire Introduction

22 Confidential

TableTable

Partitioned TablePartitioned Table

Redundant PartitionRedundant Partition

Partitioned TablePartitioned Table

Redundant PartitionRedundant Partition

Partitioned TablePartitioned Table

Redundant PartitionRedundant Partition

Replicated TableReplicated TableReplicated TableReplicated Table Replicated TableReplicated Table

SQLFSQLF SQLF

SQLF Creating Tables

Colocated PartitionColocated PartitionColocated PartitionColocated Partition Colocated PartitionColocated Partition

By default, it is only the data dictionary that is persisted to disk.

Page 17: vFabric SQLFire Introduction

23 Confidential

TableTable

Partitioned TablePartitioned Table

Redundant PartitionRedundant Partition

Partitioned TablePartitioned Table

Redundant PartitionRedundant Partition

Partitioned TablePartitioned Table

Redundant PartitionRedundant Partition

Replicated TableReplicated TableReplicated TableReplicated Table Replicated TableReplicated Table

SQLFSQLF SQLF

SQLF Creating Tables

Colocated PartitionColocated PartitionColocated PartitionColocated Partition Colocated PartitionColocated Partition

CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS) PERSISTENT ;

Page 18: vFabric SQLFire Introduction

24 Confidential

Partitioning Options

CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, … )

PARTITION BY PRIMARY KEY;

To partition using the Primay Key, use:

(Primary Key’s Java implementation must hash evenly across its range)

PARTITION BY PRIMARY KEY

Page 19: vFabric SQLFire Introduction

25 Confidential

Partitioning Options

When you wish to partition on a column or columns that are not the primary key, use:

PARTITION BY COLUMN (column-name [ , column-name ]*)

CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …)

PARTITION BY COLUMN (FLIGHT_ID);

Page 20: vFabric SQLFire Introduction

26 Confidential

Partitioning Options

You can partition entries based on a range of values of one of the columns:

PARTITION BY RANGE (column-name )

( VALUES BETWEEN value AND value

[ , VALUES BETWEEN value AND value ]*)

CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …)

PARTITION BY RANGE ( economy_seats_taken )

( VALUES BETWEEN 0 AND 50,

VALUES BETWEEN 50 AND 100,

VALUES BETWEEN 100 AND 500);

Page 21: vFabric SQLFire Introduction

27 Confidential

Partitioning Options

You can explicitly partition entries based on a list of potential values of a column:

PARTITION BY LIST ( column-name )

( VALUES ( value [ , value ]* ) [ , VALUES ( value [ , value ]* ) ]* )

CREATE TABLE Orders

(OrderId INT NOT NULL, ItemId INT, NumItems INT, CustomerId INT, OrderDate DATE, Priority INT, Status CHAR(10),

CONSTRAINT Pk_Orders PRIMARY KEY (OrderId)

CONSTRAINT Fk_Items FOREIGN KEY (ItemId) REFERENCES Items(ItemId))

PARTITION BY LIST ( Status )

( VALUES ( 'pending', 'returned' ),

VALUES ( 'shipped', 'received' ),

VALUES ( 'hold' ));

Page 22: vFabric SQLFire Introduction

29 Confidential

Demo default partitioned tables, colocation, persistent tables

FLIGHTS---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME,…..

PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTAVAILABILITY---------------------------------------------

FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER ,…..

PRIMARY KEY ( FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE))

FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER)

FLIGHTHISTORY---------------------------------------------

FLIGHT_ID CHAR(6), SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, DEST_AIRPORT CHAR(3),…..

1 – M

1 – 1

SEVERAL CODE/DIMENSION TABLES---------------------------------------------

AIRLINES: AIRLINE INFORMATION (VERY STATIC)COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTSCITIES: MAPS: PHOTOS OF REGIONS SERVED

Page 23: vFabric SQLFire Introduction

30 Confidential

Scaling with Partitioned tables

Page 24: vFabric SQLFire Introduction

31 Confidential

Hash partitioning for linear scaling

Key Hashing provides single hop access to its partitionBut, what if the access is not based on the key … say, joins are involved

Page 25: vFabric SQLFire Introduction

32 Confidential

Hash partitioning only goes so far

Consider this query :

Select * from flights, flightAvailability

where <equijoin flights with flightAvailability>

and flightId ='xxx';

If both tables are hash partitioned the join logic will need execution on all nodes where flightavailability data is stored

Distributed joins are expensive and inhibit scaling

• joins across distributed nodes could involve distributed locks and potentially a lot of intermediate data transfer across nodesEquiJOIN of rows across multiple nodes is not supported in SQLFire 1.0

Page 26: vFabric SQLFire Introduction

33 Confidential

Partition aware DB design

• Designer thinks about how data maps to partitions

• The main idea is to:

1) minimize excessive data distribution by keeping the most frequently accessed and joined data collocated on partitions

2) Collocate transaction working set on partitions so complex 2-phase commits/paxos commit is eliminated or minimized.

• Read Pat Helland’s “Life beyond Distributed Transactions” and the Google MegaStore paper

Page 27: vFabric SQLFire Introduction

34 Confidential

Partition aware DB design

• Turns out OLTP systems lend themselves well to this need• Typically it is the number of entities that grows over time and not the

size of the entity.

• Customer count perpetually grows, not the size of the customer info

• Most often access is very restricted and based on select entities

• given a FlightID, fetch flightAvailability records• given a customerID, add/remove orders, shipment records

• Identify partition key for “Entity Group”• "entity groups": set of entities across several related tables that can all

share a single identifier

• flightID is shared between the parent and child tables• CustomerID shared between customer, order and shipment

tables

Page 28: vFabric SQLFire Introduction

35 Confidential

Partition aware DB design

• Entity groups defined in SQLFire using “colocation” clause

• Entity group guaranteed to be collocated in presence of failures or rebalance

• Now, complex queries can be executed without requiring excessive distributed data access

Page 29: vFabric SQLFire Introduction

36 Confidential

Partition Aware DB design

STAR schema design is the norm in OLTP design

Fact tables (fast changing) are natural partitioning candidates

• Partition by: FlightID … Availability, history rows colocated with Flights

Dimension tables are natural replicated table candidates

• Replicate Airlines, Countries, Cities on all nodes

Dealing with Joins involving M-M relationships

• Can the one side of the M-M become a replicated table?

• If not, run the Join logic in a parallel stored procedure to minimize distribution

• Else, split the query into multiple queries in application

Page 30: vFabric SQLFire Introduction

37 Confidential

Scaling Application logic with Parallel “Data Aware

procedures”

Page 31: vFabric SQLFire Introduction

38 Confidential

Procedures

Java Stored Procedures may be created according to the SQL Standard

SQLFabric also supports the JDBC type Types.JAVA_OBJECT. A parameter of type JAVA_OBJECT supports an arbitrary Serializable Java object.

In this case, the procedure will be executed on the server to which a client is connected (or locally for Peer Clients)

CREATE PROCEDURE getOverBookedFlights

(IN argument OBJECT, OUT result OBJECT)

LANGUAGE JAVA PARAMETER STYLE JAVA

READS SQL DATA DYNAMIC RESULT SETS 1

EXTERNAL NAME com.acme.OverBookedFLights;

Page 32: vFabric SQLFire Introduction

39 Confidential

Data Aware Procedures

Parallelize procedure and prune to nodes with required data

CALL [PROCEDURE]

procedure_name

( [ expression [, expression ]* ] )

[ WITH RESULT PROCESSOR processor_name ]

[ { ON TABLE table_name [ WHERE whereClause ] } |

{ ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }}

]

Extend the procedure call with the following syntax:

Fabric Server 2Fabric Server 1

Client

Hint the data the procedure depends on

CALL getOverBookedFlights( <bind arguments>

ON TABLE FLIGHTAVAILABILITY

WHERE FLIGHTID = <SomeFLIGHTID> ;

If table is partitioned by columns in the where clause the procedure execution is pruned to nodes with the data (node with <someFLIGHTID> in this case)

Page 33: vFabric SQLFire Introduction

40 Confidential

Parallelize procedure then aggregate (reduce)

CALL [PROCEDURE]

procedure_name

( [ expression [, expression ]* ] )

[ WITH RESULT PROCESSOR processor_name ]

[ { ON TABLE table_name [ WHERE whereClause ] } |

{ ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }}

]

Fabric Server 2Fabric Server 1

Client

Fabric Server 3

CALL SQLF.CreateResultProcessor( processor_name, processor_class_name);

register a Java Result Processor (optional in some cases):

Page 34: vFabric SQLFire Introduction

41 Confidential

Consistency model

Page 35: vFabric SQLFire Introduction

42 Confidential

Consistency Model without Transactions

• Replication within cluster is always eager and synchronous

• Row updates are always atomic; No need to use transactions

• FIFO consistency: writes performed by a single thread are seen by all other processes in the order in which they were issued

• Consistency in Partitioned tables• a partitioned table row owned by one member at a point in time

• all updates are serialized to replicas through owner

• "Total ordering" at a row level: atomic and isolated

• Membership changes and consistency

• Pessimistic concurrency support using ‘Select for update’

• Support for referential integrity

Page 36: vFabric SQLFire Introduction

43 Confidential

Distributed Transactions

• Full support for distributed transactions (Single phase commit)

• Highly scalable without any centralized coordinator or lock manager

• We make some important assumptions• Most OLTP transactions are small in duration and size

• W-W conflicts are very rare in practice

• How does it work?

• Each data node has a sub-coordinator to track TX state

• Eagerly acquire local “write” locks on each replica

• Object owned by a single primary at a point in time

• Fail fast if lock cannot be obtained

• Atomic and works with the cluster Failure detection system

• Isolated until commit

• Only support local isolation during commit

Page 37: vFabric SQLFire Introduction

44 Confidential

Parallel disk persistence

Page 38: vFabric SQLFire Introduction

45 Confidential

Why is disk latency so high?

Challenges

• Disk seek times is still > 2ms

• OLTP transactions are small writes

• Flushing to disk will result in a seek

• Best rates in 100s per second

RDBs and NoSQL try to avoid the problem

• Append to transaction logs; out-of-band writes to data files

• But, reads can cause seeks to disk

Page 39: vFabric SQLFire Introduction

46 Confidential

Disk persistence in SQLF

Parallel log structured storage

Each partition writes in parallel

Backups write to disk also

• Increase reliability against h/w loss

MemoryTables

Append only Operation logs

OS Buffers

LOG Compressor

Record1

Record2

Record3

Record1

Record2

Record3

MemoryTables

Append only Operation logs

OS Buffers

LOG Compressor

Record1

Record2

Record3

Record1

Record2

Record3

• Don’t seek to disk• Don’t flush all the way to disk

– Use OS scheduler to time write

• Do this on primary + secondary• Realize very high throughput

Page 40: vFabric SQLFire Introduction

47 Confidential

Performance benchmark

Page 41: vFabric SQLFire Introduction

48 Confidential

How does it perform? Scale?

Scale from 2 to 10 servers (one per host)

Scale from 200 to 1200 simulated clients (10 hosts)

Single partitioned table: int PK, 40 fields (20 ints, 20 strings)

2 4 6 8 100

100000

200000

300000

400000

500000

600000

700000

800000

0

200

400

600

800

1000

1200

1400

Partitioned table throughput - Query By PK (redundant copy)

queriesPerSecondclient threads

servers

qu

eri

es

pe

r s

ec

on

d

clie

nt

thre

ad

s

Page 42: vFabric SQLFire Introduction

49 Confidential

How does it perform? Scale?

CPU% remained low per server – about 30% indicating many more clients could be handled

2 4 6 8 100

100000

200000

300000

400000

500000

600000

700000

800000

0

10

20

30

40

50

60

70

80

90

Partitioned table throughput and CPU - Query By PK (redundant copy)

queriesPerSecondvmCPUClientvmCPUServer

servers

qu

eri

es

pe

r s

ec

on

d

CP

U u

sa

ge

Page 43: vFabric SQLFire Introduction

50 Confidential

Is latency low with scale?

Latency decreases with server capacity

50-70% take < 1 millisecond

About 90% take less than 2 milliseconds

Small percentage of outliers

2 4 6 8 100

10

20

30

40

50

60

70

80

Partitioned table response time - Query By PK (redundant copy)

< 1 ms1-2 ms2-5 ms5-10 ms

servers

% q

ue

rie

s

Page 44: vFabric SQLFire Introduction

51 Confidential

Q & A

VMWare vFabric SQLFire BETA will be released in Early June

Checkout community.gemstone.com

Page 45: vFabric SQLFire Introduction

52 Confidential

Built using GemFire object data fabric + Derby

Storage – memory+disk, partitioning,

Replication, HA, events, Reliable distribution

JDBC

4.x

ADO.NET

GemFire CORE (from GFE) Simplifed Config

model

- Standard SQL DDL with extensions- Cluster wide

config

Query engine with Cost based optimizer; efficient tuple storage model,

skip list based indexing

Design focus: optimize for horizontally partitioned data models

- distributed scatter/gather- Rich SQL syntax

- read through- Write through

- parallel data-aware procedures

- write behind

QUERYING

FRAMEWORK for

Derby

NEW + Derby SQL façade on top of GFE framework

NEW

52