vfabric sqlfire introduction
DESCRIPTION
VMWare vFabric SQLFire - scalable SQL instead of NoSQL There is quite a bit of buzz thesedays on "NoSQL" databases. The lack of transactions and good support for querying (SQL) has been a problem for many to adopt these solutions. This talk presents, VMWare SQLFire, a distributed SQL data management solution that melds Apache Derby (borrowing SQL drivers, parsing and some aspects of the engine) and an object data grid (GemFire) to offer a horizontally scalable, memory oriented data management system where developers can continue to use SQL. We focus on new primitives that extend the well known SQL Data definition syntax for data partitioning and replication strategies but leaving the "select" and data manipulation part of SQL intact so it only minimally impacts your application. I gave this presentation at What's next, Paris 2011(http://www.whatsnextparis.com/abouttheseminar.html).TRANSCRIPT
Confidential
SQLFire Scalable SQL instead of NoSQL
Jags Ramnarayan Jags RamnarayanChief Architect, GemFire Products
2 Confidential
Agenda Various NoSQL attributes and why SQL
SQLFire features + Demo
Scalability patterns• Hash partitioning
• Entity groups and collocation
• Scaling behavior using “data aware stored procedures”
Consistency model • How we do distributed transactions
Shared nothing persistence
3 Confidential
3
Confidential
We Challenge the traditional RDBMS design NOT SQL
Too much I/O
Design roots don’t necessarily apply today
• Too much focus on ACID
• Disk synchronization bottlenecks
First write to LOG
Second write to Data files
Buffers primarily tuned
for IO
4 Confidential
4
Confidential
Common themes in next-gen DB architectures
“Shared nothing” commodity clusters
focus shifts to memory, distributing data and clustering
Scale by partitioning the data and move behavior to data nodes
HA within cluster and across data centers
Add capacity to scale dynamically
NoSQL, Data Grids, Data Fabrics, NewSQL
5 Confidential
What is different ?
Several data models Key-value
Column family (inspired by Google BigTable)
Document
Graph
Most focus on making model less rigid than SQL
Consistency model is not ACID
5
Low scale High scale Very high scale
STRICT – Full ACID (RDB)
Tunable Consistency
Eventual
6 Confidential
What is our take with SQLFire?
Eventual consistency is too difficult for the average developer
Write(A,1) Read(A) may return 2 or (1,2)
SQL : Flexible, easily understood, strong type system
essential for integrity as well as query engine efficiency
7 Confidential
SQLFire
Replicated, partitioned tables in memory. Redundancy through memory copies.
Data resides on disk when you explicitly say so
Powerful SQL engine: standard SQL for select, DML
DDL has SQLF extensions
Leverages GemFire data grid engine.
8 Confidential
SQLFire
Applications access the distributed DB using JDBC, ADO.NET
Consistency model is FIFO, Tunable
Distributed transactions without global locks
9 Confidential
SQLFire
Asynchronous replication over WAN
Synchronous replication within cluster
Clients failover, failback
Easily integrate with existing DBs - caching framework to read through, write through or write behind
10 Confidential
SQLFire
"Data aware procedures“ - standard Java stored procedures with "data aware" and parallelism extensions
When nodes are added, data and behavior is rebalanced without blocking current clients
11 Confidential
11
Confidential
Flexible Deployment Topologies
Java Application cluster can host an embedded clustered database by just changing the URLjdbc:sqlfire:;mcast-port=33666;host-data=true
12 Confidential
12
Confidential
Flexible Deployment Topologies
13 Confidential
Partitioning & Replication
14 Confidential
Explore features through example
FLIGHTS---------------------------------------------
FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME,…..
PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER)
FLIGHTAVAILABILITY---------------------------------------------
FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER ,…..
PRIMARY KEY ( FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE))
FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER)
FLIGHTHISTORY---------------------------------------------
FLIGHT_ID CHAR(6), SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, DEST_AIRPORT CHAR(3),…..
1 – M
1 – 1
SEVERAL CODE/DIMENSION TABLES---------------------------------------------
AIRLINES: AIRLINE INFORMATION (VERY STATIC)COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTSCITIES: MAPS: PHOTOS OF REGIONS SERVED
Assume, thousands of flight rows, millions of flightavailability records
16 Confidential
CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, …) PARTITION BY COLUMN (FLIGHT_ID) REDUNDANCY 1;
CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, …)
PARTITION BY COLUMN (FLIGHT_ID);
CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS)
CREATE TABLE Airlines AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. )
CREATE TABLE Airlines AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. )
REPLICATE;
Partitioned TablePartitioned TableRedundant PartitionRedundant PartitionPartitioned TablePartitioned TableRedundant PartitionRedundant PartitionPartitioned TablePartitioned TableRedundant PartitionRedundant PartitionReplicated TableReplicated TableReplicated TableReplicated TableReplicated TableReplicated TableTableTable
SQLFSQLF SQLF
SQLF Creating Tables
Colocated PartitionColocated PartitionColocated PartitionColocated Partition Colocated PartitionColocated Partition
22 Confidential
TableTable
Partitioned TablePartitioned Table
Redundant PartitionRedundant Partition
Partitioned TablePartitioned Table
Redundant PartitionRedundant Partition
Partitioned TablePartitioned Table
Redundant PartitionRedundant Partition
Replicated TableReplicated TableReplicated TableReplicated Table Replicated TableReplicated Table
SQLFSQLF SQLF
SQLF Creating Tables
Colocated PartitionColocated PartitionColocated PartitionColocated Partition Colocated PartitionColocated Partition
By default, it is only the data dictionary that is persisted to disk.
23 Confidential
TableTable
Partitioned TablePartitioned Table
Redundant PartitionRedundant Partition
Partitioned TablePartitioned Table
Redundant PartitionRedundant Partition
Partitioned TablePartitioned Table
Redundant PartitionRedundant Partition
Replicated TableReplicated TableReplicated TableReplicated Table Replicated TableReplicated Table
SQLFSQLF SQLF
SQLF Creating Tables
Colocated PartitionColocated PartitionColocated PartitionColocated Partition Colocated PartitionColocated Partition
CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS) PERSISTENT ;
24 Confidential
Partitioning Options
CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, … )
PARTITION BY PRIMARY KEY;
To partition using the Primay Key, use:
(Primary Key’s Java implementation must hash evenly across its range)
PARTITION BY PRIMARY KEY
25 Confidential
Partitioning Options
When you wish to partition on a column or columns that are not the primary key, use:
PARTITION BY COLUMN (column-name [ , column-name ]*)
CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …)
PARTITION BY COLUMN (FLIGHT_ID);
26 Confidential
Partitioning Options
You can partition entries based on a range of values of one of the columns:
PARTITION BY RANGE (column-name )
( VALUES BETWEEN value AND value
[ , VALUES BETWEEN value AND value ]*)
CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, …)
PARTITION BY RANGE ( economy_seats_taken )
( VALUES BETWEEN 0 AND 50,
VALUES BETWEEN 50 AND 100,
VALUES BETWEEN 100 AND 500);
27 Confidential
Partitioning Options
You can explicitly partition entries based on a list of potential values of a column:
PARTITION BY LIST ( column-name )
( VALUES ( value [ , value ]* ) [ , VALUES ( value [ , value ]* ) ]* )
CREATE TABLE Orders
(OrderId INT NOT NULL, ItemId INT, NumItems INT, CustomerId INT, OrderDate DATE, Priority INT, Status CHAR(10),
CONSTRAINT Pk_Orders PRIMARY KEY (OrderId)
CONSTRAINT Fk_Items FOREIGN KEY (ItemId) REFERENCES Items(ItemId))
PARTITION BY LIST ( Status )
( VALUES ( 'pending', 'returned' ),
VALUES ( 'shipped', 'received' ),
VALUES ( 'hold' ));
29 Confidential
Demo default partitioned tables, colocation, persistent tables
FLIGHTS---------------------------------------------
FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEPART_TIME TIME,…..
PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER)
FLIGHTAVAILABILITY---------------------------------------------
FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER ,…..
PRIMARY KEY ( FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE))
FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER)
FLIGHTHISTORY---------------------------------------------
FLIGHT_ID CHAR(6), SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), DEPART_TIME TIME, DEST_AIRPORT CHAR(3),…..
1 – M
1 – 1
SEVERAL CODE/DIMENSION TABLES---------------------------------------------
AIRLINES: AIRLINE INFORMATION (VERY STATIC)COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTSCITIES: MAPS: PHOTOS OF REGIONS SERVED
30 Confidential
Scaling with Partitioned tables
31 Confidential
Hash partitioning for linear scaling
Key Hashing provides single hop access to its partitionBut, what if the access is not based on the key … say, joins are involved
32 Confidential
Hash partitioning only goes so far
Consider this query :
Select * from flights, flightAvailability
where <equijoin flights with flightAvailability>
and flightId ='xxx';
If both tables are hash partitioned the join logic will need execution on all nodes where flightavailability data is stored
Distributed joins are expensive and inhibit scaling
• joins across distributed nodes could involve distributed locks and potentially a lot of intermediate data transfer across nodesEquiJOIN of rows across multiple nodes is not supported in SQLFire 1.0
33 Confidential
Partition aware DB design
• Designer thinks about how data maps to partitions
• The main idea is to:
1) minimize excessive data distribution by keeping the most frequently accessed and joined data collocated on partitions
2) Collocate transaction working set on partitions so complex 2-phase commits/paxos commit is eliminated or minimized.
• Read Pat Helland’s “Life beyond Distributed Transactions” and the Google MegaStore paper
34 Confidential
Partition aware DB design
• Turns out OLTP systems lend themselves well to this need• Typically it is the number of entities that grows over time and not the
size of the entity.
• Customer count perpetually grows, not the size of the customer info
• Most often access is very restricted and based on select entities
• given a FlightID, fetch flightAvailability records• given a customerID, add/remove orders, shipment records
• Identify partition key for “Entity Group”• "entity groups": set of entities across several related tables that can all
share a single identifier
• flightID is shared between the parent and child tables• CustomerID shared between customer, order and shipment
tables
35 Confidential
Partition aware DB design
• Entity groups defined in SQLFire using “colocation” clause
• Entity group guaranteed to be collocated in presence of failures or rebalance
• Now, complex queries can be executed without requiring excessive distributed data access
36 Confidential
Partition Aware DB design
STAR schema design is the norm in OLTP design
Fact tables (fast changing) are natural partitioning candidates
• Partition by: FlightID … Availability, history rows colocated with Flights
Dimension tables are natural replicated table candidates
• Replicate Airlines, Countries, Cities on all nodes
Dealing with Joins involving M-M relationships
• Can the one side of the M-M become a replicated table?
• If not, run the Join logic in a parallel stored procedure to minimize distribution
• Else, split the query into multiple queries in application
37 Confidential
Scaling Application logic with Parallel “Data Aware
procedures”
38 Confidential
Procedures
Java Stored Procedures may be created according to the SQL Standard
SQLFabric also supports the JDBC type Types.JAVA_OBJECT. A parameter of type JAVA_OBJECT supports an arbitrary Serializable Java object.
In this case, the procedure will be executed on the server to which a client is connected (or locally for Peer Clients)
CREATE PROCEDURE getOverBookedFlights
(IN argument OBJECT, OUT result OBJECT)
LANGUAGE JAVA PARAMETER STYLE JAVA
READS SQL DATA DYNAMIC RESULT SETS 1
EXTERNAL NAME com.acme.OverBookedFLights;
39 Confidential
Data Aware Procedures
Parallelize procedure and prune to nodes with required data
CALL [PROCEDURE]
procedure_name
( [ expression [, expression ]* ] )
[ WITH RESULT PROCESSOR processor_name ]
[ { ON TABLE table_name [ WHERE whereClause ] } |
{ ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }}
]
Extend the procedure call with the following syntax:
Fabric Server 2Fabric Server 1
Client
Hint the data the procedure depends on
CALL getOverBookedFlights( <bind arguments>
ON TABLE FLIGHTAVAILABILITY
WHERE FLIGHTID = <SomeFLIGHTID> ;
If table is partitioned by columns in the where clause the procedure execution is pruned to nodes with the data (node with <someFLIGHTID> in this case)
40 Confidential
Parallelize procedure then aggregate (reduce)
CALL [PROCEDURE]
procedure_name
( [ expression [, expression ]* ] )
[ WITH RESULT PROCESSOR processor_name ]
[ { ON TABLE table_name [ WHERE whereClause ] } |
{ ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }}
]
Fabric Server 2Fabric Server 1
Client
Fabric Server 3
CALL SQLF.CreateResultProcessor( processor_name, processor_class_name);
register a Java Result Processor (optional in some cases):
41 Confidential
Consistency model
42 Confidential
Consistency Model without Transactions
• Replication within cluster is always eager and synchronous
• Row updates are always atomic; No need to use transactions
• FIFO consistency: writes performed by a single thread are seen by all other processes in the order in which they were issued
• Consistency in Partitioned tables• a partitioned table row owned by one member at a point in time
• all updates are serialized to replicas through owner
• "Total ordering" at a row level: atomic and isolated
• Membership changes and consistency
• Pessimistic concurrency support using ‘Select for update’
• Support for referential integrity
43 Confidential
Distributed Transactions
• Full support for distributed transactions (Single phase commit)
• Highly scalable without any centralized coordinator or lock manager
• We make some important assumptions• Most OLTP transactions are small in duration and size
• W-W conflicts are very rare in practice
• How does it work?
• Each data node has a sub-coordinator to track TX state
• Eagerly acquire local “write” locks on each replica
• Object owned by a single primary at a point in time
• Fail fast if lock cannot be obtained
• Atomic and works with the cluster Failure detection system
• Isolated until commit
• Only support local isolation during commit
44 Confidential
Parallel disk persistence
45 Confidential
Why is disk latency so high?
Challenges
• Disk seek times is still > 2ms
• OLTP transactions are small writes
• Flushing to disk will result in a seek
• Best rates in 100s per second
RDBs and NoSQL try to avoid the problem
• Append to transaction logs; out-of-band writes to data files
• But, reads can cause seeks to disk
46 Confidential
Disk persistence in SQLF
Parallel log structured storage
Each partition writes in parallel
Backups write to disk also
• Increase reliability against h/w loss
MemoryTables
Append only Operation logs
OS Buffers
LOG Compressor
Record1
Record2
Record3
Record1
Record2
Record3
MemoryTables
Append only Operation logs
OS Buffers
LOG Compressor
Record1
Record2
Record3
Record1
Record2
Record3
• Don’t seek to disk• Don’t flush all the way to disk
– Use OS scheduler to time write
• Do this on primary + secondary• Realize very high throughput
47 Confidential
Performance benchmark
48 Confidential
How does it perform? Scale?
Scale from 2 to 10 servers (one per host)
Scale from 200 to 1200 simulated clients (10 hosts)
Single partitioned table: int PK, 40 fields (20 ints, 20 strings)
2 4 6 8 100
100000
200000
300000
400000
500000
600000
700000
800000
0
200
400
600
800
1000
1200
1400
Partitioned table throughput - Query By PK (redundant copy)
queriesPerSecondclient threads
servers
qu
eri
es
pe
r s
ec
on
d
clie
nt
thre
ad
s
49 Confidential
How does it perform? Scale?
CPU% remained low per server – about 30% indicating many more clients could be handled
2 4 6 8 100
100000
200000
300000
400000
500000
600000
700000
800000
0
10
20
30
40
50
60
70
80
90
Partitioned table throughput and CPU - Query By PK (redundant copy)
queriesPerSecondvmCPUClientvmCPUServer
servers
qu
eri
es
pe
r s
ec
on
d
CP
U u
sa
ge
50 Confidential
Is latency low with scale?
Latency decreases with server capacity
50-70% take < 1 millisecond
About 90% take less than 2 milliseconds
Small percentage of outliers
2 4 6 8 100
10
20
30
40
50
60
70
80
Partitioned table response time - Query By PK (redundant copy)
< 1 ms1-2 ms2-5 ms5-10 ms
servers
% q
ue
rie
s
51 Confidential
Q & A
VMWare vFabric SQLFire BETA will be released in Early June
Checkout community.gemstone.com
52 Confidential
Built using GemFire object data fabric + Derby
Storage – memory+disk, partitioning,
Replication, HA, events, Reliable distribution
JDBC
4.x
ADO.NET
GemFire CORE (from GFE) Simplifed Config
model
- Standard SQL DDL with extensions- Cluster wide
config
Query engine with Cost based optimizer; efficient tuple storage model,
skip list based indexing
Design focus: optimize for horizontally partitioned data models
- distributed scatter/gather- Rich SQL syntax
- read through- Write through
- parallel data-aware procedures
- write behind
QUERYING
FRAMEWORK for
Derby
NEW + Derby SQL façade on top of GFE framework
NEW
52