conquering "big data": an introduction to shard query

Conquering “big data”

An Introduction to

Shard-QueryA MPP distributed middleware solution for MySQL databases

Big Data is a buzzword

Shard-Query works with big data, but it works

with small data too

You don’t have to have big data to have big

performance problems with queries

Big performance problems

MySQL typically has performance problems on OLAP workloads for even tens of gigabytes* of data

Analytics

Reporting

Data mining

MySQL is generally not scalable*^ for these workloads

• * By itself. The point of this talk is to show how Shard-Query fixes this :)

• ^ Another presentation goes into depth as to why MySQL doesn't scale for OLAP

http://www.slideshare.net/MySQLGeek/divide-and-conquer-in-the-cloud

Not only MySQL has these issues

All major open source databases have problems

with these workloads

Why?

Single threaded queries

When all data is in memory, accessing X rows is

generally X times as expensive as accessing ONE row

even when multiple cpus could be used

MySQL scalability model is good for OLTP

MySQL was created at a time when commodity

machines

Had a small (usually one) CPU core

Had small amounts of memory and limited disk IOPS

Managed a small amount of data

It did not make sense to code intra-query

parallelism for these servers. They couldn’t take

advantage of it anyway.

The new age of multi-core

“If your time to you is worth saving,then you better start swimming.

Or you'll sink like a stone.For the times they are a-changing.”

Core Core

Core Core

Core Core

Core Core

CPU

Core Core

Core Core

Core Core

Core Core

CPU

Core Core

Core Core

Core Core

Core Core

CPU

Core Core

Core Core

Core Core

Core Core

CPU

- Bob Dylan

It is 2013. Still only single threaded queries.

Building a multi-threaded query plan is a lot

different than building a single threaded query

plan

The time investment to build a parallel query interface

inside of MySQL would be very high

MySQL has continued to focus on excellence for OLTP

workloads while leaving the OLAP market untapped

Just adding basic subquery options to the optimizer has

taken many years

MySQL scales great for OLTP because

MySQL has been improved significantly, especially

in 5.5 and 5.6

Many small queries are “balanced” over many

CPUs naturally

Large memories allow vast quantities of hot data

And very fast disk IO means that

The penalty for cache miss is lower

No seek penalty for SSD especially reduces cost of

concurrent misses from multiple threads (no head

movement)

But not for OLAP

Big queries "peg" one CPU and can use no more

CPU resources (low efficiency queries)

Numerous large queries can "starve" smaller

queries

This is often when innodb_thread_concurrency needs

to be set > 0

http://www.slideshare.net/MySQLGeek/divide-and-conquer-in-the-cloud

But not for OLAP (cont)

When the data set is significantly larger than

memory, single threaded queries often cause the

buffer pool to "churn"

While SSD helps somewhat, one thread can not read

from an SSD at maximum device capacity

Disk may be capable of 1000s of MB/sec, but the single

thread is generally limited to <100MB/sec

A multi-threaded workload could much better utilize the disk

Response similar to the NoSQL movement

Rather than fix the database or build complex

software, users just change the underlying

database

Many closed source vendors have stepped in and

provided OLAP SQL solutions

Hardware: IBM Netezza, Oracle Exadata

Software: HP Vertica, Vectorwise, Teradata, Greenplum

Response similar to the NoSQL movement (cont)

Or SQL => map/reduce interfaces

Apache Hadoop/Apache Hive

Impala

Map/R

Cloudera CDH

Google built a SQL interface to BigTable too…

Limitations

No correlated subqueries for example

What do those map/reduce things do?

Split data up over multiple servers (HDFS)

During query processing

Map (fetch/extract/select/etc) raw data from files or

tables on HDFS

Write the data into temporary areas

Shuffle temporary data to reduce workers

Final reduce written

Return results

Those sounds expensive…

It is (in terms of dollars for closed solutions)

It is (in terms of execution time for open solutions)

The map is especially expensive when data is

unstructured and it must be done repeatedly for

each different query you run

And complicated…

You get

a whole new toolchain

A new set of data management tools

A new set of high availability tools

And all new monitoring tools to learn!

Even if MySQL supported parallel query:

MySQL* doesn’t do distributed queries Those Map/Reduce solutions (and the closed

source databases) can use more than one server!

Building a query plan for queries that must

execute over a sharded data set has additional

challenges:

SELECT AVG(expr)

must be computed as:

SUM(expr)/COUNT(expr) AS`AVG(expr)`

* Again, Shard-Query does. Almost there.

Probably the simplest example of a necessary rewrite

MySQL network storage engines

Don't these engines claim to be parallel?

Fetching of data from remote machines may be done in

parallel, but query processing is coordinated by a serial

query thread

A sum still has to examine each individual row from every

server serially

Joins are still evaluated serially (in many cases)

The engine is parallel, but the SQL layer using the

engine is not.

NDB

NDB is bad for star schema

Dimension table rows are not usually co-located with

fact rows.

Engine condition pushdown may help somewhat to

alleviate network traffic but joins still have to traverse

the network which is expensive

Aggregation still serial

SPIDER

SPIDER is bad for star schema too

Nested loops may be very bad for SPIDER and star

schema if the fact table isn't scanned first (must use

STRAIGHT_JOIN hint extensively).

MRR/BKA in MariaDB might help?

Still no parallel aggregation or join.

CONNECT

Has ECP

No ICP or ability to expose remote indexes

Always uses join buffer(BNLJ) or BKAJ

Fetches in parallel

No parallel join

No parallel aggregation

Those are not parallel query solutions

Those engines are not OLAP parallel query

They are for OLTP lookup and/or filtering performance. Often can't sort in parallel.

They can offer improved performance when large numbers of rows are filtered from many machines in parallel

When aggregating, a query must return a small resultset before aggregation for good performance

star schema should be avoided

Enter Shard-QueryMassively parallel query execution for MySQL variants

Enter Shard-Query

Keep using MySQL

Choose a row store like XtraDB, InnoDB or TokuDB*

Choose a column store like ICE*, Groonga**

Use CSV, TAB, XML, or other data with the CONNECT**

engine in MariaDB 10

** These engines have not been thoroughly tested

* These engines work, but with some limitations due to bugs

Shard-Query connects to 3306…

Shard-Query can use any MySQL variant as a data

source

You continue to use regular SQL, no map/reduce

Is built on MySQL, PHP and Gearman – well proven

technologies

You probably already know these things.

Shard-Query re-writes SQL

Flexible

Does not have to re-implement complex SQL

functionality because it uses SQL directly

Hundreds of MySQL functions and features available out

of the box

Small subset* of functions not available

last_insert_id(), get_lock(), etc.

* https://code.google.com/p/shard-query/wiki/UnsupportedFeatures

https://code.google.com/p/shard-query/wiki/UnsupportedFeatures

Shard-Query re-writes SQL

Familiar SQL

ORDER BY, GROUP BY, LIMIT, HAVING, subqueries, even

WITH ROLLUP, all continue to work as normal

Support for all MySQL aggregate functions including

count(distinct)

Aggregation and join happens in parallel

* https://code.google.com/p/shard-query/wiki/UnsupportedFeatures

https://code.google.com/p/shard-query/wiki/UnsupportedFeatures

You don't have to know

PHP to use Shard-Query!

Just use SQL

You can still connect to 3306 (and more)!

Shard-Query has multiple ways of interacting with

your application

The PHP OO API is the underlying interface.

The other interfaces are built on it:

MySQL Proxy Lua script (virtual database)

HTTP or HTTPS web/REST interface

Access the database directly from Javascript?

Submit Gearman jobs (as SQL) directly from almost any

programming language

MySQL Proxy

Web Interface

Command line (with explain plan)

echo "select * from (select count(*) from lineorder) sq;"|phprun_query --verbose

SQL SET TO SEND TO SHARDS:Array ( [0] => SELECT COUNT(*) AS expr_2942896428 FROM lineorder AS `lineorder` WHERE 1=1 ORDER BY NULL )SENDING GEARMAN SET to: 2 shards

SQL FOR COORDINATOR NODE: SELECT SUM(expr_2942896428) AS `count(*)` FROM àggregation_tmp_21498632`

SQL SET TO SEND TO SHARDS:Array ( [0] => SELECT * FROM ( SELECT SUM(expr_2942896428) AS `count(*)` FROM àggregation_tmp_21498632` ) AS `sq` WHERE 1=1 )SENDING GEARMAN SET to: 1 shards

SQL TO SEND TO COORDINATOR NODE:SELECT * FROM àggregation_tmp_88629847`

[count(*)] => 1199721041 rows returned Exec time: 0.053546905517578

Shard-Query constructs parallel queries

MySQL can’t run a single query in multiple threads

but it can run multiple queries at once in multiple

threads (with multiple cores)

Shard-Query breaks one query into multiple

smaller queries (aka tasks)

Tasks can run in parallel on one or more servers

OLAP into OLTP

Partitioning tables for parallelismThis is similar to Oracle Parallel Query

Partitioning splits queries on a single machine

Supports partitioning to divide up a table

RANGE, LIST and RANGE/LIST COLUMNS over a single

column

Each partition can be accessed in parallel as an

individual task

A different way to look at it:

You get to move all the pieces at the same time

T1

T4

T8

T32

T48

T64

T1

T4

T8

VERSUSSINGLE THREADED PARALLEL

*Small portion of execution is still serial, so speedup won't be quite linear (but should be close)

Sharding

Sharded tables split data over many servers

Works similarly to partitioning.

You specify a "shard key". This is like a

partitioning key, but it applies to ALL tables in the

schema.

If a table contains the "shard key", then the table is

spread over the shards based on the values of that

column

Pick a "shard key" with an even data distribution

Currently only a single column is supported

Unsharded Tables

Tables that don't contain the "shard key" are

called "unsharded" tables

A copy of these tables is replicated on ALL nodes

It is a good idea to keep these tables relatively small

and update them infrequently

You can freely join between sharded and unsharded

tables

You can only join between sharded tables when the

join includes the shard key*

* A CONNECT or FEDERATED table to a Shard-Query proxy can be used to

support cross-shard joins. Consider MySQL Cluster for cross-shard joins.

Parallel Execution

Shardingand/or

Partitioned Tables

GearmanShard-Query

RESTProxy

PHP OO

Task1 Shard1 Partition 1




+ + =

Data Flow

SQL

DATA

Sharding for big dataOr how I stopped worrying and learned to scale out the database

You can only scale up so far

MySQL still caps out at between 24 and 48 cores

though it continues to improve (5.7 will be the

best one ever?)

If you are collecting enough data you will

eventually need to use more than one machine to

get good performance on queries over a large

portion of the data set

Scale Out – And Up

You could choose to use 4 servers with 16 cores or

2 servers with 32 cores

Usually depends on how large your data set is

Keep as much data in memory as possible

Scale Out – And Up

In the cloud many small servers can leverage memory more efficiently than a few large ones

Run 8 smaller servers with (in aggregate)

16 cores (52 total ECU) [2/per]

136.8GB memory [17.1/per]

3360MB combined local HDD storage [420/per]

This is the almost the same price as a single large SSD based machine

16 cores

64 GB of ram (35 ECU)

2048MB local SSD storage

The large machine had SSD though

If the workload is IO bound (working set >128GB)

Go with the large machine with 16 cores

Very fast IO

Getting data into memory so that the CPUs can

work on it is more important

Downgrade to smaller machines if the working set

shrinks

Still partition for parallelism

Scale "in and out"

Splitting a shard in Shard-Query is a manual (but

easy) process

Only supported when the directory mapper is used

mysqldump the data from the shard with the –T option

(or use mydumper)

Truncate the tables on the old shard

Create the tables on the new shard

Update the mapping table to split the data

Use the Shard-Query loader to load data

Combine with Map/Reduce

Use Map/Reduce jobs to extract data from HDFS

and write it into ICE

Execute high performance low latency MySQL

queries over the data

Combine with Map/Reduce (cont)

Make fast insights into smaller amounts of data

extracted from petabyte HDFS data stores

Extract a particular year of climate data

Or particular cultivars when comparing genomic plant

data

Open source ETL tools can automate this process

Performance Examples

Simple In-Memory COUNT(*) query performance on Wikipedia traffic stats

Working set: 128GB of data

2.5528580558.06488761313.326974218.5057123225.341401732.9345543240.19016381

44.6940.87

129.0382018

213.2315872

296.091397

405.4624271

526.9528692

643.0426209

750.457135

0

100

200

300

400

500

600

700

800

8 Pawns

The King

Linear (8 Pawns)

Linear (The King)

Days 8 Pawns The King

1 2.552858 40.84573

2 5.090356 81.4457

3 8.064888 129.0382

4 10.74412 171.9059

5 13.32697 213.2316

6 16.0227 256.3633

7 18.50571 296.0914

8 21.02053 336.3285

9 25.3414 405.4624

10 29.69324 475.0918

11 32.93455 526.9529

12 36.5517 584.8272

13 40.19016 643.0426

14 42.75 699.1011

15 44.69 750.4571

Shard-Query is scanning about 1B rows/sec

Star Schema Benchmark – Scale 10

6 cores

Partitioning for single node scaleup

6 worker threads

XTRADB

Schema Design for Big Data

Best schema – flat tables (no joins)

Scale to hundreds of machines with tens to

hundreds of terabytes each

Dozens or hundreds of columns per table

Can use map/reduce when you need to join

between sharded tables (Map/R or something

other than Shard-Query is used for this)

Joins to lookup tables can still be done but do so

with care

One table model (flat table, no joins)

Great for machine generated data - quintessential

big data.

Call data records (billing mediation and call analysis)

Sensor data (Internet of Things)

Web logs (Know thy own self before all others)

Hit/click information for advertising

Energy metering

Almost any large open data set

Ideal schema – flat tables (no joins)

Why one big table?

ICE/IEE

ICE and IEE engines are append-only (or append mostly)

ICE/IEE knowledge grid can filter out data more

effectively when all of the filters are placed on a single

table

No indexes means that only hash joins or sort/merge

joins can be performed when joining tables

Ideal schema – flat tables (no joins)

Insert-only tables are the easiest on which to

build summary tables

Querying is very easy as all attributes are always

available

But all attributes can be overwhelming.

Views can be created in this case

When named properly the views can be accessed in parallel too

Special view support

Shard-Query has special support for treating views

as partitioned tables* when the views have the

prefix v_ followed by the actual table name

select * from v_mysql_metrics from all_metrics where

host_id = 33 and collect_date = '2013-05-27';

Joins to these views are supported too

Make sure you only use the MERGE algorithm or

this will not work

* Shard-Query does not currently parse the underlying SQL for views, so this naming is necessary

to allow Shard-Query to find the partition metadata for the underlying table.

Schema Design for Analytics/BI and

Data VisualizationSee better results through faster queries

Star Schema

Most common BI/analytics table is star schema or

a denormalized table (see prev slides)

"Fact" (measurement) table is sharded

Dimension (lookup) tables are unsharded

JOINs between the fact and dimension tables are freely

supported

Star Schema

In some cases a dimension might be sharded

sharding by date to spread data around evenly by date

for example

date_id is in the fact table and in the date dimension table

This is safe because you JOIN by the date_id column

sharding by customer (SaaS) is also common

customer_id in FACT and in dim_customer

Safe because join is by customer_id

Star Schema (cont)

Shard-Query has experimental STAR optimizer

support

Scan dimension tables

Push FACT table IN predicates to SQL WHERE clause

Eliminate JOIN to dimension tables without projected

columns

Other schema types can work too

Master/detail relationship

Unsharded small lookup tables

comment_type

mood_type

etc

The main tables are sharded by blog_id:

blog_info

blog_posts

blog_comments

These all must contain the "shard key" (blog_id)

because they are joined by blog_id, thus blog metadata, comments

and posts must be stored in the same shard for the same

blog.

Table relationships can not currently be defined.

Some tables (like comments) require minor de-normalization to include

the blog_id column.

Snowflake schema

Shard-Query STAR optimizer not yet extended to

snowflake

Consider using star schema or flat table instead

Links and other info

Shard-Query

http://code.google.com/p/shard-query

http://shardquery.com

http://code.google.com/p/PHP-SQL-Parser

http://code.google.com/p/Instrumentation-for-

PHP

http://code.google.com/p/shard-query

http://shardquery.com/

http://code.google.com/p/PHP-SQL-Parser

http://code.google.com/p/Instrumentation-for-PHP

Percona

The high performance MySQL and LAMP experts

http://www.percona.com

Training - http://training.percona.com

Support - MySQL, MariaDB, and Percona Server too

Remote DBA - We wake up so you don't have to

Consulting – Is your site slow? We can help.

Development services – Somethings broke? We can fix

it. We can add or improve features to fit your use case.

http://www.percona.com/

http://training.percona.com/

Gearman

http://www.gearman.org

Job process and concurrent workload

management

Run one worker per physical CPU (or more if you

are IO bound)

Add extra loader workers and exec workers if

needed

http://www.gearman.org/

Infobright

Infobright Community Edition

Append only

http://infobright.org

Infobright Enterprise Edition

http://infobright.com

They are both column stores but they are

architecturally different.

IEE offers intra-query parallelism natively which

Shard-Query benefits from because

Infobright does not support partitioning.

http://infobright.org/

http://infobright.com/

TokuDB

Compressing row store for big data

Doesn't suffer IO penalty when updating

secondary indexes

Variable compression level by library

New, so prepare to test thoroughly

http://www.tokudb.com

http://tokudb.com/

Groonga/Mroonga

Column store and text search system

Supports text and geospatial search

Native(column store) or fulltext wrapper around

InnoDB/MyISAM

http://groonga.org/

http://groonga.org/

Network Engines

NDB(MySQL Cluster)

http://dev.mysql.com/downloads/cluster/7.3.html

SPIDER storage engine

https://launchpad.net/spiderformysql

CONNECT engine for MariaDB 10.x alpha

http://www.skysql.com/enterprise/mariadb-connect-

storage-engine

http://dev.mysql.com/downloads/cluster/7.3.html

https://launchpad.net/spiderformysql

http://www.skysql.com/enterprise/mariadb-connect-storage-engine

conquering "big data": an introduction to shard query

Technology