shard-query, an mpp database for the cloud using the lamp stack

Download Shard-Query, an MPP database for the cloud using the LAMP stack

If you can't read please download the document

Post on 11-Aug-2014



Data & Analytics

106 download

Embed Size (px)


This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here:


Shard-Query AN MPP DATABASE FOR THE CLOUD USING THE LAMP STACK Introduction Presenter Justin Swanhart Principal Support Engineer at Percona Previously a trainer and consultant at Percona too Developer Swanhart-tools Shard-Query MPP sharding middleware for MySQL Flexviews Materialized views (fast refresh) for MySQL bcmath UDF arbitrary precision math for MySQL Intended Audience MySQL users with data too large to query efficiently using a single machine Big Data Analytics / OLAP User generated content analysis People interested in distributed database processing Terms MPP Massively Parallel Processing An MPP system is a system that can process a SQL statement in parallel on a single machine or even many machines A collection of machines is often called a Grid MPP is also sometimes called Grid Computing MPP (cont) Not many open source databases (none?) support MPP Community editions of closed source offerings are limited Some closed source databases include Vertica, Greenplum, Redshift The Cloud Managed collection of virtual servers Easy to add servers on demand Ideal for a federated, distributed database grid Easy to scale up by moving to a VM with more cores Easy to scale out by adding machines Amazon is one of the most popular cloud environments LAMP stack Linux Amazon Linux RHEL Ubuntu LTS, etc. Apache Web Server Most popular web server on the planet MySQL The worlds most popular open source database PHP High level language makes development easier Database Middleware A piece of software that sits between an end-user application and the database Operates on the queries submitted by the application, then returns the results to the application Usually a proxy of some sort MySQL proxy is the open source user configurable proxy for MySQL Supports Lua scripts which intercept queries Shard-Query can use MySQL Proxy out of the box Message Queue / Job Server Accepts jobs or messages and places them in a queue A worker reads jobs/messages from the queue and acts on them Offers support for asynchronous jobs Gearman My job server of choice for PHP Has two different PHP interfaces (pear and pecl) SQ comes bundled with a modified version* of the pear interface Excellent integration with MySQL as well (UDF) * Removes warnings triggered by modern PHP strict mode Sharding It is a short for Shared Nothing Means splitting up your data onto more than one machine Tables that are split up are called sharded tables Lookup tables are not sharded. In other words, they must be duplicated on all nodes Shard-Query supports directory based or hash based sharding Shard mapper Shard-Query supports DIRECTORY and HASH mapping out of the box DIRECTORY based sharding allows you to add or remove shards from the system, but lookups may go over the network, reducing performance* compared to HASH mapping HASH based sharding uses a hash algorithm to balance rows over the sharded database. However, since a HASH algorithm is used, the number of database shards can not change after initial data loading. * But only for queries like select count(*) from table where customer_id = 50 What is big data Most machine generated data Line order information for a large organization like Wal-Mart Any data so large that you cant effectively operate on it on one machine For example, an important query that needs to run daily executes in greater than 24 hours. It is impossible to meet the daily goal unless you can find a way to make the query execute faster. These kind of problems can happen on relatively small amounts of data (tens of gigabytes) Analytics(OLAP) versus OLTP OLTP is focused on short lived small transactions that read or write small amounts of data OLAP is focused on bulk loading and reading large amounts of data in a single query. Aggregation queries are OLAP queries Shard-Query is designed for analytics (OLAP) not OLTP must parse all commands sent to it (and make multiple round trips) Minium query time of around 20ms PROBLEM: Single Threaded Queries THE BIGGEST BOTTLENECK IN ANALYTICAL QUERIES IS THE SPEED OF A SINGLE CORE Single thread queries in the database MySQL, PostgreSQL, Firebird and all other major open source databases have single threaded queries This means that a single query can only ever utilize the resources of a single core As the data size grows, analytical queries get slower and slower In memory, as the data grows the speed decreases because the data is accessed in a single query As the number of rows to be examined increases, performance decreases Why single threaded MySQL is optimized for getting small amounts of data quickly(OLTP) It was created at a time when having more than one CPU was not common Adding parallelism now is a very complex task, particularly since MySQL supports multiple storage engines So adding parallel query is not a high priority (not even on the roadmap) Designed to run LOTS of small queries simultaneously, not one big query Single Threading bad for IO If the data set is significantly larger than memory, single threaded queries often cause the buffer pool to "churn For example, small lookup tables can easily be pushed out of the buffer pool, resulting in frequent IO to look up values While SSD may helps somewhat, one database thread can not read from an SSD at maximum device capacity While the disk may be capable of 1000+ MB/sec, a single thread is generally limited to query($sql); $endtime = microtime(true); if(!empty($shard_query->errors)) { if(!empty($shard_query->errors)) { echo "ERRORS RETURNED BY OPERATION:n"; print_r($shard_query->errors); } } if(is_resource($stmt) || is_object($stmt)) { $count=0; while($row = $shard_query->DAL->my_fetch_assoc($stmt)) { print_r($row); ++$count; } echo "$count rows returnedn"; $shard_query->DAL->my_free_result($stmt); } else { if(!empty($shard_query->info)) print_r($shard_query->info); echo "no query resultsn"; } echo "Exec time: " . ($endtime - $stime) . "n"; Simple data access layer comes with Shard-Query Errors are returned as a member of the object Run the query PHP OO Apache Web Interface MySQL Proxy Gearman Message Queue Worker Worker Worker Worker MySQL database shards Shard-Query Architecture Apache web interface GUI Easy to set up Run queries and get results Serves as an example of using Shard-Query in a web app with asynchronous queries Submits queries via Gearman Simple HTTP authentication PHP OO Apache Web Interface MySQL Proxy Gearman Message Queue Worker Worker Worker Worker MySQL database shards Shard-Query Architecture MySQL Proxy Interface LUA script for MySQL Proxy Supports most SHOW commands Intercepts queries, and sends them to Shard-Query using the MySQL Gearman UDF Serves as another example of using Gearman to execute queries. Behaves slightly differently than MySQL for some commands Query submitted SQL is parsed Query rewrite for parallelism yields multiple queries Gearman Jobs (map/combine) Final Aggregation (reduce) Return result Shard-Query Data Flow Map/reduce like workflow Query submitted SQL is parsed Query rewrite for parallelism yields multiple queries Gearman Jobs (map/combine) Final Aggregation (reduce) Return result Shard-Query Data Flow SQL Parser Find it at Supports SELECT/INSERT/UPDATE/DELETE REPLACE RENAME SHOW/SET DROP/CREATE INDEX/CREATE TABLE EXPLAIN/DESCRIBE Used by SugarCRM too, as well as other open source projects. Query submitted SQL is parsed Query rewrite for parallelism yields multiple queries Gearman Jobs (map/combine) Final Aggregation (reduce) Return result Shard-Query Data Flow Query Rewrite for parallelism Shard-Query has to manipulate the SQL statement so that it can be executed over more than on partition or machine COUNT() turns into SUM of COUNTs from each query AVG turns into SUM and COUNT SEMI-JOIN is turned into a materialized join STDDEV/VARIANCE are rewritten as well use the sum of squares method Push down LIMIT when possible Query Rewrite for parallelism (cont) Because lookup tables are duplicated on all shards, the query executes in a shared-nothing way All joins, filtering and aggregation are pushed down Mean very little data must flow between nodes in most cases High performance Meets or beats Amazon Redshift in testing at 200GB of data Query submitted SQL is parsed Query rewrite for parallelism yields multiple quer