leveraging hadoop in your postgresql environment

48
Postgres & Hadoop

Upload: jim-mlodgenski

Post on 07-May-2015

7.656 views

Category:

Technology


2 download

DESCRIPTION

This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.

TRANSCRIPT

Page 1: Leveraging Hadoop in your PostgreSQL Environment

Postgres & Hadoop

Page 2: Leveraging Hadoop in your PostgreSQL Environment

Who am I?Jim Mlodgenski

CTO, OpenSCG

Co-organizer, NYCPUG

Co-organizer, Philly PUG

Co-chair, PGConf US

[email protected]

@jim_mlodgenski

Page 3: Leveraging Hadoop in your PostgreSQL Environment

AgendaStrengths of PostgreSQL

Strengths of Hadoop

Hadoop Community

Use Cases

Page 4: Leveraging Hadoop in your PostgreSQL Environment

Best of Both World

PostgresWorld’s most advanced open source database solution

Enterprise class including MVCC, streaming replication & rich data type support (to name a few!)

Robust transaction support with strong ANSI-SQL compliance

HadoopBig data distributed framework

Reliable, massively scalable & proven

Failures handled at the application layer allowing commodity hardware

Page 5: Leveraging Hadoop in your PostgreSQL Environment

Strengths of PostgreSQLStrong Data Types

Concurrency

Transactions

Security

Indexes

Connectors

Page 6: Leveraging Hadoop in your PostgreSQL Environment

Components of PostgreSQLDatabase

Connectors

– JDBC

– ODBC

– Libpq

Foreign Data Wrappers

And more...

Page 7: Leveraging Hadoop in your PostgreSQL Environment

Strengths of HadoopParallelism

Flexibility

Redundancy

Scalability

Page 8: Leveraging Hadoop in your PostgreSQL Environment

Components of HadoopHDFS

Hive

Flume

Sqoop

ZooKeeper

Hbase

And many more...

Page 9: Leveraging Hadoop in your PostgreSQL Environment

HDFS

Hadoop Distributed File System

Page 10: Leveraging Hadoop in your PostgreSQL Environment

HbaseModeled after Google BigTable

Column-oriented database on top of HDFS

Page 11: Leveraging Hadoop in your PostgreSQL Environment

ZooKeeperDistributed Configuration Service

Supports synchronization and distributed locking

Automatic leader election

Page 12: Leveraging Hadoop in your PostgreSQL Environment

HiveAdds SQL on Hadoop

Converts SQL (HQL) to MapReduce Jobs

Page 13: Leveraging Hadoop in your PostgreSQL Environment

FlumeStreams data into HDFS

Distributed and Highly Available

Page 14: Leveraging Hadoop in your PostgreSQL Environment

SqoopAllows for bulk transfers of data between Hadoop and a RDBMS

Page 15: Leveraging Hadoop in your PostgreSQL Environment

Hadoop CommunityMuch more like the Linux community than the PostgreSQL community

Some competing commercial interests makes the direction unclear to some

Page 16: Leveraging Hadoop in your PostgreSQL Environment

Use Cases

Page 17: Leveraging Hadoop in your PostgreSQL Environment

Hive MetastoreAll of the meta data of the Hive tables reside in a RDBMS

The default is to use Derby

– Limits to a single connection

Page 18: Leveraging Hadoop in your PostgreSQL Environment

Hive Metastore (cont.)Use PostgreSQL for scalability and reliability

Many concurrent users

Page 19: Leveraging Hadoop in your PostgreSQL Environment

PostgreSQL BackupsPostgreSQL's WAL archiving and Point In Time Recovery is powerful

– But it requires a lot of storage

Typically used with some sort of NFS

Page 20: Leveraging Hadoop in your PostgreSQL Environment

PostgreSQL Backups (cont.)Use HDFS

– Redundancy & Scalability

Page 21: Leveraging Hadoop in your PostgreSQL Environment

PostgreSQL Backups (cont.)Archive Command

archive_command =

'hadoop dfs -copyFromLocal %p /user/postgres/wal/%f'

Page 22: Leveraging Hadoop in your PostgreSQL Environment

Log FilesMaintain log files for months or years

May use Syslog to consolidate multiple database logs

Turning on query logging makes the log file huge

Page 23: Leveraging Hadoop in your PostgreSQL Environment

Log Files (cont.)Use Flume

Consolidates logs across databases

MapReduce allows for parallel analysis

Page 24: Leveraging Hadoop in your PostgreSQL Environment

Log Files (cont.)Setup Syslog to forward messages to Flume

rsyslog.conf:

*.* @127.0.0.1:5140

Configure Flume to act as a Syslog server

pglogs.sources.sl.type = syslogudp

pglogs.sources.sl.port = 5140

pglogs.sources.sl.host = 0.0.0.0

Page 25: Leveraging Hadoop in your PostgreSQL Environment

Log Files (cont.)MapReduce jobs can quickly analyze the logs

public static class MapClass extends MapReduceBase implements Mapper<StatementOffset, Text, Text, LongWritable> {

private final static String STATEMENT_DELIM = "statement: "; private final static String SYSLOG_IDENT = "postgres";

private final static LongWritable one = new LongWritable(1);

public void map(StatementOffset key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

if (line.startsWith(SYSLOG_IDENT) && line.contains(STATEMENT_DELIM)) { output.collect(getStatementType(line), one); } }...

Page 26: Leveraging Hadoop in your PostgreSQL Environment

Transaction HistoryHistory Tables grow very rapidly

Maintaining the tables over time is a huge undertaking

Partitioning frequently used

Page 27: Leveraging Hadoop in your PostgreSQL Environment

Transaction History (cont.)Use Sqoop

– Add a sequence to the table for fast incremental loads

Page 28: Leveraging Hadoop in your PostgreSQL Environment

OLAP CubesCan take a very long time to build

PostgreSQL will use only a single CPU

Drilling down to the details can be a very long query

Page 29: Leveraging Hadoop in your PostgreSQL Environment

OLAP CubesUse a Foreign Data Wrapper

Looks like a native table to reporting tools

Drill down takes place on Hadoop

Page 30: Leveraging Hadoop in your PostgreSQL Environment

OLAP Cubes (cont.)Create a Foreign Server

CREATE EXTENSION hadoop_fdw;

CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');

CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;

Page 31: Leveraging Hadoop in your PostgreSQL Environment

OLAP Cubes (cont.)Create a Foreign Table

CREATE FOREIGN TABLE order_line ( ol_w_id integer, ol_d_id integer, ol_o_id integer, ol_number integer, ol_i_id integer, ol_delivery_d timestamp, ol_amount decimal(6,2), ol_supply_w_id integer, ol_quantity decimal(2,0), ol_dist_info varchar(24)) SERVER hadoop_server OPTIONS (table 'order_line');

Page 32: Leveraging Hadoop in your PostgreSQL Environment

OLAP Cubes (cont.)Loading PostgreSQL aggregate tables is a simple SQL statement

Use Hive views for more complex aggregations

INSERT INTO item_sale_month SELECT ol_i_id as i_id, EXTRACT(YEAR FROM ol_delivery_d) as year, EXTRACT(MONTH FROM ol_delivery_d) as month, sum(ol_amount) as amount FROM order_line GROUP BY 1, 2, 3;

Page 33: Leveraging Hadoop in your PostgreSQL Environment

OLAP Cubes (cont.)Drill downs pass the processing down to Hive

postgres=# explain verbose select sum(ol_amount) from order_line where ol_i_id = 34928;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=11002.50..11002.51 rows=1 width=14) Output: sum(ol_amount) -> Foreign Scan on public.order_line (cost=10000.00..11000.00 rows=1000

width=14) Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,

ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info Remote SQL: SELECT * FROM order_line WHERE ((ol_i_id = 34928))(5 rows)

Page 34: Leveraging Hadoop in your PostgreSQL Environment

Audit History

All database access should be audited and autonomously logged

Must be maintained for years

Page 35: Leveraging Hadoop in your PostgreSQL Environment

Audit History (cont.)Use the Hadoop Foreign Data Wrapper to Flume

Page 36: Leveraging Hadoop in your PostgreSQL Environment

Audit History (cont.)Create a writable foreign table

CREATE FORIEGN TABLE audit ( audit_id bigint, event_d timestamp, table varchar, action varchar, user varchar,) SERVER hadoop_server OPTIONS (table 'audit', flume_port '44444');

Page 37: Leveraging Hadoop in your PostgreSQL Environment

Message Queue

Tables have a lot of churn with many updates and deletes

Causes a lot of table and index bloat in PostgreSQL

AKA a vacuuming nightmare

Page 38: Leveraging Hadoop in your PostgreSQL Environment

Message Queue (cont.)

Use an FDW to HbaseHbase is not an “Eventually Consistent” architecture so it is ideal for message queues

Page 39: Leveraging Hadoop in your PostgreSQL Environment

Message Queue (cont.)Create a writable foreign table

CREATE FOREIGN TABLE hbase_table ( key varchar, value varchar) SERVER hadoop_server OPTIONS (table 'hbase_table', hbase_address

'localhost', hbase_port '9090', hbase_mapping ':key,cf:val');

INSERT INTO hbase_table VALUES ('key1', 'value1');INSERT INTO hbase_table VALUES ('key2', 'value2');UPDATE hbase_table SET value = 'update' WHERE key = 'key2';DELETE FROM hbase_table WHERE key='key1';SELECT * from hbase_table;

Page 40: Leveraging Hadoop in your PostgreSQL Environment

High Availability

When setting up replication for high availability many necessary components are not provided by PostgreSQL

Failure detection

Split brain prevention

Replica promotion

Notification to clients of fail over

Page 41: Leveraging Hadoop in your PostgreSQL Environment

High Availability (cont.)

ZooKeeper with a custom background worker can handle all of the missing components

Page 42: Leveraging Hadoop in your PostgreSQL Environment

High Availability (cont.)

Failure Detection – Replicas watch an ephemeral lock created by the master

void watch_master() {... sprintf(root_path, "%s/lock", zookeeper_path);

while (!found_master && !got_sigterm) { elog(DEBUG1, "Looking for the master lock...");

rc = zoo_get_children(zh, root_path, 0, &children);

if (rc == ZOK) { sprintf(child, "%s", "~"); for(i=0; i < children.count; i++) { if (strcmp(child, children.data[i]) > 0) { sprintf(child, "%s", children.data[i]); found_master = 1; } }

if (found_master) { sprintf(lock_path, "%s/%s", root_path, child); elog(DEBUG1, "Found a lock at %s", lock_path);

/* Set the watch on the lock */ bufferlen= sizeof(buffer); rc = zoo_get(zh, lock_path, 1, buffer, &bufferlen, NULL); if (rc != ZOK) { found_master = 0; elog(LOG, "Unable to watch %s. Retrying...", lock_path); } } } else { elog(LOG, "The path %s does not have any children yet. ...", root_path); }

...}

Page 43: Leveraging Hadoop in your PostgreSQL Environment

High Availability (cont.)

Split brain prevention – master grabs an exclusive zooKeeper lock on startup. Shut down immediately if unsuccessful

char *create_lock() { char path[PATH_LEN]; char *buffer; int rc;

buffer = (char *) palloc(PATH_LEN);

ensure_connected();

sprintf(path, "%s/lock", zookeeper_path); if (zoo_exists(zh, path, 0, NULL) == ZNONODE) { rc = zoo_create(zh, path, NULL, -1, &ZOO_OPEN_ACL_UNSAFE, 0, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper path: %d", rc); } }

sprintf(path, "%s/s-", path);

rc = zoo_create(zh, path, "master", 6, &ZOO_OPEN_ACL_UNSAFE, ZOO_EPHEMERAL | ZOO_SEQUENCE, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper lock: %d", rc); } elog(DEBUG1, "Created a zooKeeper ephemeral path at: %s", buffer);

return buffer;}

Page 44: Leveraging Hadoop in your PostgreSQL Environment

High Availability (cont.)

Replica promotion – use zooKeeper for ballots of a election. Highest LSN wins

void elect_master() {... recptr = GetWalRcvWriteRecPtr(NULL, NULL); sprintf(lsn, "%X/%08X", (uint32) (recptr >> 32), (uint32) recptr);

elog(DEBUG1, "Entering a ballot with an LSN of: %s", lsn);

sprintf(path, "%s/lock/%s", zookeeper_path, replica_id);

rc = zoo_create(zh, path, lsn, strlen(lsn), &ZOO_OPEN_ACL_UNSAFE, ZOO_EPHEMERAL, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper path: %s", path); } elog(DEBUG1, "Created a zooKeeper ephemeral path at: %s", buffer); all_votes_in = false; while (!all_votes_in && !got_sigterm) { sprintf(path, "%s/replica", zookeeper_path); rc = zoo_get_children(zh, path, 0, &replicas);

if (rc == ZOK) { sprintf(path, "%s/lock", zookeeper_path); rc = zoo_get_children(zh, path, 0, &ballots);

if (rc == ZOK) { all_votes_in = true; for(i=0; i < replicas.count; i++) { found = false; for(j=0; j < ballots.count; j++) { if (strcmp(replicas.data[i], ballots.data[j]) == 0) { found = true; break; } }

if (!found) { all_votes_in = false; break; } } } }… }

for(j=0; j < ballots.count; j++) { if (strcmp(ballots.data[j], replica_id) != 0) { sprintf(path, "%s/lock/%s", zookeeper_path, ballots.data[j]);

memset(buffer, 0, sizeof(buffer)); bufferlen= sizeof(buffer); rc = zoo_get(zh, path, 0, buffer, &bufferlen, NULL); if (rc != ZOK) { elog(LOG, "Unable to get %s. New master probably already found...", path); }

elog(DEBUG1, "Comparing the LSN: %s", buffer);

if (strcmp(lsn, buffer) < 0) { elog(DEBUG1, "Found an LSN greater than mine. I am not the winner."); return; } else if (strcmp(lsn, buffer) == 0) { elog(DEBUG1, "Found an LSN equal to mine. See if I was the first to the start."); if (strcmp(replica_id, ballots.data[j]) > 0 ) { elog(DEBUG1, "Found an LSN equal to mine and a sequence earlier than mine. I am not the winner."); return; } } } }

elog(LOG, "Becoming the new master. Acquiring the proper locks.");

lock = create_lock();

for(j=0; j < ballots.count; j++) {

elog(DEBUG1, "Removing ballot at %s", path); rc = zoo_delete(zh, path, -1); if (rc != ZOK) { elog(LOG, "Unable to delete %s", path); }

}

if (!has_lock(lock)) { elog(LOG, "Unable to acquire a zooKeeper lock. Shutting down to prevent a split brain scenario"); do_stop(); } else { elog(LOG, "Promoting to become the new master."); do_promote(); }

publish_master_info();}

Page 45: Leveraging Hadoop in your PostgreSQL Environment

High Availability (cont.)

Client notification – Python (or others) can watch the master and act appropriately

def __init__(self,zkHosts,pathName): zkHosts = zkHosts pathName = pathName

watchPath = pathName + "/master"

zk = KazooClient(hosts=zkHosts) zk.start()

if zk.exists(watchPath + "/connection"): (data,stat) = zk.get(watchPath + "/connection") self.pgConnection = data

@zk.DataWatch(watchPath + "/connection") def host_watch(data, stat): print("The new master connection is %s" % data) self.pgConnection = data

host = self.pgConnection.split(":")[0] port = self.pgConnection.split(":")[1]

Config = ConfigParser.SafeConfigParser() Config.read(bouncer_config_file) for name, value in Config.items("databases"): if name == bouncer_db: newValue = "" options = value.split(" ")

for option in options: if option.split("=")[0] == "host": newValue = newValue + "host=" + host + " " elif option.split("=")[0] == "port": newValue = newValue + "port=" + port + " " else: newValue = newValue + option + " "

Config.set("databases", bouncer_db, newValue)

cfgfile = open(bouncer_config_file,'w') Config.write(cfgfile)

cfgfile.close()

self.reloadBouncer()

else: raise NameError("The path (" + watchPath + \ "/connection) does not exist in ZooKeeper")

Page 46: Leveraging Hadoop in your PostgreSQL Environment

Getting the Componentshttp://hadoop.apache.org/

http://hive.apache.org/

http://flume.apache.org/

http://sqoop.apache.org/

http://zookeeper.apache.org/

http://hbase.apache.org/

http://www.postgresql.org/

http://jdbc.postgresql.org/

http://openjdk.java.net/

http://openscg.com/se/hadoop-fdw/

Page 47: Leveraging Hadoop in your PostgreSQL Environment

Or...BigSQL.org

Page 48: Leveraging Hadoop in your PostgreSQL Environment

Questions?