leveraging hadoop in your postgresql environment

Postgres & Hadoop

Who am I?Jim Mlodgenski

CTO, OpenSCG

Co-organizer, NYCPUG

Co-organizer, Philly PUG

Co-chair, PGConf US

[email protected]

@jim_mlodgenski

AgendaStrengths of PostgreSQL

Strengths of Hadoop

Hadoop Community

Use Cases

Best of Both World

PostgresWorld’s most advanced open source database solution

Enterprise class including MVCC, streaming replication & rich data type support (to name a few!)

Robust transaction support with strong ANSI-SQL compliance

HadoopBig data distributed framework

Reliable, massively scalable & proven

Failures handled at the application layer allowing commodity hardware

Strengths of PostgreSQLStrong Data Types

Concurrency

Transactions

Security

Indexes

Connectors

Components of PostgreSQLDatabase

Connectors

– JDBC

– ODBC

– Libpq

Foreign Data Wrappers

And more...

Strengths of HadoopParallelism

Flexibility

Redundancy

Scalability

Components of HadoopHDFS

Hive

Flume

Sqoop

ZooKeeper

Hbase

And many more...

HDFS

Hadoop Distributed File System

HbaseModeled after Google BigTable

Column-oriented database on top of HDFS

ZooKeeperDistributed Configuration Service

Supports synchronization and distributed locking

Automatic leader election

HiveAdds SQL on Hadoop

Converts SQL (HQL) to MapReduce Jobs

FlumeStreams data into HDFS

Distributed and Highly Available

SqoopAllows for bulk transfers of data between Hadoop and a RDBMS

Hadoop CommunityMuch more like the Linux community than the PostgreSQL community

Some competing commercial interests makes the direction unclear to some

Use Cases

Hive MetastoreAll of the meta data of the Hive tables reside in a RDBMS

The default is to use Derby

– Limits to a single connection

Hive Metastore (cont.)Use PostgreSQL for scalability and reliability

Many concurrent users

PostgreSQL BackupsPostgreSQL's WAL archiving and Point In Time Recovery is powerful

– But it requires a lot of storage

Typically used with some sort of NFS

PostgreSQL Backups (cont.)Use HDFS

– Redundancy & Scalability

PostgreSQL Backups (cont.)Archive Command

archive_command =

'hadoop dfs -copyFromLocal %p /user/postgres/wal/%f'

Log FilesMaintain log files for months or years

May use Syslog to consolidate multiple database logs

Turning on query logging makes the log file huge

Log Files (cont.)Use Flume

Consolidates logs across databases

MapReduce allows for parallel analysis

Log Files (cont.)Setup Syslog to forward messages to Flume

rsyslog.conf:

*.* @127.0.0.1:5140

Configure Flume to act as a Syslog server

pglogs.sources.sl.type = syslogudp

pglogs.sources.sl.port = 5140

pglogs.sources.sl.host = 0.0.0.0

Log Files (cont.)MapReduce jobs can quickly analyze the logs

public static class MapClass extends MapReduceBase implements Mapper<StatementOffset, Text, Text, LongWritable> {

private final static String STATEMENT_DELIM = "statement: "; private final static String SYSLOG_IDENT = "postgres";

private final static LongWritable one = new LongWritable(1);

public void map(StatementOffset key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

if (line.startsWith(SYSLOG_IDENT) && line.contains(STATEMENT_DELIM)) { output.collect(getStatementType(line), one); } }...

Transaction HistoryHistory Tables grow very rapidly

Maintaining the tables over time is a huge undertaking

Partitioning frequently used

Transaction History (cont.)Use Sqoop

– Add a sequence to the table for fast incremental loads

OLAP CubesCan take a very long time to build

PostgreSQL will use only a single CPU

Drilling down to the details can be a very long query

OLAP CubesUse a Foreign Data Wrapper

Looks like a native table to reporting tools

Drill down takes place on Hadoop

OLAP Cubes (cont.)Create a Foreign Server

CREATE EXTENSION hadoop_fdw;

CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');

CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;

OLAP Cubes (cont.)Create a Foreign Table

CREATE FOREIGN TABLE order_line ( ol_w_id integer, ol_d_id integer, ol_o_id integer, ol_number integer, ol_i_id integer, ol_delivery_d timestamp, ol_amount decimal(6,2), ol_supply_w_id integer, ol_quantity decimal(2,0), ol_dist_info varchar(24)) SERVER hadoop_server OPTIONS (table 'order_line');

OLAP Cubes (cont.)Loading PostgreSQL aggregate tables is a simple SQL statement

Use Hive views for more complex aggregations

INSERT INTO item_sale_month SELECT ol_i_id as i_id, EXTRACT(YEAR FROM ol_delivery_d) as year, EXTRACT(MONTH FROM ol_delivery_d) as month, sum(ol_amount) as amount FROM order_line GROUP BY 1, 2, 3;

OLAP Cubes (cont.)Drill downs pass the processing down to Hive

postgres=# explain verbose select sum(ol_amount) from order_line where ol_i_id = 34928;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=11002.50..11002.51 rows=1 width=14) Output: sum(ol_amount) -> Foreign Scan on public.order_line (cost=10000.00..11000.00 rows=1000

width=14) Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,

ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info Remote SQL: SELECT * FROM order_line WHERE ((ol_i_id = 34928))(5 rows)

Audit History

All database access should be audited and autonomously logged

Must be maintained for years

Audit History (cont.)Use the Hadoop Foreign Data Wrapper to Flume

Audit History (cont.)Create a writable foreign table

CREATE FORIEGN TABLE audit ( audit_id bigint, event_d timestamp, table varchar, action varchar, user varchar,) SERVER hadoop_server OPTIONS (table 'audit', flume_port '44444');

Message Queue

Tables have a lot of churn with many updates and deletes

Causes a lot of table and index bloat in PostgreSQL

AKA a vacuuming nightmare

Message Queue (cont.)

Use an FDW to HbaseHbase is not an “Eventually Consistent” architecture so it is ideal for message queues

Message Queue (cont.)Create a writable foreign table

CREATE FOREIGN TABLE hbase_table ( key varchar, value varchar) SERVER hadoop_server OPTIONS (table 'hbase_table', hbase_address

'localhost', hbase_port '9090', hbase_mapping ':key,cf:val');

INSERT INTO hbase_table VALUES ('key1', 'value1');INSERT INTO hbase_table VALUES ('key2', 'value2');UPDATE hbase_table SET value = 'update' WHERE key = 'key2';DELETE FROM hbase_table WHERE key='key1';SELECT * from hbase_table;

High Availability

When setting up replication for high availability many necessary components are not provided by PostgreSQL

Failure detection

Split brain prevention

Replica promotion

Notification to clients of fail over

High Availability (cont.)

ZooKeeper with a custom background worker can handle all of the missing components


Failure Detection – Replicas watch an ephemeral lock created by the master

void watch_master() {... sprintf(root_path, "%s/lock", zookeeper_path);

while (!found_master && !got_sigterm) { elog(DEBUG1, "Looking for the master lock...");

rc = zoo_get_children(zh, root_path, 0, &children);

if (rc == ZOK) { sprintf(child, "%s", "~"); for(i=0; i < children.count; i++) { if (strcmp(child, children.data[i]) > 0) { sprintf(child, "%s", children.data[i]); found_master = 1; } }

if (found_master) { sprintf(lock_path, "%s/%s", root_path, child); elog(DEBUG1, "Found a lock at %s", lock_path);

/* Set the watch on the lock */ bufferlen= sizeof(buffer); rc = zoo_get(zh, lock_path, 1, buffer, &bufferlen, NULL); if (rc != ZOK) { found_master = 0; elog(LOG, "Unable to watch %s. Retrying...", lock_path); } } } else { elog(LOG, "The path %s does not have any children yet. ...", root_path); }

...}


Split brain prevention – master grabs an exclusive zooKeeper lock on startup. Shut down immediately if unsuccessful

char *create_lock() { char path[PATH_LEN]; char *buffer; int rc;

buffer = (char *) palloc(PATH_LEN);

ensure_connected();

sprintf(path, "%s/lock", zookeeper_path); if (zoo_exists(zh, path, 0, NULL) == ZNONODE) { rc = zoo_create(zh, path, NULL, -1, &ZOO_OPEN_ACL_UNSAFE, 0, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper path: %d", rc); } }

sprintf(path, "%s/s-", path);

rc = zoo_create(zh, path, "master", 6, &ZOO_OPEN_ACL_UNSAFE, ZOO_EPHEMERAL | ZOO_SEQUENCE, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper lock: %d", rc); } elog(DEBUG1, "Created a zooKeeper ephemeral path at: %s", buffer);

return buffer;}


Replica promotion – use zooKeeper for ballots of a election. Highest LSN wins

void elect_master() {... recptr = GetWalRcvWriteRecPtr(NULL, NULL); sprintf(lsn, "%X/%08X", (uint32) (recptr >> 32), (uint32) recptr);

elog(DEBUG1, "Entering a ballot with an LSN of: %s", lsn);

sprintf(path, "%s/lock/%s", zookeeper_path, replica_id);

rc = zoo_create(zh, path, lsn, strlen(lsn), &ZOO_OPEN_ACL_UNSAFE, ZOO_EPHEMERAL, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper path: %s", path); } elog(DEBUG1, "Created a zooKeeper ephemeral path at: %s", buffer); all_votes_in = false; while (!all_votes_in && !got_sigterm) { sprintf(path, "%s/replica", zookeeper_path); rc = zoo_get_children(zh, path, 0, &replicas);

if (rc == ZOK) { sprintf(path, "%s/lock", zookeeper_path); rc = zoo_get_children(zh, path, 0, &ballots);

if (rc == ZOK) { all_votes_in = true; for(i=0; i < replicas.count; i++) { found = false; for(j=0; j < ballots.count; j++) { if (strcmp(replicas.data[i], ballots.data[j]) == 0) { found = true; break; } }

if (!found) { all_votes_in = false; break; } } } }… }

for(j=0; j < ballots.count; j++) { if (strcmp(ballots.data[j], replica_id) != 0) { sprintf(path, "%s/lock/%s", zookeeper_path, ballots.data[j]);

memset(buffer, 0, sizeof(buffer)); bufferlen= sizeof(buffer); rc = zoo_get(zh, path, 0, buffer, &bufferlen, NULL); if (rc != ZOK) { elog(LOG, "Unable to get %s. New master probably already found...", path); }

elog(DEBUG1, "Comparing the LSN: %s", buffer);

if (strcmp(lsn, buffer) < 0) { elog(DEBUG1, "Found an LSN greater than mine. I am not the winner."); return; } else if (strcmp(lsn, buffer) == 0) { elog(DEBUG1, "Found an LSN equal to mine. See if I was the first to the start."); if (strcmp(replica_id, ballots.data[j]) > 0 ) { elog(DEBUG1, "Found an LSN equal to mine and a sequence earlier than mine. I am not the winner."); return; } } } }

elog(LOG, "Becoming the new master. Acquiring the proper locks.");

lock = create_lock();

for(j=0; j < ballots.count; j++) {

elog(DEBUG1, "Removing ballot at %s", path); rc = zoo_delete(zh, path, -1); if (rc != ZOK) { elog(LOG, "Unable to delete %s", path); }

}

if (!has_lock(lock)) { elog(LOG, "Unable to acquire a zooKeeper lock. Shutting down to prevent a split brain scenario"); do_stop(); } else { elog(LOG, "Promoting to become the new master."); do_promote(); }

publish_master_info();}


Client notification – Python (or others) can watch the master and act appropriately

def __init__(self,zkHosts,pathName): zkHosts = zkHosts pathName = pathName

watchPath = pathName + "/master"

zk = KazooClient(hosts=zkHosts) zk.start()

if zk.exists(watchPath + "/connection"): (data,stat) = zk.get(watchPath + "/connection") self.pgConnection = data

@zk.DataWatch(watchPath + "/connection") def host_watch(data, stat): print("The new master connection is %s" % data) self.pgConnection = data

host = self.pgConnection.split(":")[0] port = self.pgConnection.split(":")[1]

Config = ConfigParser.SafeConfigParser() Config.read(bouncer_config_file) for name, value in Config.items("databases"): if name == bouncer_db: newValue = "" options = value.split(" ")

for option in options: if option.split("=")[0] == "host": newValue = newValue + "host=" + host + " " elif option.split("=")[0] == "port": newValue = newValue + "port=" + port + " " else: newValue = newValue + option + " "

Config.set("databases", bouncer_db, newValue)

cfgfile = open(bouncer_config_file,'w') Config.write(cfgfile)

cfgfile.close()

self.reloadBouncer()

else: raise NameError("The path (" + watchPath + \ "/connection) does not exist in ZooKeeper")

Getting the Componentshttp://hadoop.apache.org/

http://hive.apache.org/

http://flume.apache.org/

http://sqoop.apache.org/

http://zookeeper.apache.org/

http://hbase.apache.org/

http://www.postgresql.org/

http://jdbc.postgresql.org/

http://openjdk.java.net/

http://openscg.com/se/hadoop-fdw/

Or...BigSQL.org

Questions?

leveraging hadoop in your postgresql environment

Technology