data federation with apache spark

19
Data Federation with Spark Dan Marshall [email protected] 06/13/2017

Upload: dataworks-summit

Post on 23-Jan-2018

411 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Federation with Apache Spark

Data Federation with SparkDan Marshall

[email protected]

06/13/2017

Page 2: Data Federation with Apache Spark
Page 3: Data Federation with Apache Spark

PostgreSQL Source

postgres=# create table pg_employee (emp_id int, emp_name varchar,

emp_title varchar, emp_hire_date date, emp_dept_id varchar);

postgres=# select * from pg_employee;

emp_id | emp_name | emp_title | emp_hire_date | emp_dept_id

--------+------------------+-----------+---------------+-------------

1 | Fred Flinstone | Quarryman | 2001-07-01 | M1

2 | Donald Duck | Fisherman | 2011-04-28 | F1

3 | Larry Fitzgerald | Receiver | 2005-11-12 | S2

4 | Randy Johnson | Pitcher | 2008-01-11 | S2

(4 rows)

Page 4: Data Federation with Apache Spark

PostgreSQL - Spark

Page 5: Data Federation with Apache Spark

PostgreSQL - Spark

Page 6: Data Federation with Apache Spark

HBase Source

hbase(main):004:0* create 'hb_dept','cf1'

=> Hbase::Table - hb_dept

hbase(main):008:0* put 'hb_dept','M1','cf1:dept_name','Maintenance'

hbase(main):009:0> put 'hb_dept','F1','cf1:dept_name','Entertainment'

hbase(main):010:0> put 'hb_dept','S2','cf1:dept_name','Sports'

hbase(main):012:0* scan 'hb_dept'

ROW COLUMN+CELL

F1 column=cf1:dept_name, timestamp=1496621309775,

value=Entertainment

M1 column=cf1:dept_name, timestamp=1496621309741, value=Maintenance

S2 column=cf1:dept_name, timestamp=1496621309863, value=Sports

3 row(s) in 0.0590 seconds

Page 7: Data Federation with Apache Spark

HBase - Spark

Page 8: Data Federation with Apache Spark

Join – HBase and PostgreSQL

Page 9: Data Federation with Apache Spark

Cassandra Source

Connected to Test Cluster at cassandra:9042.

[cqlsh 5.0.1 | Cassandra 3.10 | CQL spec 3.4.4 | Native protocol v4]

Use HELP for help.

cqlsh> use mykeyspace;

cqlsh:mykeyspace> create table bonus_table (userid int primary key, bonus_amount decimal);

cqlsh:mykeyspace> insert into bonus_table (userid, bonus_amount) values (1, 500.00);

cqlsh:mykeyspace> insert into bonus_table (userid, bonus_amount) values (4, 1000.00);

cqlsh:mykeyspace> select * from bonus_table;

userid | bonus_amount

--------+--------------

1 | 500.00

4 | 1000.00

(2 rows)

Page 10: Data Federation with Apache Spark

Cassandra – Spark

Page 11: Data Federation with Apache Spark

Use SQL on DataFrame from Cassandra Source

Page 12: Data Federation with Apache Spark

Join – Hbase,PostgreSQL,Cassandra

Page 13: Data Federation with Apache Spark

JSON Source

{"dept":"F1"}

{"dept":"S2"}

Page 14: Data Federation with Apache Spark

JSON Code

Page 15: Data Federation with Apache Spark

Join – Hbase,PostgreSQL,Cassandra,JSON

Page 16: Data Federation with Apache Spark

SQL on Final Temp View

Page 17: Data Federation with Apache Spark

JDBC Write

Page 18: Data Federation with Apache Spark

Enriched Table in PostgreSQL

Page 19: Data Federation with Apache Spark

Data Federation with SparkDan Marshall

[email protected]

06/13/2017