data compression for data lakes

16
Data Management Compression for Data Lakes Jonathan WINANDY primatice.com

Upload: jonathan-winandy

Post on 05-Dec-2014

274 views

Category:

Technology


3 download

DESCRIPTION

A short presentation about Row based compression in Data Lake with Hadoop. The first version of source code is available at https://github.com/viadeo/viadeo-avro-utils . The code use Avro for metadata management, which makes the task easier, but the ideas can be ported to other formats.

TRANSCRIPT

Page 1: Data compression for Data Lakes

Data Management Compression for

Data Lakes

Jonathan WINANDY primatice.com

Page 2: Data compression for Data Lakes

About ME

• I am (Not Only) a Data Engineer

Page 3: Data compression for Data Lakes

• Data Lake : What is a Data Lake ?

• Compute : What can we do with the gathered data ?

• Conclusion !

Contents

Page 4: Data compression for Data Lakes

Caches

App V1

DB Cluster

DB Cluster

Jobs

Specific DB

Specific DB

Monitoring

Authentification

HDFS / HADOOP

OLAP DB Backend App

JobsDeploy ?

App V2

Infrastructure > The Data Lake* (*) you actually needmore than just Hadoop to make a data lake.

Data Lake

Page 5: Data compression for Data Lakes

Comment faire circuler la data entre les différents stockages ?

DB Cluster DB

Cluster

HDFS / HADOOPOLAP

DB

Specific DB

Specific DB

We use HDFS to store a copy of all Data.

And we use HADOOP to create views on this Data.

Data Lake

Page 6: Data compression for Data Lakes

Photos de la base de données

DB Cluster

DB Cluster

HDFS / HADOOP

De : sql://$schema/$table

Vers : hdfs://staging/fromschema=$schema/fromtable=$table/importdatetime=$time

Offloading to HDFS

Data Lake

(HDFS is ‘just’ a Distributed File System)

Offloading data is just making a copy of databases data onto HDFS as a directory with a couple of files (parts).

Page 7: Data compression for Data Lakes

Photos de la base de données

DB Cluster

DB Cluster

HDFS / HADOOP

De : sql://$schema/$table

Vers : hdfs://staging/fromschema=$schema/fromtable=$table/importdatetime=$time

For relational databases, we can use apache sqoop to copy for example the table payments from schema salesinto hdfs://staging/sales/payments/2014-09-22T18_01_10into hdfs://staging/sales/payments/2014-09-21T17_50_25into hdfs://staging/sales/payments/2014-09-20T18_32_43

Data Lake

(HDFS is ‘just’ a Distributed File System)

Page 8: Data compression for Data Lakes

A light capacity planning for your potential cluster :

• If you have 100 GB of binary compressed data*,

• With a cost of storage around 0,03$/per GB/per month,

• An offload every day would cost 90$/per month of kept Data,

• In six month, this offloading would cost 1900$ and weight 18TB (<- this is Big Data).

Data Lake

* obviously not plain text/JSON

Page 9: Data compression for Data Lakes

But for this 18TB, you have quite a lot of features :

• You can track and understand bugs and data corruption,

• Analysing the business without harming your production DBs,

• Bootstrap new products based on your existing Data,

• And also have now a real excuse to learn Hadoop or make some use of your existing Hadoop bare metal clusters !

Data Lake

Page 10: Data compression for Data Lakes

Analyse des photosExtraction des photos de tables SQL sur HDFS, puis comparaison avec les photos des jours précédents :

J J - 1delta = f( , )J J - 1merge = f( , , …)J - 2, J - …,

Having a couple of snapshots in HDFS, we can use the tremendous power of MapReduce to join over the snapshots, and compute what changed in the database during a day.

Compute

Page 11: Data compression for Data Lakes

Analyse des photosInput / Output :

∆( , ) = id name1 Jon2 Bob3 James

id name1 John3 James4 Jim

id name dm1 John 011 Jon 103 James 114 Jim 012 Bob 10

https://github.com/viadeo/viadeo-avro-utils

∆ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’.

Inputs : Output :

https://github.com/viadeo/viadeo-avro-utilshttps://github.com/viadeo/viadeo-avro-utils

Compute

Page 12: Data compression for Data Lakes

Analyse des photosInput / Output (généralisé) :

∆( , , , ) = id123

id dm1 11112 10003 11014 01105 0011

id134

id145

id135

https://github.com/viadeo/viadeo-avro-utils

Inputs (generalised) : Output :

https://github.com/viadeo/viadeo-avro-utils

∆ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’.

Compute

Page 13: Data compression for Data Lakes

Efficacité des photosAnalysing the compression : merging 50 snapshots

Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect

Compute

Page 14: Data compression for Data Lakes

Efficacité des photos

WINS !!

OK, Just x2Small datasetsget a bit of overhead

Analysing the compression : merging 50 snapshots

Compute

Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect

Page 15: Data compression for Data Lakes

• So now the offloading might cost just 36$ of storage for 6 month (6$ per month).

• Welcome back at small data !

• But what does this mean for production Data ? Did you really need to store that in a mutable database in the first place ?

Conclusion

Page 16: Data compression for Data Lakes

Thank you,Questions ?