Download - Technologies for Data Analytics Platform
![Page 1: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/1.jpg)
Technologies for Data Analytics PlatformYAPC::Asia Tokyo 2015 - Aug 22, 2015
![Page 2: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/2.jpg)
Who are you?
• Masahiro Nakagawa • github: @repeatedly
• Treasure Data Inc. • Fluentd / td-agent developer • https://jobs.lever.co/treasure-data
• I love OSS :) • D Language, MessagePack, The organizer of several meetups, etc…
![Page 3: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/3.jpg)
Why do we analyze data?
![Page 4: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/4.jpg)
Reporting Monitoring Exploratory data analysis Confirmatory data analysis etc…
![Page 5: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/5.jpg)
Need data, data, data!
![Page 6: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/6.jpg)
It means we need data analysis platform for own requirements
![Page 7: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/7.jpg)
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
![Page 8: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/8.jpg)
Let’s launch platform!
![Page 9: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/9.jpg)
• Easy to use and maintain • Single server • RDBMS is popular and has huge ecosystem
RDBMS
ETL QueryExtract + Transformation + Load
![Page 10: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/10.jpg)
×
Oops! RDBMS is not good for data analytics against large data volume. We need more speed and scalability!
![Page 11: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/11.jpg)
Let’s consider Parallel RDBMS instead!
![Page 12: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/12.jpg)
Parallel RDBMS
• Optimized for OLAP workload • Columnar storage, Shared nothing, etc… • Netezza, Teradata, Vertica, Greenplum, etc…
Compute Node
Leader Node
Compute Node
Compute Node
Query
![Page 13: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/13.jpg)
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
• Good data format for analytics workload • Read only selected columns, efficient compression • Not good for insert / update
Columnar Storage
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
Row ColumnarUnit
Unit
![Page 14: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/14.jpg)
Okay, query is now processed normally.
L
C C C
![Page 15: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/15.jpg)
No silver bullet
• Performance depends on data modeling and query • distkey and sortkey are important
• should reduce data transfer and IO Cost • query should take advantage of these keys
• There are some problems • Cluster scaling, metadata management, etc…
![Page 16: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/16.jpg)
Performance is good :) But we often want to change schema for new workloads. Now, hard to maintain schema and its data…
L
C C C
![Page 17: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/17.jpg)
Okay, let’s separate data sources into multiple layers for reliable platform
![Page 18: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/18.jpg)
Schema on Write(RDBMS)• Writing data using schema
for improving query performance
• Pros: • minimum query overhead
• Cons: • Need to design schema and workload before • Data load is expensive operation
![Page 19: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/19.jpg)
Schema on Read(Hadoop)• Writing data without schema and
map schema at query time
• Pros: • Robust over schema and workload change • Data load is cheap operation
• Cons: • High overhead at query time
![Page 20: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/20.jpg)
Data Lake• Schema management is hard
• Volume is increasing and format is often changed • There are lots of log types
• Feasible approach is storing raw data and converting it before analyze data
• Data Lake is a single storage for any logs • Note that no clear definition for now
![Page 21: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/21.jpg)
Data Lake Patterns
• Use DFS, e.g. HDFS, for log storage • ETL or data processing by Hadoop ecosystem • Can convert logs via ingestion tools before
• Use Data Lake storage and related tools • These storages support Hadoop ecosystem
![Page 22: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/22.jpg)
Apache Hadoop• Distributed computing framework
• First implementation based on Google MapReduce
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
![Page 24: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/24.jpg)
MapReduce
![Page 25: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/25.jpg)
Cool! Data load becomes robust!
EL
T
Raw data Transformed data
![Page 26: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/26.jpg)
Apache Tez• Low level framework for YARN Applications
• Hive, Pig, new query engine and more
• Task and DAG based processing flow
ProcessorInput Output
Task DAG
![Page 27: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/27.jpg)
MapReduce vs Tez
MapReduce Tez
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
SELECT g1.x, g2.avg, g2.cntFROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x JOIN (a, b)
ORDER BY
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
![Page 28: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/28.jpg)
Superstition• HDFS and YARN have SPOF
• Recent version doesn’t have SPOF on both MapReduce 1 and MapReduce 2
• Can’t build from a scratch • Really? Treasure Data builds Hadoop on CircleCI.
Cloudera, Hortonworks and MapR too. • They also check its dependent toolchain.
![Page 29: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/29.jpg)
Which Hadoop packageshould we use?• Distribution by Hadoop distributor is better
• CDH by Cloudera • HDP by Hortonworks • MapR distribution by MapR
• If you are familiar with Hadoop and its ecosystem, Apache community edition becomes an option. • For example, Treasure Data has patches and
they want to use patched version.
![Page 30: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/30.jpg)
Good :) In addition, we want to collect data in efficient way!
![Page 31: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/31.jpg)
Ingestion tools• There are two execution model!
• Bulk load: • For high-throughput • Almost tools transfer data in batch and parallel
• Streaming load: • For low-latency • Almost tools transfer data in micro-batch
![Page 32: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/32.jpg)
Bulk load tools• Embulk
• Pluggable bulk data loader for various inputs and outputs
• Write plugins using Java and JRuby
• Sqoop • Data transfer between Hadoop and RDBMS • Included in some distributions
• Or each bulk loader for each data store
![Page 33: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/33.jpg)
Streaming load tools• Fluentd
• Pluggable and json based streaming collector • Lots of plugins in rubygems
• Flume • Mainly for Hadoop ecosystem, HDFS, HBase, … • Included in some distributions
• Or Logstash, Heka, Splunk and etc…
![Page 34: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/34.jpg)
Data ingestion also becomes robust and efficient!
Raw data Transformed data
![Page 35: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/35.jpg)
It works! but…we want to issue ad-hoc query to entire data. We can’t wait loading data into database.
![Page 36: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/36.jpg)
You can use MPP query engine for data stores.
![Page 37: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/37.jpg)
MPP query engine• It doesn’t have own storage unlike parallel RDBMS
• Follow “Schema on Read” approach • data distribution depends on backend • data schema also depends on backend
• Some products are called “SQL on Hadoop” • Presto, Impala, Apache Drill, etc… • It has own execution engine, not use MapReduce.
![Page 38: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/38.jpg)
• Distributed Query Engine for interactive queries against various data sources and large data.
• Pluggable connector for joining multiple backends • You can join MySQL and HDFS data in one query
• Lots of useful functions for data analytics • window functions, approximate query,
machine learning, etc…
![Page 39: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/39.jpg)
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly BatchInteractive query
CommercialBI Tools
Batch analysis platform Visualization platform
Dashboard
![Page 40: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/40.jpg)
HDFS
Hive
Daily/Hourly BatchInteractive query
✓ Less scalable ✓ Extra cost
CommercialBI Tools
Dashboard
✓ More work to manage 2 platforms
✓ Can’t query against “live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
![Page 41: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/41.jpg)
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
HiveDashboard
Daily/Hourly Batch
Interactive query
Interactive query
![Page 42: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/42.jpg)
Presto
HDFS
HiveDashboard
Daily/Hourly BatchInteractive query
Cassandra MySQL Commertial DBs
SQL on any data sets CommercialBI Tools
✓ IBM Cognos✓ Tableau ✓ ...
Data analysis platform
![Page 43: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/43.jpg)
Client
Coordinator ConnectorPlugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
![Page 44: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/44.jpg)
Execution Model
All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory
task
disk
map map
reduce reduce
disk
disk
Write datato disk
Wait betweenstages
![Page 45: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/45.jpg)
Okay, we have now low latency and batch combination.
Raw data
![Page 46: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/46.jpg)
Resolved our concern! But… we also need quick estimation.
![Page 47: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/47.jpg)
Currently, we have several stream processing softwares. Let’s try!!
![Page 48: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/48.jpg)
Apache Storm• Distributed realtime processing framework
• Low latency: tuple at a time • Trident mode uses micro batch
https://storm.apache.org/
![Page 49: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/49.jpg)
Norikra• Schema-less CEP engine for stream processing
• Use SQL like Esper EPL • Not distributed unlike Storm for now
Calculated result
![Page 50: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/50.jpg)
Great! We can get insight by streaming and batch way :)
![Page 51: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/51.jpg)
One more. We can make data transfer more reliable for multiple data streams with distributed queue
![Page 52: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/52.jpg)
• Distributed messaging system • Producer - Broker - Consumer pattern • Pull model, replication, etc…
Apache Kafka
App
PushPull
![Page 53: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/53.jpg)
Push vs Pull
• Push: • Easy to transfer data to multiple destinations • Hard to control stream ratio in multiple streams
• Pull: • Easy to control stream ratio • Should manage consumers correctly
![Page 54: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/54.jpg)
This is a modern analytics platform
![Page 55: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/55.jpg)
Seems complex and hard to maintain? Let’s use useful services!
![Page 56: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/56.jpg)
Amazon Redshift• Parallel RDBMS on AWS
• Re-use traditional Parallel RDMBS know-how • Scale is easier than traditional systems
• With Amazon EMR is popular 1. Store data into S3 2. EMR processes S3 data 3. Load processed data into Redshift
• EMR has Hadoop ecosystem
![Page 57: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/57.jpg)
Using AWS Services
![Page 58: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/58.jpg)
Google BigQuery• Distributed query engine and scalable storage
• Tree model, Columnar storage, etc… • Separate storage from workers
• High performance query by Google infrastructure • Lots of workers • Storage / IO layer on Colossus
• Can’t manage Parallel RDBMS properties like distkey, but it works on almost cases.
![Page 59: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/59.jpg)
BigQuery architecture
![Page 60: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/60.jpg)
Using GCP Services
![Page 61: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/61.jpg)
Treasure Data• Cloud based end-to-end data analytics service
• Hive, Presto, Pig and Hivemall for one big repository • Lots of ingestion and output way, scheduling, etc… • No stream processing for now
• Service concept is Data Lake • JSON based schema-less storage
• Execution model is similar to BigQuery • Separate storage from workers • Can’t specify Parallel RDBMS properties
![Page 62: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/62.jpg)
Using Treasure Data Service
![Page 63: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/63.jpg)
Resource Model Trade-off
Pros Cons
Fully Guaranteed Stable execution Easy to control resource Non boost mechanizm
Guaranteed with multi-tenanted
Stable execution Good scalability less controllable resource
Fully multi-tenanted Boosted performance Great scalability Unstable execution
![Page 64: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/64.jpg)
MS Azure also has useful services: DataHub, SQL DWH, DataLake, Stream Analytics, HDInsight…
![Page 65: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/65.jpg)
Use service or build a platform?
• Should consider using service first • AWS, GCP, MS Azure, Treasure Data, etc… • Important factor is data analytics, not platform
• Do you have enough resources to maintain it?
• If specific analytics platform is a differentiator, building a platform is better • Use state-of-the-art technologies • Hard to implement on existing platforms
![Page 66: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/66.jpg)
Conclusion• Many softwares and services for data analytics
• Lots of trade-off, performance, complexity, connectivity, execution model, etc
• SQL is a primary language on data analytics
• Should focus your goal! • data analytics platform is your business core?
If not, consider using services first.
![Page 67: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/67.jpg)
Cloud service for entire data pipeline!
![Page 68: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/68.jpg)
Appendix
![Page 69: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/69.jpg)
Apache Spark• Another Distributed computing framework
• Mainly for in-memory computing with DAG • RDD and DataFrame based clean API
• Combination with Hadoop is popular
http://slidedeck.io/jmarin/scala-talk
![Page 70: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/70.jpg)
Apache Flink• Streaming based execution engine
• Support batch and pipelined processing • Hadoop and Spark are batch based •
https://ci.apache.org/projects/flink/flink-docs-master/
![Page 71: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/71.jpg)
Batch vs Pipelined
All stages are pipe-lined ✓ No wait time ✓ fault-tolerance with
check pointing
Batch(Staged) Pipelined
task task
task task
task
task
memory-to-memory data transfer ✓ use disk if needed
task
disk
disk
Wait betweenstagestask
task task
task task
task task stage3
stage2
stage1
![Page 72: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/72.jpg)
Visualization• Tableau
• Popular BI tool in many area • Awesome GUI, easy to use, lots of charts, etc
• Metric Insights • Dashboard for many metrics • Scheduled query, custom handler, etc
• Chartio • Cloud based BI tool
![Page 73: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/73.jpg)
How to manage job dependency? We want to issue Job X after Job A and Job B are finished.
![Page 74: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/74.jpg)
Data pipeline tool• There are some important features
• Manage job dependency • Handle job failure and retry • Easy to define topology • Separate tasks into sub-tasks
• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1, etc…
![Page 75: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/75.jpg)
Luigi
• Python module for building job pipeline • Write python code and run it.
• task is defined as Python class • Easy to manage by VCS
• Need some extra tools • scheduled job, job hisotry, etc…
class T1(luigi.task): def requires(self): # dependencies
def output(self): # store result
def run(self): # task body
![Page 76: Technologies for Data Analytics Platform](https://reader036.vdocuments.mx/reader036/viewer/2022062503/5876d52a1a28ab1d238b55b7/html5/thumbnails/76.jpg)
Airflow• Python and DAG based workflow
• Write python code but it is for defining ADAG • Task is defined by Operator
• There are good features • Management web UI • Task information is stored into database • Celery based distributed execution
dag = DAG('example') t1 = Operator(..., dag=dag) t2 = Operator(..., dag=dag) t2.set_upstream(t1)