akass@ + dmi@€¦ · - naturally worked well with parallel data-stores on hdfs - also improved...
TRANSCRIPT
![Page 1: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/1.jpg)
akass@ + dmi@
![Page 2: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/2.jpg)
digitalocean.com
Building a Robust Analytics Platformwith an open-source stack
![Page 3: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/3.jpg)
digitalocean.com
What’s coming up:1) DigitalOcean - a company background
2) Data @ DigitalOcean
3) The Big Data Tech Stack @ DO
4) Use-cases + Demo
![Page 4: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/4.jpg)
digitalocean.com
What is DigitalOcean?
![Page 5: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/5.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
![Page 6: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/6.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
- 4 years old
![Page 7: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/7.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
- 4 years old- 12 Data Centers globally
![Page 8: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/8.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
- 4 years old- 12 Data Centers globally- Over 30 Million VMs created
![Page 9: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/9.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
- 4 years old- 12 Data Centers globally- Over 30 Million VMs created- In 196 Countries
![Page 10: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/10.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
- 4 years old- 12 Data Centers globally- Over 30 Million VMs created- In 196 Countries- 700K Developer Users
![Page 11: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/11.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
- 4 years old- 12 Data Centers globally- Over 30 Million VMs created- In 196 Countries- 700K Developer Users- 30K Developer Teams
![Page 12: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/12.jpg)
digitalocean.com
- Data & Analytics (“DnA” - Analysts + Data Scientists/Engineering)
- (Alex & Dao live here)
- Platform/Infrastructure Engineering
- Product Engineering
- Security
Data @ DO - Participating Teams
![Page 13: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/13.jpg)
digitalocean.com
Data @ DO
- Product Usage
![Page 14: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/14.jpg)
- Product Revenue Forecasting- Churn Prediction- User Segmentation- Beta product uses and product cannibalization- Support Team Efficacy
Data @ DO - Product Usage
digitalocean.com
![Page 15: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/15.jpg)
digitalocean.com
Data @ DO - Product Usage, v1.0
Pain Point 1: General Data Architecture
Huge, unwieldy SQL Tables → Slooooow, monolithic, unadaptive
Pain Point 2: Upstream Dependencies
E.g. Monthly Invoicing → Revenue analytics done on monthly basis
![Page 16: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/16.jpg)
digitalocean.com
Data @ DO - Product Usage, v2.0
Revision 1: General Data Architecture
Microservices + Kafka pass application-level events → Faster and more robust, but teams must build their own consumers.
Revision 2: Active Downstream Consumption
E.g. Granular Billable Events → Daily, hourly, even near-RT processing of revenue for ingestion into analysis
![Page 17: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/17.jpg)
digitalocean.com
Data @ DO
- Product Usage
- Sales/Marketing Leads
![Page 18: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/18.jpg)
Problem:
- User behavioral data lives in AWS Redshift
- User metadata lives in MySQL on DO’s cloud
Data @ DO - Sales/Marketing Leads
digitalocean.com
![Page 19: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/19.jpg)
Problem:
- User behavioral data lives in AWS Redshift
- User metadata lives in MySQL on DO’s cloud
Solution:
- Migrate warehousing to our own cloud so that all data stays on-premise
Data @ DO - Sales/Marketing Leads
digitalocean.com
![Page 20: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/20.jpg)
digitalocean.com
Data @ DO
- Product Usage
- Sales/Marketing Leads
- Infrastructure
![Page 21: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/21.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
Data @ DO - Infrastructure
digitalocean.com
![Page 22: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/22.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
Data @ DO - Infrastructure
digitalocean.com
![Page 23: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/23.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
Data @ DO - Infrastructure
digitalocean.com
![Page 24: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/24.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
Data @ DO - Infrastructure
digitalocean.com
![Page 25: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/25.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
Data @ DO - Infrastructure
digitalocean.com
![Page 26: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/26.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
Data @ DO - Infrastructure
digitalocean.com
![Page 27: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/27.jpg)
Problem - nay - Conundrum:
- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions
- Significant scale (too big for RDBMS), inherent silos
Data @ DO - Infrastructure
digitalocean.com
![Page 28: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/28.jpg)
digitalocean.com
We need to reimagine how we process and store everything.
To recap:- Product Data in MySQL is slow and isolated- Sales/Marketing data are isolated in different warehouses- Infra data are prohibitively large and isolated
![Page 29: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/29.jpg)
digitalocean.com
We need to reimagine how we process and store everything.
![Page 30: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/30.jpg)
digitalocean.com
We need to reimagine how we process and store everything.
![Page 31: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/31.jpg)
digitalocean.com
We need to reimagine how we process and store everything.
![Page 32: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/32.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company for Software Developers.
![Page 33: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/33.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company = We have cloud infrastructure.
![Page 34: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/34.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company = We have cloud infrastructure.
= We can build our own big data environment.
![Page 35: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/35.jpg)
digitalocean.com
What is DigitalOcean?A Cloud Hosting Company = We have cloud infrastructure.
= We can build our own big data environment.
![Page 36: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/36.jpg)
digitalocean.com
The DO Big Data Stack
![Page 37: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/37.jpg)
digitalocean.com
The DO Big Data Stack
![Page 38: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/38.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management
![Page 39: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/39.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute
![Page 40: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/40.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments
![Page 41: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/41.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing
![Page 42: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/42.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System
![Page 43: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/43.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine
![Page 44: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/44.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine7. Unifying IDE
![Page 45: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/45.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine7. Unifying IDE8. Logging at Scale
![Page 46: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/46.jpg)
digitalocean.com
The DO Big Data Stack
1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine7. Unifying IDE8. Logging at Scale
![Page 47: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/47.jpg)
digitalocean.com
“Buzz.”
– Hive
![Page 48: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/48.jpg)
digitalocean.com
“Buzz.”
– Hive
![Page 49: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/49.jpg)
digitalocean.com
Use Case 1: A New ETL Pipeline for Support Ticket Events
Measuring Customer Satisfaction (CSAT)Integration with Internal TicketingMore transparency for Support Team
![Page 50: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/50.jpg)
digitalocean.com
Use Case 1: A New ETL Pipeline for Support Ticket Events
Subscribe to new Kafka topic
![Page 51: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/51.jpg)
digitalocean.com
Use Case 1: A New ETL Pipeline for Support Ticket Events
Subscribe to new Kafka topicPrototype reading of Kafka data within a
Spark-enabled Jupyter Notebook
![Page 52: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/52.jpg)
digitalocean.com
Use Case 1: A New ETL Pipeline for Support Ticket Events
Subscribe to new Kafka topicPrototype reading of Kafka data within a Spark-enabled
Jupyter NotebookUpdate Spark Docker container to include new modules
and configurations, if necessary
$ cat spark-defaults.conf | grep dockerspark.mesos.executor.docker.image docker.internal.digitalocean.com/platform/spark:1929b8d
![Page 53: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/53.jpg)
digitalocean.com
Use Case 1: A New ETL Pipeline for Support Ticket Events
Subscribe to new Kafka topicPrototype reading of Kafka data within a Spark-
enabled Jupyter NotebookUpdate Spark Docker container to include new
modules and configurations, if necessaryWrite Spark script & deploy onto Mesos
![Page 54: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/54.jpg)
digitalocean.com
Use Case 1: A New ETL Pipeline for Support Ticket Events
Subscribe to new Kafka topicPrototype reading of Kafka data within a Spark-
enabled Jupyter NotebookUpdate Spark Docker container to include new
modules and configurations, if necessaryWrite Spark script & deploy onto MesosAnalyze!
![Page 55: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/55.jpg)
digitalocean.com
Use Case 2: Billable Data, from Months to Minutes
Reminder:
Pain Point 2: Upstream Dependencies
E.g. Monthly Invoicing → Revenue analytics done on monthly basis
Revision 2: Active Downstream Consumption
Goal → Increase temporal granularity to daily, hourly, even near-RT processing of revenue for ingestion into analysis
![Page 56: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/56.jpg)
digitalocean.com
Use Case 2: Billable Data, from Months to Minutes
![Page 57: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/57.jpg)
digitalocean.com
Use Case 2: Billable Data, from Months to Minutes
![Page 58: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/58.jpg)
Use Case 3: Power Failure Detection + PDU Clustering
Real-World problem: - many different drive models out in fleet…
digitalocean.com
![Page 59: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/59.jpg)
Real-World problem: - many different drive models out in fleet…
- how to predict performance stability/failures with consistency?
Use Case 3: Power Failure Detection + PDU Clustering
digitalocean.com
![Page 60: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/60.jpg)
Use Case 3: Power Failure Detection + PDU Clustering
Solution:
1) Measure PDU patterns & fluctuations on a podlet/rack/device level
2) Join to smartctl data to get vendor/drive-specific metadata
3) Mine for outliers to help DC teams react quicker to anomalies
digitalocean.com
![Page 61: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/61.jpg)
digitalocean.com
Use Case 3: Power Failure Detection + PDU Clustering
![Page 62: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/62.jpg)
digitalocean.com
Use Cases in Action:
Let’s have a DEMO...
![Page 63: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/63.jpg)
digitalocean.com
Let’s have a DEMO...
Use Cases in Action:
![Page 64: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/64.jpg)
digitalocean.com
Lessons learned from work thus far...
![Page 65: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/65.jpg)
Lessons Learned (a small sample thus far)
- Jupyter was indispensable for prototyping Spark + Kafka interactively
digitalocean.com
![Page 66: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/66.jpg)
Lessons Learned (a small sample thus far)
- Jupyter was indispensable for prototyping Spark + Kafka interactively
- Spark image containerization à consistency across Mesosphere
digitalocean.com
![Page 67: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/67.jpg)
Lessons Learned (a small sample thus far)
- Jupyter was indispensable for prototyping Spark + Kafka interactively
- Spark image containerization à consistency across Mesosphere
- Multiple Docker images à developmental flexibility
digitalocean.com
![Page 68: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/68.jpg)
Lessons Learned (a small sample thus far)
- Jupyter was indispensable for prototyping Spark + Kafka interactively
- Spark image containerization à consistency across Mesosphere
- Multiple Docker images à developmental flexibility
- Presto DB à significantly improved querying efficiency:- Naturally worked well with parallel data-stores on HDFS- Also improved RDBMS data retrieval through parallelization
digitalocean.com
![Page 69: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/69.jpg)
Lessons Learned (a small sample thus far)
- Jupyter was indispensable for prototyping Spark + Kafka interactively
- Spark image containerization à consistency across Mesosphere
- Multiple Docker images à developmental flexibility
- Presto DB à significantly improved querying efficiency:- Naturally worked well with parallel data-stores on HDFS- Also improved RDBMS data retrieval through parallelization
- Laying pipelines to connect Kafka to Spark to HDFS was challenging;
digitalocean.com
![Page 70: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/70.jpg)
Lessons Learned (a small sample thus far)
- Jupyter was indispensable for prototyping Spark + Kafka interactively
- Spark image containerization à consistency across Mesosphere
- Multiple Docker images à developmental flexibility
- Presto DB à significantly improved querying efficiency:- Naturally worked well with parallel data-stores on HDFS- Also improved RDBMS data retrieval through parallelization
- Laying pipelines to connect Kafka to Spark to HDFS was challenging; tuningeverything was often even harder!
digitalocean.com
![Page 71: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/71.jpg)
digitalocean.com
Recapping…
![Page 72: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/72.jpg)
digitalocean.com
1) Consolidation/Centralization of data from across DO
Recapping…
![Page 73: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/73.jpg)
digitalocean.com
1) Consolidation/Centralization of data from across DO
Anyone can build reports/analyses
Recapping…
![Page 74: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/74.jpg)
digitalocean.com
1) Consolidation/Centralization of data from across DO
Anyone can build reports/analyses
2) Faster Reporting, Better Granularity
Recapping…
![Page 75: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/75.jpg)
digitalocean.com
1) Consolidation/Centralization of data from across DO
Anyone can build reports/analyses
2) Faster Reporting, Better Granularity
3) Tight coupling with the other engineering groups
Recapping…
![Page 76: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/76.jpg)
digitalocean.com
1) Consolidation/Centralization of data from across DO
Anyone can build reports/analyses
2) Faster Reporting, Better Granularity
3) Tight coupling with the other engineering groups
4) Modern stack allows massive scale with small headcount
Recapping…
![Page 77: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/77.jpg)
digitalocean.com
Gazing into the Future...
![Page 78: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/78.jpg)
Future Work- Hardware failure detection forecasting to help the DC team perform predictable
maintenance.
digitalocean.com
![Page 79: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/79.jpg)
Future Work- Hardware failure detection forecasting to help the DC team perform predictable
maintenance.
digitalocean.com
![Page 80: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/80.jpg)
Future Work- Hardware failure detection forecasting to help the DC team perform predictable
maintenance.
- Better inform hardware acquisition costs with detailed machine performance metrics.
digitalocean.com
![Page 81: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/81.jpg)
Future Work- Hardware failure detection forecasting to help the DC team perform predictable
maintenance.
- Better inform hardware acquisition costs with detailed machine performance metrics.
- Profile hypervisors by usage to optimize how VMs are allocated to particular racks.
digitalocean.com
![Page 82: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/82.jpg)
Future Work- Hardware failure detection forecasting to help the DC team perform predictable
maintenance.
- Better inform hardware acquisition costs with detailed machine performance metrics.
- Profile hypervisors by usage to optimize how VMs are allocated to particular racks.
- Beta-test/prototype/dogfood future products.
digitalocean.com
![Page 83: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/83.jpg)
digitalocean.com
Questions?
![Page 84: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to](https://reader034.vdocuments.mx/reader034/viewer/2022042220/5ec65200a610db41e04efb80/html5/thumbnails/84.jpg)
Thank you!