real-time data processing pipeline & visualization with docker, spark, kafka and cassandra
TRANSCRIPT
![Page 1: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/1.jpg)
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka
and Cassandra
Roberto G. Hashioka – 2016-10-04 – TIAD – Paris
![Page 2: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/2.jpg)
Personal Information
• Roberto Gandolfo Hashioka
• @rogaha (Github) e @rhashioka (Twitter)
• Finance -> Software Engineer
• Growth & Data Engineer at Docker
![Page 3: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/3.jpg)
Summary
• Background / Motivation
• Project Goals
• How to build it?
• DEMO
![Page 4: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/4.jpg)
Background
• Gather of data from multiple sources and process them in “real-time”
• Transform raw data into meaningful and useful information used to enable more effective
decision-making process
• Provide more visibility into trends on: 1) user behavior 2) feature engagement 3) opportunities
for future investments
• Data transparency and standardization
![Page 5: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/5.jpg)
Project Goals
• Create a data processing pipeline that can handle a huge amount of events per second
• Automate the development environment — Docker compose.
• Automate the remote machines management — Docker for AWS / Machine.
• Reduce the time to market / time to development — New hires / new features.
![Page 6: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/6.jpg)
Project / Language Stack
![Page 7: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/7.jpg)
How to build it?
• Step 1: Install Docker for Mac/Win and dockerize all the applications
link: https://www.docker.com/products/docker
![Page 8: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/8.jpg)
Exemplo de Dockerfile-----------------------------------------------------------------------------------------------------------
FROM ubuntu:14.04
MAINTAINER Roberto Hashioka ([email protected])
RUN apt-get update && apt-get install -y nginx
RUN echo “Hello World! #TIAD” > /usr/share/nginx/html/index.html
EXPOSE 80
------------------------------------------------------------------------------------------------------------
$ docker build –t rogaha/web_demotiad2016 . $ docker run –d –p 80:80 –-name web_demotiad2016 rogaha/web_demotiad2016
![Page 9: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/9.jpg)
How to build it?
• Step 2: Define your services stack with a docker-compose file
![Page 10: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/10.jpg)
Docker Compose
containers: web: build: . command: python app.py ports: - "5000:5000" volumes: - .:/code links: - redis environment: - PYTHONUNBUFFERED=1 redis: image: redis:latest command: redis-server --appendonly yes
![Page 11: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/11.jpg)
How to build it?
• Step 3: Test the applications locally from your laptop using containers
![Page 12: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/12.jpg)
How to build it?
![Page 13: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/13.jpg)
How to build it?
• Step 4: Provision your remote servers and deploy your containers
![Page 14: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/14.jpg)
How to build it?
![Page 15: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/15.jpg)
How to build it?
• Step 5: Scale your services with Docker swarm
![Page 16: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/16.jpg)
DEMOsource code: https://github.com/rogaha/data-processing-pipeline
![Page 17: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/17.jpg)
Open Source Projects Used• Docker (https://github.com/docker/docker)
• An open platform for distributed applications for developers and sysadmins
• Apache Spark / Spark SQL (https://github.com/apache/spark)
• A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD)
• Apache Kafka (https://github.com/apache/kafka)
• A fast and scalable pub-sub messaging service
• Apache Zookeeper (https://github.com/apache/zookeeper)
• A distributed configuration service, synchronization service, and naming registry for large distributed systems
• Apache Cassandra (https://github.com/apache/cassandra)
• Scalable, high-available and distributed columnar NoSQL database
• D3 (https://github.com/mbostock/d3)
• A JavaScript visualization library for HTML and SVG.
![Page 18: Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra](https://reader030.vdocuments.mx/reader030/viewer/2022021500/586e8c571a28aba0038b8417/html5/thumbnails/18.jpg)
Thanks!Questions?
@rhashioka