building end to end streaming application on spark

43
Building End to End Streaming Application on Spark Streaming application development journey https://github.com/Shasidhar/sensoranalytics

Upload: datamantra

Post on 16-Apr-2017

868 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Building end to end streaming application on Spark

Building End to End Streaming Application on Spark

Streaming application development journey

https://github.com/Shasidhar/sensoranalytics

Page 2: Building end to end streaming application on Spark

● Shashidhar E S

● Big data consultant and trainer at datamantra.io

● www.shashidhare.com

Page 3: Building end to end streaming application on Spark

Agenda● Problem Statement● Spark streaming● Stage 1 : File Streams● Stage 2 : Kafka as input source (Introduction to Kafka)● Stage 3 : Casandra as Output Store (Introduction to Cassandra)● Stage 4 : Flume as data collection engine (Introduction to Flume)● How to test streaming code?● Next steps

Page 4: Building end to end streaming application on Spark

Earlier SystemBusiness model

● Providers of Wi-Fi hot spot devices in public spaces● Ability to collect data from these devices and analyse

Existing System

● Collect data and process in daily batches to generate the required results

Page 5: Building end to end streaming application on Spark

Existing System

Server

Server

Server

Server

Central directory Splunk Downstream

Systems

Page 6: Building end to end streaming application on Spark

Need for real time engine● Lot of failures in User login● Need to analyse why there is a drop in user logins● Ability to analyse the data in real time rather than daily

batches● As the company is growing Splunk was not scaling as it is

not meant for horizontal scaling

Page 7: Building end to end streaming application on Spark

New system requirement● Able to collect and process large amount of data ● Ability to store results in persistent storage● A reporting mechanism to view the insights obtained from

the analysis● Need to see the results in real time● In a simple term, we can call it as a real time monitoring

system

Page 8: Building end to end streaming application on Spark

Why Spark Streaming ? ● Easy to port batch system to streaming engine in Spark● Spark streaming can handle large amounts of data and it

is very fast● Best choice for near real time systems● Futuristic views

○ Ability to ingest data from many sources○ Good support for downstream stores like NoSQL○ And lot more

Page 9: Building end to end streaming application on Spark

Spark Streaming Architecture

Server Source directory

Spark Streaming

engine

Output directory

View in Zeppelin

Page 10: Building end to end streaming application on Spark

Data formatLog Data with the following format

● Timestamp● Country● State● City● SensorStatus

Page 11: Building end to end streaming application on Spark

Required Results● Country Wise Stats

○ Hourly,Weekly and Monthly view of total count of records captured countrywise.

● State Wise Stats○ Hourly,Weekly and Monthly view of total count of records captured

statewise.

● City Wise Stats○ Hourly,Weekly and Monthly view of total count of records captured city

wise with respect to sensor status

Page 12: Building end to end streaming application on Spark

Data Analytics - Phase 1 ● Receive data from servers● Store the input data into files● Use file as input and output● Process the data , generate

required statistics● Store results into output files

Spark Streaming engine

Input files (Directory)

Output files (Directory)

Page 13: Building end to end streaming application on Spark

Spark streaming introduction

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

Page 14: Building end to end streaming application on Spark

Micro batch

● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batchand runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it

can do all that under a sec● These batches are called as DStreams

Page 15: Building end to end streaming application on Spark

Apache Zeppelin● Web based notebook that allows interactive data analysis● It allows

○ Data ingestion○ Data Discovery○ Data Analytics○ Data Visualization and collaboration

● Built-in Spark integration

Page 16: Building end to end streaming application on Spark

Data Model● 4 models

○ SensorRecord - To read input records○ CountryWiseStats - Store country wise aggregations○ StateWiseStats - Store state wise aggregations○ CityWiseStats - Store city wise aggregations

Page 17: Building end to end streaming application on Spark

Phase 1 - Hands On

Git branch : Master

Page 18: Building end to end streaming application on Spark

Problems with Phase 1● Input and output is a file● Cannot detect new records / new data as and when it is

received● File causes Low latency in system

Solution : Replace Input file source with Apache kafka

Page 19: Building end to end streaming application on Spark

Data Analytics - Phase 2 ● Receive data from servers● Store the input data in Kafka● Use kafka as input ● Process the data , generate

required statistics● Store results into output files

Spark Streaming engine

Kafka

Output files (Directory)

Page 20: Building end to end streaming application on Spark

Apache Kafka● High throughput publish subscribe based messaging

system● Distributed, partitioned and replicated commit log● Messages are persistent in system as Topics● Uses Zookeeper for cluster management● Written in scala, but supports many client API’s - Java,

Ruby, Python etc● Developed by LinkedIn

Page 21: Building end to end streaming application on Spark

High Level Architecture

Page 22: Building end to end streaming application on Spark

Terminology● Topics : Is where messages are maintained and

partitioned● Producers : Processes which produces messages to

Topic● Consumers: Processes which subscribes to topic and

read messages● Brokers: Every server which is part of kafka cluster

Page 23: Building end to end streaming application on Spark

Anatomy of Kafka Topic

Page 24: Building end to end streaming application on Spark

Spark Streaming - Kafka● Two ways to fetch data from kafka to spark

○ Receiver approach■ Data is stored in receivers■ Kafka topic partitions does not correlate with RDDs■ Enable WAL for zero data loss■ To increase input speed create multiple receivers

Page 25: Building end to end streaming application on Spark

Spark Streaming - Kafka cont○ Receiver less approach

■ No data is stored in receivers■ Exact same partitioning in maintained in Spark RDDs as in

Kafka topics■ No WAL is needed as data is already in kafka we can fetch

older data on receiver crash■ More kafka partitions increases the data fetching speed

Page 26: Building end to end streaming application on Spark

Phase 2 - Hands On

Git branch : Kafka

Page 27: Building end to end streaming application on Spark

Problems with Phase 2● Output is still a file● Always full file scan is needed to retrieve, no lookups● Querying results is cumbersome● Nosql Database is the better option

Solution : Replace Output file with Cassandra

Page 28: Building end to end streaming application on Spark

Data Analytics - Phase 3

Spark Streaming engine

Kafka

Cassandra

● Receive data from servers● Store the input data in Kafka● Use kafka as input● Process the data , generate

required statistics● Store results into cassandra

Page 29: Building end to end streaming application on Spark

What is Cassandra “Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunable consistency, column-oriented database”

“Daughter of Dynamo and Big Table”

Page 30: Building end to end streaming application on Spark

Key Components and Features

● Distributed● System keyspace ● Peer to peer - No SPOF● Read and write to any node ● Operational simplicity● Gossip and Failure Detection

Page 31: Building end to end streaming application on Spark

Cassandra daemon

cassandra(CLI)

Language driversJDBC Drivers

Memtable SS tablesCommit Log

Overall Architecture

Page 32: Building end to end streaming application on Spark

Spark Cassandra Connector● Loads data from cassandra to spark and vice versa● Handles type conversions● Maps tables to spark RDDs● Support all cassandra data types, collections and UDTs● Spark-Sql support● Supports for Spark SQLs predicate push

Page 33: Building end to end streaming application on Spark

Phase 3 - Hands On

Git branch : Cassandra

Page 34: Building end to end streaming application on Spark

Problems with Phase 3● Servers cannot push directly to Kafka● There is an intervention to push data● Need for automated way to push data

Solution : Add Flume as a data collection agent

Page 35: Building end to end streaming application on Spark

Data Analytics - Phase 4 ● Receive data from Server● Stream data into kafka through

flume● Store the input data in Kafka● Use kafka as input● Process the data , generate

required statistics● Store results into cassandra

Spark Streaming engine

Kafka

Cassandra

Flume

Page 36: Building end to end streaming application on Spark

Apache Flume● Distributed data collection service● Solution for data collection of all formats● Initially designed to transfer log data into HDFS frequently

and reliably● It is horizontally scalable● Configurable routing

Page 37: Building end to end streaming application on Spark

Flume ArchitectureComponents

○ Event○ Source ○ Sink○ Channel○ Agent

Page 38: Building end to end streaming application on Spark

Flume Configuration● Define Source, Sink and Channel names● Configure Source● Configure Sink● Configure Channel● Bind Source and Sink to Channel

Page 39: Building end to end streaming application on Spark

Phase 4 - Hands On

Git branch : Flume

Page 40: Building end to end streaming application on Spark

Data Analytics - Re Design● Why we want to re design/ re structure ?● What we want to test ?● How to test Streaming applications● Hack a bit on Spark Manual Clock● Use scala-test for unit testing● Bring up abstractions to decouple the code● Write some tests

Page 41: Building end to end streaming application on Spark

Manual Clock

● A clock whose time can be set and modified● Its notified time will not change as time elapses● Only callers have control over it● Specially used for testing

Page 42: Building end to end streaming application on Spark

Phase 5 - Hands On

Git branch : unittest

Page 43: Building end to end streaming application on Spark

Next steps● Use better serialization frameworks like Avro● Enable Checkpointing● Integrate kafka monitoring tools● Adding support for multiple kafka topics● Write more tests for all functionality