real time data processing with kafla spark integration
TRANSCRIPT
1© Copyright 2014 EMC Corporation. All rights reserved.
Real Time Data Streaming
+
Speakers:
Sumit Gupta, Data Intelligene Engineer, EMCKartikeya Putturaya, Data Intelligence Engineer, EMCChandraSekarRao Venkata, Data Intelligence Engineer, EMC
2© Copyright 2014 EMC Corporation. All rights reserved.
Data Engineering at EMC ITStack
Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache StormMessaging Systems: Rabbit MQ, Apache KafkaRelation Store: Greenplum
A glimpse on what we do
Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes
User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time
3© Copyright 2014 EMC Corporation. All rights reserved.
Predictive Maintenance of Exchange Servers
4© Copyright 2014 EMC Corporation. All rights reserved.
User Behavior Analytics for Network Threat Detection
5© Copyright 2014 EMC Corporation. All rights reserved.
Apache Kafka
6© Copyright 2014 EMC Corporation. All rights reserved.
OverviewAn apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system• Designed for processing of real time activity stream data e.g. logs, metrics collections• Written in Scala• Does not follow JMS Standards, neither uses JMS APIs
FeaturesPersistent messagingHigh-throughputSupports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)and many more…
http://kafka.apache.org/
7© Copyright 2014 EMC Corporation. All rights reserved.
How it works
8© Copyright 2014 EMC Corporation. All rights reserved.
Real time transferBroker does not Push messages to Consumer, Consumer Polls messages from Broker.
9© Copyright 2014 EMC Corporation. All rights reserved.
Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log
10© Copyright 2014 EMC Corporation. All rights reserved.
Kafka InstallationDownload
http://kafka.apache.org/downloads.html
Untar it> tar -xzf kafka_<version>.tgz> cd kafka_<version>
11© Copyright 2014 EMC Corporation. All rights reserved.
Start ServersStart the Zookeeper server
> bin/zookeeper-server-start.sh config/zookeeper.properties
Pre-requisite: Zookeeper should be up and running.
Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties
12© Copyright 2014 EMC Corporation. All rights reserved.
Create a topic> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
List down all topics> bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test
Create/List Topics
13© Copyright 2014 EMC Corporation. All rights reserved.
ProducerSend some Messages
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message
14© Copyright 2014 EMC Corporation. All rights reserved.
ConsumerReceive some Messages
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message
15© Copyright 2014 EMC Corporation. All rights reserved.
Copy configs> cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties
Changes in the config files.config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2
Multi-Broker Cluster
16© Copyright 2014 EMC Corporation. All rights reserved.
Start other Nodes with new configs> bin/kafka-server-start.sh config/server-1.properties &> bin/kafka-server-start.sh config/server-2.properties &
Create a new topic with replication factor as 3> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic
List down the all topics> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topicTopic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Start with New Nodes
17© Copyright 2014 EMC Corporation. All rights reserved.
Spark StreamingMakes it easy to build scalable fault-tolerant streaming applications.
Ease of UseFault ToleranceCombine streaming with batch and interactive queries.
18© Copyright 2014 EMC Corporation. All rights reserved.
19© Copyright 2014 EMC Corporation. All rights reserved.
20© Copyright 2014 EMC Corporation. All rights reserved.
Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS
21© Copyright 2014 EMC Corporation. All rights reserved.
22© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming + Kafka
There are two approaches to receive the data from Kafka for spark streaming
• Receiver based approach • Direct approach
23© Copyright 2014 EMC Corporation. All rights reserved.
24© Copyright 2014 EMC Corporation. All rights reserved.
#import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream
lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream
counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start() ssc.awaitTermination()
25© Copyright 2014 EMC Corporation. All rights reserved.
26© Copyright 2014 EMC Corporation. All rights reserved.
from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})offsetRanges = [] def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset) directKafkaStream\ .transform(storeOffsetRanges)\ .foreachRDD(printOffsetRanges)
27© Copyright 2014 EMC Corporation. All rights reserved.
Thank You