spark streaming + amazon kinesis
TRANSCRIPT
- 1. Spark Streaming + Amazon Kinesis @imai_factory
- 2. Spark Streaming Spark KafkaKinesis RDDDStream FRP
- 3. Conclusion KinesisConsumer Spark Streaming SQL Kinesis
- 4. RDD @t1 RDD @t2 RDD @t3 DStream Time RDD @t4 RDD @t5 DStream
- 5. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
- 6. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
- 7. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
- 8. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
- 9. DStream Flume Kafka Kinesis Twitter File Socket Data sources
- 10. Amazon Kinesis / Kafka
- 11. Amazon Kinesis Amazon Kinesis Datastream Store,Shue&Sort Consumer apps Consumer apps Consumer apps Process
- 12. Spark Streaming +Amazon Kinesis Amazon Kinesis Datastream Store,Shue&Sort Process
- 13. Spark Streaming +Amazon Kinesis KinesisSpark Kinesis +SparkSQL KinesisConsumer
- 14. Building Amazon Kinesis Consumer app Amazon Kinesis Datastream Store,Shue&Sort API, SDK KCL AWS Lambda Process SparkKinesisStormkinesis-spout KCL StormSpark
- 15. Amazon Kinesis Datastream Store,Shue&Sort Process Run SparkSQL on Kinesis Stream SQL
- 16. Run SparkSQL on Kinesis Stream import org.apache.spark.streaming.kinesis.KinesisUtils! ! val kinesisStreams = (0 until numStreams).map { i =>! KinesisUtils.createStream(! ssc, streamName, endpointUrl, kinesisCheckpointInterval,! InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )! }! val unionStreams = ssc.union(kinesisStreams)! val words = unionStreams.flatMap(...)!
- 17. import org.apache.spark.streaming.kinesis.KinesisUtils! ! val kinesisStreams = (0 until numStreams).map { i =>! KinesisUtils.createStream(! ssc, streamName, endpointUrl, kinesisCheckpointInterval,! InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )! }! ! val unionStreams = ssc.union(kinesisStreams)! ! val words = unionStreams.flatMap(...)! Run SparkSQL on Kinesis Stream Dstream DstreamUNION DstreamTransformation
- 18. words.foreachRDD(foreachFunc = (rdd: RDD[String], time: Time) => {! ! val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)! ! sqlContext.read.json(rdd).registerTempTable("words")! ! val wordCountsDataFrame =! sqlContext.sql(select level, count(*) as total ! from words! group by level)! ! println(s"========= $time =========")! wordCountsDataFrame.show()! ! })! DStream Run SparkSQL on Kinesis Stream JSON
- 19. Conclusion KinesisConsumer Spark Streaming
- 20. PluggableInputDStream KinesisReceiver KinesisClientLibrary Worker thread KinesisUtils.createStream(! ssc, streamName, endpointUrl, kinesisCheckpointInterval,! InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )! DynamoDB Table Kinesis Stream Under the hood GetRecords Checkpoint
- 21. One more thing: Amazon EMR now supports Apache Spark! EMR Spark 2015/06/23 Spark1.3.1
- 22. One more thing: Amazon EMR now supports Apache Spark! Amazon Kinesis Amazon EMR +