Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Download Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Post on 27-Nov-2014




2 download

Embed Size (px)


Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming and Approximations Lambda Architecture


<ul><li> 1. Spark Streamingand FriendsChris FreglyGlobal Big Data ConferenceSept 2014KinesisStreaming</li></ul> <p> 2. Who am I?Former Netflixer:netflix.github.ioSpark 3. Spark Streaming-Kinesis Jira 4. Quick Poll Hadoop, Hive, Pig? Spark, Spark streaming? EMR, Redshift? Flume, Kafka, Kinesis, Storm? Lambda Architecture? Bloom Filters, HyperLogLog? 5. StreamingKinesisStreamingVideoStreamingPipingBig DataStreaming 6. Agenda Spark, Spark Streaming Overview Use Cases API and Libraries Execution Model Fault Tolerance Cluster Deployment Monitoring Scaling and Tuning Lambda Architecture Approximations 7. Spark Overview (1/2) Berkeley AMPLab ~2009 Part of Berkeley Data Analytics Stack (BDAS,aka badass) 8. Spark Overview (2/2) Based on Microsoft Dryad paper ~2007 Written in Scala Supports Java, Python, SQL, and R In-memory when possible, not required Improved efficiency over MapReduce 100x in-memory, 2-10x on-disk Compatible with Hadoop File formats, SerDes, and UDFs 9. Spark Use Cases Ad hoc, exploratory, interactive analytics Real-time + Batch Analytics Lambda Architecture Real-time Machine Learning Real-time Graph Processing Approximate, Time-bound Queries 10. Explosion of Specialized Systems 11. Unified Spark Libraries Spark SQL (Data Processing) Spark Streaming (Streaming) MLlib (Machine Learning) GraphX (Graph Processing) BlinkDB (Approximate Queries) Statistics (Correlations, Sampling, etc) Others Shark (Hive on Spark) Spork (Pig on Spark) 12. Unified Benefits Advancements in higher-level librariespushed down into core and vice-versa Examples Spark Streaming: GC and memorymanagement improvements Spark GraphX: IndexedRDD for random,hashed access within a partition versusscanning entire partition 13. Spark API 14. Resilient Distributed Dataset (RDD) Core Spark abstraction Represents partitionsacross the cluster nodes Enables parallel processingon data sets Partitions can be in-memory oron-disk Immutable, recomputable,fault tolerant Contains transformation lineage on data set 15. RDD Lineage 16. Spark API Overview Richer, more expressive than MapReduce Native support for Java, Scala, Python,SQL, and R (mostly) Unified API across all libraries Operations = Transformations + Actions 17. Transformations 18. Actions 19. Spark Execution Model 20. Spark Execution Model Overview Parallel, distributed DAG-based Lazy evaluation Allows optimizations Reduce disk I/O Reduce shuffle I/O Parallel execution Task pipelining Data locality and rack awareness Worker node fault tolerance using RDDlineage graphs per partition 21. Execution Optimizations 22. Spark Cluster Deployment 23. Spark Cluster Deployment 24. Master High Availability Multiple Master Nodes ZooKeeper maintains current Master Existing applications and workers will benotified of new Master election New applications and workers need toexplicitly specify current Master Alternatives (Not recommended) Local filesystem NFS Mount 25. Spark Streaming 26. Spark Streaming Overview Low latency, high throughput, fault-tolerance(mostly) Long-running Spark application Supports Flume, Kafka, Twitter, Kinesis,Socket, File, etc. Graceful shutdown, in-flight messagedraining Uses Spark Core, DAG Execution Model,and Fault Tolerance 27. Spark Streaming Use Cases ETL on streaming data during ingestion Anomaly, malware, and fraud detection Operational dashboards Lambda architecture Unified batch and streaming ie. Different machine learning models for differenttime frames Predictive maintenance Sensors NLP analysis Twitter firehose 28. Discretized Stream (DStream) Core Spark Streaming abstraction Micro-batches of RDDs Operations similar to RDD Fault tolerance using DStream/RDD lineage 29. Spark Streaming API 30. Spark Streaming API Overview Rich, expressive API similar to core Operations Transformations Actions Window and State Operations Requires checkpointing to snip long-runningDStream lineage Register DStream as a Spark SQL tablefor querying! 31. DStream Transformations 32. DStream Actions 33. Window and State DStream Operations 34. DStream Example 35. Spark Streaming ClusterDeployment 36. Spark Streaming Cluster Deployment 37. Scaling Receivers 38. Scaling Processors 39. Spark Streaming+Kinesis 40. Spark Streaming + Kinesis ArchitectureKinesisProducerKinesisProducerSpark St reaming Kinesis Archit ec t ureKinesis Spark St reaming,Kinesis Cl ient LibraryAppl icat ionKinesis St reamShard 1Shard 2Shard 3KinesisProducerKinesis Receiver DSt ream 1Kinesis Cl ient LibraryKinesis RecordProcessor Thread 1Kinesis RecordProcessor Thread 2Kinesis Receiver DSt ream 2Kinesis Cl ient LibraryKinesis RecordProcessor Thread 1 41. Throughput and PricingSparkKinesisProducerSpark St reaming Kinesis Throughput and Pr ic ing&lt; 10 second delayKinesisSpark St reamingAppl icat ionKinesis St reamShard 1SparkKinesisReceiverShard 1Shard 11 MB/ secper shard1000 PUTs/ sec50K/ PUT2 MB/ secper shardShard Cost : $ 0.36 per day per shardPUT Cost : $ 2.50 per day per shardNet work Transf er Cost : Free wit hin Region! ! 42. Demo!KinesisStreaming /scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scalaJava: /java/org/apache/spark/examples/streaming/ 43. Spark StreamingFault Tolerance 44. Fault Tolerance Points of Failure Receiver Driver Worker/Processor Solutions Data Replication Secondary/Backup Nodes Checkpoints 45. Streaming Receiver Failure Use a backup receiver Use multiple receivers pulling from multipleshards Use checkpoint-enabled, sharded streamingsource (ie. Kafka and Kinesis) Data is replicated to 2 nodes immediatelyupon ingestion Possible data loss Possible at-least once Use buffered sources (ie. Kafka and Kinesis) 46. Streaming Driver Failure Use a backup Driver Use DStream metadata checkpoint info torecover Single point of failure interrupts streamprocessing Streaming Driver is a long-running Sparkapplication Schedules long-running stream receivers State and Window RDD checkpoints helpavoid data loss (mostly) 47. Stream Worker/Processor Failure No problem! DStream RDD partitions will berecalculated from lineage 48. Types of CheckpointsSpark1. Spark checkpointing of StreamingContextDStreams and metadata2. Lineage of state and window DStreamoperationsKinesis3. Kinesis Client Library (KCL) checkpointscurrent position within shard Checkpoint info is stored in DynamoDB perKinesis application keyed by shard 49. Spark StreamingMonitoring and Tuning 50. Monitoring Monitor driver, receiver, worker nodes, andstreams Alert upon failure or unusually high latency Spark Web UI Streaming tab Ganglia, CloudWatch StreamingListener callback 51. Spark Web UI 52. Tuning Batch interval High: reduce overhead of submitting new tasks for each batch Low: keeps latencies low Sweet spot: DStream job time (scheduling + processing) issteady and less than batch interval Checkpoint interval High: reduce load on checkpoint overhead Low: reduce amount of data loss on failure Recommendation: 5-10x sliding window interval Use DStream.repartition() to increase parallelism of processingDStream jobs across cluster Use spark.streaming.unpersist=true to let the Streaming Frameworkfigure out when to unpersist Use CMS GC for consistent processing times 53. Lambda Architecture 54. Lambda Architecture Overview Batch Layer Immutable,Batch read,Append-only write Source of truth ie. HDFS Speed Layer Mutable,Random read/write Most complex Recent data only ie. Cassandra Serving Layer Immutable,Random read,Batch write ie. ElephantDB 55. Spark + AWS + Lambda 56. Spark + AWS + Lambda + ML 57. Approximations 58. Approximation Overview Required for scaling Speed up analysis of large datasets Reduce size of working dataset Data is messy Collection of data is messy Exact isnt always necessary Approximate is the new Exact 59. Some Approximation Methods Approximate time-bound queries BlinkDB Bernouilli and Poisson Sampling RDD: sample(), RDD.takeSample() HyperLogLogPairRDD: countApproxDistinctByKey() Count-min Sketch Spark Streaming and Twitter Algebird Bloom Filters Everywhere! 60. Approximations In ActionFigure: Memory Savings with Approximation Techniques( 61. Spark Statistics Library Correlations Dependence between 2 random variables Pearson, Spearman Hypothesis Testing Measure of statistical significance Chi-squared test Stratified Sampling Sample separately from different sub-populations Bernoulli and Poisson sampling With and without replacement Random data generator Uniform, standard normal, and Poisson distribution 62. Summary Spark, Spark Streaming Overview Use Cases API and Libraries Execution Model Fault Tolerance Cluster Deployment Monitoring Scaling and Tuning Lambda Architecture ApproximationsOct 2014 MEAP Early Access </p>