apache spark - lightning fast cluster computing - hyderabad scalability meetup
TRANSCRIPT
Spark - Lightning-Fast Cluster Computing by
Example
Ramesh Mudunuri, VectorumSaturday, December 6, 2014
About me
• Big data enthusiast• Vectorum.com , Startup Product development team
member and using spark technology
What to expect
• Introduction to Spark• Spark Eco system• How Spark is different from Hadoop Map Reduce• Where Spark shines well• How easy to install and start learning• Small code demos• Where to find additional information
This is not…• Training class• Work shop• Product demo with commercial interest
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information
What is Spark ?
General purpose large-scale high performance processing engine
http://spark.apache.org/
What is Spark ?
Like map-Reduce but in-memory processing engine and also runs fast
http://spark.apache.org/
What is Spark
• Apache Spark™ is a fast and general engine for large-scale data processing.
Spark History
• Started as research project in 2009 at UC Berkeley amplab and became Apache open source project since 2010
• Matai Zaharia Spark Dev. team member and Databricks co-founder
Why is Spark so special
Speed General purpose faster processing In-memory engine
(relatively) Easy to develop and deploy complex analytical applicationsAPIs for : Java, Scala and Python
Well integrated eco system tools
www.databricks.com
Why is Spark so special…..
• In-memory processing makes well suites for Iterative nature Algorithm computations
• Can run in various setups– Standalone (my favorite way to learn Spark)– Cluster, EC2, – Yarn, Mesos
• Read data from – Local file system– HDFS– Hbase, Cassandra and …
http://www.cloudera.com
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information
Apache Spark Core
• Foundation• Scheduling,• Memory Management• Fault Recovery and etc.
Spark SQL
• Execute SPARK with SQL expressions• Compatible with Hive*• JDBC/ODBC connection capabilities
* Hive :Distributed Data storage SQL software with custom UDF capabilities
Spark Streaming
• Component to process live streaming of the data.• API to handle streaming data• E.g: Sources : Log files, queued messages, sensor emitted
data
MLlib- Machine leaning
Libraries for Machine learning AlgorithmsEg : Classification, regression, clustering, collaborative filtering, dimensionality reduction
Very active Spark Development community
GraphX
APIs for Graph computation
• PageRank• Connected components• Label propagation• SVD++• Strongly connected
components• Triangle count
Alpha level
Spark Engine Terminology
• Spark Context– An Object Spark uses to access cluster
• Driver & Executor– Driver runs main program and execute parallel operations– Executor runs inside worker and execute the tasks
– Resilient Distributes Dataset (RDD)• Immutable fault tolerant collection object
– RDD functions (similar to Hadoop map-Reduce functions)1. Transformation2. Action
Spark shell and Spark context
Driver & ExecutorDriver runs main program and execute parallel operationsExecutor runs inside worker and execute the tasks
RDD-Resilient Distributed Dataset
• Resilient Distributed Data‐ sets(RDD)is Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster.
• Simple Definition: Immutable and fault tolerant collection object
• There are two ways to create an RDD in Spark: – 1. Create an RDD from an external data source – 2. Performing a transformation on one or more existing RDDs
– val lines = sc.textFile("/filepath/README.md")
– val errors = lines.filter(_.startsWith("ERROR"))
RDD
• There are two ways to create an RDD in Spark:
1. Create an RDD from an external data source val lines = sc.textFile("/filepath/REDME.md")
2. Performing a transformation on one or more existing RDDs val errors = lines.filter(_.startsWith("ERROR"))
Transformation - Action
• Transformations operations are lazy (will not be executed immediately)
• Transformations can create new RDDs from existing RDDe.g filter, map,
• Action operations return final values to driver program or write data into file system
e.g: Collect, SaveAsTextFile
http://www.mapr.com
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system
• How is Spark different from Hadoop Map Reduce
• Where it shines well• How easy to install and start learning• Small code demos• Where to find additional information
How is Spark different from Hadoop Map-Reduce
SPARK Hadoop
1 Speed : • 100X in-memory and • 10X on Disk
2 Easy of Use : • Easily write application using Java, Scala, Python• Interactive Shell available with Scala and Python• High level simple map-reduce Operations
• Java• No shell• complex map-reduce operations
3 Tools : • Well integrated tools (Spark SQL, Streaming,
Mllib and etc.) to develop complex analytical application
• Loosely coupled large set of tools, but very matured
4 Deployment: • Hadoop : V1/V2(YARN) • And also Meson, Amazon-EC2
--
5 Data Source: • HDFS(Hadoop), HBase, Cassandra, Amazon-S3
--
How is spark different from Hadoop Map-Reduce
SPARK Hadoop
6 Applications: • Spark ‘Application’ is higher level of Unit, runs
multiple jobs in sequence or parallel• Application process are called executors, runs on
clusters(workers)
• Hadoop ‘job’ is higher level unit process data with map reduce and writes data to storage
7 Executors: • Executors can run multiple tasks in a single
processor
• Each mapReduce runs in its own processor
8 Shared Variable: • Broadcast variables: Read-only(look-up) variable,
ships only once to worker• Accumulators: Workers add values and driver reads
the data, and fault tolerant
• Hadoop counter have additional (system ) metric counters like ‘Map input records’
9 Persisting/Caching RDD: • Cached RDDs can be used & reused in across the
operation, thus increase the processing speed --
10 Lazy Evaluation: Transformation functions execution plan bundled together and executes only with RDD action function
--
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce
• Where Spark shines well• How easy to install and start learning• Small code demos• Where to find additional information
Where Spark shines well
• Well suited for any iterative computations – Machine Learning Algorithms– Iterative Analytics
• Multi data source Computations – Multi sourced Sensor data
• Aggregated Analytics– Transforming and Summering the data
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well
• How easy to install and start learning• Small code demos• Where to find additional information
• Link http://spark.apache.org/downloads.html
• Standalone - Chose a package type : Prebuild for hadoop1.x• Source code is also available Build Toolls : maven or sbtDistro-versions : Hadoop, Cloudera, MapR
Current Spark version
Release Cycle : Every 3 months
How easy to install and start learning
Can install quickly on our laptop/PC• Parameter check lists
– JAVA 1.7– SCALA 2.10X– SPARK/Conf
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning
• Small code demos• Where to find additional information
Spark Scala REPL
cd $SPARK_HOME ./bin/spark-shell port 4040
Spark Master & Worker in background cd $SPARK_HOME ./sbin/start-all.sh
Starts both Master and worker
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning
• Small code demos• Where to find additional information
Use case with Spark SQL
• Spark Scala REPL• Spark SQL
• Write some interesting code snippets on REPL using scala • 1. Read Meetup participants info and prepare data file • 2. Use Spark SQL to create a aggregated data• 3. Show visualization with Spark output data
Spark SQL Code : Create table and Run Queries
1. Create SPARK Context // Spark context will be created as sc when we launch the shell.
2. Create SQL Context3. Create Case Class4. Create RDD5. Create Schema6. Register RDD as Table in the Schema7. Run Select statements8. Save SQL output 9. Visualization - D3
Code 1. import sqlContext.createSchemaRDD
2. Spark context available as sc
3. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
4. case class Attendees(Name: String, Interest: String )
5. val meetup = sc.textFile("/Users/vectorum/Documents/Ramesh/Dec6/meetup.csv").map(_.split(",")).map(a => Attendees( a(0),a(1)))
6. val hyd = sqlContext.createSchemaRDD(meetup)
7. hyd.registerTempTable("iiit")
8. val iiitRoster = sqlContext.sql("SELECT Name, Interest FROM iiit")
9. iiitRoster.count()
10. iiitRoster.map(a => "Name: " + a(0) + "Interest :" + a(1) ).collect().foreach(println)
11. val iiitAChart = sqlContext.sql("SELECT Interest, count( Interest) FROM iiit group by Interest order by Interest”)
12. iiitAChart.map(a => a(0) + "," + a(1) ).collect().foreach(println)
Our Product
VisualizationHighCharts,D3
SPARK(SQL, Hive,MLlib)
DataHDFS, MySql, Files
Technology Stack
Spark Programing Model
• 1. Define set of transformations on input datasets
• 2.Invoke actions that output the transformed dataset into persistent state/local memory
• Running local computations that operate on the results computed in a distributed fashion. These can help decide what transformations and actions to undertake next.
Example RDD Lineage
HDFS/File
Prepare Dataset(RDD-0)
Cached RDDFiltered Data Set
0
Filtered Data Sets.. n
Export Data
Visualization
Machine Learning
Demo - Visualization
• Bubble Chart : Data Distribution• Heat Chart :Correlation
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark• Spark Eco system• How is it different from Hadoop Map Reduce• Where it shines well• How easy to install and start learning• Small code demos
• Where to find additional information
Where to find additional information
• http://spark.apache.org/• http://spark-summit.org/2014#videos• http://databricks.com/spark-training-resources
• Users mailing list [email protected]• Developers mailing list [email protected]
• My twitter handle https://twitter.com/rameshmudunuri
Final note
Thank you