big data tools in practice
TRANSCRIPT
Big Data tools in practice
Darko Marjanović, [email protected]š Milovanović, [email protected]
Agenda• Hadoop• Spark• Python
Hadoop
• Pros• Linear scalability.• Commodity hardware.• Pricing and licensing. • Any data types.• Analytical queries.• Integration with traditional
systems.
• Cons• Implementation.• Map Reduce ease of use.• Intense calculations with little
data.• In memory.• Real time analytics.
The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.
Hadoop• Hadoop Common
• HDFS
• Map Reduce
• YARN
Hadoop HDFS
Hadoop HDFS
Apache Spark
• Pros• 100X faster than Map Reduce.• Ease of use.• Streaming, Mllib, Graph and SQL.• Pricing and licensing.• In memory. • Integration with Hadoop.
• Cons• Integration with traditional
systems.• Limited memory per machine(GC).• Configuration.
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Spark
Spark stack
Resilient Distributed DatasetsA distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow model like MapReduce.*
• Immutability• Lineage (reconstruct lost partitions)• Fault tolerance through logging updates made to a dataset (single operation applied to
many records)• Creation:
• Reading a dataset from storage (HDFS or any other)• From other RDDs
*Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html
RDD operations• Transformations
• Lazy evaluated (executed by calling an action)
• Reduces wait states• Better pipelining
• Actions• Runned immediately• Return value to the application or
export to storage system
• map(f : T U)⇒• filter(f : T Bool)⇒• groupByKey()• join()
• count()• collect()• reduce(f : (T, T) T)⇒• save(path: String)
Spark program lifecycle
Create RDD(external data or parallelize collection)
Transformation(lazy evaluated)
Cache RDD(for reuse)
Action(execute computation and return results)
Spark in a cluster mode
* http://spark.apache.org/docs/latest/img/cluster-overview.png
PySpark• Python API for Spark
• Easy-to-use programming abstraction and parallel runtime: • “Here’s an operation, run it on all of the data”
• Dynamically typed (RDDs can hold objects of multiple types)
• Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask
• Run Spark from Jupyter notebooks
Spark DataframesDataFrames are a common data science abstraction that go across languages.
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
A Spark DataFrame is a distributed collection of data organized into named columns, and can be created:• - from structured data files• - from Hive tables• - from external databases• - from RDDs
Some supported operations: - slice data
• - sort data• - aggregate data• - join with other dataframes
Dataframe benefits• Lazy evaluation
• Domain specific language for distributed data manipulation
• Automatic parallelization and cluster distribution
• Integration with pipeline API for Mllib
• Query structured data with SQL (using SQLContext)
• Integration with Pandas Dataframes (and other Python data libraries)
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")df.show()
df.select(“id”).show()
df.filter(df[”id”] > 10).show()
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")df.registerTempTable(“data”)
results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)
Pandas DF vs Spark DF
Single machine tool (all data needs to fit to memory, except with HDF5)
Distributed (data > memory)
Better API Good API
No parallelism Parallel by default
Mutable Immutable
Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns(https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )
A very popular benchmark
* https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png
Big Data tools in practice
Darko Marjanović, [email protected]š Milovanović, [email protected]