big data tools in practice

Big Data tools in practice

Darko Marjanović, [email protected]š Milovanović, [email protected]

mailto:[email protected]


Agenda• Hadoop• Spark• Python

Hadoop

• Pros• Linear scalability.• Commodity hardware.• Pricing and licensing. • Any data types.• Analytical queries.• Integration with traditional

systems.

• Cons• Implementation.• Map Reduce ease of use.• Intense calculations with little

data.• In memory.• Real time analytics.

The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop• Hadoop Common

• HDFS

• Map Reduce

• YARN

Hadoop HDFS

Apache Spark

• Pros• 100X faster than Map Reduce.• Ease of use.• Streaming, Mllib, Graph and SQL.• Pricing and licensing.• In memory. • Integration with Hadoop.

• Cons• Integration with traditional

systems.• Limited memory per machine(GC).• Configuration.

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Spark stack

Resilient Distributed DatasetsA distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow model like MapReduce.*

• Immutability• Lineage (reconstruct lost partitions)• Fault tolerance through logging updates made to a dataset (single operation applied to

many records)• Creation:

• Reading a dataset from storage (HDFS or any other)• From other RDDs

*Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html

http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html

http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html

RDD operations• Transformations

• Lazy evaluated (executed by calling an action)

• Reduces wait states• Better pipelining

• Actions• Runned immediately• Return value to the application or

export to storage system

• map(f : T U)⇒• filter(f : T Bool)⇒• groupByKey()• join()

• count()• collect()• reduce(f : (T, T) T)⇒• save(path: String)

Spark program lifecycle

Create RDD(external data or parallelize collection)

Transformation(lazy evaluated)

Cache RDD(for reuse)

Action(execute computation and return results)

Spark in a cluster mode

* http://spark.apache.org/docs/latest/img/cluster-overview.png

http://spark.apache.org/docs/latest/img/cluster-overview.png

http://spark.apache.org/docs/latest/img/cluster-overview.png

PySpark• Python API for Spark

• Easy-to-use programming abstraction and parallel runtime: • “Here’s an operation, run it on all of the data”

• Dynamically typed (RDDs can hold objects of multiple types)

• Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask

• Run Spark from Jupyter notebooks

Spark DataframesDataFrames are a common data science abstraction that go across languages.

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

A Spark DataFrame is a distributed collection of data organized into named columns, and can be created:• - from structured data files• - from Hive tables• - from external databases• - from RDDs

Some supported operations: - slice data

• - sort data• - aggregate data• - join with other dataframes

Dataframe benefits• Lazy evaluation

• Domain specific language for distributed data manipulation

• Automatic parallelization and cluster distribution

• Integration with pipeline API for Mllib

• Query structured data with SQL (using SQLContext)

• Integration with Pandas Dataframes (and other Python data libraries)

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)

df = sqlContext.read.json("data.json")df.show()

df.select(“id”).show()

df.filter(df[”id”] > 10).show()

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)

df = sqlContext.read.json("data.json")df.registerTempTable(“data”)

results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)

Pandas DF vs Spark DF

Single machine tool (all data needs to fit to memory, except with HDF5)

Distributed (data > memory)

Better API Good API

No parallelism Parallel by default

Mutable Immutable

Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns(https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )

https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2

https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2

A very popular benchmark

* https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png

https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png

https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png

Big Data tools in practice

Darko Marjanović, [email protected]š Milovanović, [email protected]



big data tools in practice

Technology