spark, the new age of data scientist

SPARK THE NEW AGE FOR DATA SCIENTIST

In a very short time, Apache Spark has emerged as the next generation big

data processing engine, and is being applied throughout the industry faster

than ever.

TABLE OF CONTENTS

Summary

GETTING STARTED ............................................................................................................................................................ 1

WHAT IS APACHE SPARK? ........................................................................................................................................................................ 1

BIG DATA, WHAT IS AND WHY IT MATTERS.......................................................................................................................................... 1

Parallelism enable Big Data .......................................................................................................................................................2

Big data defined ...............................................................................................................................................................................2

WHY SPARK? ............................................................................................................................................................................................ 4

FROM BIG CODE TO SMALL CODE ........................................................................................................................................................... 4

What language should I choose for my next Apache Spark project? ........................................................................4

A BRIEF HISTORY OF SPARK.......................................................................................................................................... 6

THE INTERVIEW ....................................................................................................................................................................................... 6

THE SPARK GROWTH ............................................................................................................................................................................... 6

THE COMPANY AROUND SPARK ............................................................................................................................................................. 7

A UNIFIED STACK ............................................................................................................................................................... 8

SPARK CORE ............................................................................................................................................................................................... 8

How Spark works? ..........................................................................................................................................................................9 RDD ............................................................................................................................................................................................................................10 Key Value Methods ............................................................................................................................................................................................12 Caching Data..........................................................................................................................................................................................................13

Distribution and Instrumentation ........................................................................................................................................ 14 Spark Submit .........................................................................................................................................................................................................14 Cluster Manager ..................................................................................................................................................................................................16

SPARK LIBRARIES .................................................................................................................................................................................. 18

Spark SQL ........................................................................................................................................................................................ 18 Impala vs Hive: difference between Sql on Hadoop components ............................................................................................20 Cloudera Impala ..................................................................................................................................................................................................21 Apache Hive ...........................................................................................................................................................................................................22 Other features that include Hive: ...............................................................................................................................................................22

Streaming Data ............................................................................................................................................................................. 22

Machine Learning ........................................................................................................................................................................ 23 Machine Learning Library (MLlib) ............................................................................................................................................................23

GraphX .............................................................................................................................................................................................. 24 How it works GraphX? .....................................................................................................................................................................................24

OPTIMIZATIONS AND THE FUTURE ..................................................................................................................................................... 26

Understanding closures ............................................................................................................................................................. 26 Local vs. cluster modes ....................................................................................................................................................................................26

Broadcasting .................................................................................................................................................................................. 27

Optimizing Partitioning ............................................................................................................................................................ 29

Future of Spark.............................................................................................................................................................................. 31 R ...................................................................................................................................................................................................................................31 Tachyon ...................................................................................................................................................................................................................31

TABLE OF CONTENTS

Project Tungsten .................................................................................................................................................................................................31

REFERENCES..................................................................................................................................................................... 32

CONTACT INFORMATION ............................................................................................................................................. 33

INTRODUCTION TO DATA ANALYSIS WITH SPARK

Pagina 1

Getting Started

In this section I will explain the concepts around Spark, how Spark works and how Spark has been built,

what are the advantages of using Spark and a brief nod to history of MapReduce Algorithm.

WHAT IS APACHE SPARK?

Apache Spark is an open-source framework developed by AMPlab of University of California and,

successively, donated to Apache Software Foundation. Unlike the MapReduce paradigm based on two-

level disk of Hadoop, the primitive in-memory multilayer provided by Spark allow you to have performance

up to 100 times better.

Apache Spark is a general-purpose distributed computing engine for processing and analyzing large

amounts of data. Spark offers performance improvements over MapReduce, especially when Spark’s in-

memory computing capabilities can be leveraged. In practice, Spark extends the popular MapReduce

model to efficiently support more types of computations, including interactive queries and stream

processing. Speed is important in processing large datasets or Big Data, as it means the difference between

exploring data interactively and waiting minutes or hours.

Figure 1 – Spark is the best solution for Big Data

BIG DATA, WHAT IS AND WHY IT MATTERS

Big data is a popular term used to described the exponential growth and availability of data, both

structured and unstructured. Today Big Data as important to business – and society – as the Internet has

become.


Pagina 2

Why? More data may lead to more accurate analyses. More accurate analyses may lead to mode confident

decision making. And better decision can mean greater operational efficiencies, cost reductions and

reduced risk.

Organizations are increasingly generating large volumes of data as result of instrumented business

processes, monitoring of user activity, web site tracking, sensors, finance, accounting, among other

reasons. For example, with the advent of social network websites, users create records of their lives by

daily posting details of activities they perform, events they attend, places they visit, pictures they take,

and things they enjoy and want. This data deluge is often referred to as Big Data, a term that conveys the

challenges it poses on existing infrastructure in respect to storage, management, interoperability,

governance and analysis large amounts of data.

Parallelism enable Big Data

• Big Data

- Applying data parallelism for data processing of large amount of data (terabyte, petabyte, …)

- Exploiting cluster of computing resources and the Cloud

• MapReduce

- Data parallelism technique introduced by Google for big data processing

- Large scale, distributing both data and computation over large cluster of machines

- Can be used to parallelize different kind of computational task

- Open source implementation: Hadoop

Big data defined

As far back as 2001, industry analyst Doug Laney articulated the now mainstream definition of big data

as the three Vs of big data: volume, velocity and variety.

Figure 2 – Three Vs of Big Data

• Volume defines the amount of data: many factors contribute to the increase in data volume.

Transaction-based data stored the years. In the past, excessive data volume was a storage issue.

Big Data

Volume

VelocityVariety


Pagina 3

But with decreasing storage costs, other issues emerge, including how to determine relevance

within large data volumes and how to use analytics to create value from relevant data.

• Velocity refers to the rate at which the data is produced and processed, for example in the batch

mode, near-time mode, real-time mode and streams. Data is streaming in at unprecedented speed

and must be dealt with in a timely manner. Reacting quickly enough to deal with data velocity is

a challenge for most organizations.

• Variety represent the data types. Data today comes in all types of formats. Structured, numeric

data in traditional databases. Information created from line-of-business applications.

Unstructured text documents, email, video, audio, stock ticker data and financial transactions.

Managing, merging and governing different varieties of data is something many organizations

still grapple with.

On the other hand, many company consider two additional dimension then they thinking about big data:

• Variability. In addition to the increasing velocities and varieties of data, data flows can be highly

inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and

event-triggered peak data loads can be challenging to manage. Even more so with unstructured

data involved.

• Complexity. Today's data comes from multiple sources. And it is still an undertaking to link,

match, cleanse and transform data across systems. However, it is necessary to connect and

correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out

of control.

Figure 3 – The cube of Big Data


Pagina 4

WHY SPARK?

On the generality side, Spark is designed to cover a wide range of workloads that previously required

separate distributed system, including batch applications, iterative algorithms, interactive queries and

streaming.

By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine

different processing types, which is often necessary in production data analysis pipelines. In addition, it

reduces the management burden of maintaining separate tools.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich

built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop

clusters and access any Hadoop data source, including Cassandra, for example.

FROM BIG CODE TO SMALL CODE

The most difficult thing for big data developers today is choosing a programming language for big data

applications. There are many languages that a developer can uses, such as Python or R: both are the

languages of choice among data scientists for building Hadoop applications. With the advent of various

big data frameworks like Apache Spark or Apache Kafka, Scala programming language has gained

prominence amid big data developers.

What language should I choose for my next Apache Spark project?

The answer to this question varies, as it depends on the programming expertise of the developers but

preferably Scala programming language has become the language of choice for working with big data

framework like Apache Spark. Not only, Spark is mostly written in Scala, more precisely:

1. Scala 81.2%

2. Python 8.7%

3. Java 8.5%

4. Shell 1.3%

5. Other 0.3%

Figure 4 – The Growth of Scala


Pagina 5

Creating any program in Scala using functional programming cannot impact performance. Functional

Programming is an abstraction, which restricts certain programming practices, for example I/O, avoiding

the side effect. Forcing the abstraction allows the code to be executed in a Parallel Execution Environment.

For this, Spark supports only a functional paradigm, in which Scala results be the most efficient language

supported.

Figure 5 – From Big Code to Tiny Code

In the next sections I will explain the advantage of Spark: a philosophy of tight integration has several

benefits:

- All libraries and higher-level components, in the stack, benefit from improvements at the

lower layers.

For example: when Spark’s core engine adds an optimization, SQL and machine learning

libraries automatically speed up as well.

- The costs associated with running the stack are minimized, because instead of running 5-10

independed software system, an organization needs to run only one. These costs include

deployment, maintenance, testing, support, and others.

This also means that each time a new component is added to the Spark stack, every organization that uses

Spark will immediately be able to try this new component. This changes the cost of trying out a new type

of data analysis from downloading, deploying, and learning a new software project to upgrading Spark.

The advantages are the following:

- Readability and Expressiveness

- Fast and Testability - Interactive

- Fault Tolerant

- Unify Big Data


Pagina 6

A Brief History of Spark

Apache Spark is one of the most interesting frameworks in big data in recent years. Spark had its humble

beginning as a research project at UC Berkeley. Recently O’Reilly Ben Lorica interviewed Ion Stoica, UC

Berkeley professor and databricks CEO, about history of apache spark.

The question was: “How Apache Spark started?”

THE INTERVIEW

Ion Stoica, professor and databricks CEO, replay this:

“the story started back in 2009 with Mesos1. It was a class project in UC Berkley. Idea was to

build a cluster management framework, which can support different kind of cluster

computing systems. Once Mesos was built, then we thought what we can build on top of Mesos.

That’s how spark was born.”

“We wanted to show how easy it was to build a framework from scratch in Mesos. We also

wanted to be different from existing cluster computing systems. At that time, Hadoop targeted

batch processing, so we targeted interactive iterative computations like machine learning.”

“The requirement for machine learning came from the head of department at that time. They

were trying to scale machine learning on top of Hadoop. Hadoop was slow for ML as it needed

to exchange data between iterations through HDFS. The requirement for interactive querying

came from the company, Conviva, which I founded that time. As part of video analytical tool

built by company, we supported adhoc querying. We were using MySQL at that time, but we

knew it’s not good enough. Also there are many companies came out of UC, like Yahoo,

Facebook were facing similar challenges. So we knew there was need for framework focused

on interactive and iterative processing.2”

THE SPARK GROWTH

Spark is an open source project that has been built and is maintained by a thriving and diverse community

of developers. If you or your organization are trying Spark for the first time, you might be interested in

the history of the project. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to

become the AMPLab. The researchers in the lab had previously been working on Hadoop Map-Reduce,

and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the

1 Mesos is a distributed systems kernel: abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. In the next section I explain more precisely what Mesos is it. 2 (Phatak, 2015)


Pagina 7

beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas

like support for in-memory storage and efficient fault recovery. Research papers were published about

Figure 6 – The History of Spark

Spark at academic conferences and soon after its creation in 2009, it was already 10–20× faster than

MapReduce for certain jobs.

In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark)

and Spark Streaming. These and other components are sometimes referred to as the Berkeley Data

Analytics Stack (BDAS). Spark was first open sourced in March 2010, and was transferred to the Apache

Software Foundation in June 2013, where it is now a top-level project.

THE COMPANY AROUND SPARK

Today, most company uses Apache Spark. Company like Ebay, Pandora, Twetter, Netflix… use Spark for

quickly query analysis, improving their business.

Figure 7 – Who’s using Spark?


Pagina 8

A Unified Stack

The Spark project contains multiple closely integrated components. At its core, Spark is a “computational

engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many

computational task across many worker machines, or a computing cluster. Because the core engine of

Spark is both fast and general-purpose, it powers multiple higher-level components specialized for

various workloads, such as SQL or machine learning.

Figure 8 – The Spark stack

Finally, on of the largest advantage of tight integration is the ability to build applications that seamlessly

combine different processing models. For example, in Spark you can write one application that uses

machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously,

analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured

logfiles). In addition, more sophisticated data engineers and data scientists can access the same data via

the Python shell for ad hoc analysis. Others might access the data in standalone batch applications. All the

while, the IT team has to maintain only one system. A Figure 6 explain that without words, evidencing the

Spark architecture.

SPARK CORE

Spark Core contains the basic functionality of Spark, including components for task scheduling, memory

management, fault recovery, interacting with storage systems, and more. It provides distributed task

dispatching, scheduling, and basic I/O functionalities, exposed through an application programming

interface (for Java, Python, Scala, and R) centered on the RDD abstraction. Spark Core is based on the

Resilient Distributed Datasets (RDDs), which are Spark’s main programming abstraction.

More precisely: the RDD is a huge list of object in memory, but distributed. It can image this as a huge

magic tree distributed across many compute nodes. Is a big set of objects, that we can distribute and

parallelize as we want. In practice it can basically scale horizontally reasonably well. A "driver" program

invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which


Pagina 9

then schedules the function's execution in parallel on the cluster3. These operations, and additional ones

such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are

lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD, the sequence of operations

produced it, so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python,

Java, or Scala objects.

In this section I will show you the basics of Apache Spark, then the working mechanism, what is the RDD

file, the difference between actions and changes and the cache mechanism.

How Spark works?

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext

object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext

can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos

or YARN), which allocate resources across applications. Once connected, Spark acquires executors on

nodes in the cluster, which are processes that run computations and store data for your application. Next,

it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.

Finally, SparkContext sends tasks to the executors to run.

Figure 9 – The architecture of Spark, inside the Core

The Spark architecture works through the Master-Worker pattern, where the Driver Program (Master),

controlled by the Spark Context component, orchestra the work by sending the various jobs to the

Workers; in particular, they divide the work into different chunks for each Executors; finally, they compute

the result as quickly as possible. Once the Workers end to compute the request, they send the result to the

Master that deals with aggregate all the results and insert them in the DAG (Directly Acyclic Graph). The

work continues until all chunks has been computed.

3 (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2011)


Pagina 10

RDD

The core of Apache Spark is MapReduce algorithm: MapReduce and its variants have been highly

successful in implementing large-scale data-intensive applications in commodity clusters. These systems

are built around an Acyclic Data Graph model (DAG) that is not suitable for other popular applications:

the focus on this type of working, across a multiset of multiple parallel operations is the better for this

use. To achieve these goal, Spark introduces an abstraction called Resilient Distributed Datasets (RDDs).

An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a

partition is lost.

Thank the RDD and DAG model, Spark can outperform Hadoop by 10x in iterative machine learning jobs,

and can be used to interactively query a hundred GB dataset with sub-second response time.

Definition

RDD is a distributed memory abstraction that lets programmers perform in-memory computations on

large cluster in a fault-tolerant manner. RDDs are motivated by two types of applications:

- Iterative algorithms

- Interactive data mining tools

In both cases, keeping data in memory can improve performance by an order of magnitude. can improve

performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted

form of shared memory, based on coarse-grained transformations rather than fine-grained updates to

shared state.

Behavior

The RDD - Resilient Distributed Datasets - is a collection of partitioned elements along the nodes of cluster

which can be computed in parallel. The workers can access only to the RDD in read mode, so they cannot

change it.

More precisely:

- Datasets: elements such as lists or arrays

- Distributed: the work is distributed across the cluster, so the computation can take place in

parallel

- Resilient: given that the work is distributed, Spark has a powerful management of fault

tolerance, because if a job on a node crash, Spark is able to detect the crash and immediately

restore the job over to another node.

Because many functions in Spark are lazy, rather than immediately calculate functions and extractions by

time in time, these extractions are saved in what is called the DAG, the Directed Acyclic Graph, making

fault tolerance as simple as possible.

The Dag grow for all transformations that are carried out, as

- Map

- Filter

- (Other lazy operation)


Pagina 11

Once all the results have been collected in a DAG, they can be reduced, and then brought to the desired

result, through the follow actions:

- Collect

- Count

- Reduce

- …

The results can be saved in local storage and then sent to the Master Worker, or sent directly.

Transformation

Spark offers an environment that supports functional paradigm, and thanks to Scala many features are

reduced to the minimum number of possible lines:

Actions

While all the changes are "lazy evaluation", the actions aggregate the data and return the result to the

Master (Driver Program). More precisely, each action aggregates the result directly in the node, and then

is sent by the Master and aggregate all; this type of Hadoop policy has been developed to minimize, as

much as possible, the traffic on the network.

Associative properties

Thanks to the associative property it can aggregate all data regardless of the order in which they arrive.

Spark Context can organize clusters in different ways, thanks to the associative property that

distinguishes the distributed computing.

Figure 10 – Associative Property of Spark

Action on Data

Whenever that we want to "see" the result, the collect computes the entire collection of RDD.


Pagina 12

Figure 11

Key Value Methods

Spark works by the pair<k,v> to transform and aggregate computations that you want to process. For

example these are some of the functions that you can use:

Figure 12 – RDD functions

Example:

- Given the follow list key-value

Figure 13- List key value Pairs

- The functions highlighted above (rddToPairRDDFuncion) divides the tables in different

nodes. Note that the same keys are assigned to the same nodes.


Pagina 13

Figure 14

- Now, each node can perform the work of transformation and action, and then send the result

to the Master.

Caching Data

- Suppose, for example, that we have a simple RDD, and we want to compute intensive

algorithm that we simulate through Thread.sleep() method.

- The result is another RDD with the transformed data

- Suppose that we want to make a simple map function, which is not computationally heavy as

the first that has been made; the result that we get is the follow result:

- Now, maybe the result is not correct, or maybe yes, or at least not the one you want. Instead

recalculate all over again, Spark stores the intermediate result in the cache.

By default, Spark uses the RAM memory, but if specified, you can also use a larger memory

such as hard-disk (keep attention, for this case, bottleneck can occur!).

- If, at this point, if you want to continue the analysis, the data that you will use are those in the

cache:

With this method there are two benefits


Pagina 14

1. Speed: the data, especially those intermediate, are available in a short time

2. Fault Tolerance: if, in the first transformation, the heavy computational crashed, the data would

be recalculated and made available too, without that the user can notice it at all.

Distribution and Instrumentation

In this section it will explain the system aspects of Spark, where the data passes and which components

are activated when the action occurs.

Spark Submit

In Spark the MapReduce the functions are processed through the spark-submit function. Applications that

are launched in spark refer to the main program and are sent to the cluster manager driver.

Figure 15

At this point, depend how you cluster is configured, the driver may directly call the machine on which you

installed the Spark (for example, in local), or directly to the Master node at the head of the distributed

cluster. This feature it can choose in the launch phase of the main program.

At this point the driver asks to the cluster manager the specific resources that can be used in the

computation. Depending on available resources, the Cluster enables all available resources that can be

used in the MapReduce policy.

Figure 16

At this point, the driver performs the main application, building the RDD until the action occurs. This

causes the sending of the RDD to all nodes of the cluster.

Once the workflow process is completed, execution continues within the driver until the execution is not

completed finally.

When the workflow process is complete, the execution continue inside the driver until the execution is

complete definitively.


Pagina 15

Figure 17

Now, the resources are freed. Note: the resources may also be released manually before the job is

completed.

To understand better, the potential offered by spark-submit feature, you can navigate inside the help

supplied directly from the command shell:

Figure 18 – List of command in Spark

1. First, we note that Spark can send to only compiled programs driver in Java or Python,

specifying any appropriate options that will be shown shortly.

2. Through the kill or status commands, you can drastically end or keep monitored the jobs

present in Spark, passing the appropriate reference ID.

3. The master command determines whether to use Spark in local or on the cluster, if this’s

available, also specifying the number of processors you want to use.

4. Deploy mode allows you to specify only two options, client or cluster. The method specifies

where the driver, explained above, resides: if specified in local, the driver resides on the same

machine from which you will launch the computation; this requires that the driver is


Pagina 16

obviously connected via a fast connection to the computing cluster. The cluster option, on the

other hand, offers a service "fire-and-forget", sending the computation directly to the Master.

5. Another possible option is specifying the compute class through the class option.

6. The name option assigns the process that will be computed.

7. Jars or py-file specifies the type of file that you want to compute. Note that both files in Scala

and in Java produce output as a jar file.

8. The executor-memory option (not showed in the screenshot) is able to set the amount of

memory that the launched process can be used. Default is set to 1GB, but you can change

depending on the power required.

Cluster Manager

The Apache Spark’s Cluster Manager is distributed. The main goal, for the cluster manager, is managing and

distributing the workload and assign tasks to the machine that composed the cluster.

Figure 19 – Cluster Manager View

The kernel that manages a single machine, in this case, the Master, in practice is "distributed" to the entire

cluster.

Main Cluster Manager

Figure 20 – Main Cluster Manager

Spark has a proprietary kernel: Spark-standalone.

The Hadoop cluster managers, on the other hand, is YARN, very convenient both in terms of

infrastructure and in terms of ease to use.

Apache Mesos is another product created by University of Berkeley, in competition with the

cluster manager YARN.


Pagina 17

Standalone Manager

Spark-Standalone Manager is simple and quite convenient if you are

looking for a manager with the basic functions.

To use spark-submit feature just enter the host and port of the

machine on which you want to launch the process (default port

is 7077).

The standalone manager works in both client mode and in

cluster mode.

Another very helpful feature is spark.deploy.spreadOut: depending on itself setup, tries to

distribute the workload across the node of the cluster, trying to querying multiple machines as

possible.

On the other hand, if this parameter is setup on false, the cluster manager tries to works with less

node as possible trying to group the works as soon as possible.

The feature total-executor-cores # lets you specify how many cores are to be used for each

executor, for each job. By default, this setting uses all available cores in the cluster, but you can

specify a lower number if you want to preserve the performance impact for future computations.

Apache Mesos Manager

The URL’s Master is pretty much the same as standalone, the only

difference is the port that is launched, that is, 5050.

Supports both deployment-modes: client and cluster; is not a feature

installed by default, it must be installed separately.

Mesos is able to allocate dynamically the resources in two ways: fine

and coarse. By default, it is set to the fine; the difference lies in the type

of the allocation of the various jobs:


Pagina 18

o fine: each task, in Spark, is mapped and run on each task of Mesos. This allows

applications to be able to scale their resources dynamically when needed4;

o coarse: this mode uses a very high number of resources for the whole duration of the

application launched. This feature is best if you want maximum performance, since each

application can determine which is its best configuration and keep it.

If spark.mesos.coarse is selected to true, you can also set the number of cores that must be used

for each node in the cluster; default uses all available cores.

Therefore, unlike the Standalone Manager, you cannot choose the kind of spread you want to set:

by default, Mesos uses less possible nodes avoiding trying to minimize to the maximum the

distribution of work within the cluster.

Hadoop YARN

Unlike previous cluster, to launch a spark-submit process is simply the yard master command;

Since in HADOOP / YARN_CONF_DIR configuration file is set the path to use to connect to the

cluster manager.

Client / Cluster: even with this manager you can set the launch mode of computation.

The number of executors, in this manager, is set by default to 2, causing problems if you want to

improve performance and scale.

You can specify the number of cores to be executed for each Executors, which by default is set to

1.

Unlike all other managers, in YARN there’s the concept of queues, in which you can specify the

type of the queue for all jobs that have yet to be launched from the cluster manager.

SPARK LIBRARIES

In this section we will look how Spark is used to the pragmatic level, that is, how it is used to query the

Hadoop file system and which are, therefore, the features that it allows us to use.

Spark SQL

Called in this mode, Spark SQL, because, thanks to this functionality, it provides a work environment where

you can query any database stored in Hadoop with a language very similar to SQL.

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces

provided by Spark SQL provide Spark with more information about the structure of both the data and the

4 This type of resource management, for example, is not optimal if you are running streaming applications, where applications require low latency; but optimal for a parallel use of multiple processes launched by different instances: the resources needed for a parallel use are automatically allocated and distributed.


Pagina 19

computation being performed. Internally, Spark SQL uses this extra information to perform extra

optimizations. There are several ways to interact with Spark SQL including SQL, the DataFrames API and

the Datasets API. When computing a result, the same execution engine is used, independent of which

API/language you are using to express the computation. This unification means that developers can easily

switch back and forth between the various APIs based on which provides the most natural way to express

a given transformation.

One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. Spark

SQL can also be used to read data from an existing Hive installation. For more on how to configure this

feature, please refer to the Hive Tables section. When running SQL from within another programming

language the results will be returned as a DataFrame. You can also interact with the SQL interface using

the command-line or over JDBC/ODBC.

This chapter will show only an example that shows how to use the functional paradigm functions to query

the Hadoop cluster. The language in the example chosen is Scala, which is the most powerful of all those

built by the community of Spark.

The Spark SQL interface supports the automatic conversion of the RDD containing the classes of data

frame5. Classes define the schemes of the tables. The names of the topics of the classes are read using

reflection (it is the ability of a program to perform calculations that relate itself to the same program, in

particular, the structure of its source code) and become the column names. Classes can also be nested or

contain complex types such as Sequences and Arrays. This RDD can be converted implicitly into a data

frame and then be registered as a table. Tables can be used sequentially by the SQL statement.

Refer to an excerpt of this code, in a Figure21:

5 A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R.


Pagina 20

Figure 21 – Example Code

Impala vs Hive: difference between Sql on Hadoop components

Apache Hive was introduced by Facebook to manage and process a large amount of distributed data sets

and saved on Hadoop. Apache Hive is an abstraction of MapReduce and Hadoop allows you to query

databases in their own language similar to SQL, that HiveQL.

Across Cloudera Impala it was developed to try to solve all the limitations posed by the slow interaction

with Hadoop SQL. Cloudera Impala provides high performance to process and analyze data saved on the

Hadoop cluster.

Cloudera Impala Apache Hive

It does not require the data are moved or processed by Apache HDFS and HBase.

Is data warehouses infrastructure built over the Hadoop platform to enhance tasks such as query, analysis, process and display data.

Impala uses generated at runtime code for the "big loops" using LLVM (Low Level Virtual Machine).

Hive generates query at compile time.

Impala avoids to run the startup process, the processes in general are demons that are started during boot of the nodes and are always ready to plan, coordinate and execute queries.

All Hive processes have the problem, since they are compiled at compile-time, of "cold start."


Pagina 21

Impala is very fast on the parallel processes. It follows the SQL-92 standard.

Hive query results in their own "dialect" (Hive QL) in the DAG to have them implemented by the MapReduce jobsAd esempio, in Hive e possibile scrivere questo:

Table 1- Impala vs. Hive

Cloudera Impala

Cloudera Impala is an excellent choice for all developers who do not have the need to move large amounts

of data easily integrates with Hadoop ecosystem and frameworks like MapReduce, Apache Hive, Pig and

other Apache Hadoop software. The architecture is specifically designed to support the familiar with the

SQL language and give support to all those users who make the transition from relational databases to big

data.

Columnar Storage

The data, on Impala, are saved in mode to provide a compression ratio and a very efficient scanning.

Figure 22 – Columnar Storage with Impala

Tree Architecture

The tree architecture is absolutely crucial to the achievement distributed computing, delivering the query

nodes and aggregating the result once the sub-processing.

Figure 23 – Tree Architecture

Impala offers the following features:

Support to HDFS (Hadoop Distributed File System) and Apache HBase storage

It recognizes all the Hadoop file formats, texts, LZO, SEQUENCEFILE, Acro, RCFILE and Flooring


Pagina 22

Support Hadoop Security (Kerberos authentication)

Can easily read metadata, handle ODBC driver and the SQL syntax from Apache Hive

Apache Hive

Initially developed by Facebook, Apache Hive is a data warehouse infrastructure built on top of Hadoop

to provide a querying tool, analysis, processing, and visualization. It is a versatile tool that supports

analysis of large databases stored in Hadoop HDFS and other compatible systems like Amazon S3. To

maintain compatibility with the SQL language is used as HiveQL tool converts queries and send the jobs

of MapReduce.

Other features that include Hive:

Indexing to speed up the computation

Different storage support, such as text files, RCFILE, HBase, ORC and other

Familiarity to building user defined functions (UDFs) to manipulate strings, dates, ...

If you are looking for Advanced Analytics Language that allows to have a great deal of familiarity with SQL

(without writing separate processes of MapReduce) Apache Hive is the best choice. Queries HiveQL still

are converted into the corresponding MapReduce job that will run on the cluster and will return the final

figure.

Streaming Data

Spark Streaming is an extension of the core API that enables streaming processes that are scalable, high-

throughput and fault-tolerant. Data can be selected from various resources like Kafka, Flume, HDFS / S3,

Kinesis, Twitter or commonplace TCP Sockets and can be processed using complex algorithms expressed

at a high level thanks to the functional paradigm as a map, a reduce, joins, and window. Thereafter, the

processed data can be entered in a filesystem or into database.

Figure 24 – Streaming Data

Internally, Spark, works as follows: Spark receives streaming data in live and divides the data into separate

batches, where they are finally processed by the Spark engine to generate the final stream containing the

processing result.

Figure 25 – Streaming Data in detail


Pagina 23

Spark Streaming provides a high-level abstraction layer called Discretized Stream or, more simply,

DStream, which represents a continuous stream of data. DStream can be created for each input data

streams from the sources such as Kafka, Flume, Kinesis, or, simply, through high-level operations on other

DStreams. Internally a DStreams shall be represented as a sequence of RDD.

Machine Learning

The machine learning algorithms are constantly growing and evolving and are becoming ever more

efficient. To understand the evolution of machine learning in Spark is good to focus on the evolution that

this has had.

Machine Learning Library (MLlib)

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and

easy. It consists of common learning algorithms and utilities, including classification, regression,

clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives

and higher-level pipeline APIs.

It divides into two packages:

spark.mllib contains the original API built on top of RDDs.

spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we

will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable

using spark.mllib features and expect more features coming. Developers should contribute new

algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.

In fact, on the one hand, Matlab and R are easy to use but are not scalable. On the other hand, conversely,

there was a growth by Mahaut and GraphLab, very scalable but difficult to use.


Pagina 24

Figure 26 – Mlib Architecture

The library Spark MLlib (ML) aims to make the practice of machine learning scalable and relatively easy.

It consists of a set of algorithms that include regression, clustering, collaborative filtering, dimensional

reduction and others functions that allow integration at a low level. That architecture optimizes the high-

level APIs that are provided and used. also provide tools to perform Extraction and Transformation.

GraphX

GraphX is a component of Spark for simply graphs and graphs parallel computation. GraphX extends

the RDD Spark introducing a new abstraction graph attaching to them the concept of vertex and edge. To

support this, GraphX computing provides operators as a subgraph, joinVertices and aggregateMessages. In

addition, GraphX includes a growing collection of graphics algorithms to simplify access and high-level

analysis.

Figure 27 – How use GraphX

Actually the WWW is a gigantic graph; later, multinationals like Twitter, Netflix and Amazon have used

this concept to expand their markets and their profits. In the world of medicine, the graphs are still used

in research.

How it works GraphX?

This is a directed multigraph with some properties that make all vertices and nodes.

In particular, at each node may be associated metadata, as well as a unique identifier

within the graph.


Pagina 25

In this case, each point represents a person, as

happens every day for the "social graph". In

GraphiX this object is called Vertex; that includes:

- (VertexId, VType): simply a key-value pair

The arrows have the task to connect all the

vertex, in the example, these are created of the

relationships between some people saved in

the vertex.

In Spark, these are called edge, in which are

stored the ID of connected vertices and the type

of correlation between them.

- Edge (VertexId, VertexId, EType)

The Spark graph is built on top of these simple relationships, in particular, the graph is built over the

representation of two different RDD, one containing the relationships between the vertices, and the

another one containing the relations between the edge.

In addition, there is another concept to which we must be careful when working with charts, that is, the

view that exists between each connection in the graph:

The graph presents a comprehensive EdgeTriple, containing all the information that exist between each

connection. This provides a much more accurate rendering of the graph of the reasoning process much

simpler.


Pagina 26

OPTIMIZATIONS AND THE FUTURE

In the final part I have seen some recurring issues, especially to understand how Spark. I'll try to

understand how they work and what problems may have the following constructs:

Closures

Broadcast

Partitioning

Understanding closures

One of the most difficult things to understand in Spark is understanding the scope and lifetime of variables

and methods when the code is executed on the cluster.

The RDD operations that modify variables outside of their scope can be a frequent source of confusion. It

will be shown in the example code which uses a foreach( ) to increment a counter, but similar problems

can occur for other well-known operations.

Example: Consider a sum RDD element, viewed in the following, the behavior of which may be different

depending on the node on which it is processed. A common example of what happens when we change

the Spark behavior from local to cluster (via YARN, for example):

Local vs. cluster modes

The behavior of the code, in the image above, is not clearly defined and might not work as intended. To

run jobs, Spark divides the process around RDD in different tasks, each by an executor. Before execution,

Spark computes the closure of the task. Closures are all those variables and methods that must be visible

for the executor that allow their computation on the RDD (in this case, the foreach statement). This closure

is serialized and sent to each executor.

Variables with the closure are copied and sent to every executor, so when the counter is called by the works

of the foreach, the counter does not refer to the Master node counter (driver). There is still the counter in

the driver node memory, but this is not visible to the executors. The executors "see" only the copy of the

closure serialized. So, the final value of the counter, will always be zero until all the operations on the

counter are not completed and sent to the master (driver).


Pagina 27

In the local environment, the foreach statement will run the same code on the same JVM as the original

driver and calculate the original counter: this could then upgrade directly behaving differently.

To ensure a well-defined behavior in these types of scenarios, you should use an Accumulator. The

Accumulators in Spark are used specifically to allow a mechanism for safely updating a variable by

aggregating all the results that are sent to the driver by the various nodes running the workers.

Broadcasting

In the previous chapter you saw how Spark sends a copy of the RDD to compute operations to various

workers (nodes) in the cluster. This type of operation, without a policy and a well-designed strategy, could

lead to specific problems, given the calculations that must be made are massive Spark and then we talk

about the order of gigabytes.

This section will show an incorrect management of the broadcasting technical of the RDD, and then see

which strategy was implemented in Spark to solve the problem.

Example: are provides the lazy type operations and closure of type Map and flatMap, respectively:

Given a Master (driver) and three Workers:

1. Is created the RDD inside the driver, 1MB size (dummy variable)

2. Whenever that the actions are called by the workers, a copy is sent to the various workers of the

RDD

3. The RDD of 1MB in size, if called, for example, 4 times from the first worker, become 4MB to the

local level. If you think about this in a distributed scale, so for all workers in this small network,

the results could grow out of control and in an uncontrolled manner.


Pagina 28

In a network of millions of nodes, this cannot happen.

In Spark this issue has been resolved in this way: it wraps the Map function inside the broadcast function

that handles sending of the RDD in parallel and, then, the lazy operations:

The broadcast function, first of all, compresses the RDD as much as possible; following:

1. Send indexers (RDD) once and only once time to the various worker

2. Each indexer must be cached by the various worker

This controls the uncontrolled growth of the indexer, even if the workers require to compute repeatedly

the same RDD. Not only! To try to solve the bottleneck that is possible is created between Drivers and

Workers, the Workers can exchange indexers in a similar manner to that used in BitTorrent networks.

This means, for example, that if the first worker has received the indexer to perform an any computation...

... and the second worker has also need to receive the indexer, the workers can help each other in order to

minimize calls to the driver, which could cause annoying bottleneck and create latency:


Pagina 29

Optimizing Partitioning

In previous sections it has shown how an uncontrolled broadcasting can grow RDD distribution in an

uncontrolled manner. In this section, otherwise, you will be seen instead as a broadcast technique

disproportionate use may be inadequate if misused. To best explain this problem through an example.

Given the following RDD:

Given a set "huge" data 1 to int.MaxValue

Divided into 10000 executors

1. Perform a filter, that minimizes the result as much as possible

2. Operated an order

3. Perform one map, that adds value 1 to each element in the final array

4. It shows the result using the collect action.

In this case, if you check the systemistic level work how the nodes on the network works, we have

conflicting results.

First, we can see that the execution was presided over by 2 chunks, each approximately 10,000 partitions,

whose first (sortBy) was completed in 13 seconds, while the second (collect) in 20 seconds.

If we make a better analysis of the result and we go down in detail of the function collect ...

We note that the sortBy function, after filtering, it took a full 20 seconds. This is because it has been asked

to use Spark 10000 executors to order 10 values (the action of sortBy is performed, to how the code is

written, after the filtering):


Pagina 30

Figure 28 – Systemistic View

As you can see, this overhead is bad if not controlled: from the image above you can see that 10000 tasks

are performed, and almost every one of them puts zero time to perform the required computation. Simply

put, the synchronization time of the various executors is extremely higher than the desired type of

computation, increasing the overhead and user perceived latency of proportion.

A technique used to control the use of the executors by the user is to use the coalesce function, which

reduces the use of executors from 10000 to 8 after execution of the filter.

The second field passed to coalesce function indicates the shuffle properties: if this is set to true value, then

the output is equally distributed across all nodes equally, otherwise, if it set to false value are used less

possible partitions.

If you monitoring the execution time, the result is always the same,

but with absolutely best performance: in fact, the collect call back after filtering, in the second chunk of

the second test, it takes only 4 seconds compared to 20 used previously.

In details:


Pagina 31

It is evident that the sort function, this time, takes only 24 milliseconds requesting the execution of 8

executors in total.

The coalesce function saves an enormous amount of time by reducing the number of chunks that are called

upon execution of actions.

Future of Spark

This section will show some features that the community around Spark is trying to develop and advance.

R

Integration with R is definitely the first “frontier” to be conquered to join the community of

mathematicians, statisticians and that of engineers and data scientists. R offers a user-

friendly development environment from a statistical and mathematical point of view and is

widely used in the community for the analysis of data.

Tachyon

Why Spark results to be 100 times faster than

MapReduce? The answer is provided by the

community around you work for Tachyon. The

memory is the key to process BigData quickly, fast and

scalable: Tachyon is a "central memory" adopted

Spark in distributed systems in which in addition to providing automatic fault-tolerance and cache

techniques, provides reliable services the memory sharing along the cluster used. Continue to develop

this type of service would make the field of the analysis of big data even faster and powerful than it already

is.

Project Tungsten

Another features developed by the community around which has rallied Spark is the Tungesten Project:

the project has changed the execution core of the various nodes within the cluster going to substantially

improve the efficiency of memory and CPU usage for applications Spark, trying to maximize, as soon as

possible, the performance. The fundamental concept of this research revolves around the concept about

the use of modern CPUs: they are built around the concept of the I / O. This research has the goal to

minimize the I / O and to improve to the maximum performance if the CPU and the memory, of the

machine, are used as a cluster.


Pagina 32

References

Holden Karau, A. K. (2015). Learning Spark. 1005 Gravenstein Highway North, Sebastopol, CA 95472:

O’Reilly Media.

Mortar Data Software. (2013, 05). How Spark Works. Retrieved from

https://www.linkedin.com/company/mortar-data-inc.

Phatak, M. (2015, 1 2). History of Apache Spark : Journey from Academia to Industry. Retrieved from

http://blog.madhukaraphatak.com/history-of-spark/

SAS - The power to Know. (2013, 04). Big Data. Retrieved from http://www.sas.com/en_ph/insights/big-

data/what-is-big-data.html

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2011, 06). USENIX Workshop on Hot

Topics in Cloud Computing (HotCloud). Retrieved from https://amplab.cs.berkeley.edu/wp-

content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf


Pagina 33

Contact information

MASSIMILIANO MARTELLA

Email: [email protected]

spark, the new age of data scientist

Data & Analytics