lambda architecture using apache spark – with java code examples

11

Upload: quovantis

Post on 14-Apr-2017

295 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Lambda Architecture using Apache Spark – with Java code examples
Page 2: Lambda Architecture using Apache Spark – with Java code examples

Lambda Architecture

Page 3: Lambda Architecture using Apache Spark – with Java code examples

Lambda architecture, devised by Nathan Marz, is a layered architecture which solves the problem of computing arbitrary functions on arbitrary data in real time. In a real time system the requirement is something like this -

result = function (all data)

With increasing volume of data, the query will take a significant amount of time to execute no matter what resources we have used.

Lambda Architecture uses three layer architecture and a concept of pre-computed views to solve this problem. Three layers are

● Batch Layer● Speed Layer● Serving Layer

Page 4: Lambda Architecture using Apache Spark – with Java code examples
Page 5: Lambda Architecture using Apache Spark – with Java code examples

 

Batch Layer

Batch layer stores immutable master data, computes arbitrary functions on all data and creates batch views. Function of batch layer can be summarized as

batch view = function (all data)

Batch layer continuously does this job and updates batch views.

Page 6: Lambda Architecture using Apache Spark – with Java code examples

Traffic from Social Media

Serving Layer

Purpose of Serving Layer is to store batch views obtained from batch layer and provide random access to batch views. When batch layer computes new views, they are updated in Serving Layer by Batch Layer. The Serving Layer can be achieved by using a random access database.

Speed Layer

While batch layer computes batch view, it will not include data which came while re-computing batch views. The purpose of Speed layer is to compute incremental views on recent data that is not included in batch views. These views are called real time views.

A Speed Layer can be summarized as

real time view = function (real time view, new data)

So, our final query can be served by speed layer or serving layer.

batch view = function (all data)

real time view = function (real time view, new data)

result = merge (query (batch view), query (real time view))

Page 7: Lambda Architecture using Apache Spark – with Java code examples
Page 8: Lambda Architecture using Apache Spark – with Java code examples

An Example using Apache Spark

Suppose we want to build a system to find popular hash tags in a twitter stream, we can implement lambda architecture using Apache Spark to build this system.

Batch Layer Implementation - Batch layer will read a file of tweets and calculate hash tag frequency map and will save it to Cassandra database table.

Batch.java

Page 9: Lambda Architecture using Apache Spark – with Java code examples

Speed Layer Implementation - Speed layer can also be written in Apache spark using spark streaming feature. We can get a stream of recent tweets and calculate recent real time view from this stream we can also save this real time view to Cassandra for simplicity.

Speed.java :

Page 10: Lambda Architecture using Apache Spark – with Java code examples

Serving Layer implementation - Serving layer can be implemented as a RESTful web service which will query Cassandra tables to get the final result in real time.

Page 11: Lambda Architecture using Apache Spark – with Java code examples

Unique Page Views

References and image credits

http://www.databasetube.com/database/big-data-lambda-architecture/ Big Data Principles and best practices of scalable real time data systems by Nathan Marz and James Warren