big data , big problem?

Scalable Big Data Architecture

Big Data Big Problem?

PRESENTATION BY :

MOHAMMAD HASAN FARAZMANDOCTOBER 2016

[email protected]

We Will Review… Identifying Big Data Symptoms

Size Matters

Typical Business Use Case

Understanding the Big Data Project’s Ecosystem

Hadoop Distribution

Data Acquisition

Processing Language

Machine Learning

NoSQL Stores

Foundation of long-term Big Data Architecture

Architecture Overview

Long Ingestion Application

Learning Application

Processing Engine

Search Engine

This presentation has been prepared based on the first chapter of

Scalable Big Data Architecture by

Bahaaldine Azarmi

Identifying Big Data Symptoms

Data management is more complex than it has been before!

Big Data is every where , on every one’s mind

When Should I think about employing Big Data ?

Am I ready?

What should I start with?!

Different needs :

The volume of data you handle

Variety of data structure

Scalability issue

Reduce the cost of data processing

Size Matters

Two main areas : Size + Volume

Handle new data structures with flexible & schemaless technology

Big data is also about extracting added value information

Near real time processing with distributed architecture

Execute complex queries with NoSQL store

Value

Typical Business Use Case

Analyzing application’s log, web access log, server log, DB log, Social Networks

Customer Behavior Analytics : Used on e-commerce websites

Sentiment Analysis : Images and reputation of companies which perceived across social networks.

CRM On Boarding : Combine online data sources with offline data sources for better and more accurate customer segmentation ( profile-customized offers)

Prediction : Learning from Data , main big data trend (for 2 past years) –

For example in telecommunication industry :

1) Issue or event prediction based on router log

2) Product catalog selection

3) Pricing depending on user’s global behavior

Understanding Big Data Project’s Ecosystem

Choosing …

Hadoop distribution

Distributed file system

SQL-Like processing language

Machine learning language

Scheduler

Message-oriented middleware

NoSQL data store

Data visualization

Hadoop Distribution

Two Choices :

Download the project you need separately

Use one of most popular Hadoop distribution

Cloudera CDH

1. Impala : realtime, parallelized, SQL based engine that searches for

data in HDFS and Base.

2. Cloudera Management : Cloudera’s console to manage and

deploy Hadoop components.

3. Hue : Console for user interaction with data and scripts

Hortonworks HDP

Hadoop Distributed File System

HDFS

Key features:

Distribution

High Availability

Fault Tolerance

Tuning

Security

Load Balancing

High Throughput Access

Automatic replication across the cluster data nodes

Data Acquisition Large log file, Streamed data, ETL processing outcome, Online

unstructured data, Offline structured data, etc.

ApacheFlume Reliable, Highly available, Simple, Flexible, Intuitive programming

model based on streaming data flows.

Composed of “Sources”,”Channels”,”Sinks”

Apache Sqoop

Transfer bulk data between structured data store and HDFS.

Import data from external relational database to HDFS, Hbase , Hive.

Export data from Hadoop cluster to a relational database or data

warehouse.

Processing Language

MapReduce was the main processing framework in the first generation of the Hadoop cluster.

Grouping sibling data together (Map) and then aggregating the data in depending on a specified aggregation operation (Reduce).

Now that YARN (Yet Another Resource Negotiator) has been implemented.

Batch Processing with Hive

Hive, which brings users the simplicity and power of querying data

from HDFS in a SQL-like way.

Hive is not a near or real-time processing language. It is long-term

processing job with a low priority

Main drawback of using another language rather than using native

MapReduce, is “Performance”.

Stream Processing with Spark Streaming

Extension of Spark.

Leveraging Spark’s distributed data processing framework and treats

streaming computation.

Spark Streaming lets you write a processing job as you would do for

batch processing in Java, Scale, or Python.

Foundation of a strong fault-tolerant and high-performance system.

Message-Oriented Middleware

with Apache Kafka

Persistent messaging and high-throughput system.

Kafka as a pivot point in our architecture mainly to receive data

and push it into Spark Streaming.

Machine Learning

Spark MLlib enables machine learning for Spark.

Composed of various algorithms that go from basic statistics, logistic

regression, k-means clustering, and Gaussian mixtures to singular

value decomposition and multinomial naive Bayes.

Train your data and build prediction models with a few lines of code

NoSQL Stores

Fundamental pieces of the data architecture.

Scalability and Resiliency, and thus High Availability.

Ingest a very large amount of data.

Couchbase

Document-oriented NoSQL database that is easily scalable,

provides a flexible model, and is consistently high performance.

ElasticSearch

Scalable distributed indexing engine and search features.

Based on Apache Lucene and enables real-time data analytics

and full-text search in your architecture.

ELK platform

ElasticSearch is part of the ELK platform.

ElasticSearch + Logstash + Kibana

Provide the best end-to-end platform for collecting, storing, and

visualizing data.

Logstash lets you collect data from many kinds of sources

ElasticSearch indexes the data in a distributed, scalable, and

resilient system.

Kibana is a customizable user interface in which you can build a

simple to complex dashboard to explore and visualize data indexed

by ElasticSearch.

Foundation of a Long-Term

Big Data Architecture

Log Ingestion Application

Consume application logs such as web access logs.

Learning Application

Receives a stream of data and builds prediction to optimize our

recommendation engine.

Processing Engine

Heart of the architecture

Summary

The search engine leverages the data processed by the processing

engine and exposes a dedicated RESTful API that will be used for

analytic purposes.

Search Engine

We have seen all the components that make up our architecture

Good Luck

big data , big problem?

Data & Analytics