big data , big problem?

27
Scalable Big Data Architecture Big Data Big Problem? PRESENTATION BY : MOHAMMAD HASAN FARAZMAND OCTOBER 2016 [email protected]

Upload: mohammadhasan-farazmand

Post on 21-Jan-2018

231 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data , Big Problem?

Scalable Big Data Architecture

Big Data Big Problem?

PRESENTATION BY :

MOHAMMAD HASAN FARAZMANDOCTOBER 2016

[email protected]

Page 2: Big Data , Big Problem?

We Will Review… Identifying Big Data Symptoms

Size Matters

Typical Business Use Case

Understanding the Big Data Project’s Ecosystem

Hadoop Distribution

Data Acquisition

Processing Language

Machine Learning

NoSQL Stores

Foundation of long-term Big Data Architecture

Architecture Overview

Long Ingestion Application

Learning Application

Processing Engine

Search Engine

Page 3: Big Data , Big Problem?

This presentation has been prepared based on the first chapter of

Scalable Big Data Architecture by

Bahaaldine Azarmi

Page 4: Big Data , Big Problem?

Identifying Big Data Symptoms

Data management is more complex than it has been before!

Big Data is every where , on every one’s mind

When Should I think about employing Big Data ?

Am I ready?

What should I start with?!

Different needs :

The volume of data you handle

Variety of data structure

Scalability issue

Reduce the cost of data processing

Page 5: Big Data , Big Problem?

Size Matters

Two main areas : Size + Volume

Handle new data structures with flexible & schemaless technology

Big data is also about extracting added value information

Near real time processing with distributed architecture

Execute complex queries with NoSQL store

Value

Page 6: Big Data , Big Problem?

Typical Business Use Case

Analyzing application’s log, web access log, server log, DB log, Social Networks

Customer Behavior Analytics : Used on e-commerce websites

Sentiment Analysis : Images and reputation of companies which perceived across social networks.

CRM On Boarding : Combine online data sources with offline data sources for better and more accurate customer segmentation ( profile-customized offers)

Prediction : Learning from Data , main big data trend (for 2 past years) –

For example in telecommunication industry :

1) Issue or event prediction based on router log

2) Product catalog selection

3) Pricing depending on user’s global behavior

Page 7: Big Data , Big Problem?

Understanding Big Data Project’s Ecosystem

Choosing …

Hadoop distribution

Distributed file system

SQL-Like processing language

Machine learning language

Scheduler

Message-oriented middleware

NoSQL data store

Data visualization

Page 8: Big Data , Big Problem?

Hadoop Distribution

Two Choices :

Download the project you need separately

Use one of most popular Hadoop distribution

Page 9: Big Data , Big Problem?

Cloudera CDH

1. Impala : realtime, parallelized, SQL based engine that searches for

data in HDFS and Base.

2. Cloudera Management : Cloudera’s console to manage and

deploy Hadoop components.

3. Hue : Console for user interaction with data and scripts

Page 10: Big Data , Big Problem?

Hortonworks HDP

Page 11: Big Data , Big Problem?

Hadoop Distributed File System

HDFS

Key features:

Distribution

High Availability

Fault Tolerance

Tuning

Security

Load Balancing

High Throughput Access

Automatic replication across the cluster data nodes

Page 12: Big Data , Big Problem?

Data Acquisition Large log file, Streamed data, ETL processing outcome, Online

unstructured data, Offline structured data, etc.

ApacheFlume Reliable, Highly available, Simple, Flexible, Intuitive programming

model based on streaming data flows.

Composed of “Sources”,”Channels”,”Sinks”

Page 13: Big Data , Big Problem?

Apache Sqoop

Transfer bulk data between structured data store and HDFS.

Import data from external relational database to HDFS, Hbase , Hive.

Export data from Hadoop cluster to a relational database or data

warehouse.

Page 14: Big Data , Big Problem?

Processing Language

MapReduce was the main processing framework in the first generation of the Hadoop cluster.

Grouping sibling data together (Map) and then aggregating the data in depending on a specified aggregation operation (Reduce).

Now that YARN (Yet Another Resource Negotiator) has been implemented.

Page 15: Big Data , Big Problem?

Batch Processing with Hive

Hive, which brings users the simplicity and power of querying data

from HDFS in a SQL-like way.

Hive is not a near or real-time processing language. It is long-term

processing job with a low priority

Main drawback of using another language rather than using native

MapReduce, is “Performance”.

Page 16: Big Data , Big Problem?

Stream Processing with Spark Streaming

Extension of Spark.

Leveraging Spark’s distributed data processing framework and treats

streaming computation.

Spark Streaming lets you write a processing job as you would do for

batch processing in Java, Scale, or Python.

Foundation of a strong fault-tolerant and high-performance system.

Page 17: Big Data , Big Problem?

Message-Oriented Middleware

with Apache Kafka

Persistent messaging and high-throughput system.

Kafka as a pivot point in our architecture mainly to receive data

and push it into Spark Streaming.

Page 18: Big Data , Big Problem?

Machine Learning

Spark MLlib enables machine learning for Spark.

Composed of various algorithms that go from basic statistics, logistic

regression, k-means clustering, and Gaussian mixtures to singular

value decomposition and multinomial naive Bayes.

Train your data and build prediction models with a few lines of code

Page 19: Big Data , Big Problem?

NoSQL Stores

Fundamental pieces of the data architecture.

Scalability and Resiliency, and thus High Availability.

Ingest a very large amount of data.

Page 20: Big Data , Big Problem?

Couchbase

Document-oriented NoSQL database that is easily scalable,

provides a flexible model, and is consistently high performance.

ElasticSearch

Scalable distributed indexing engine and search features.

Based on Apache Lucene and enables real-time data analytics

and full-text search in your architecture.

Page 21: Big Data , Big Problem?

ELK platform

ElasticSearch is part of the ELK platform.

ElasticSearch + Logstash + Kibana

Provide the best end-to-end platform for collecting, storing, and

visualizing data.

Logstash lets you collect data from many kinds of sources

ElasticSearch indexes the data in a distributed, scalable, and

resilient system.

Kibana is a customizable user interface in which you can build a

simple to complex dashboard to explore and visualize data indexed

by ElasticSearch.

Page 22: Big Data , Big Problem?

Foundation of a Long-Term

Big Data Architecture

Page 23: Big Data , Big Problem?

Log Ingestion Application

Consume application logs such as web access logs.

Page 24: Big Data , Big Problem?

Learning Application

Receives a stream of data and builds prediction to optimize our

recommendation engine.

Page 25: Big Data , Big Problem?

Processing Engine

Heart of the architecture

Page 26: Big Data , Big Problem?

Summary

The search engine leverages the data processed by the processing

engine and exposes a dedicated RESTful API that will be used for

analytic purposes.

Search Engine

We have seen all the components that make up our architecture

Page 27: Big Data , Big Problem?

Good Luck