big data & hadoop environment · 2018-01-31 · big data & hadoop environment bu eğitim...

Big Data & Hadoop Environment

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında

yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde

gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın

görüşlerini yansıtmamaktadır.

What is big data?

Why do we need big data analytics?

How to setup Infrastructure for big data?

How much data?

Hadoop: 10K nodes,

150K cores, 150 PB

(4/2014)

Processes 20 PB a day (2008)

Crawls 20B web pages a day (2012)

Search index is 100+ PB (5/2014)

Bigtable serves 2+ EB, 600M QPS

(5/2014)

300 PB data in Hive +

600 TB/day (4/2014)

400B pages,

10+ PB

(2/2014)

LHC: ~15 PB a year

LSST: 6-10 PB a year

(~2020) 640K ought to

be enough for

anybody.

150 PB on 50k+ servers

running 15k apps (6/2011)

S3: 2T objects, 1.1M

request/second (4/2013)

SKA: 0.3 – 1.5 EB

per year (~2020)

Hadoop: 365 PB, 330K

nodes (6/2014)

The percentage of all data in the word that has been generated in last 2 years?

Big Data

What are Key Features of Big Data?

Volume

Petabyte scale

Variety

Structured

Semi-structured

Unstructured

Velocity

Social Media

Sensor

Throughput

Veracity

Unclean

Imprecise

Unclear

4 Vs

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGC

AATGCTTAGCTATGCGGGC



CGGTCTAGATGCTTACTATGC


CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?

Subject

genome

Sequencer

Reads

Human genome: 3 gbp

A few billion short reads

(~100 GB compressed data)

What to do with more data?

Answering factoid questions

Pattern matching on the Web

Works amazingly well

Learning relations

Start with seed instances

Search for patterns on the Web

Using patterns to find more instances

Who shot Abraham Lincoln? X shot Abraham Lincoln

Birthday-of(Mozart, 1756)

Birthday-of(Einstein, 1879)

Wolfgang Amadeus Mozart (1756 - 1791)

Einstein was born in 1879

PERSON (DATE –

PERSON was born in DATE

(Brill et al., TREC 2001; Lin, ACM TOIS 2007)

(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )

No data like more data!

(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

s/knowledge/data/g;

Database Workloads

OLTP (online transaction processing)

Typical applications: e-commerce, banking, airline reservations

User facing: real-time, low latency, highly-concurrent

Tasks: relatively small set of “standard” transactional queries

Data access pattern: random reads, updates, writes (involving

relatively small amounts of data)

OLAP (online analytical processing)

Typical applications: business intelligence, data mining

Back-end processing: batch workloads, less concurrency

Tasks: complex analytical queries, often ad hoc

Data access pattern: table scans, large amounts of data per query

One Database or Two?

Downsides of co-existing OLTP and OLAP workloads

Poor memory management

Conflicting data access patterns

Variable latency

Solution: separate databases

User-facing OLTP database for high-volume transactions

Data warehouse for OLAP workloads

How do we connect the two?

OLTP/OLAP Architecture

OLTP OLAP

ETL (Extract, Transform, and Load)

Structure of Data Warehouses

SELECT P.Brand, S.Country,

SUM(F.Units_Sold)

FROM Fact_Sales F

INNER JOIN Dim_Date D ON F.Date_Id = D.Id

INNER JOIN Dim_Store S ON F.Store_Id = S.Id

INNER JOIN Dim_Product P ON F.Product_Id = P.Id

WHERE D.YEAR = 1997 AND P.Product_Category = 'tv'

GROUP BY P.Brand, S.Country;

Source: Wikipedia (Star Schema)

OLAP Cubes

store

pro

duct

slice and dice

Common operations

roll up/drill down

pivot

Fast forward…

ETL Bottleneck

ETL is typically a nightly task:

What happens if processing 24 hours of data takes longer than 24

hours?

Hadoop is perfect:

Ingest is limited by speed of HDFS

Scales out with more nodes

Massively parallel

Ability to use any processing tool

Cheaper than parallel databases

ETL is a batch process anyway!

What’s changed?

Dropping cost of disks

Cheaper to store everything than to figure out what to throw away

Types of data collected

From data that’s obviously valuable to data whose value is less

apparent

Rise of social media and user-generated content

Large increase in data volume

Growing maturity of data mining techniques

Demonstrates value of data analytics

a useful service

analyze user

behavior to extract

insights

transform

insights into

action

$ (hopefully)

data science data products

Virtuous Product Cycle

OLTP/OLAP/Hadoop Architecture

OLTP OLAP

ETL (Extract, Transform, and Load)

Hadoop

ETL: Redux

Often, with noisy datasets, ETL is the analysis!

Note that ETL necessarily involves brute force data scans

L, then E and T?

Big Data Ecosystem

Source : datafloq.com

Cloud Computing

Before clouds…

Grids

Connection machine

Vector supercomputers

…

Cloud computing means many different things:

Big data

Rebranding of web 2.0

Utility computing

Everything as a service

Utility Computing

What?

Computing resources as a metered service (“pay as you go”)

Ability to dynamically provision virtual machines

Why?

Cost: capital vs. operating expenses

Scalability: “infinite” capacity

Elasticity: scale up or down on demand

Does it make sense?

Benefits to cloud users

Business case for cloud providers

I think there is a world

market for about five

computers.

Enabling Technology: Virtualization

Hardware

Operating System

App App App

Traditional

Stack

Hardware

OS

App App App

Hypervisor

OS OS

Virtualized Stack

Everything as a Service

Utility computing = Infrastructure as a Service (IaaS)

Why buy machines when you can rent cycles?

Examples: Amazon’s EC2, Rackspace

Platform as a Service (PaaS)

Give me nice API and take care of the maintenance, upgrades, …

Example: Google App Engine

Software as a Service (SaaS)

Just run it for me!

Example: Gmail, Salesforce

Who cares?

A source of problems…

Cloud-based services generate big data

Clouds make it easier to start companies that generate big data

As well as a solution…

Ability to provision analytics clusters on-demand in the cloud

Commoditization and democratization of big data capabilities

Parallelization Challenges

How do we assign work units to workers?

What if we have more work units than workers?

What if workers need to share partial results?

How do we aggregate partial results?

How do we know all the workers have finished?

What if workers die?

What’s the common theme of all of these problems?

Where the rubber meets the road

Concurrency is difficult to reason about

Concurrency is even more difficult to reason about

At the scale of datacenters and across datacenters

In the presence of failures

In terms of multiple interacting services

Not to mention debugging…

The reality:

Lots of one-off solutions, custom code

Write you own dedicated library, then program with it

Burden on the programmer to explicitly manage everything

The datacenter is the computer!

Source: Barroso and Urs Hölzle (2009)

The datacenter is the computer

It’s all about the right level of abstraction

Moving beyond the von Neumann architecture

What’s the “instruction set” of the datacenter computer?

Hide system-level details from the developers

No more race conditions, lock contention, etc.

No need to explicitly worry about reliability, fault tolerance, etc.

Separating the what from the how

Developer specifies the computation that needs to be performed

Execution framework (“runtime”) handles actual execution

Scaling “up” vs. “out”

No single machine is large enough

Smaller cluster of large SMP machines vs. larger cluster of

commodity machines (e.g., 16 128-core machines vs. 128 16-core

machines)

Nodes need to talk to each other!

Intra-node latencies: ~100 ns

Inter-node latencies: ~100 s

Move processing to the data

Cluster have limited bandwidth

Process data sequentially, avoid random access

Seeks are expensive, disk throughput is reasonable

Seamless scalability

Source: analysis on this an subsequent slides from Barroso and Urs Hölzle (2009)

Common Big Data Analytics Use Cases

• Batch

• Use case 1 : ETL / Batch query (Single Silo)

• Use case 2 : distributed log aggregation

• Batch + real time

• Use case 3 : real time data store

• Use case 4 : real time data store + batch analytics

• Real time / Streaming

• Use case 5 : Streaming

USE CASE 1 : ETL AND BATCH ANALYTICS at SCALE

• Data collected in various databases

• Data is scattered across multiple silos !

• Need a single silo to bring all data together and analyze

USE CASE 1

• Using core Hadoop components

• No vendor lock in (works on all Hadoop distributions)

• Use HDFS (Hadoop File System) for storage

• Data Ingest with Sqoop

• Processing done by Map Reduce & Cousins

• Results are exported back to DB

USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES

• Data coming in from multiple sources.

• Data is ‘streaming in’

• Capture data in Hadoop

• Do batch analytics

USE CASE 2

• Flume

• To bring in logs from multiple sources

• Distributed, reliable way to collect and move data

• If uplinks are dis connected, flume agents will store and

forward data

• HDFS

• Flume can directly write data to HDFS

• Files are segmented or ‘rolled’ by size / time e.g.

• Data-2015-01-01_10-00-00.log

• Data-2015-01-01_11-00-00.log

• Data-2015-01-01_12-00-00.log

USE CASE 2

• Analytics stack : Pig / Hive / Oozie / Spark

(Same as in Use Case 1)

• Oozie

• Work flow manager

• “run this work flow every 1 hour”

• “run this work flow when data shows up in input directory”

• Can manage complex work flows

• Send alerts when processes fail

• ..etc

USE CASE 3 : REAL TIME DATA STORE • Events are coming in

• Need to store the events

• Can be billions of events

• And query them in real time

e.g. last 10 events by user

USE CASE 3

• HDFS is not ideal for updating data in real time and

accessing data in random

• A scalable real time store is needed e.g. Hbase, Cassandra

to support real time updates

• Data comes trickling in (as stream)

• Saved data becomes queryable immediately

• Use HBase APIs (Java / REST) to build dashboards

USE CASE 4 : REAL TIME + BATCH • Building on use case 3

• Do extensive analysis on data on HBase

• E.g. : ‘scoring user models’

• ‘flagging credit card transactions’

USE CASE 4

• HBase is the real time store

• Analytics is done via Map Reduce stack (Pig / Hive)

• Can we do them in a single stack?

• May not be a good idea

• Don’t mix real time

and batch analytics

• Batch Analytics will

impede real time

performance

USE CASE 4

• How to replicate data?

• 1 : periodic synchronization of data between clusters

• 2 : data goes to both clusters at the same time

USE CASE 5 : Streaming

Decision times : batch ( hours / days)

Use cases:

• Modeling

• ETL

• Reporting

MOVING TOWARDS FAST DATA

• Decision time : (near) real time

• seconds (or milli seconds)

• Use Cases

• Alerts (medical / security)

• Fraud detection

• Streaming is becoming

more prevalent

• ‘Connected Devices’

• ‘Internet of Things’

• ‘Beyond Batch’

• We need faster processing

/ analytics

STREAMING ARCHITECTURE – DATA BUCKET

• ‘data bucket’

• Captures incoming data

• Acts as a ‘buffer’ – smoothes out bursts

• So even if our processing offline, we won’t loose data

• Data bucket choices

• * Kafka

• MQ (RabittMQ ..etc)

• Amazon Kinesis

KAFKA ARCHITECTURE

• Producers write data

to brokers

• Consumers read data from brokers

• All of this is

distributed / parallel

• Failure tolerant

• Data is stored as topics

• “sensor_data”

• “alerts”

• “emails”

STREAMING ARCHITECTURE – PROCESSING ENGINE

• Need to process events with low latency

• So many to choose from !

• Choices

• Storm

• Spark

• NiFi

• Flink

STREAMING ARCHITECTURE – DATA STORE

• Where processed data ends up

• Need to absorb data in real time

• Usually a NoSQL storage

• HBase

• Cassandra

• Lots of NoSQL stores

LAMBDA ARCHITECTURE

• Each component is scalable

• Each component is fault tolerant

• Incorporates best practices

• All open source !

LAMBDA ARCHITECTURE

1. All new data is sent to both batch layer and speed layer

2. Batch layer

• Holds master data set (immutable , append-only)

• Answers batch queries

3. Serving layer

• updates batch views so they can be queried adhoc

4. Speed Layer

• Handles new data

• Facilitates fast / real-time queries

5. Query layer

• Answers queries using batch & real-time views

big data & hadoop environment · 2018-01-31 · big data & hadoop environment bu eğitim...

Documents