big data & hadoop environment · 2018-01-31 · big data & hadoop environment bu eğitim...
TRANSCRIPT
Big Data & Hadoop Environment
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında
yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde
gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın
görüşlerini yansıtmamaktadır.
What is big data?
Why do we need big data analytics?
How to setup Infrastructure for big data?
How much data?
Hadoop: 10K nodes,
150K cores, 150 PB
(4/2014)
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2+ EB, 600M QPS
(5/2014)
300 PB data in Hive +
600 TB/day (4/2014)
400B pages,
10+ PB
(2/2014)
LHC: ~15 PB a year
LSST: 6-10 PB a year
(~2020) 640K ought to
be enough for
anybody.
150 PB on 50k+ servers
running 15k apps (6/2011)
S3: 2T objects, 1.1M
request/second (4/2013)
SKA: 0.3 – 1.5 EB
per year (~2020)
Hadoop: 365 PB, 330K
nodes (6/2014)
The percentage of all data in the word that has been generated in last 2 years?
Big Data
What are Key Features of Big Data?
Volume
Petabyte scale
Variety
Structured
Semi-structured
Unstructured
Velocity
Social Media
Sensor
Throughput
Veracity
Unclean
Imprecise
Unclear
4 Vs
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject
genome
Sequencer
Reads
Human genome: 3 gbp
A few billion short reads
(~100 GB compressed data)
What to do with more data?
Answering factoid questions
Pattern matching on the Web
Works amazingly well
Learning relations
Start with seed instances
Search for patterns on the Web
Using patterns to find more instances
Who shot Abraham Lincoln? X shot Abraham Lincoln
Birthday-of(Mozart, 1756)
Birthday-of(Einstein, 1879)
Wolfgang Amadeus Mozart (1756 - 1791)
Einstein was born in 1879
PERSON (DATE –
PERSON was born in DATE
(Brill et al., TREC 2001; Lin, ACM TOIS 2007)
(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )
No data like more data!
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
s/knowledge/data/g;
Database Workloads
OLTP (online transaction processing)
Typical applications: e-commerce, banking, airline reservations
User facing: real-time, low latency, highly-concurrent
Tasks: relatively small set of “standard” transactional queries
Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
OLAP (online analytical processing)
Typical applications: business intelligence, data mining
Back-end processing: batch workloads, less concurrency
Tasks: complex analytical queries, often ad hoc
Data access pattern: table scans, large amounts of data per query
One Database or Two?
Downsides of co-existing OLTP and OLAP workloads
Poor memory management
Conflicting data access patterns
Variable latency
Solution: separate databases
User-facing OLTP database for high-volume transactions
Data warehouse for OLAP workloads
How do we connect the two?
OLTP/OLAP Architecture
OLTP OLAP
ETL (Extract, Transform, and Load)
Structure of Data Warehouses
SELECT P.Brand, S.Country,
SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D ON F.Date_Id = D.Id
INNER JOIN Dim_Store S ON F.Store_Id = S.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
WHERE D.YEAR = 1997 AND P.Product_Category = 'tv'
GROUP BY P.Brand, S.Country;
Source: Wikipedia (Star Schema)
OLAP Cubes
store
pro
duct
slice and dice
Common operations
roll up/drill down
pivot
Fast forward…
ETL Bottleneck
ETL is typically a nightly task:
What happens if processing 24 hours of data takes longer than 24
hours?
Hadoop is perfect:
Ingest is limited by speed of HDFS
Scales out with more nodes
Massively parallel
Ability to use any processing tool
Cheaper than parallel databases
ETL is a batch process anyway!
What’s changed?
Dropping cost of disks
Cheaper to store everything than to figure out what to throw away
Types of data collected
From data that’s obviously valuable to data whose value is less
apparent
Rise of social media and user-generated content
Large increase in data volume
Growing maturity of data mining techniques
Demonstrates value of data analytics
a useful service
analyze user
behavior to extract
insights
transform
insights into
action
$ (hopefully)
data science data products
Virtuous Product Cycle
OLTP/OLAP/Hadoop Architecture
OLTP OLAP
ETL (Extract, Transform, and Load)
Hadoop
ETL: Redux
Often, with noisy datasets, ETL is the analysis!
Note that ETL necessarily involves brute force data scans
L, then E and T?
Big Data Ecosystem
Source : datafloq.com
Cloud Computing
Before clouds…
Grids
Connection machine
Vector supercomputers
…
Cloud computing means many different things:
Big data
Rebranding of web 2.0
Utility computing
Everything as a service
Utility Computing
What?
Computing resources as a metered service (“pay as you go”)
Ability to dynamically provision virtual machines
Why?
Cost: capital vs. operating expenses
Scalability: “infinite” capacity
Elasticity: scale up or down on demand
Does it make sense?
Benefits to cloud users
Business case for cloud providers
I think there is a world
market for about five
computers.
Enabling Technology: Virtualization
Hardware
Operating System
App App App
Traditional
Stack
Hardware
OS
App App App
Hypervisor
OS OS
Virtualized Stack
Everything as a Service
Utility computing = Infrastructure as a Service (IaaS)
Why buy machines when you can rent cycles?
Examples: Amazon’s EC2, Rackspace
Platform as a Service (PaaS)
Give me nice API and take care of the maintenance, upgrades, …
Example: Google App Engine
Software as a Service (SaaS)
Just run it for me!
Example: Gmail, Salesforce
Who cares?
A source of problems…
Cloud-based services generate big data
Clouds make it easier to start companies that generate big data
As well as a solution…
Ability to provision analytics clusters on-demand in the cloud
Commoditization and democratization of big data capabilities
Parallelization Challenges
How do we assign work units to workers?
What if we have more work units than workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have finished?
What if workers die?
What’s the common theme of all of these problems?
Where the rubber meets the road
Concurrency is difficult to reason about
Concurrency is even more difficult to reason about
At the scale of datacenters and across datacenters
In the presence of failures
In terms of multiple interacting services
Not to mention debugging…
The reality:
Lots of one-off solutions, custom code
Write you own dedicated library, then program with it
Burden on the programmer to explicitly manage everything
The datacenter is the computer!
Source: Barroso and Urs Hölzle (2009)
The datacenter is the computer
It’s all about the right level of abstraction
Moving beyond the von Neumann architecture
What’s the “instruction set” of the datacenter computer?
Hide system-level details from the developers
No more race conditions, lock contention, etc.
No need to explicitly worry about reliability, fault tolerance, etc.
Separating the what from the how
Developer specifies the computation that needs to be performed
Execution framework (“runtime”) handles actual execution
Scaling “up” vs. “out”
No single machine is large enough
Smaller cluster of large SMP machines vs. larger cluster of
commodity machines (e.g., 16 128-core machines vs. 128 16-core
machines)
Nodes need to talk to each other!
Intra-node latencies: ~100 ns
Inter-node latencies: ~100 s
Move processing to the data
Cluster have limited bandwidth
Process data sequentially, avoid random access
Seeks are expensive, disk throughput is reasonable
Seamless scalability
Source: analysis on this an subsequent slides from Barroso and Urs Hölzle (2009)
Common Big Data Analytics Use Cases
• Batch
• Use case 1 : ETL / Batch query (Single Silo)
• Use case 2 : distributed log aggregation
• Batch + real time
• Use case 3 : real time data store
• Use case 4 : real time data store + batch analytics
• Real time / Streaming
• Use case 5 : Streaming
USE CASE 1 : ETL AND BATCH ANALYTICS at SCALE
• Data collected in various databases
• Data is scattered across multiple silos !
• Need a single silo to bring all data together and analyze
USE CASE 1
• Using core Hadoop components
• No vendor lock in (works on all Hadoop distributions)
• Use HDFS (Hadoop File System) for storage
• Data Ingest with Sqoop
• Processing done by Map Reduce & Cousins
• Results are exported back to DB
USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES
• Data coming in from multiple sources.
• Data is ‘streaming in’
• Capture data in Hadoop
• Do batch analytics
USE CASE 2
• Flume
• To bring in logs from multiple sources
• Distributed, reliable way to collect and move data
• If uplinks are dis connected, flume agents will store and
forward data
• HDFS
• Flume can directly write data to HDFS
• Files are segmented or ‘rolled’ by size / time e.g.
• Data-2015-01-01_10-00-00.log
• Data-2015-01-01_11-00-00.log
• Data-2015-01-01_12-00-00.log
USE CASE 2
• Analytics stack : Pig / Hive / Oozie / Spark
(Same as in Use Case 1)
• Oozie
• Work flow manager
• “run this work flow every 1 hour”
• “run this work flow when data shows up in input directory”
• Can manage complex work flows
• Send alerts when processes fail
• ..etc
USE CASE 3 : REAL TIME DATA STORE • Events are coming in
• Need to store the events
• Can be billions of events
• And query them in real time
e.g. last 10 events by user
USE CASE 3
• HDFS is not ideal for updating data in real time and
accessing data in random
• A scalable real time store is needed e.g. Hbase, Cassandra
to support real time updates
• Data comes trickling in (as stream)
• Saved data becomes queryable immediately
• Use HBase APIs (Java / REST) to build dashboards
USE CASE 4 : REAL TIME + BATCH • Building on use case 3
• Do extensive analysis on data on HBase
• E.g. : ‘scoring user models’
• ‘flagging credit card transactions’
USE CASE 4
• HBase is the real time store
• Analytics is done via Map Reduce stack (Pig / Hive)
• Can we do them in a single stack?
• May not be a good idea
• Don’t mix real time
and batch analytics
• Batch Analytics will
impede real time
performance
USE CASE 4
• How to replicate data?
• 1 : periodic synchronization of data between clusters
• 2 : data goes to both clusters at the same time
USE CASE 5 : Streaming
Decision times : batch ( hours / days)
Use cases:
• Modeling
• ETL
• Reporting
MOVING TOWARDS FAST DATA
• Decision time : (near) real time
• seconds (or milli seconds)
• Use Cases
• Alerts (medical / security)
• Fraud detection
• Streaming is becoming
more prevalent
• ‘Connected Devices’
• ‘Internet of Things’
• ‘Beyond Batch’
• We need faster processing
/ analytics
STREAMING ARCHITECTURE – DATA BUCKET
• ‘data bucket’
• Captures incoming data
• Acts as a ‘buffer’ – smoothes out bursts
• So even if our processing offline, we won’t loose data
• Data bucket choices
• * Kafka
• MQ (RabittMQ ..etc)
• Amazon Kinesis
KAFKA ARCHITECTURE
• Producers write data
to brokers
• Consumers read data from brokers
• All of this is
distributed / parallel
• Failure tolerant
• Data is stored as topics
• “sensor_data”
• “alerts”
• “emails”
STREAMING ARCHITECTURE – PROCESSING ENGINE
• Need to process events with low latency
• So many to choose from !
• Choices
• Storm
• Spark
• NiFi
• Flink
STREAMING ARCHITECTURE – DATA STORE
• Where processed data ends up
• Need to absorb data in real time
• Usually a NoSQL storage
• HBase
• Cassandra
• Lots of NoSQL stores
LAMBDA ARCHITECTURE
• Each component is scalable
• Each component is fault tolerant
• Incorporates best practices
• All open source !
LAMBDA ARCHITECTURE
1. All new data is sent to both batch layer and speed layer
2. Batch layer
• Holds master data set (immutable , append-only)
• Answers batch queries
3. Serving layer
• updates batch views so they can be queried adhoc
4. Speed Layer
• Handles new data
• Facilitates fast / real-time queries
5. Query layer
• Answers queries using batch & real-time views