big data. definition big data is a term for collection of datasets so large and complex that it...

67
Big Data

Upload: tyler-rose

Post on 12-Jan-2016

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Big Data

Page 2: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

DefinitionBig Data is a term for collection of datasets so

large and complex that it becomes difficult to process using conventional database management tools or traditional data processing applications

Page 3: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

What is ‘Big’ ‘Data’?Is it;

Too big to be stored on a single server?Too unstructured to fit into a Row/column DB?Too voluminous or dynamic to fit into a static data

warehouse?

Page 5: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Scale of data

Page 6: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Scale of data

Page 7: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Opportunity for Big Data

Page 9: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Opportunity for Big Data

Page 10: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Candidates for Big Data

Page 11: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Big Data gap

Page 12: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Community

Usages Adobe Alibaba Amazon AOL Facebook Google IBM

Users

Page 13: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Conventional approach

Page 14: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Issues with conventional approach

Batch-oriented

Batches cannot be interrupted or reconfigured on-the-fly

Schema management required in multiple places

Lots of data being shuffled around

Data duplication

Turn-around times of hours rather than minutes

Page 15: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Big Data systems (real time)

Applications

Indexing and Search

Usage Analytics

Insights, recommen

dationsCRUD

Views

Data backendData

MetadataAttention data

Indexes

Page 16: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

A Data system – example (perhaps)

Raw dataEg: Tweets

View 1Eg: #tweets/URL

View 2Eg: Influence

scores

View 3Eg: trending topics

Applications

Indexing and Search

Usage Analytics

Insights, recommen

dationsCRUD

Page 17: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Properties of a Data system

robustness

fast reads AND updates/inserts

scalable

generic

extensible

allows ad-hoc analysis

low-cost maintenance

debuggable

Page 18: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Big Data Technologies

Page 19: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDBMongoDB

Hadoop

Page 20: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB

Horizontally Scalable

{ author : “steve”, date : new Date(), text : “About MongoDB...”, tags : [“tech”, “database”]}

Document Oriented

Application

High Performance

Fully Consistent

Page 21: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB philosophyKeep functionality when we can (key/value

stores are great, but we need more)

Non-relational (no joins) makes scaling horizontally practical

Document data models are good

Database technology should run anywhere virtualized, cloud, metal, etc

Page 22: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Under the hoodWritten in C++

Runs nearly everywhere

Data serialized to BSON

Extensive use of memory-mapped files i.e. read-through write-through memory caching.

Page 23: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Database LandscapeSca

labili

ty &

Pe

rform

ance

Depth of Functionality

MongoDB

RDBMS

Memcached

Page 24: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB

“MongoDB has the best features of key/value stores, document databases and relational databasesin one.” John Nunemaker

Page 25: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Relational made normalized data look like

this

User• Name• Email Address

Category• Name• Url

Article• Name• Slug• Publish date• Text

Tag• Name• Url

Comment• Comment• Date• Author

Page 26: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Document databases make normalized data look like

this

User• Name• Email Address

Article• Name• Slug• Publish date• Text• Author

Tag[]• Value

Comment[]• Comment• Date• Author

Category[]• Value

Page 27: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

When MongoDB?Online processing

Working on small subsets at a time

Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

Page 28: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

HadoopWhat is Hadoop?

A scalable, Fault tolerant, High performance distributed file system

Asynchronous replicationWrite-once and read-many (WORM)Hadoop cluster with 3 DataNodes minimumData divided into 64MB or 128 MB blocks, each

block replicated 3 times (default)NameNode holds filesystem metadataFiles are broken up and spread over the DataNodes

Page 29: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Benefits of HadoopRuns on cheap commodity hardware

Automatically handles data replication and node failure

It does the hard work – you can focus on processing data

Cost Saving and efficient and reliable data processing

Page 30: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Where and When Hadoop? Where?

Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling)

Highly parallel data intensive distributed applications

Very large production deployments

When? Process lots of

unstructured data When your processing

can easily be made parallel

Running batch jobs is acceptable

When you have access to lots of cheap hardware

Page 31: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

What is Hadoop used for?Searching

Log Processing

Recommendation systems

Analytics

Video and Image analysis

Data Retention

HadoopMassively scalable

data100PB of data Injest at scaleSensor dataClickstreamOnline gaming data

(user activity, time online, activities performed, etc)

Later analytics

Page 32: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

How Hadoop works?Hadoop implements a computational paradigm named

Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework

Page 33: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Hadoop ArchitectureDistributed File System

and MapReduce

HDFS Runs on top of existing

file system in a node Designed to handle

large files with streaming data access patterns

Designed for streaming (sequential data access rather than random access)

Page 34: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

HDFSKnown as Hadoop Distributed File System

Primary storage system for Hadoop Apps

Multiple replicas of data blocks distributed on compute nodes for reliability

Files are stored on multiple boxes for durability and high availability

Page 35: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

HDFSOptimized for long sequential reads

Data written once , read multiple times, no append possible

Large file, sequential reads so no local caching of data.

Data replication HDFS

Page 36: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

HDFS Architecture

Page 37: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

HDFS ArchitectureBlock Structure files system

File is divided to bocks and stored

Each individual machine in cluster is Data Node

Default block size is 64 MB

Information of blocks is stored in metadata

All this meta data is stored on machine which is Name Node

Page 38: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Map Reduce Hadoop MapReduce is a software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Page 39: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MapReduce

Technology from Google

MapReduce program contains transformations that can be applied to the data any number of times

MapReduce is an executing mapreduce program which has map tasks running parallel to each other and reduce tasks running parallel to each other as well

Page 40: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MapReduce

Page 41: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MapReduceHDFS handles the Distributed File System layer

MapReduce is how we process the data

MapReduce Daemons JobTracker TaskTracker

Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing

Page 42: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Job TrackerOne per cluster “master node”

Takes jobs from clients

Splits work into “tasks”

Distributes “tasks” to TaskTrackers

Monitors progress, deals with failures

Page 43: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Task TrackerMany per cluster “slave nodes”

Does the actual work, executes the code for the job

Talks regularly with JobTracker

Launches child process when given a task

Reports progress of running “task” back to JobTracker

Page 44: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Minimally, applications specify the input/output locations and supply map and reduce

functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the

JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java.

Anatomy of MapReduce

Page 45: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Submitting a MapReduce job

Page 46: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Simple Data Flow Example

Page 47: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Example of MapReduceRead text files and count how often words occur.

The input is text files The output is a text file

each line: word, tab, count

Map: Produce pairs of (word, count)

Reduce: For each word, sum up the counts.

Page 48: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Anatomy of MapReduce Client Submits job:

I want to count the occurrences of each word We will assume that the data to process is already there in HDFS

JobTracker receives job Queries the NameNode for number of blocks in File The job is split into Tasks One map task per each block As many reduce tasks as specified in the Job

TaskTracker checks in Regularly with JobTracker Is there any work for me ?

If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”

Page 49: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Hadoop, pros & consPros

Superior in availability / scalability / manageabilityLarge block sizes large files (giga, peta,…)Extremely scalable due to HDFSBatch-based MapReduce facilitating parallel work

Cons: Programmability and MetadataLess efficient for smaller filesMapReduce is complex than traditional sql queriesMapReduce is batch-based - delayNeed to publish data in well known schemas

Page 50: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

HiveTo turn hadoop into a data warehouse

Developed at Facebook

Declarative language (SQL Dialect) - HiveQL

Schema non-optional but data can have many schemas

Relationally complete

Page 51: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Data warehousing in Facebook (case study)

Hadoop/Hive cluster 8400 cores Raw storage capacity

~12.5PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology

1 Gbit/sec from node to rack switch

4 Gbit/sec to top level rack switch

2 clusters One for adhoc users One for strict SLA jobs

Page 52: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Hadoop / Hive usageStatistics per day

12 TB of compressed new data added per day135TB of compressed data scanned per day7500+ Hive jobs per day80k computing hours per day

Hive simplifies HadoopNew engineers go through a Hive training session~200 people/month run jobs on Hadoop/HiveAnalysts (non-engineers) use Hadoop through HiveMost of jobs are Hive Jobs

Data warehousing in Facebook (case study)

Page 53: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Types of Applications Reporting

Eg: Daily/ weekly aggregations of impression / click countsMeasures of user engagementMicrostrategy reports

Ad hoc AnalysisEg: how many group admins broken down by state /

country Machine Learning (Assembling training data)

Ad OptimizationEg: User Engagement as a function of user attributes

Many others

Data warehousing in Facebook (case study)

Page 54: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

ScribeA service for distributed log file collection

designed to run as a daemon process on every node in data center

forward log files from any process running on that machine back to a central pool of aggregators

Scribe is a server for aggregating streaming log data

designed to scale to a very large number of nodes and be robust to network and node failures

Page 55: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Scribe and Hadoop clusters at Facebook

Used to log Data from web servers

Clusters collocated with the web servers

Network is the biggest bottleneck

Typical cluster has about 50 nodes

Page 56: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Data Flow Architecture at Facebook

Page 57: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

OK – what do we know so far?

We’ve talked about MongoDB for reasonably-large-sized amounts of unstructured data, and processing it quickly

We’ve talked about Hadoop (Hive, Scribe, etc) for unstructured data holding really large amounts of stuff so we can process and mine it with HiveQL

What if we have a combined need to process large and huge sets for the same app? What if we need to process good amounts of unstructured data right away, while storing it as lots more for analysis later?

Page 58: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Applications have complex needs

Lets use the best tool for the job

Often more than one tool is needed

MongoDB ideal operational database

MongoDB ideal for BIG data, but

Not really a data processing engine

For heavy processing needs use tool designed for that job ... Hadoop

Page 59: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB & Hadoop together

HadoopMassively scalable

data100PB of data Injest at scaleSensor dataClickstreamOnline gaming data

(user activity, time online, activities performed, etc)

Later analytics

MongoDBOnline processingWorking on small

subsets at a timeProcessing document

store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

Page 60: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB MapReduce

MongoDB map reduce quite capable... but with limits

Javascript not best language for processing map reduce

Javascript limited in external data processing libraries

Adds load to data store

Sharded environments do parallel processing

Page 61: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB Map Reduce

Page 62: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB Aggregation

Most uses of MongoDB Map Reduce were for aggregation

Aggregation Framework optimized for aggregate queries

Fixes some of limits of MongoDB MR

Can do realtime aggregation similar to SQL GroupBy

parallel processing on sharded clusters

Page 63: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB Aggregation

Page 64: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB Map Reduce

Page 65: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

Hadoop Map Reduce

Page 66: Big Data. Definition Big Data is a term for collection of datasets so large and complex that it becomes difficult to process using conventional database

MongoDB & Hadoop