big data. definition big data is a term for collection of datasets so large and complex that it...

Big Data

DefinitionBig Data is a term for collection of datasets so

large and complex that it becomes difficult to process using conventional database management tools or traditional data processing applications

What is ‘Big’ ‘Data’?Is it;

Too big to be stored on a single server?Too unstructured to fit into a Row/column DB?Too voluminous or dynamic to fit into a static data

warehouse?

How big is big?This was in 2011

By end of 2012 >2.4Bn

http://www.thecultureist.com/2013/05/09/how-many-people-use-the-internet-more-than-2-billion-infographic/

Now have about 7Bn devices connected





Scale of data

Opportunity for Big Data

Opportunity for Big DataSexiest job of the 21st Century?

http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1




Opportunity for Big Data

Candidates for Big Data

Big Data gap

Community

Usages Adobe Alibaba Amazon AOL Facebook Google IBM

Users

Conventional approach

Issues with conventional approach

Batch-oriented

Batches cannot be interrupted or reconfigured on-the-fly

Schema management required in multiple places

Lots of data being shuffled around

Data duplication

Turn-around times of hours rather than minutes

Big Data systems (real time)

Applications

Indexing and Search

Usage Analytics

Insights, recommen

dationsCRUD

Views

Data backendData

MetadataAttention data

Indexes

A Data system – example (perhaps)

Raw dataEg: Tweets

View 1Eg: #tweets/URL

View 2Eg: Influence

scores

View 3Eg: trending topics

Applications

Indexing and Search

Usage Analytics

Insights, recommen

dationsCRUD

Properties of a Data system

robustness

fast reads AND updates/inserts

scalable

generic

extensible

allows ad-hoc analysis

low-cost maintenance

debuggable

Big Data Technologies

MongoDBMongoDB

Hadoop

MongoDB

Horizontally Scalable

{ author : “steve”, date : new Date(), text : “About MongoDB...”, tags : [“tech”, “database”]}

Document Oriented

Application

High Performance

Fully Consistent

MongoDB philosophyKeep functionality when we can (key/value

stores are great, but we need more)

Non-relational (no joins) makes scaling horizontally practical

Document data models are good

Database technology should run anywhere virtualized, cloud, metal, etc

Under the hoodWritten in C++

Runs nearly everywhere

Data serialized to BSON

Extensive use of memory-mapped files i.e. read-through write-through memory caching.

http://bsonspec.org/

Database LandscapeSca

labili

ty &

Pe

rform

ance

Depth of Functionality

MongoDB

RDBMS

Memcached

MongoDB

“MongoDB has the best features of key/value stores, document databases and relational databasesin one.” John Nunemaker

Relational made normalized data look like

this

User• Name• Email Address

Category• Name• Url

Article• Name• Slug• Publish date• Text

Tag• Name• Url

Comment• Comment• Date• Author

Document databases make normalized data look like

this

User• Name• Email Address

Article• Name• Slug• Publish date• Text• Author

Tag[]• Value

Comment[]• Comment• Date• Author

Category[]• Value

When MongoDB?Online processing

Working on small subsets at a time

Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

HadoopWhat is Hadoop?

A scalable, Fault tolerant, High performance distributed file system

Asynchronous replicationWrite-once and read-many (WORM)Hadoop cluster with 3 DataNodes minimumData divided into 64MB or 128 MB blocks, each

block replicated 3 times (default)NameNode holds filesystem metadataFiles are broken up and spread over the DataNodes

Benefits of HadoopRuns on cheap commodity hardware

Automatically handles data replication and node failure

It does the hard work – you can focus on processing data

Cost Saving and efficient and reliable data processing

Where and When Hadoop? Where?

Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling)

Highly parallel data intensive distributed applications

Very large production deployments

When? Process lots of

unstructured data When your processing

can easily be made parallel

Running batch jobs is acceptable

When you have access to lots of cheap hardware

What is Hadoop used for?Searching

Log Processing

Recommendation systems

Analytics

Video and Image analysis

Data Retention

HadoopMassively scalable

data100PB of data Injest at scaleSensor dataClickstreamOnline gaming data

(user activity, time online, activities performed, etc)

Later analytics

How Hadoop works?Hadoop implements a computational paradigm named

Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework

Hadoop ArchitectureDistributed File System

and MapReduce

HDFS Runs on top of existing

file system in a node Designed to handle

large files with streaming data access patterns

Designed for streaming (sequential data access rather than random access)

HDFSKnown as Hadoop Distributed File System

Primary storage system for Hadoop Apps

Multiple replicas of data blocks distributed on compute nodes for reliability

Files are stored on multiple boxes for durability and high availability

HDFSOptimized for long sequential reads

Data written once , read multiple times, no append possible

Large file, sequential reads so no local caching of data.

Data replication HDFS

HDFS Architecture

HDFS ArchitectureBlock Structure files system

File is divided to bocks and stored

Each individual machine in cluster is Data Node

Default block size is 64 MB

Information of blocks is stored in metadata

All this meta data is stored on machine which is Name Node

Map Reduce Hadoop MapReduce is a software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

MapReduce

Technology from Google

MapReduce program contains transformations that can be applied to the data any number of times

MapReduce is an executing mapreduce program which has map tasks running parallel to each other and reduce tasks running parallel to each other as well

MapReduce

MapReduceHDFS handles the Distributed File System layer

MapReduce is how we process the data

MapReduce Daemons JobTracker TaskTracker

Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing

Job TrackerOne per cluster “master node”

Takes jobs from clients

Splits work into “tasks”

Distributes “tasks” to TaskTrackers

Monitors progress, deals with failures

Task TrackerMany per cluster “slave nodes”

Does the actual work, executes the code for the job

Talks regularly with JobTracker

Launches child process when given a task

Reports progress of running “task” back to JobTracker

Minimally, applications specify the input/output locations and supply map and reduce

functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the

JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java.

Anatomy of MapReduce

Submitting a MapReduce job

Simple Data Flow Example

Example of MapReduceRead text files and count how often words occur.

The input is text files The output is a text file

each line: word, tab, count

Map: Produce pairs of (word, count)

Reduce: For each word, sum up the counts.

Anatomy of MapReduce Client Submits job:

I want to count the occurrences of each word We will assume that the data to process is already there in HDFS

JobTracker receives job Queries the NameNode for number of blocks in File The job is split into Tasks One map task per each block As many reduce tasks as specified in the Job

TaskTracker checks in Regularly with JobTracker Is there any work for me ?

If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”

Hadoop, pros & consPros

Superior in availability / scalability / manageabilityLarge block sizes large files (giga, peta,…)Extremely scalable due to HDFSBatch-based MapReduce facilitating parallel work

Cons: Programmability and MetadataLess efficient for smaller filesMapReduce is complex than traditional sql queriesMapReduce is batch-based - delayNeed to publish data in well known schemas

HiveTo turn hadoop into a data warehouse

Developed at Facebook

Declarative language (SQL Dialect) - HiveQL

Schema non-optional but data can have many schemas

Relationally complete

Data warehousing in Facebook (case study)

Hadoop/Hive cluster 8400 cores Raw storage capacity

~12.5PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology

1 Gbit/sec from node to rack switch

4 Gbit/sec to top level rack switch

2 clusters One for adhoc users One for strict SLA jobs

Hadoop / Hive usageStatistics per day

12 TB of compressed new data added per day135TB of compressed data scanned per day7500+ Hive jobs per day80k computing hours per day

Hive simplifies HadoopNew engineers go through a Hive training session~200 people/month run jobs on Hadoop/HiveAnalysts (non-engineers) use Hadoop through HiveMost of jobs are Hive Jobs


Types of Applications Reporting

Eg: Daily/ weekly aggregations of impression / click countsMeasures of user engagementMicrostrategy reports

Ad hoc AnalysisEg: how many group admins broken down by state /

country Machine Learning (Assembling training data)

Ad OptimizationEg: User Engagement as a function of user attributes

Many others


ScribeA service for distributed log file collection

designed to run as a daemon process on every node in data center

forward log files from any process running on that machine back to a central pool of aggregators

Scribe is a server for aggregating streaming log data

designed to scale to a very large number of nodes and be robust to network and node failures

Scribe and Hadoop clusters at Facebook

Used to log Data from web servers

Clusters collocated with the web servers

Network is the biggest bottleneck

Typical cluster has about 50 nodes

Data Flow Architecture at Facebook

OK – what do we know so far?

We’ve talked about MongoDB for reasonably-large-sized amounts of unstructured data, and processing it quickly

We’ve talked about Hadoop (Hive, Scribe, etc) for unstructured data holding really large amounts of stuff so we can process and mine it with HiveQL

What if we have a combined need to process large and huge sets for the same app? What if we need to process good amounts of unstructured data right away, while storing it as lots more for analysis later?

Applications have complex needs

Lets use the best tool for the job

Often more than one tool is needed

MongoDB ideal operational database

MongoDB ideal for BIG data, but

Not really a data processing engine

For heavy processing needs use tool designed for that job ... Hadoop

MongoDB & Hadoop together

HadoopMassively scalable

data100PB of data Injest at scaleSensor dataClickstreamOnline gaming data

(user activity, time online, activities performed, etc)

Later analytics

MongoDBOnline processingWorking on small

subsets at a timeProcessing document

store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)

MongoDB MapReduce

MongoDB map reduce quite capable... but with limits

Javascript not best language for processing map reduce

Javascript limited in external data processing libraries

Adds load to data store

Sharded environments do parallel processing

MongoDB Map Reduce

MongoDB Aggregation

Most uses of MongoDB Map Reduce were for aggregation

Aggregation Framework optimized for aggregate queries

Fixes some of limits of MongoDB MR

Can do realtime aggregation similar to SQL GroupBy

parallel processing on sharded clusters

MongoDB Aggregation

MongoDB Map Reduce

Hadoop Map Reduce

MongoDB & Hadoop

Some Videos

Mongo MapReduce https://www.youtube.com/watch?v=WovfjprPD_IMySQL & NoSQL at Craigslist https://www.youtube.com/watch?v=a0OvgTfF8Pg9 Databases in 45 minutes https://www.youtube.com/watch?v=XfK4aBF7tEI

https://www.youtube.com/watch?v=WovfjprPD_I



https://www.youtube.com/watch?v=a0OvgTfF8Pg



https://www.youtube.com/watch?v=XfK4aBF7tEI

https://www.youtube.com/watch?v=XfK4aBF7tEI

big data. definition big data is a term for collection of datasets so large and complex that it...

Documents

big data http

big data big datahttp

scale of data http

updatesinserts scalable

practicaldocument data

data duplication turn

static data warehouse

ittoo big