big data. definition big data is a term for collection of datasets so large and complex that it...
Post on 12-Jan-2016
222 Views
Preview:
TRANSCRIPT
Big Data
DefinitionBig Data is a term for collection of datasets so
large and complex that it becomes difficult to process using conventional database management tools or traditional data processing applications
What is ‘Big’ ‘Data’?Is it;
Too big to be stored on a single server?Too unstructured to fit into a Row/column DB?Too voluminous or dynamic to fit into a static data
warehouse?
How big is big?This was in 2011
By end of 2012 >2.4Bn
http://www.thecultureist.com/2013/05/09/how-many-people-use-the-internet-more-than-2-billion-infographic/
Now have about 7Bn devices connected
Scale of data
Scale of data
Opportunity for Big Data
Opportunity for Big DataSexiest job of the 21st Century?
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1
Opportunity for Big Data
Candidates for Big Data
Big Data gap
Community
Usages Adobe Alibaba Amazon AOL Facebook Google IBM
Users
Conventional approach
Issues with conventional approach
Batch-oriented
Batches cannot be interrupted or reconfigured on-the-fly
Schema management required in multiple places
Lots of data being shuffled around
Data duplication
Turn-around times of hours rather than minutes
Big Data systems (real time)
Applications
Indexing and Search
Usage Analytics
Insights, recommen
dationsCRUD
Views
Data backendData
MetadataAttention data
Indexes
A Data system – example (perhaps)
Raw dataEg: Tweets
View 1Eg: #tweets/URL
View 2Eg: Influence
scores
View 3Eg: trending topics
Applications
Indexing and Search
Usage Analytics
Insights, recommen
dationsCRUD
Properties of a Data system
robustness
fast reads AND updates/inserts
scalable
generic
extensible
allows ad-hoc analysis
low-cost maintenance
debuggable
Big Data Technologies
MongoDBMongoDB
Hadoop
MongoDB
Horizontally Scalable
{ author : “steve”, date : new Date(), text : “About MongoDB...”, tags : [“tech”, “database”]}
Document Oriented
Application
High Performance
Fully Consistent
MongoDB philosophyKeep functionality when we can (key/value
stores are great, but we need more)
Non-relational (no joins) makes scaling horizontally practical
Document data models are good
Database technology should run anywhere virtualized, cloud, metal, etc
Under the hoodWritten in C++
Runs nearly everywhere
Data serialized to BSON
Extensive use of memory-mapped files i.e. read-through write-through memory caching.
Database LandscapeSca
labili
ty &
Pe
rform
ance
Depth of Functionality
MongoDB
RDBMS
Memcached
MongoDB
“MongoDB has the best features of key/value stores, document databases and relational databasesin one.” John Nunemaker
Relational made normalized data look like
this
User• Name• Email Address
Category• Name• Url
Article• Name• Slug• Publish date• Text
Tag• Name• Url
Comment• Comment• Date• Author
Document databases make normalized data look like
this
User• Name• Email Address
Article• Name• Slug• Publish date• Text• Author
Tag[]• Value
Comment[]• Comment• Date• Author
Category[]• Value
When MongoDB?Online processing
Working on small subsets at a time
Processing document store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)
HadoopWhat is Hadoop?
A scalable, Fault tolerant, High performance distributed file system
Asynchronous replicationWrite-once and read-many (WORM)Hadoop cluster with 3 DataNodes minimumData divided into 64MB or 128 MB blocks, each
block replicated 3 times (default)NameNode holds filesystem metadataFiles are broken up and spread over the DataNodes
Benefits of HadoopRuns on cheap commodity hardware
Automatically handles data replication and node failure
It does the hard work – you can focus on processing data
Cost Saving and efficient and reliable data processing
Where and When Hadoop? Where?
Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling)
Highly parallel data intensive distributed applications
Very large production deployments
When? Process lots of
unstructured data When your processing
can easily be made parallel
Running batch jobs is acceptable
When you have access to lots of cheap hardware
What is Hadoop used for?Searching
Log Processing
Recommendation systems
Analytics
Video and Image analysis
Data Retention
HadoopMassively scalable
data100PB of data Injest at scaleSensor dataClickstreamOnline gaming data
(user activity, time online, activities performed, etc)
Later analytics
How Hadoop works?Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework
Hadoop ArchitectureDistributed File System
and MapReduce
HDFS Runs on top of existing
file system in a node Designed to handle
large files with streaming data access patterns
Designed for streaming (sequential data access rather than random access)
HDFSKnown as Hadoop Distributed File System
Primary storage system for Hadoop Apps
Multiple replicas of data blocks distributed on compute nodes for reliability
Files are stored on multiple boxes for durability and high availability
HDFSOptimized for long sequential reads
Data written once , read multiple times, no append possible
Large file, sequential reads so no local caching of data.
Data replication HDFS
HDFS Architecture
HDFS ArchitectureBlock Structure files system
File is divided to bocks and stored
Each individual machine in cluster is Data Node
Default block size is 64 MB
Information of blocks is stored in metadata
All this meta data is stored on machine which is Name Node
Map Reduce Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
MapReduce
Technology from Google
MapReduce program contains transformations that can be applied to the data any number of times
MapReduce is an executing mapreduce program which has map tasks running parallel to each other and reduce tasks running parallel to each other as well
MapReduce
MapReduceHDFS handles the Distributed File System layer
MapReduce is how we process the data
MapReduce Daemons JobTracker TaskTracker
Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing
Job TrackerOne per cluster “master node”
Takes jobs from clients
Splits work into “tasks”
Distributes “tasks” to TaskTrackers
Monitors progress, deals with failures
Task TrackerMany per cluster “slave nodes”
Does the actual work, executes the code for the job
Talks regularly with JobTracker
Launches child process when given a task
Reports progress of running “task” back to JobTracker
Minimally, applications specify the input/output locations and supply map and reduce
functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.
The Hadoop job client then submits the job (jar/executable etc.) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java.
Anatomy of MapReduce
Submitting a MapReduce job
Simple Data Flow Example
Example of MapReduceRead text files and count how often words occur.
The input is text files The output is a text file
each line: word, tab, count
Map: Produce pairs of (word, count)
Reduce: For each word, sum up the counts.
Anatomy of MapReduce Client Submits job:
I want to count the occurrences of each word We will assume that the data to process is already there in HDFS
JobTracker receives job Queries the NameNode for number of blocks in File The job is split into Tasks One map task per each block As many reduce tasks as specified in the Job
TaskTracker checks in Regularly with JobTracker Is there any work for me ?
If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task”
Hadoop, pros & consPros
Superior in availability / scalability / manageabilityLarge block sizes large files (giga, peta,…)Extremely scalable due to HDFSBatch-based MapReduce facilitating parallel work
Cons: Programmability and MetadataLess efficient for smaller filesMapReduce is complex than traditional sql queriesMapReduce is batch-based - delayNeed to publish data in well known schemas
HiveTo turn hadoop into a data warehouse
Developed at Facebook
Declarative language (SQL Dialect) - HiveQL
Schema non-optional but data can have many schemas
Relationally complete
Data warehousing in Facebook (case study)
Hadoop/Hive cluster 8400 cores Raw storage capacity
~12.5PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology
1 Gbit/sec from node to rack switch
4 Gbit/sec to top level rack switch
2 clusters One for adhoc users One for strict SLA jobs
Hadoop / Hive usageStatistics per day
12 TB of compressed new data added per day135TB of compressed data scanned per day7500+ Hive jobs per day80k computing hours per day
Hive simplifies HadoopNew engineers go through a Hive training session~200 people/month run jobs on Hadoop/HiveAnalysts (non-engineers) use Hadoop through HiveMost of jobs are Hive Jobs
Data warehousing in Facebook (case study)
Types of Applications Reporting
Eg: Daily/ weekly aggregations of impression / click countsMeasures of user engagementMicrostrategy reports
Ad hoc AnalysisEg: how many group admins broken down by state /
country Machine Learning (Assembling training data)
Ad OptimizationEg: User Engagement as a function of user attributes
Many others
Data warehousing in Facebook (case study)
ScribeA service for distributed log file collection
designed to run as a daemon process on every node in data center
forward log files from any process running on that machine back to a central pool of aggregators
Scribe is a server for aggregating streaming log data
designed to scale to a very large number of nodes and be robust to network and node failures
Scribe and Hadoop clusters at Facebook
Used to log Data from web servers
Clusters collocated with the web servers
Network is the biggest bottleneck
Typical cluster has about 50 nodes
Data Flow Architecture at Facebook
OK – what do we know so far?
We’ve talked about MongoDB for reasonably-large-sized amounts of unstructured data, and processing it quickly
We’ve talked about Hadoop (Hive, Scribe, etc) for unstructured data holding really large amounts of stuff so we can process and mine it with HiveQL
What if we have a combined need to process large and huge sets for the same app? What if we need to process good amounts of unstructured data right away, while storing it as lots more for analysis later?
Applications have complex needs
Lets use the best tool for the job
Often more than one tool is needed
MongoDB ideal operational database
MongoDB ideal for BIG data, but
Not really a data processing engine
For heavy processing needs use tool designed for that job ... Hadoop
MongoDB & Hadoop together
HadoopMassively scalable
data100PB of data Injest at scaleSensor dataClickstreamOnline gaming data
(user activity, time online, activities performed, etc)
Later analytics
MongoDBOnline processingWorking on small
subsets at a timeProcessing document
store data (like game data) while interacting in game (items/HP/XP, graphics, stage, etc)
MongoDB MapReduce
MongoDB map reduce quite capable... but with limits
Javascript not best language for processing map reduce
Javascript limited in external data processing libraries
Adds load to data store
Sharded environments do parallel processing
MongoDB Map Reduce
MongoDB Aggregation
Most uses of MongoDB Map Reduce were for aggregation
Aggregation Framework optimized for aggregate queries
Fixes some of limits of MongoDB MR
Can do realtime aggregation similar to SQL GroupBy
parallel processing on sharded clusters
MongoDB Aggregation
MongoDB Map Reduce
Hadoop Map Reduce
MongoDB & Hadoop
Some Videos
Mongo MapReduce https://www.youtube.com/watch?v=WovfjprPD_IMySQL & NoSQL at Craigslist https://www.youtube.com/watch?v=a0OvgTfF8Pg9 Databases in 45 minutes https://www.youtube.com/watch?v=XfK4aBF7tEI
top related