lesson 1 - hadoop and big data overview
Post on 02-Jun-2018
229 Views
Preview:
TRANSCRIPT
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
1/57
Hadoop Developer Day
Nicolas MoralesIBM Big Datanicolasm@us.ibm.com
@NicolasJMorales
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
2/57
FREE
Monthly Events San Jose & Foster City
Full Day Developer Days Afternoon & Evening Hackathons Past Meetups covered
Text Analytics Real-time Analytics
Big Data Developers @
2 2013 IBM Corporation
SQL for Hadoop HBase Social Media Analytics Machine Data Analytics Security and Privacy
Development Environmentprovided
Live streaming Topic suggestions welcome
http://www.meetup.com/BigDataDevelopers/
NEXT MEETUP: Streams Developer Day on Thursday, April 17.Coming Soon: Big R, Watson, Big Data in the Cloud, Big SQL, MongoDB & more!
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
3/57
3 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
4/57
Agenda: Hadoop Developer Day
Time Subject
8:00 AM 9:00 AM Registration & Breakfast
9:00 AM 9:30 AM Introduction to Hadoop
4 2013 IBM Corporation4
9:30 AM 11:00 AM Hadoop Architecture and HDFS + Hands-on Lab11:00 AM 11:45 AM Introduction to MapReduce
11:45 AM 12:45 PM Lunch
12:45 PM 2:00 PM MapReduce Hands-on Lab
2:00 PM 4:00 PM Using Hive for Data Warehousing + Hands-on Lab
4:00 PM 6:00 PM SQL for Hadoop + Hands-on Lab
6:00 PM Closing Remarks
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
5/57
Big Data Universitywww.bigdatauniversity.com
5 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
6/57
Big Data Universitywww.bigdatauniversity.com
6 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
7/57
Quick Start Edition VM
Download: http://ibm.co/QuickStart .tar.gz Unpack using WinRAR, 7-Zip, etc.
7 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
8/57
Your Feedback is Important, pleasecomplete your Survey
8 2013 IBM Corporation8
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
9/57
Introduction to Hadoop
9 2013 IBM Corporation
Rafael CossIBM Big Datarcoss@us.ibm.com
@racoss
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
10/57
Executive Summary
Whats Big Data? More Analytics on More Data for More People
More than just Hadoop
Whats Hadoop? Distributed Computing framework that is
10 2013 IBM Corporation
Cost Effective Flexible Fault Tolerance
What Hadoops Distribution?
Common set of Apache Projects Install Unique Value Add
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
11/57
Enrich YourInformation Basewith Big Data Exploration
Improve CustomerInteraction withEnhanced 360 Viewof the Customer
Key Business-driven Use Cases Improve BusinessOutcomes
Help Reduce Riskand Prevent Fraudwith Security andIntelligence Extension
42TB
1,100
99%
11 2013 IBM Corporation
OptimizeInfrastructureand Monetize Datawith Operations Analysis
Gain IT efficiencyand scale with DataWarehouse
Modernization
-AcousticData Analyzed
Gain inAnalysisPerformance
40XMeteredCustomersin Five States
60K
PublishingPartnerships
In Time RequiredFor Analysis
2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
12/57
12 2013 IBM Corporation12
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
13/57
Why is Big Data important?
Data AVAILABLE to an
organization
13 2013 IBM Corporation13
data an organization canPROCESS
Enterprises are more blindto new opportunities.
Organizations are able toprocess less and less of theavailable data.
100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passedthrough the net. 80 % spam and viruses. => Prefiltering is more and more important.
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
14/57
What is Big Data?
Transactional &Application Data
Machine Data Social Data EnterpriseContent
More Analytics on More Data for More People
14 2013 IBM Corporation
Volume Structured
Throughput
Velocity Semi-structured
Ingestion
Variety Highly unstructured
Veracity
Variety Highly unstructured
Volume
2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
15/57
Insurance
360 View of Domainor Subject
Catastrophe Modeling
Fraud & Abuse
Producer PerformanceAnalytics
Analytics Sandbox
Banking
Optimizing Offers andCross-sell
Customer Service andCall Center Efficiency
Fraud Detection &Investigation
Credit & CounterpartyRisk
Every Industry can Leverage Big Data and Analytics
Telco
Pro-active Call Center
Network Analytics
Location BasedServices
Energy &Utilities
Smart Meter Analytics
Distribution LoadForecasting/Scheduling
Condition Based
Maintenance Create & Target
Customer Offerings
Media &Entertainment
Business processtransformation
Audience & MarketingOptimization
Multi-ChannelEnablement
Digital commerceoptimization
RetailTravel &Transport
ConsumerProducts
Government Healtcare
15 2013 IBM Corporation
Actionable Customer
Insight Merchandise
Optimization
Dynamic Pricing
Customer Analytics &
Loyalty Marketing Predictive Maintenance
Analytics
Capacity & PricingOptimization
Shelf Availability
Promotional SpendOptimization
MerchandisingCompliance
Promotion Exceptions& Alerts
Civilian Services
Defense & Intelligence Tax & Treasury Services
Measure & Act on
Population HealthOutcomes
Engage Consumers intheir Healthcare
!utomotive
Advanced ConditionMonitoring
Data WarehouseOptimization
Actionable CustomerIntelligence
"i#e$ciences
Increase visibility intodrug safety andeffectiveness
Cemical &Petroleum
Operational Surveillance,Analysis & Optimization
Data WarehouseConsolidation, Integration& Augmentation
Big Data Exploration forInterdisciplinaryCollaboration
!erospace& %e#ense
Uniform InformationAccess Platform
Data WarehouseOptimization
Airliner CertificationPlatform
Advanced Condition
Monitoring (ACM)
Electronics
Customer/ ChannelAnalytics
Advanced ConditionMonitoring
2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
16/57
Big data adoption
Big Data use study
16 2013 IBM Corporation
2012 Big Data @ Work Study surveying 1144 business and IT professionals in 95 countries
When segmented into four groups based on current levels of big data activity, respondents showed significant consistency inorganizational behaviors
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
17/57
Big Data AnalyticsIterative & ExploratoryData is the structure
IT TeamDelivers DataOn Flexible
Traditional AnalyticsStructured & Repeatable
Structure built to store data
BusinessUsers
DetermineAnalyzedInformation
Warehouse Modernization Has Two Themes
17 2013 IBM Corporation
BusinessUsers
Explore andAsk Any Question
Analyze ALL Available Information
Whole population analyticsconnects the dots
IT TeamBuilds System
To AnswerKnown Questions
17
Available Information
Capacity constrained down samplingof available information
Carefully cleanse all informationbefore any analysis
AnalyzedInformation
Analyze information as is & cleanse asneeded
AnalyzedInformation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
18/57
Big Data AnalyticsIterative & ExploratoryData is the structure
Traditional AnalyticsStructured & Repeatable
Structure built to store data
Warehouse Modernization Has Two Themes
?QuestionHypothesis Data
All Information
Exploration
18 2013 IBM Corporation18
Analyzed
Information
DataAnswer
Start with hypothesisTest against selected data
Data leads the wayExplore all data, identify correlations
CorrelationActionable Insight
Analyze after landing Analyze in motion
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
19/57
Getting the Value from Big Data Why a Platform?
Almost all big data use cases requirean integrated set of big data technologiesto address the business pain completely
The Whole is Greater thanthe Sum of the Parts
Accelerators
DataStreamHadoop
DiscoveryApplicationDevelopmentSystemsManagement
BIG DATA PLATFORM
19 2013 IBM Corporation
Reduce time and cost and provide quick ROIby leveraging pre-integrated components
Provide both out of the box and standards-based services
Start small with a single project and progressto others over your big data journey
Information Integration & Governance
are ouseompu ngys em
Data Media Content Machine Social
2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
20/57
Watson Foundations
Exploration,landing and
archive Trusted data
Reporting &interactiveanalysis
Deepanalytics &modeling
Data typesReal-time processing & analytics
$TRE!M$& %!T! REP"IC!TI'(
Transaction &applicationdata
Machine andsensor data
Enterprisecontent
Image andvideo
Operationalsystems
Actionableinsight
Decisionmanagement
Predictiveanalytics &modeling
Reporting, analysis,content analytics
1
2 3
3
3
5
3
3
Watson Foundations Differentiators
20 2013 IBM Corporation
Information Integration & Governance
Third-partydata Discovery andexploration
4
3
3
1
2
3
4
5
More than HadoopGreater resiliency and recoverability
Advanced workload management, multi-tenancyEnhanced, flexible storage management (GPFS)Enhanced data access (BigSQL, Search)Analytics accelerators & visualizationEnterprise-ready security framework
Data in MotionEnterprise class stream processing & analytics
Analytics EverywhereRichest set of analytics capabilities
Ability to analyze data in placeGovernance EverywhereComplete integration & governance capabilitiesAbility to govern all data where ever it is
Complete PortfolioEnd-to-end capabilities to address all needs
Ability to grow and address future needsRemains open to work with existing investments
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
21/57
IBM Watson FoundationsNew/Enhanced
ApplicationsAll Data
What actionshould I
IBM Big Data & Analytics
Real-time Data #rocessing & $nalytics What ishappening
Discovery andexploration
Why did it
Deep$nalytics
21 2013 IBM Corporation
Inormation Integration & 'overnance
(ystems (ecurity
On premise, !loud, $s a service
(torage
IBM Big Data & Analytics Infrastructure
take
Decisionmanagement
!ognitive
)a*ric
"anding,
Explorationand $rchivedata %one
EDW anddata mart
%one
pera ona
data %one Reporting andanalysis
What couldhappen
#redictiveanalytics and
modeling
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
22/57
What is Hadoop?
Apache open source software framework for reliable, scalable, distributedcomputing of massive amount of data Hides underlying system details and complexities from user
Developed in Java
Core sub projects: MapReduce
22 2013 IBM Corporation
. . .
Hadoop Common
Supported by several Hadoop-related projects HBase Zookeeper Avro Etc.
Meant for heterogeneous commodity hardware
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
23/57
Design principles of Hadoop New way of storing and processing the data:
Let system handle most of the issues automatically: Failures Scalability
Reduce communications Distribute data and processing power to where the data is Make parallelism part of operating system Relatively inexpensive hardware ($2 4K)
23 2013 IBM Corporation
Hadoop = HDFS + MapReduce infrastructure +
Optimized to handle Massive amounts of data through parallelism
A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware
Reliability provided through replication
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
24/57
Hadoop is not for all types of work
Not to process transactions (random access)
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
24 2013 IBM Corporation
Not good for intensive calculations with little data
Big Data Solution
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
25/57
Who uses Hadoop?
25 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
26/57
Map-Reduce
Hadoop
BigInsights
26 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
27/57
What is Apache Hadoop?
Flexible, enterprise-class support for processing large volumes ofdata Inspired by Google technologies (MapReduce, GFS, BigTable, )
Initiated at Yahoo Originally built to address scalability problems of Nutch, an open source Web search
technology
Well-suited to batch-oriented, read-intensive applications
27 2013 IBM Corporation
Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner CPU + disks = node Nodes can be combined into clusters
New nodes can be added as needed without changing Data formats How data is loaded How jobs are written
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
28/57
Hadoop Open Source Projects
Hadoop is supplemented by an ecosystem of open source projects
28 2013 IBM Corporation
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
29/57
How do I leverage Hadoop to create new value for myenterprise?
Hadoop, Pig, Hive, Zookeeper, Jaql, Hbase, Ozzie, Flume
HDFS
MapReduceAQL
Machinelearning
Terabytes
PetabytesExabytes
Loganal sis
29 2013 IBM Corporation29
Sentimentanalysis
. . .
. . .
CDRs. . .
. . .
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
30/57
Whats a Hadoop Distribution?
Whats a Linux Distribution? Linux Kernel Open Source Tools around Kernel
Installer Administration UI
Open Source Distribution Formula
30 2013 IBM Corporation
Core Projects around Kernel Value Add
Test Components Installer Administration UI
Apps
WebSphere WAS 25 > Apache Projects + Additional Open Source + installer + IBM Value Add
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
31/57
BigInsights: Value Beyond Open Source
Enterprise Capabilities
Advanced Engines
Visualization & Exploration
Development Tools
Key differentiators Built-in analytics
Text engine, annotators, Eclipse tooling Interface to project R (statistical platform)
Enterprise software integration Spreadsheet-style analysis Integrated installation of supported open source
and other components Web Console for admin and application access
31 2013 IBM Corporation
Administration & Security
Workload Optimization
Connectors
Open source
components IBM-certifiedApache Hadoop
a orm enr c men : a ona secur y,
performance features, . . . World-class support Full open source compatibility
Business benefits Quicker time-to-value due to IBM technology
and support Reduced operational risk Enhanced business knowledge with flexible
analytical platform Leverages and complements existing software
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
32/57
From Getting Starting to Enterprise Deployment:Different BigInsights Editions For Varying Needs
Standard Edition
nterprise
class Enterprise Edition
- S readsheet-st le tool
- Accelerators
-- GPFS FPO
-- Adaptive MapReduce
- Text analytics
- Enterprise Integration
-- Monitoring and alerts
--
32 2013 IBM Corporation 2013 IBM Corporation32
Breadth of capabilities
Quick StartFree. Non-production
-- Web console
-- Dashboards
- Pre-built applications
-- Eclipse tooling
-- RDBMS connectivity
-- Big SQL
-- Jaql
-- Platform enhancements
-- . . .
-
-- InfoSphere Streams*
-- Watson Explorer*
-- Cognos BI*
-- . . .
-* Limited use license
Apache
Hadoop
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
33/57
Scalable New nodes can be added
on the fly
Performance & reliability Adaptive MapReduce, Compression,
Indexing, Flexible Scheduler, +++
IBM Enriches Hadoop
33 2013 IBM Corporation
Affordable Massively parallel computing on
commodity servers
Flexible
Hadoop is schema-less, and canabsorb any type of data
Fault Tolerant Through MapReduce
software framework
Enterprise Hardening of Hadoop
Productivity Accelerators Web-based UIs and tools End-user visualization
Analytic Accelerators +++
Enterprise Integration To extend & enrich your information
supply chain
33
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
34/57
Big Database Vendors Adopt Hadoop
34 2013 IBM CorporationIBM Internal Use Only
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
35/57
Competing Hadoop Distribution Vendors
Cloudera Cloudera makes it easy to run open source Hadoop in production Focus on deriving business value from all your data instead of worrying about managing Hadoop
Hortonworks Make Hadoop easier to consume for enterprises and technology vendors Provide expert support by the leading contributors to the Apache Hadoop open source projects
EMC Greenplum HD ** Pivotal HD **
35 2013 IBM Corporation
Provides a complete platform including installation, training, global support, and value-add beyond
simple packaging of the Apache Hadoop distribution
MapR High Performance Hadoop, up to 2-5 times faster performance than Apache-based distributions The first distribution to provide true high availability at all levels making it more dependable
Amazon Elastic MapReduce
Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having toworry about time-consuming set-up, management or tuning of Hadoop clusters or the computecapacity upon which they sit
IBM Internal Use Only
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
36/57
Capabilities Required for Hadoop Style Workloads
Visualization &Discovery
Analytics Engines
Application Support and DevelopmentTooling
36 2013 IBM Corporation
Runtime
Cluster and Workload ManagementDataIngest
File System
Data Store Security
36
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
37/57
Open Source Hadoop Components
Visualization & Discovery Data Ingest
Analytics Engines
Application Support and Development Tooling
MapReduceMapReduce PigPig HiveHiveLuceneLucene OozieOozie
37 2013 IBM Corporation
Open Source
Cluster Optimization and Management
Runtime
File System
MapReduce
HDFS
Data StoreHBase
ZooKeeperZooKeeper
Sqoop
Security
HCatalog
Flume
AvroAvroDerby
37
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
38/57
Open Source Components Across Distributions
ComponentBig
Insights2.0
HortonWorksHDP 1.2
MapR2.0
GreenplumHD 1.2
ClouderaCDH3u5
ClouderaCDH4*
Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *
HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1
Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1
Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2
38 2013 IBM Corporation
Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3
Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3
Avro 1.6.3 X X X X X
Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0
Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1
HCatalog 0.4.0 0.5.0 0.4.0 X X X
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
39/57
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
40/57
Two Key Aspects of Hadoop
Hadoop Distributed File System = HDFS
Where Hadoop stores data A file system that spans all the nodes in a Hadoop cluster It links together the file systems on many local nodes to
make them into one bi file s stem
40 2013 IBM Corporation
MapReduce framework How Hadoop understands and assigns work to the nodes
(machines)
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
41/57
What is the Hadoop Distributed File System?
HDFS stores data across multiple nodes
HDFS assumes nodes will fail, so it achievesreliability by replicating data across multiple nodes
41 2013 IBM Corporation
e e sys em s u rom a c us er o a a no es ,each of which serves up blocks of data over thenetwork using a block protocol specific to HDFS.
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
42/57
MapReduce
Take a large problem and divide it into sub-problems Break data set down into small chunks
Perform the same function on all sub-problems
MAP
42 2013 IBM Corporation
Combine the output from all sub-problems
DoWork()DoWork() DoWork()DoWork() DoWork()DoWork()
OutputR
EDUCE
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
43/57
MapReduce Example
Hadoop computation model Data stored in a distributed file system spanning many inexpensive computers Bring function to the data Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
public static class TokenizerMapper
extends Mapper {
private final static IntWritable
Hadoop Data Nodes
43 2013 IBM Corporation
MapReduce Application
1. Map Phase(break job into small parts)
2. Shuffle(transfer interim outputfor final processing)
3. Reduce Phase(boil all output down toa single result set)
Return a single result setResult Set
Shuffle
one = ne IntWritable!"#$
private Text ord = ne Text!#$
public void %ap!Object ke&, Text val, 'ontext
(trin)Tokenizer itr =
ne (trin)Tokenizer!val*to(trin)!##$
+ile !itr*+asMoreTokens!## {
ord*set!itr*nextToken!##$
context*rite!ord, one#$
public static class Int(u%-educer
extends -educer
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
44/57
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
45/57
So What Does This Result In?
Easy To Scale
Fault Tolerant and Self-Healing
45 2013 IBM Corporation
Data Agnostic
Extremely Flexible
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
46/57
Resources
bigdatauniversity.com
youtube.com/ibmBigData
Quick Start Editions Ibm.co/quickstart Ibm.co/streamsqs
ibm.meetu .com
46 2013 IBM Corporation
ibmdw.net/streamsdev ibm.co/streamscon
ibmbigdatahub.com
ibm.co/bigdatadev
http://tinyurl.com/biginsights Links to demos, papers, forum, downloads, etc
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
47/57
Thank YouYour feedback is important!
Please fill out survey
47 2013 IBM Corporation
A k l d t d Di l i
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
48/57
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries inwhich IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided forinformational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS withoutwarranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, thispresentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties orrepresentations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the useof IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may haveachieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intendedto, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or otherresults.
48 2013 IBM Corporation
Copyright IBM Corporation 2014. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract withIBM Corp.
IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International BusinessMachines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on theirfirst occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common lawtrademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common lawtrademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information atwww.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
49/57
Backup
49 2013 IBM Corporation
Global TLE Framework
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
50/57
Implications of Big Data
Just reading 100 terabytes is slow Standard computer (100 MBPS) ~11 days Across 10Gbit link (high end storage) 1 day 1000 standard computers 15 minutes!
Seek times for random disk access is a problem 1 TB data set with 1010 100-byte records
Updates to 1% would require 1 month Reading and rewriting the whole data set would take 1 day*
50 2013 IBM Corporation
One node is not enough! Need to scale out not up!
50
+ )rom the adoop mailing list
Global TLE Framework
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
51/57
Scaling out
Bad news: nodes fail, especially if you have many Mean time between failures for 1 node = 3 years, 1000 nodes = 1 day Super-fancy hardware still fails and commodity machines give better performance
per dollar
Bad news II: distributed programming is hard Communication, synchronization, and deadlocks Recovering from machine failure Debugging Optimization
51 2013 IBM Corporation
51
Global TLE Framework
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
52/57
A new model is needed
Its all about the right level of abstraction
Hide system-level details from the developers
No more race conditions, lock contention, etc.
Separating the whatfrom how Developer specifies the computation that needs to be performed Execution framework (runtime) handles actual execution
52 2013 IBM Corporation 52
Global TLE Framework
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
53/57
MapReduce
53 2013 IBM Corporation 53
Traditional computing
apReduce computing
Global TLE Framework
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
54/57
MapReduce, the reality
54 2013 IBM Corporation 54
any node, little communication *et.een the nodes,some stragglers and ailures
Bi Diff S h R
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
55/57
Big Difference: Schema on Run
Regular database Schema on load
Big Data (Hadoop) Schema on run
Raw dataRaw data
55 2013 IBM Corporation 2013 IBM Corporation55
Schemato filter
Storage(pre-filtered data)
Storage(unfiltered,raw data)
Schemato filter
Output
K B fit A ilit /Fl ibilit
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
56/57
Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS)
Schema must be defined beforeany data is loaded
An explicit load operation hasto take place which transformsdata to internal DB structure
Schema-on-Read (Hadoop)
Data is copied to the file store,no transformation is needed
A SerDe (Serializer/Deserlizer)is applied during read time toextract the re uired columns
56 2013 IBM Corporation
New Columns must be addedexplicitly before new data forsuch columns can be loadedinto the database
Read First
Standard/Governance
(late binding)
New data can start flowinganytime and will appearretroactively once SerDe isupdated to parse it.
Load Fast
Flexibility/Agility
Pros
S l bilit S l bl S ft D l t
-
8/10/2019 Lesson 1 - Hadoop and Big Data Overview
57/57
Scalability: Scalable Software Deployment
57 2013 IBM Corporation
top related