a holistic approach to big datadama-ncr.org/library/2013-09-19-a_holistic_approach.pdf · 2013. 9....
TRANSCRIPT
Raul F. ChongSenior Big Data and Cloud Program ManagerBig Data University Community [email protected]
A holistic approach to Big Data
© 2013 BigDataUniversity.com
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Big Data Adoption Phases
What is your Big Data source?
“What type of data/records are you planning to analyze
using big data technologies?”
“What type of data/records are you planning to analyze
using big data technologies?”
Multiple responses accepted
“What type of data/records are you planning to
analyze using big data technologies?”
“What type of data/records are you planning to
analyze using big data technologies?”
What is your Big Data source?
What do you want to do with the Big Data collected?
“What kind of analytics do you want to
perform on this big data?”
“What kind of analytics do you want to
perform on this big data?”
Multiple responses accepted
“What kind of analytics do you
want to perform on this big data?”
“What kind of analytics do you
want to perform on this big data?”
What do you want to do with the Big Data collected?
Use of Big Data globally and in the financial sector
Multiple responses accepted
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
11
KTH Swedish Royal Institute of Technology Reducing Traffic Congestion
• Deployed real-time Smarter Traffic system to predict and improve traffic flow.
• Analyzes streaming real-time data gathered from cameras at entry/exit to city, GPS data from taxis and trucks, and weather information.
• Predicts best time and method to travel such as when to leave to catch a flight at the airport
Results• Enables ability to analyze and predict traffic
faster and more accurately than ever before
• Provides new insight into mechanisms that affect a complex traffic system
• Smarter, more efficient, and more environmentally friendly traffic
11
Benefits Real-time display of public sentiment as
candidates respond to questions
Debate winner prediction based on public opinion instead of solely political analysts
University of Southern California Innovation Lab Monitors Political Debates
Big Data – A holistic approach
Big Data is Not Only Hadoop! Examples where Hadoop is not entirely applicable:
– Cyber security, Stock market, Traffic control, Sensor information, monitoring trends in Social Media
– What if your company has many silos of information, difficult to move to HDFS?
– What about governance? Can we trust the source of this data?
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
The IBM Big Data Platform
Delivers deep insight with advanced in-database analytics & operational analytics
Data Warehouse
Data Warehouse
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Stream Computing
Data Warehouse
Analyze streaming data and large data bursts for real-time insightsStream
Computing
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
The IBM Big Data Platform
HadoopSystem
Stream Computing
Data Warehouse
Cost-effectively analyze Petabytesof unstructured and structured data
HadoopSystem
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
18
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Govern data quality and manage the information lifecycle
Information Integration & Governance
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Speed time to value with analytic and application accelerators
Accelerators
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Systems Management
Application Development
Visualization & Discovery
The IBM Big Data Platform
Discover, understand, search, and navigate federated sources of big data
Visualization & Discovery
Big data holistic approach: A platform
Process any type of data
– Structured, unstructured, in-motion, at-rest, in-place
Built-for-purpose engines
– Designed to handle different requirements
Manage and govern data in the ecosystem
Enterprise data integration
Grow and evolve on current infrastructure
The whole is greater than the sum of parts Integrated components
Out of the box, standards-based services
Start small (value is additive)
21
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Systems Management
Application Development
Visualization & Discovery
Big data holistic approach: A platform
ETL, MDM, Data Governance
Metadata and Governance Zone
Warehousing Zone
Enterprise Warehouse
Data Marts
Ingestion and Real-time Analytic ZoneStreams
Connectors
BI & Reporting
PredictiveAnalytics
Analytics and Reporting Zone
Visualization & Discovery
Landing and Analytics Sandbox Zone
Hive/HBaseCol Stores
Documentsin variety of formats
MapReduce
Hadoop
An example of the big data platform in practice
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Big Data ExplorationFind, visualize, understand all big data to improve business knowledge
Enhanced 360o Viewof the CustomerAchieve a true unified view, incorporating internal and external sources
Security/Intelligence ExtensionLower risk, detect fraud and monitor cyber security in real-time
Data Warehouse AugmentationIntegrate big data and data warehouse capabilities to increase operational efficiency
Operations AnalysisAnalyze a variety of machinedata for improved business results
The 5 High Value Big Data Use Cases
Find, visualize and understand all big data to improve business knowledge• Greater efficiencies in
business processes
• New insights from combining and analyzing data types in new ways
• Develop new business models with resulting increased market presence and revenue
CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems
ConnectorFramework
App Builder
Hadoop
Integration & Governance
UI / User
Streams
Big Data Exploration: Illustrated
WarehouseData Explorer
Big Data Exploration: Example in Practice
• Exploring 4 TB to drive point business solutions (supplier portal, call center, etc.)
• Single-point of data fusion for all employees to use• Reduced costs & improved operational performance for the business
How do you enable employees to navigate and explore enterprise and external content? Can you present this in a single user interface?
How do you identify areas of data risk before they become a problem?
What is the starting point for your big data initiatives?
Is Big Data Exploration Right for You? How do you separate the “noise” from useful
content?
How do you perform data exploration on large and complex data?
How do you find insights in new or unstructured data types (e.g. social media and email)?
Airplane ManufacturerBlinded for confidentiality
Big Data Platform Component Starting Point: Data Explorer
Enhanced 360º View of the Customer: Illustrated
CRMJ Robertson
Pittsburgh, PA 15213
35 West 15th
Name:
Address:
Address:
ERPJanet Robertson
Pittsburgh, PA 15213
35 West 15th St.
Name:
Address:
Address:
LegacyJan Robertson
Pittsburgh, PA 15213
36 West 15th St.
Name:
Address:
Address:
SOURCE SYSTEMS
Janet
35 West 15th St
Pittsburgh
Robertson
PA / 15213
F
48
1/4/64
First:
Last:
Address:
City:
State/Zip:
Gender:
Age:
DOB:
360 View of Party Identity
MasterDataManagement
Unified View of Party’s InformationHadoop Streams Warehouse
• Advertisements• Promotions• Campaigns• Planning
• Preferred Styles• Designs• Products• Interests
• Pins / Re-pins• Likes / Dislikes• Tweets• Favorites
Photo Albums and Pinboards
Style Kitchen Gallery
Dream Home Wedding
• Photo Semantic Analysis
• User Segmentation
ComputerConsumer
Retailers, Marketers and Planners
28
Enhanced 360º View of the Customer: Insight from user’s photos
Enhanced 360º Customer View: Customer Example
• Increase revenue and decrease cost in the call center• Increase customer & employee satisfaction• Leverage new data types in customer analysis
How are you driving consistency across your information assets when representing your customer, clients, partners etc.?
How do you deliver a complete view of the customer enhance to your line of business users to ensure better business outcomes?
Is the Enhanced 360º Customer View Right for You? How do you identify and deliver all data as it relates to
a customer, product, competitor to those to need it?
How do you gather insights about your customers from social data, surveys, support emails, etc.?
How do you combine your structured and unstructured data to run analytics?
Big Data Platform Component Starting Point: Data Explorer, Hadoop
Blinded for confidentialityLeading Medical Equipment Supplier
LogsEvents Alerts
Configuration information
System audit trails
External threat intelligence feeds
Network flows and anomalies
Identity context
Web pagetext
Video/audio surveillance
E-mail andsocial activity
Business process data
Customertransactions
Traditional Security Operations and Technology
Big Data Analytics
New ConsiderationsCollection, Storage and Processing
Collection and integrationSize and speedEnrichment and correlation
Analytics and Workflow
VisualizationUnstructured analysisLearning and predictionCustomizationSharing and export
Security/Intelligence Extension: Illustrated
“Reconstructing Events” – Integrating Multimedia from Diverse Sources
• Correlate multimedia content across a wide diversity of sources and dynamic topology of cameras
• Exploit partial overlaps in field of view, re-identification of objects/people and contextual information
• Obtain real-time operational picture across diverse content• 100K security cameras (static cameras, slowly changing topology)
• 10M mobile photos/day (limited knowledge about locations)• 50M social media photos/video (uncertain geo-temporal context)• Moving vehicles (patrol cars), overhead drones, broadcast, retail, 311, etc.
Overhead
Social MediaMobile Cameras
Security Cameras
31
Security/Intelligence Extension: Customer Example
What are your plans to enrich your security or intel system with unused or underleveraged data sources (video, audio, smart devices, network, Telco, social media)?
How will you address the need sub second detection, identification, resolution of physical or cyber threats?
How do you intend to follow activities of criminals, terrorists, or persons in a blacklist?
How do you plan to enhance your surveillance system with real-time data from video, acoustic, thermal or other security sensors?
Do you want to correlate lots of technical or human intel data and sources looking for associations or patterns (big data forensics)?
How are you going to deal with unstructured data (email, social, etc.) in your Security Information & Event Management (SIEM) solution to improve cyber threat detection & remediation?
Would the Security / Intelligence Extension benefit you?
Captured and analyzed 42TB of daily traffic in real-time for tracking persons of interest to take suitable action and reduce risk.
Big Data Platform Component Starting Point: Streams, Hadoop
Raw
Log
s an
d M
achi
ne D
ata
Indexing, Search
Statistical Modeling
Root Cause Analysis
Federated Navigation & Discovery
Real-time Analysis
Only storewhat is needed
Operations Analysis: Illustrated
Machine DataAccelerator
1 http://www.information-management.com/infodirect/2009_133/downtime_cost-10015855-1.html2 http://www.itchannelplanet.com/business_news/article.php/3916786/IT-System-Downtime-Costs-265-Billion-A-Year-Study-Finds.htm
Operations analysis is a Business Imperative
Cost of System Down Time– 49 percent of Fortune 500 companies experience > 80 hours of system down time/year1
• Cost of down time varies between $90,000/hr to $6.48 million/hr• 80 hours * $6.48M = approx $500M per year
– System downtown costs North American businesses $26.5 billion a year in lost revenue2
Operations Analysis: Customer Example
• Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management
• Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance
• Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos
Do you deal with large volumes of machine data? How do you access and search that data? How do you perform root cause analysis?
How do you perform complex real-time analysis to correlate across different data sets?
How do you monitor and visualize streaming data in real time and generate alerts?
Would Operations Analysis benefit you?
Big Data Platform Component Starting Point: Hadoop, Streams
Integrate big data and data warehouse capabilities to increase operational efficiency
Data Warehouse Augmentation: Needs
Need to leverage variety of data Extend warehouse infrastructure• Optimized storage, maintenance and licensing
costs by migrating rarely used data to Hadoop• Reduced storage costs through smart
processing of streaming data• Improved warehouse performance by
determining what data to feed into it
• Structured, unstructured, and streaming data sources required for deep analysis
• Low latency requirements (hours—not weeks or months)
• Required query access to data
Filter and summarize big data for the warehouse
Hadoop
Data Warehouse Augmentation: Illustrated
Hadoop as a query-ready archive for a data warehouse
Hadoop
Data Warehouse Augmentation: Illustrated
Data Warehouse Augmentation: Customer Example
Are you drowning in very large data sets (TBs to PBs) that are difficult and costly to store?
Are you able to utilize and store new data types?
Are you facing rising maintenance/licensing costs?
Do you use your warehouse environment as a repository for all data?
Improved analysis performance by over 40 times, reduced wait timefrom hours to seconds, and increased campaign effectiveness by 20+%.
Do you have a lot of cold, or low-touch, data driving up costs or slowing performance?
Do you want to perform analysis of data in-motion to determine what should be stored in the warehouse?
Do you want to perform data exploration on all data? Are you using your data for new types of analytics?
Could Data Warehouse Augmentation benefit you?
Big Data Platform Component Starting Point: Hadoop, Streams
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
© 2013 BigDataUniversity.com
Sentiment Analysis using IBM Text Analytics (Basic example)
© 2013 BigDataUniversity.com
Sentiments for movie Ra.One :-(
© 2013 BigDataUniversity.com
Sentiments for movie Swades :-)
© 2013 BigDataUniversity.com
Architecture Diagram
AQLAQL Text AnalyticsOptimizer
Text AnalyticsOptimizer
Text AnalyticsRuntime
Text AnalyticsRuntime
CompiledOperator
Graph (.aog)
CompiledOperator
Graph (.aog)
Rule language with familiar SQL-like syntax
Specify annotator semantics declaratively
Rule language with familiar SQL-like syntax
Specify annotator semantics declaratively
Choose an efficient
execution plan that implements the semantics
Choose an efficient
execution plan that implements the semantics
Highly scalable, embeddable Java runtime
Highly scalable, embeddable Java runtime
InputDocumentStream
AnnotatedDocumentStream
continuous ingestion Continuous ingestion Continuous analysis
How Streams Works
Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts
Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity
TransformFilter / Sample
ClassifyCorrelate
Annotate
Where appropriate: Elements can be fused togetherfor lower communication latency
Continuous ingestion Continuous analysis
How Streams Works
Scalable Stream Processing
Streams programming model: construct a graph
– Mathematical concept• not a line -, bar -, or pie chart!• Also called a network• Familiar: for example, a tree structure is a graph
– Consisting of operators and the streams that connect them• The vertices (or nodes) and edges of the mathematical graph• A directed graph: the edges have a direction (arrows)
Streams runtime model: distributed processes– Single or multiple operators form a Processing Element (PE)– Compiler and runtime services make it easy to deploy PEs
• On one machine• Across multiple hosts in a cluster when scaled-up processing is required
– All links and data transport are handled by runtime services• Automatically• With manual placement directives where required
OP
OP
OP
OP
OP
OP
OPstream
From Essential Elements to Running Jobs
Streams application graph:– A directed, possibly cyclic, graph– A collection of operators– Connected by streams
Each complete application is a potentially deployable job
Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance)
An instance can include a single processing node (hardware)
Or multiple processing nodes
Streams instance
OP
OP
Src
Src
Sink
Sink
OPstream
h/w node
node nodenode
nodenode node
node
Streams Runtime Illustrated
x86 host x86 host x86 host x86 host
Optimizing scheduler assigns jobs to hosts, and continually manages resource allocation
Optimizing scheduler assigns jobs to hosts, and continually manages resource allocation
Commodity hardware – laptop, blades or high performance clustersCommodity hardware – laptop, blades or high performance clusters
MetersCompany Filter
Usage Model
Usage Contract
Text Extract
Season Adjust
Daily Adjust
Temp Action
Streams Runtime Illustrated
x86 host x86 host x86 host x86 host x86 host
Optimizing scheduler assigns PEsto hosts, and continually manages resource allocation
Optimizing scheduler assigns PEsto hosts, and continually manages resource allocation
Commodity hardware – laptop, blades or high performance clustersCommodity hardware – laptop, blades or high performance clusters
MetersCompany Filter
Usage Model
Usage Contract
Temp Action
Dynamically add hosts and jobsDynamically add hosts and jobs
New jobs work with existing jobsNew jobs work with existing jobs
Text Extract
Degree History
Compare History Store
History
Meters
Season Adjust
Daily Adjust
Text Extract
Streams Runtime Includes High Availability
x86 host x86 host
MetersCompany Filter
Usage Model
Meters
x86 host
A PE failing on one host can be moved automatically to another; communications are automatically rerouted
A PE failing on one host can be moved automatically to another; communications are automatically rerouted
PEs on busy hosts can be moved manually by the Streams administrator
PEs on busy hosts can be moved manually by the Streams administrator
Usage Contract
x86 host x86 host
Text Extract
Degree History
Compare History Store
History
Text Extract
Temp Action
Season Adjust
Daily Adjust
Social Data Analytics Accelerator Architecture
Data Ingestand Prep
Extract Buzz, Intent ,
Sentiment
Entity Analytics:
Profile Resolution
Real time analytics. Pre-defined views
and charts
Dashboard
Stream Computing and Analytics
BigInsights System and Analytics
Online flow: Data-in-motion analysis
Offline flow: Data-at-rest analysis
Pre-defined Workbooks and
Dashboards
Social Media Data
Extract Buzz, Intent ,
Sentiment And Consumer
Profiles
Entity Analytics and
Integration
Comprehensive Social Media
Customer Profiles
Social Media
Optional: Indexed Search
Index using Push API
Data Explorer
Ad hoc access
Social Data Analytics Accelerator
Business requirement– Improve ability to understand, correct and anticipate outages
Solution Overview– Provide faceted search across log records from multiple systems to find events – Link and correlate events across systems– Discover interesting patterns
Solution Detail– BigInsights applications for
• Import, Extract, Transform, Analyze, Visualize
Machine Data Analytics Accelerator – Preventing outages
Data Scientist End UserData
Administrator
Import Logs Transform Analyze VisualizeExtract
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
The Future of Big Data and Cloud
SQL for Hadoop support improvements – towards full ANSI support
Hive
Impala (Cloudera)
Big SQL (IBM)
Stinger (Hortonworks)
Drill (MapR)
HAWQ (Pivotal)
SQL-H (Teradata)
Improvements in Multimedia Analytics
Growth in usage and adoption of R programming language
Cloud Bare metal support helping with Hadoop workloads
Private network
Full support with APIs
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
BigInsights on the Cloud - Making Learning Hadoop Easy and Fun Flexible on-line delivery allows
learning @your place and @your pace
Free courses, free study materials.
Cloud-based sandbox for exercises – zero setup with Robust Course Management System and Content Distribution infrastructure
108,000 registered students.
Free IBM Hadoop, BigInsights Publications
Big Data University (bigdatauniversity.com)
BigInsights on the Cloud - Making Learning Hadoop Easy and FunQuick Start Editions available (Free, non-
production, no time bomb):
– IBM InfoSphere BigInsights (IBM’s Hadoop Distribution)ibm.co/QuickStart
– IBM InfoSphere Streamsibm.co/streamsqs
Big Data University (bigdatauniversity.com)
61
My contact information
Contact Info:Email: [email protected]
Twitter: @raulchong
Facebook: facebook.com/raul.f.chong
LinkedIN: linkedin.com/pub/raul-f-chong/8/aa2/b63
My contact information
Thank You!
© 2013 BigDataUniversity.com