toug big data challenge and impact
Post on 19-Oct-2014
494 views
DESCRIPTION
Presented to Toronto Oracle Users Group members on Jan 22, 2014 by Ian AbramsonTRANSCRIPT
Ian Abramson
EPAM Systems, Canada
January 2014
Getting Started with Big Data
and Making an Impact
@TOUG Jan. 22, 2014
Confidential
Big Data .. The Silver Bullet?
Confidential
Agenda
Confidential 3
Introductions and Goals
What is Big Data
Technology Choices
Making an Impact with Data Science
Use Cases
About Me
4
• Degree in Applied Mathematics
• Over 20 years with Oracle software
• Over 10 years with data warehouses
• Big Data Analyst
• Author of numerous Oracle books
• Blogger: http://ians-oracle.blogspot.com/
• Oracle ACE
• IOUG Past-President
• TOUG Board Member
• Toronto based
• Twitter: @iabramson
WHERE IS BIG DATA?
5
Why Big data?
• New data sources
• Unprecedented volume
• Real World Issues
– Data Systems are reaching capacity
requiring high cost alternatives
– Archive data is too far offline
– Organizations require cost effective
options
– Retain all data for future analysis
6
Big Data Defined
Gart
ner:ID
C:
Ro
ger
Mag
ou
las
(O’R
eily
Re
se
arc
h)W
ikip
ed
ia:
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Big Data is a term/concept, which is used as a generic name for a “generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis”.
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
“Data becomes “Big Data”, when the size of the data becomes a part of the problem”
7
The Attributes of Big Data
• Classic Data Attributes:
– Volume
– Velocity
– Variety
• Big Data Technical Attributes
– massive, parallel computing environment
– infinitely scalable computing clusters, including cloud
• Three main technical requirements
– Need medium to accommodate large volumes for storage and data streaming
– Require the computing horsepower and architectural approach which allows for the processing of the data where it exists and not via extraction and processing
– Use the appropriate programming which allows for a computational paradigm, which performs computations in a highly parallel and scalable environment
8
Challenges for Big Data
Confidential 9
http://tdwi.org/blogs/fern-halper/2013/10/four-big-data-challenges.aspx
• The problem – different uses – different schemas and different partitioning. In most cases the requirements are orthogonal – impossible to provide optimal for everybody data partitioning/indexing
• The ideal goal – acquire and store “as is” – access using multiple models. Need for powerful artificial intelligence knowledge base and data access code generators.
• Will never be optimal for everybody unless huge redundancy
• Problems are less painful if most of the data are read anyway. Good for analytics, not good for OLTP
• Eventually Big data platforms will become DW platforms with well developed access interfaces
• Until then -> acquire and store and then distribute on demand to conventional DW and data marts
Big Data and Data Warehouse – war or peaceful coexistence?
10
Operational Systems
The New Data Architecture
11
Enterprise Data
Social & Clickstream
Sensor Generated
Public Data
Historical Data
Other New Sources
Big Data
Hadoop
HDFS Map/Reduce
Data Archive
ODS
Data Warehouse/BI/Analytics
TECHNOLOGY CHOICES
Confidential 12
The Choices for Your Data
RDBMS
- High Concurrency- TB Storage- Indexed reads
- Efficient updates- Caching
- Highly secure
Analytic Appliances
- Scalable- Medium Concurrency
- High Volume Processing (Postgres)
- No indexes- TB + Netezza (128TB/rack)Oracle (300TB/Rack)
NoSQL
- Highly Scalable- High Concurrency- Storage Options- Updates- Real-time Capable- Rudimentary indexes- TB + Capacity
Hadoop
- Highly scalable- Low concurrency- Distributed Storage- Complex Access- Security (TBD)
The Open Source/Big Data Landscape
14
http://www.bigdata-startups.com/open-source-tools/
Hadoop In Detail
Confidential 15
Reference: http://blog.blazeclan.com/252/
Hadoop Distributions
Confidential 16
For Example if you choose Cloudera…
Confidential 17
Comparing Hadoop Distributions
Confidential 18http://www.infoworld.com/d/business-intelligence/enterprise-hadoop-big-data-processing-made-easier-184330?page=0,5
• Disaster recovery
• Security
• Data consistency
• Workload management
• Reprocessing
• Troubleshooting
• Performance
Big Data’s Technical Challenges
19
DATA SCIENCE
Confidential 20
Big Data vs. BI presentation viewpoint
Confidential 21
IMPACT
• Sample questions for BI
– What is my sales volume by time, by region, by store, by season?
– What is average review rating by product category, by product? What is the dynamic of reviews, what are the trends?
• Sample questions for Big Data/ Data Science
– How change in review ratings impact sales?
– What is the time lag between review rating change and sales volume change?
– What products are purchased together and can I improve product recommendations?
Questions for BI and Big Data
Confidential 22
DATA SCIENCE
Data Science
Confidential 23
Skills Science
• State the ProblemPurpose
• Discover information about topic
Research
• Predict the OutcomeHypothesis
• Develop a process to test the hypothesis
Experiment
• Record the resultsAnalysis
• Compare hypothesis and results
Conclusion
Data Science Team
Each team would include:
• Data Science Analyst – excellent communication skills, science and analytical background.
• Data Science Researcher/Solution Architect – good communication,, good statistical/math, working knowledge 2 out of the following data science libraries (Mahootor any other machine learning, Rhadoop, R, SAS, SPSS) –
• Data Science Technologist – acceptable communication skills, 25% deployable to the client site (as minimum few should be deployable, others can be offshore), good developer, working knowledge of Big data and related technologies
• Data Science presentation engineer – knowledge BI and presentation tools
24
Nordstrom’s Big Data Team Mission:
“Delighting Customers through data-driven
products”
USING BIG DATA
Confidential 25
Data Science Sample use cases
Confidential 26
Top 10 Use Cases (2013 Computerworld)
http://www.computerworld.com.sg/resource/storage/iiis-2013-technical-workshops/?page=2
1. Modeling Risk
2. Customer Churn Analysis
3. Recommendation Engines
4. Ad Targeting
5. POS Transaction Analysis
6. Analysis of network data to predict future failures
7. Threat Analysis
8. Trade Surveillance
9. Search Quality
10.Data Sandbox
• From analysis of match.com dating patterns:
• 21+ Million members
• 100+ million hits per month
– January 2nd is the busiest day for people to sign up on dating sites
– Women get 60% more attention if photo is taken indoors
– Men get 19% more attention if theirs is taken outside
– Full-body photos boost both sexes success by 203%
– Posing with animals or your best friends might seem cute but it actually reduces your popularity by 53 per cent (men) and 42 per cent (women)
– Men get 8% fewer messages if they put up selfies.
– Mentions of words like divorce and separated gets men 52 per cent more messages
– Women who are more forward, using phrases like dinner, drinks or lunch in the first message get 73 per cent more replies, while men should play it cooler. Those who mention the same words in their opening message get 35 per cent fewer replies.
The Big Data of Dating
Confidential 28
Use Case Development
Business
Questions
Business Stakeholders
IdentifyBusiness
Value
Define Success Criteria
Develop Hypothesisand Identify
Data Sources
Iterate results and
develop data for goals
Use Case Checklist• Title - An active description which identifies the goals of the
primary actor
• Characteristics:
– Primary actor
– Goal in Context
– Scope
– Level
– Stakeholders and Interests
– Precondition
• Success criteria
– Precondition
– Minimal Guarantees
– Success Guarantees
– Trigger
– Main Success Scenario
– Extensions
• Technology & Data Variations List
• Related Information.Reference: Alistair Cockburn
Archive Use Case
1.5 Petabytes continuous ingestion data
One of the largest Hadoop clusters in the
world
80% Open Source EDW
Customer Benefits
Avoided massive cost of new DW
Infrastructure
Able to keep and analyze historical
transactions
Reduce risk of DW replacement
Able to scale on demand using low-cost
servers
Transaction Volume
> 500 GB daily increases from all sources
transaction, social, contact center
31
Call Center and
Online data
Staging and Historical
Analysis
Analytic Infrastructure
Informatica
transformation &
aggregation
EXPEDIA CASE STUDY
Use Case: Sales Analysis Sales per sq.ft.: Changes Over time
• Fitting the no-intercept line to the scatter of sales over sales floor brings about visual baseline Sales-per-Sq.Ft. (SpSF) for each year
Mathematically the
SpSF measure is
given by the slope
coefficient of the
trend:
392.51 [CAD/Sq.Ft.]
in 2011 vs.
373.76 [CAD/Sq.Ft.]
in 2012
SpSF417 in 2011
417 in 2012
Looking for Patterns Anomalies
This chart tells us most of the stores have highest sales on Saturday. But, Store X peaks on Friday and
Is also doing well on Mondays. Why?
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
THU FRI SAT SUN MON TUE WED
Affinity Analysis Use Case
Build model that provides the foundation for analyzing and
understanding the factors that influence year over year
change in store performance
• Affinity Analysis is an input to:• Identify products purchased in tandem
• Provide guidance an recommendations for
upsell and cross-sell
• Redesign stores, layouts and planograms
• Discount Plans and Promotions
• Identifying customer baskets in different
time and geography• Investigating patterns on fine line and
product levels
• Ranking customer baskets by Number of
times bought together Revenue
contributed
Clustering of Products
35
Snow Scrapers and Washer Fluid
36
Related Baskets
1. Potted annuals/plants, Cell-packs/annual plants
2. Potted annuals/plants, vegetables/plants
3. Potted annuals/plants, Outdoor soils/outdoor lawn & plant care
4. Cell-packs/annual plants, vegetables/annual plants
Size of the circle show how often
basket has been purchased
Season: 2012-05-16 - 2012-08-28
This kind of analysis can be used
for spotting driver products
Confidential 38
Big Data is Evolving
• The industry is evolving
• Hadoop is now 8 years old since start in 2007 at Yahoo
• CDH 5 recently released
• $2.5B in venture capital in the space
• Hadoop is now considered a standard
• Hbase is an example of a project which has not found a standard
• Many tools today? What will be in 5 years from now?
• How to avoid the big data pitfalls?
• 50% of big data projects fail
• Those who success drive it by focus
• Insight vs. Impact
• Find one problem and fix it
• Data Science
• Change how you do analysis… scientific methods
• New and exciting
• Build a hybrid team to develop Data solutions
• Team can program, knows math and statistics and communicate
The Big Data Adventure
Ian AbramsonEPAM SystemsToronto, CanadaGMT -5Mobile phone: +1 (416) 254-9286Skype: ian.abramsonE-mail: [email protected]
Thank You and Questions
Confidential 40