big data analysis patterns with hadoop, mahout and solr
DESCRIPTION
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.TRANSCRIPT
![Page 1: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/1.jpg)
1
Big DataAnalysis PatternsAtlanta Big Data User Group8/15/2013
![Page 2: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/2.jpg)
2
whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• [email protected]
![Page 3: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/3.jpg)
3
Announcements Next ATLHUG Meeting - Sept. 26–How Google Does Big Data
Wednesday – MapR Data Warehouse Offload Roadshow
MapR Upcoming Training• MapR M7 & HBase for Developers on August 27 in Campbell, CA• MapR M7 & HBase for Developers on Sept 17 in Reston, VA• MapR M5 for Administrators on Oct 3 in Campbell, CA
3
![Page 4: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/4.jpg)
4
BIG DATA
![Page 5: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/5.jpg)
5
![Page 6: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/6.jpg)
6
Big Data is not new!but the tools are.
![Page 7: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/7.jpg)
7
The Good News in Big Data:
“Simple algorithms and lots of data trump complex models”
Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems
![Page 8: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/8.jpg)
8
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?
8
![Page 9: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/9.jpg)
9
Ask a Different Question
It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?
9
![Page 10: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/10.jpg)
10
Picking the Best Solution
Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try
But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.
10
![Page 11: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/11.jpg)
11
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data
Solr is a small data tool that has flourished in a big data world
![Page 12: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/12.jpg)
12
Apache Mahout
Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.
Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification
Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr
![Page 13: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/13.jpg)
13
Apache Drill
Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …
Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?
Bypasses MapReduce
![Page 14: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/14.jpg)
14
Storm
Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic
“The Hadoop of Realtime”
![Page 15: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/15.jpg)
15
Titan Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB
Search Integrated– Solr/Lucene– Elastic Search
Faunus– Batch processing of large graphs
Fulgora– Graph traversals on subset– In-memory
![Page 16: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/16.jpg)
16
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?
![Page 17: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/17.jpg)
17
Big Data Decision Tree
How big is your data?
<10 GB >200 GBmid
What size queries?
Single element at a time
One passover 100%
Multiple passesover big chunks
Big storage Streaming
Response time?
< 100s(human scale)
throughputnot response
A
B C
ED
??
![Page 18: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/18.jpg)
18
Use Cases Company Data Shape Technique(s) Business Value
![Page 19: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/19.jpg)
19
Business Value
![Page 20: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/20.jpg)
20
Business Value
![Page 21: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/21.jpg)
21
Telecommunications Giant
ETL Offload
![Page 22: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/22.jpg)
22
Lots of Data Lots of Queries across Large Sets Throughput important
Data ShapeTelecommunications
![Page 23: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/23.jpg)
23
Techniques
AnalyticsETL
Telecommunications
![Page 24: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/24.jpg)
24
Techniques
+
ETL (Hadoop) Analytics (Teradata)
Telecommunications
![Page 25: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/25.jpg)
25
Business ValueTelecommunications
![Page 26: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/26.jpg)
26
Credit CardIssuer
![Page 27: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/27.jpg)
27
Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations
Data Shape
Credit CardIssuer
![Page 28: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/28.jpg)
28
History matrix
One row per user
One column per thing
A Recommendation Engine with Mahout and Solr/Lucene
Techniques
Credit CardIssuer
![Page 29: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/29.jpg)
29
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
Techniques
Credit CardIssuer
![Page 30: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/30.jpg)
30
Cooccurrence matrix can also be implemented as a search index
Techniques
Credit CardIssuer
![Page 31: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/31.jpg)
31
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
Techniques
20 Hrs 3 Hrs
Credit CardIssuer
![Page 32: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/32.jpg)
32
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
Techniques
8Hrs 3 Min
Credit CardIssuer
![Page 33: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/33.jpg)
33
Techniques
PurchaseHistory
Merchant Information
Merchant Offers
RecommendationEngine Results
(Mahout)
PresentationData Store
(DB2)
App
App
App
App
App
Hadoop Export(4 hrs)
Import(4 hrs)
Credit CardIssuer
![Page 34: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/34.jpg)
34
Techniques
PurchaseHistory
Merchant Information
Merchant Offers
RecommendationEngine Results
(Mahout)
RecommendationSearch Index
(Solr)
App
App
App
App
App
Hadoop
IndexUpdate(3 min)
Credit CardIssuer
![Page 35: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/35.jpg)
35
Business Value
Credit CardIssuer
![Page 36: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/36.jpg)
36
Idle Alerts
Waste & Recycling Leader
![Page 37: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/37.jpg)
37
Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)
Landfill Geographic Boundaries
Data Shape
![Page 38: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/38.jpg)
38
Techniques
TruckGeolocation
Data
Realtime Stream Computation(Storm)
Batch Computation(MapReduce)
ImmediateAlerts
Tax ReductionReporting
HadoopStorage
Shortest PathGraph Algorithm
(Titan)
Route Optimization
![Page 39: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/39.jpg)
39
Business Value
![Page 40: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/40.jpg)
40
Social Engagement Application
Beverage Company
![Page 41: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/41.jpg)
41
Tweets, FB Messages Person, Activity links Graph Traversal
Data Shape
![Page 42: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/42.jpg)
42
Consumer Activity Graph
Wal*Mart.com
CVS
Dollar General
Ebay
Ebay Motors
Toys R UsStubHub
Shopping.comSam’s
![Page 43: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/43.jpg)
43
Techniques
Property Graph(Titan)
Key/Value Store(MapR M7)
Social Activity Stream
Graph Traversal(Faunus/Fulgora)
![Page 44: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/44.jpg)
44
Business Value
![Page 45: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/45.jpg)
45
Fraud DetectionData Lake
![Page 46: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/46.jpg)
46
Anti-Money Laundering Consumer Transactions
Data Sources
![Page 47: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/47.jpg)
47
TechniquesAnti-Money Laundering
SystemConsumer Transactions
System
![Page 48: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/48.jpg)
48
Techniques
AML
Consumer Transactions
Data Lake(Hadoop)
Suspicious Events
Latent Dirichlet Allocation,Bayesian Learning Neural Network,
Peer Group Analysis
Analyst
![Page 49: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/49.jpg)
49
Business Value
![Page 50: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/50.jpg)
50
Machine LearningSearch Relevance
DNA Matching
![Page 51: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/51.jpg)
51
Birth, Death, Census, Military, Immigration records
Search Behavior Activity DNA SNP (snips)
Data Sources
![Page 52: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/52.jpg)
52
Techniques Record Linking Search Relevance Clickstream Behavior Security Forensics DNA Matching
![Page 53: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/53.jpg)
53
Business Value
![Page 54: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/54.jpg)
54
Traffic Analytics
![Page 55: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/55.jpg)
55
Inrix Road Segment Data– Avg Speed / minute / segment– Reference Speeds
Road Segment Geolocation Data
Data Sources
![Page 56: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/56.jpg)
56
Techniques Bottleneck Detection Algorithm Time Offset Correlations– Alternate Routes
Predictive Congestion Analysis– Growth & Term Assumptions
![Page 57: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/57.jpg)
57
![Page 58: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/58.jpg)
58
![Page 59: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/59.jpg)
59
Business Value
![Page 60: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/60.jpg)
60
Similar Characteristics Lots of Data Structured, Semi-Structured, Unstructured Varied Systems Interoperating– Hadoop, Storm, Solr, MPP, Visualizations
Increase Revenue Decrease Costs
![Page 61: Big Data Analysis Patterns with Hadoop, Mahout and Solr](https://reader035.vdocuments.mx/reader035/viewer/2022081518/54b7687a4a795957768b463c/html5/thumbnails/61.jpg)
61
Questions?