dataweek keynote: large scale search, discovery and analysis in action
DESCRIPTION
Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other insights. In this talk, we`ll discuss real world use cases across several industries as well as how to effectively leverage open source tools like Hadoop, Solr, Mahout and others to better enable user access to big data.TRANSCRIPT
Confidential © Copyright 2012
Large Scale Search, Discovery and Analysis in Action
Ivan ProvalovResearch EngineerOffice of the Chief ScientistSeptember 25, 2012
Confidential and Proprietary © 2012 LucidWorks
User Interactions With Big Data
2
Data
Data
Data
DFS
Key Value Store
Index
Command Line
Query Language
Keyword Search
SystemAdministrator
Engineer
End User
Confidential and Proprietary © 2012 LucidWorks
Search, Discovery and Analytics
Is Search Enough?
• Keyword search is a commodity
• Holistic view of the data and the user interactions with that data
• Search, Discovery and Analytics are the key to unlocking this view of users and data
Search
endeavour shuttle bay area
3
Confidential and Proprietary © 2012 LucidWorks
Why Search, Discovery and Analytics?
• User Needs- real-time, ad hoc access to
content- aggressive prioritization
based on importance- serendipity- feedback/learning from past
• Business Needs- deeper insight into users- leverage existing internal
knowledge- cost effective
Search
DiscoveryAnalytics
4
Confidential and Proprietary © 2012 LucidWorks
Topics
• Background and needs• Architecture• Search, Discovery and Analytics in action• Road map• Wrap up
5
Confidential and Proprietary © 2012 LucidWorks
Search
• Performance• Real time• Relevance and importance• Presenting results• Experiment management
6
Confidential and Proprietary © 2012 LucidWorks
Discovery
• Content clustering • Discovering near duplicate documents• Finding ‘dark data’• Making recommendations• Uncovering trends• Recognizing topics• More like this
7
Confidential and Proprietary © 2012 LucidWorks
Analytics
• Term frequency• Facets• Click analysis• Relevancy metrics• Zero results queries• Hot spots• Statistically interesting phrases
8
Confidential and Proprietary © 2012 LucidWorks
Some Use Cases
• Video streaming- classification- recommendations
• Financial, transportation, telecommunications- fraud detection
• Social media- trend monitoring
• Information technology- logs monitoring
•Healthcare- identifying patients for clinical studies
9
Confidential and Proprietary © 2012 LucidWorks
In Focus: Personalized Medicine
10
Genetic Variations
Patient DNA
Alignment and other analysis
Search and Faceting
Standard Therapies
Alternative Therapies
Confidential and Proprietary © 2012 LucidWorks11
In Focus: Log Processing in Telecommunications
• Each year, large sums of money are lost due to fraudulent calls and poor service
• Logs are usually semi-structured and contain vital information about errors and fraud
• Deeper batch analytics can provide insight into patterns across vast amounts of data
• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities
Confidential and Proprietary © 2012 LucidWorks
What Does a Search, Discovery and Analytics Platform Need?
• Fast, efficient, scalable search- bulk and near real time indexing
- handle billions of records with sub-second search and faceting
• Large scale, cost effective storage and processing capabilities- need whole data consumption and analysis
- experimentation/sampling tools
• NLP and machine learning tools that scale to enhance discovery and analysis
12
Confidential and Proprietary © 2012 LucidWorks
Building a Search, Discovery and Analytics Platform
Inpu
tsAPI
Man
agem
entSearch, Discovery, Analytics
Processing & Storage
Provisioning, Monitoring & Configuration
Bulk & Real Time
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs
API
Provisioning, Monitoring & Configuration
Man
agem
entSearch, Discovery, Analytics
Processing & Storage
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs
Processing & Storage
API
Provisioning, Monitoring & Configuration
Man
agem
entSearch, Discovery, Analytics
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs Search, Discovery, Analytics
Processing & Storage
Analytics Service Document Service
API
Provisioning, Monitoring & Configuration
Man
agem
ent
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs MgmtSearch, Discovery, Analytics
Processing & Storage
Analytics Service Document ServiceAdmin
ServiceMgmt
DataMgmt
API
Provisioning, Monitoring & Configuration
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs MgmtSearch, Discovery, Analytics
Processing & Storage
Provisioning, Monitoring & Configuration
Analytics Service Document ServiceAdmin
ServiceMgmt
DataMgmt
API
Confidential and Proprietary © 2012 LucidWorks
LucidWorks Big Data
Inputs
API
MgmtSearch, Discovery, Analytics
Processing & Storage
Analytics Service Document Service
Big Data LucidWorks Web HDFS
Admin
ServiceMgmt
DataMgmt
Provisioning, Monitoring & Configuration
Confidential and Proprietary © 2012 LucidWorks20
Components – LucidWorks Search
Component Benefit
LucidWorks Search (2.1.1)• connector framework• security• user click framework• business process integration• administration
Lucene/Solr 4.0-dev, sharded with SolrCloud, near-real time indexing, transaction logs for recovery.
LucidWorks Search
Confidential and Proprietary © 2012 LucidWorks21
Components - Hadoop
Component Benefit
Apache Hadoop (1.0.3) Distributed computing and processing for ETL and analytics jobs.
Apache HBase (0.92) Key-value store allowing fast access to the data.
Apache Oozie (modified 3.2) Workflow orchestration.
Confidential and Proprietary © 2012 LucidWorks22
Components - Analysis/ML/NLP
Component Benefit
Apache Mahout (trunk)• k-means clustering• statistically interesting phrases• similar documents• classification
Distributed machine learning processing framework.
Apache UIMA (2.4.0) Text processing and annotations.
Apache OpenNLP (1.5.2)• named entity extraction
Machine learning toolkit for natural language processing.
Behemoth (modified trunk) Makes easier M/R data extraction, abstracts annotations frameworks.
Apache Pig (0.9.2)• ETL• log analysis
Helps with writing analytics M/R programs.
Confidential and Proprietary © 2012 LucidWorks23
Components - Middleware
Component Benefit
Apache ZooKeeper (3.4.3)• Netflix Curator
Service discovery.
Apache Kafka (0.7) Logs consumption and event-based real-time document processing framework.
Confidential and Proprietary © 2012 LucidWorks
Components - SDA Engine
• RESTful services (Restlet 2.1)• ZooKeeper + Netflix Curator• Authentication and authorization• Proxies for LucidWorks and
WebHDFS API• Workflow engine
24
Confidential and Proprietary © 2012 LucidWorks
Road Map
• Analytics themes- relevance- data quality- discovery- integration with other packages (R)
• Machine learning- NLP- recommendations
• Experiment management
25
Confidential and Proprietary © 2012 LucidWorks
Conclusions
• Search, Discovery and Analytics, when combined into a single, integrated system provides powerful insight into both your content and your users
• LucidWorks has combined many of these things into LucidWorks Big Data
26
Confidential and Proprietary © 2012 LucidWorks27
LucidWorks Big Data
• Unified development platform for Big Data applications• Integrated open source stack: Lucene/Solr, Hadoop,
Mahout, NLP• Single, uniform REST API• Pre-tuned by open source industry experts• Out of the box provisioning - hosted or on premise
Confidential and Proprietary © 2012 LucidWorks
www.lucidworks.com/bigdata
@iprovalov
Search | Discover | Analyze
28