dataweek keynote: large scale search, discovery and analysis in action

28
Confidential © Copyright 2012 Large Scale Search, Discovery and Analysis in Action Ivan Provalov Research Engineer Office of the Chief Scientist September 25, 2012

Upload: ivan-provalov

Post on 27-Jan-2015

112 views

Category:

Technology


4 download

DESCRIPTION

Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other insights. In this talk, we`ll discuss real world use cases across several industries as well as how to effectively leverage open source tools like Hadoop, Solr, Mahout and others to better enable user access to big data.

TRANSCRIPT

Page 1: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential © Copyright 2012

Large Scale Search, Discovery and Analysis in Action

Ivan ProvalovResearch EngineerOffice of the Chief ScientistSeptember 25, 2012

Page 2: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

User Interactions With Big Data

2

Data

Data

Data

DFS

Key Value Store

Index

Command Line

Query Language

Keyword Search

SystemAdministrator

Engineer

End User

Page 3: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Search, Discovery and Analytics

Is Search Enough?

• Keyword search is a commodity

• Holistic view of the data and the user interactions with that data

• Search, Discovery and Analytics are the key to unlocking this view of users and data

Search

endeavour shuttle bay area

3

Page 4: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Why Search, Discovery and Analytics?

• User Needs- real-time, ad hoc access to

content- aggressive prioritization

based on importance- serendipity- feedback/learning from past

• Business Needs- deeper insight into users- leverage existing internal

knowledge- cost effective

Search

DiscoveryAnalytics

4

Page 5: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Topics

• Background and needs• Architecture• Search, Discovery and Analytics in action• Road map• Wrap up

5

Page 6: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Search

• Performance• Real time• Relevance and importance• Presenting results• Experiment management

6

Page 7: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Discovery

• Content clustering • Discovering near duplicate documents• Finding ‘dark data’• Making recommendations• Uncovering trends• Recognizing topics• More like this

7

Page 8: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Analytics

• Term frequency• Facets• Click analysis• Relevancy metrics• Zero results queries• Hot spots• Statistically interesting phrases

8

Page 9: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Some Use Cases

• Video streaming- classification- recommendations

• Financial, transportation, telecommunications- fraud detection

• Social media- trend monitoring

• Information technology- logs monitoring

•Healthcare- identifying patients for clinical studies

9

Page 10: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

In Focus: Personalized Medicine

10

Genetic Variations

Patient DNA

Alignment and other analysis

Search and Faceting

Standard Therapies

Alternative Therapies

Page 11: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks11

In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to fraudulent calls and poor service

• Logs are usually semi-structured and contain vital information about errors and fraud

• Deeper batch analytics can provide insight into patterns across vast amounts of data

• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities

Page 12: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

What Does a Search, Discovery and Analytics Platform Need?

• Fast, efficient, scalable search- bulk and near real time indexing

- handle billions of records with sub-second search and faceting

• Large scale, cost effective storage and processing capabilities- need whole data consumption and analysis

- experimentation/sampling tools

• NLP and machine learning tools that scale to enhance discovery and analysis

12

Page 13: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Building a Search, Discovery and Analytics Platform

Inpu

tsAPI

Man

agem

entSearch, Discovery, Analytics

Processing & Storage

Provisioning, Monitoring & Configuration

Bulk & Real Time

Page 14: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs

API

Provisioning, Monitoring & Configuration

Man

agem

entSearch, Discovery, Analytics

Processing & Storage

Page 15: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs

Processing & Storage

API

Provisioning, Monitoring & Configuration

Man

agem

entSearch, Discovery, Analytics

Page 16: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs Search, Discovery, Analytics

Processing & Storage

Analytics Service Document Service

API

Provisioning, Monitoring & Configuration

Man

agem

ent

Page 17: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs MgmtSearch, Discovery, Analytics

Processing & Storage

Analytics Service Document ServiceAdmin

ServiceMgmt

DataMgmt

API

Provisioning, Monitoring & Configuration

Page 18: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs MgmtSearch, Discovery, Analytics

Processing & Storage

Provisioning, Monitoring & Configuration

Analytics Service Document ServiceAdmin

ServiceMgmt

DataMgmt

API

Page 19: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs

API

MgmtSearch, Discovery, Analytics

Processing & Storage

Analytics Service Document Service

Big Data LucidWorks Web HDFS

Admin

ServiceMgmt

DataMgmt

Provisioning, Monitoring & Configuration

Page 20: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks20

Components – LucidWorks Search

Component Benefit

LucidWorks Search (2.1.1)• connector framework• security• user click framework• business process integration• administration

Lucene/Solr 4.0-dev, sharded with SolrCloud, near-real time indexing, transaction logs for recovery.

LucidWorks Search

Page 21: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks21

Components - Hadoop

Component Benefit

Apache Hadoop (1.0.3) Distributed computing and processing for ETL and analytics jobs.

Apache HBase (0.92) Key-value store allowing fast access to the data.

Apache Oozie (modified 3.2) Workflow orchestration.

Page 22: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks22

Components - Analysis/ML/NLP

Component Benefit

Apache Mahout (trunk)• k-means clustering• statistically interesting phrases• similar documents• classification

Distributed machine learning processing framework.

Apache UIMA (2.4.0) Text processing and annotations.

Apache OpenNLP (1.5.2)• named entity extraction

Machine learning toolkit for natural language processing.

Behemoth (modified trunk) Makes easier M/R data extraction, abstracts annotations frameworks.

Apache Pig (0.9.2)• ETL• log analysis

Helps with writing analytics M/R programs.

Page 23: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks23

Components - Middleware

Component Benefit

Apache ZooKeeper (3.4.3)• Netflix Curator

Service discovery.

Apache Kafka (0.7) Logs consumption and event-based real-time document processing framework.

Page 24: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Components - SDA Engine

• RESTful services (Restlet 2.1)• ZooKeeper + Netflix Curator• Authentication and authorization• Proxies for LucidWorks and

WebHDFS API• Workflow engine

24

Page 25: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Road Map

• Analytics themes- relevance- data quality- discovery- integration with other packages (R)

• Machine learning- NLP- recommendations

• Experiment management

25

Page 26: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

Conclusions

• Search, Discovery and Analytics, when combined into a single, integrated system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data

26

Page 27: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks27

LucidWorks Big Data

• Unified development platform for Big Data applications• Integrated open source stack: Lucene/Solr, Hadoop,

Mahout, NLP• Single, uniform REST API• Pre-tuned by open source industry experts• Out of the box provisioning - hosted or on premise

Page 28: DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Confidential and Proprietary © 2012 LucidWorks

www.lucidworks.com/bigdata

[email protected]

@iprovalov

Search | Discover | Analyze

28