search analytics with flume and hbase

Download Search Analytics with Flume and HBase

If you can't read please download the document

Upload: oleksiy-kovyrin

Post on 21-Nov-2014

269 views

Category:

Documents


3 download

TRANSCRIPT

Search Analytics with Flume & HBaseOtis Gospodneti Sematext International

Copyright 2010 Sematext Int'l. All rights reserved.

1

Agenda

Who I am What Why How Architecture Evolution Role of Flume and HBase + Flume HBase Sink Challenges

Copyright 2010 Sematext Int'l. All rights reserved.

2

About Otis Gospodneti Lucene/Solr/Nutch/Mahout committer Lucene in Action 1 & 2 co-author Lucene Consulting since 2005 Sematext Int'l since 2007

Copyright 2010 Sematext Int'l. All rights reserved.

3

About SematextConsulting, development, support for:

Big Data (Hadoop, HBase, Voldemort...) Search (Lucene, Solr, Elastic Search...) Web Crawling (Nutch) Machine Learning (Mahout)Copyright 2010 Sematext Int'l. All rights reserved.

4

What We Built

Analytics for Search

Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.) Trending over time Comparisons of time periods Top N reports Various report filters

Copyright 2010 Sematext Int'l. All rights reserved.

5

Report Example

Copyright 2010 Sematext Int'l. All rights reserved.

6

Why We Built itsubliminal msg: go use this site

We need it

search-hadoop.com & search-lucene.com

Search customers need it

Want to know what their visitors are searching for Want to know how their search is behaving

Copyright 2010 Sematext Int'l. All rights reserved.

7

How We Built it

JavaScript Beacons Metric Capture Web App Data Capture Mechanisms

Custom Log4J Appender Flume Agents, Collectors, Sinks

HBase MapReduce Aggregations Search Analytics Reporting Web AppCopyright 2010 Sematext Int'l. All rights reserved.

8

What's Flume

Distributed data/log collection service Scalable, configurable, extensible Centrally manageable, open source Agents get data from app, Collectors save it Abstractions: Source Decorator(s) Sink

Copyright 2010 Sematext Int'l. All rights reserved.

9

What's HBase

Scalable, reliable, distributed, column-oriented DB On top of HDFS MapReducable

Copyright 2010 Sematext Int'l. All rights reserved.

10

High Level Architecture

Copyright 2010 Sematext Int'l. All rights reserved.

11

Architecture #1

Copyright 2010 Sematext Int'l. All rights reserved.

12

Architecture #1 - Getting Messy

Copyright 2010 Sematext Int'l. All rights reserved.

13

Arch #2 HBaseLog4JAppender

Copyright 2010 Sematext Int'l. All rights reserved.

14

HBaseLog4JAppender Cons

Doesn't help with reliable delivery

e.g. when network or HBase down

Non-centralized config with larger clusters

e.g. changing destination table in HBase e.g. changing sampling rate

Copyright 2010 Sematext Int'l. All rights reserved.

15

Architecture #3 Flume OOTB

Copyright 2010 Sematext Int'l. All rights reserved.

16

Arch #4 Flume HBase Sink

Copyright 2010 Sematext Int'l. All rights reserved.

17

FLUME-247 Flume HBase sink

Contributed by Sematext in September 2010 Reviewed, pending commit Similar to FLUME-6 (basic example), but more flexible https://issues.cloudera.org/browse/FLUME-247Copyright 2010 Sematext Int'l. All rights reserved.

18

Walk-Through

Start EC2 micro instance, configure logs-generation tool to simulate user actions User actions start getting logged to a log file Configure Flume Agent to "tail" the generated logs and send data to Flume Collector Collector processes log messages and sends them to HBase's "raw logs" table Later these logs are processed by the MapReduce job

Search Action Metric Capture Log File Flume Agent Flume Collector Decorators HBase Sink HBase

Decorator: processes Flume Collector log events and prepares them for HBase HBase sink: FLUME-247

Copyright 2010 Sematext Int'l. All rights reserved.

19

Why Flume

Reliable delivery

e.g. queue msgs locally if destination unreachable

Easy, centralized management via Web UI or console Good community, good progress But: more complex, more moving parts On Flume: slideshare.net/cloudera/inside-flume

Copyright 2010 Sematext Int'l. All rights reserved.

20

Why HBase

Scalable raw search data storage MapReduce data input Scalable aggregate data storage Fast scans for time ranges, fast key lookups Easy storage and compute power expansion Good looking roadmap, community, progress

Copyright 2010 Sematext Int'l. All rights reserved.

21

Challenges

HBase in a box is like dynamic equilibrium, or virtual reality, or jumbo shrimp search-hadoop.com/m/p68C12nb7Hn Data size. Solutions:

Compression (4-5x smaller with lzo) Data pruning (variable levels) Lots of data to process, update, aggregate

Query string distribution: very long-tail

Copyright 2010 Sematext Int'l. All rights reserved.

22

Work @ SematextWe are hiring world-wide! Search & Data Analytics Machine Learning & NLP Biiig Data

Copyright 2010 Sematext Int'l. All rights reserved.

23

Contact sematext.com blog.sematext.com @sematext @otisg [email protected]

Copyright 2010 Sematext Int'l. All rights reserved.

24