hbasecon 2013: using coprocessors to index columns in an elasticsearch cluster

33
HBase Coprocessor to Index Columns into ElasticSearch Cluster Dibyendu Bhattacharya Architect – Big Data Analytics HappiestMinds

Upload: cloudera-inc

Post on 20-Aug-2015

3.030 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Coprocessor to Index Columns into ElasticSearch Cluster

Dibyendu BhattacharyaArchitect – Big Data AnalyticsHappiestMinds

Page 2: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

About HappiestMinds

• Next Gen IT Consultancy Company launched Aug 2011 . Head office in Bangalore, India, have offices in USA, UK, Canada, Australia and Singapore. Core focus on disruptive technologies like Big Data/Analytics, Cloud, Mobile and Social.

• Raised USD 45M Series A Funding from prominent VCs , Intel Capital, Canaan Partners and founders.

• 45 + Client Globally, 800 + Employees.

About Myself :

Dibyendu is Big Data Architect at HappiestMinds where he is involved in architecting and developing solutions on a Hadoop-based analytics and search platform. In the past few years, he has worked on complex data analytics related projects that utilize Hadoop, HBase, and real time analytics. Before HappiestMinds, he worked at EMC, FairIsaac, Cisco, IBM etc.

Page 3: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

This Presentation….

…….will explores the design and challenges HappiestMinds faced while implementing a storage and search infrastructure for a library procurement system where books/documents/artifacts related records are stored in Apache HBase. Upon bulk insert of book records into HBase, the Elasticsearch index is built offline using MapReduce but there are certain use cases where the records need to be re-indexed in Elasticsearch using Region Observer Coprocessors.

Page 4: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Storing and Indexing Book records from Publishers and Libraries

Publisher/Library Data

HDFS HBase Cluster

Data Pre Processing • Data ingestion to Hadoop

Data Loading : Map Reduce• Bulk Data upload to HBase table1

2

1

2

3 Elastic Search Cluster

3Data Indexing : Map Reduce• Incremental Data Indexing to

ElasticSearch• Part of the document is indexed.

User Search

4

4 User Search:• User Search Data.• Search engine display results. • Full data access request fetch

from HBase.

User Update data5a

5b

5 User Update:• User update HBase record.• Update will propagate to Search

Cluster.

Page 5: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Write Path

Page 6: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Write Path

Page 7: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Storage Layout

Region Server

………………….…….

Page 8: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Put Request

Page 9: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Here comes the Coprocessors

The idea of HBase Coprocessors was inspired by Google’s Big Table coprocessors.

• HBase coprocessors are an addition to data-manipulation toolset that were introduced as a feature in HBase in the 0.92.0 release.

• With the introduction of coprocessors, we can push arbitrary computation out to the HBase nodes hosting data.

• Coprocessors can be loaded globally on all tables and regions hosted by the region server, or the administrator can specify which coprocessors should be loaded on all regions for a table on a per-table basis.

Page 10: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Coprocessors Class and Interfaces

The Coprocessor Interface

• All User code must inherit from this class

The CoprocessorEnvironement Interface

• Retain state across invocation

The CoprocessorHost interfaces

• Tied state and the user code

Page 11: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Observer Coprocessors

Two types of Coprocessor• observer, which are like triggers in conventional databases. • endpoint, dynamic RPC endpoints that resemble stored procedures.

Observer Coprocessor : Callback functions/hooks for every explicit API method

• MasterObserver• Hooks into HMaster API

• RegionObserver• Hooks into Region related operations

• WALObserver• Hooks into write-ahead log operations

Page 12: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

RegionObserver Coprocessor … Put ( )

RegionObserver: Provides hooks for data manipulation events, Get, Put, Delete, Scan, and so on. There is an instance of a RegionObserver coprocessor for every table region and the scope of the observations they can make is constrained to that region.

Page 13: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

RegionObserver Coprocessor ... Get ( )

Page 14: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Let us see what is ElasticSearch

Page 15: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Distributed Search Engine : ElasticSearch

• Distributed• Highly-available• REST based search engine (on top of Lucene)• Designed to speak JSON (JSON in, JSON out)• Built on top of Lucene.

For each index you can specify:

• Number of shardsEach index has fixed number of shards

• Number of replicasEach shard can have 0-many replicas, can be changed

dynamically

Page 16: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

ElasticSearch : Automatic DiscoveryDiscovery Module responsible for discovering nodes within the cluster , as well as electing master node.The responsibility of master node is to maintain global cluster state, and act if nodes join or leave cluster by reassigning shards.

Page 17: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

ElasticSearch : Talking to Cluster

Page 18: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

ElasticSearch : Nodes are Different

Page 19: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

The idea is to perform Indexing into ElasticSearch from HBase Coprocessors…..

Page 20: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

We need a Java Client…

Use ElasticSearch Transport Client : The Transport Client connects remotely to an ElasticSearch cluster. It does not join the cluster, but simply gets one or more initial transport addresses and communicates with them in round robin fashion on each action (though most actions will probably be “two hop” operations).

Page 21: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

And Index with Transport Client…

Page 22: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

But this approach has a problem..

• Client does not have the knowledge of the ElasticSearch cluster.

• Two Hop indexing.• No fault tolerant mechanism if transport address is down.• HBase Region Servers can have hundreds regions and

hence hundreds of transport client.

Solution

• Use ElasticSearch Node Client. Client Node does not hold index but have knowledge of complete Cluster.

• Use HBASE-6505 to share Node Client across Regions in a RegionServer.

Page 23: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase 6505

RegionCoprocessorEnvironment provides a getSharedData() method, which returns a ConcurrentMap, which is held by the RegionCoprocessorHost as a weak reference (in a special map with strongly referenced keys and weakly referenced values), and held strongly by the RegionEnvironment.

That way if the coprocessor is blacklisted the coprocessors environment is removed, and any shared data is immediately available for garbage collection. This shared data is per RegionServer. As long as there is at least one region observer or endpoint active this shared data is not garbage collected and can be accessed to share state between the remaining coprocessors of the same class.

Page 24: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Shared Node Client across Regions

Page 25: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Shared Node Client across Regions

Page 26: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

The Final Problem….

Concurrency Control …

HBase Solve it using MVCC (Multi Version Concurrency Control):

Implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by making the old data as obsolete and adding newer version

And ElasticSearch using OCC (Optimistic Concurrency Control) :

Multiple transactions can complete without affecting each other, and that therefore transactions can proceed without locking the data resources that they affect. Before committing, each transaction verifies that no other transaction has modified its data. If the check reveals conflicting modifications, the committing transaction rolls back.

Page 27: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Let See a Conflict.. Search and Update

HBase ES

C1

C2

V1

V1

V1(M/R)

HBase ES

C1

C2

V1

V1

V2 (Update success)

Conflict

V2(CP)

V1(M/R)

Page 28: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

One More Conflict.. Search and Update

HBase ES

C1

C2

V1

V1

V1(M/R)V1(M/R)

HBase ES

C1

C2

V1

V1

Conflict

V2(M/R)

Conflict

Page 29: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

The bottom line is.

Search and Update should only be successful when the Version of ElasticSearch and Version of HBase is same during the update.

Page 30: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Solution..

1. Data Load from Source to HBase will insert a document with Put call.

2. postPut coprocessor will perform incrementColumnValue for a version column.

………………………

………………………

Page 31: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Solution..

3. Same Version number will be propagated to ElasticSearch during Map Reduce based bulk indexing. ElasticSearch support version number supplied externally.

4. Step 1-3 will repeat for any new data upload.

5. During search and update , the client will perform checkAndPut () call.

5i. Client perform search and get the Version number from ElasticSearch5ii. Client construct a Put with new Version No = Old Version + 15iii. Client perform checkAndPut, and check for old Version number before doing Put. 5iv. postCheckAndPut Coprocessor invoked to propagate the successful Put to Search Cluster.5v. After this step the Version Number of HBase column and ElasticSearch version will be equal.

Page 32: HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Solution..

……………………………….