![Page 1: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/1.jpg)
Apache Hadoop India Summit 2011
Searching Information
Inside Hadoop Platform
Abinasha KaranaDirector-Technology
Bizosys Technologies Pvt Ltd.
www.bizosys.com
![Page 2: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/2.jpg)
Apache Hadoop India Summit 2011
To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr
![Page 3: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/3.jpg)
Apache Hadoop India Summit 2011
Map-reduce
Result not in a mouse clickWhat didn’t work for us
![Page 4: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/4.jpg)
Apache Hadoop India Summit 2011
It required vertical scaling with manual sharding and subsequent resharding as data grew
Lucene/Solr
What didn’t work for us
![Page 5: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/5.jpg)
Apache Hadoop India Summit 2011
We built a new search for
Hadoop Platform HDFS and HBase
What we did
![Page 6: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/6.jpg)
Apache Hadoop India Summit 2011
my learning from designing, developing
and benchmarking a distributed,
real-time search engine whose Index is
stored and served out of HBase
In the next few slides you will hear about
![Page 7: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/7.jpg)
Apache Hadoop India Summit 2011
Key Learning1. Using SSD is a design decision.
2. Methods to reduce HBase table storage size
3. Serving a request without accessing ALL region servers
4. Methods to move processing near the data
5. Byte block caching to lower network and I/O trips to HBase
6. Configuration to balance network vs. CPU vs. I/O vs memory
![Page 8: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/8.jpg)
Apache Hadoop India Summit 2011
1 Using SSD is a design decision
SSD improved HSearch response time by 66% over SATA. However, SSD is costlier.
In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.
![Page 9: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/9.jpg)
Apache Hadoop India Summit 2011
.. Our SSD Friendly Schema Design
Keyword + Document in 1 Table
Keyword: Reads all for a query.
Document: Reads 10 docs / query.
Keyword + Document in 2 Tables
SSD deployment is All or none SSD deployment is only for Keyword Table
![Page 10: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/10.jpg)
Apache Hadoop India Summit 2011
2 Methods to reduceHBase table storage size
Storing a 4 byte cell requires >27bytes in HBase.
Key len
gth
Valu
e leng
th
Ro
w len
gth
Ro
w B
ytes
Fam
ily Len
gth
Fam
ily Bytes
Qu
alifier Bytes
Tim
estamp
Key Typ
e
Valu
e B
ytes
4 BYTES1 BYTE4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTESBYTES
![Page 11: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/11.jpg)
Apache Hadoop India Summit 2011
.. to 1/3rd
• Stored large cell values by merging cells• Reduced the Family name to 1 Character• Reduced the Qualifier name to 1 Character
![Page 12: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/12.jpg)
Apache Hadoop India Summit 2011
3Serving a request without
accessing ALL region servers
Consider a 100 node cluster of HBase and a single
search request need to access all of them.
Bad Design.. Clogged Network.. No scaling
![Page 13: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/13.jpg)
Apache Hadoop India Summit 2011
3 And our solution…
0-1 M
1-2 M
2-3 M
3-4 M
4-5 M
Ro
w R
ang
es
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Fam
ily A
Fam
ily B
Scan “Family A” - 5 Machines HitScan Table A - 3 Machines Hit
Machine 1
Machine 2
Machine 3 Machine 3
Machine 4
Machine 5
Table A Table B
Index Table was divided on Column-Family as separate tables
![Page 14: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/14.jpg)
Apache Hadoop India Summit 2011
4Methods to move
processing near the data
public class TermFilter implements Filter {public ReturnCode filterKeyValue(KeyValue kv) {
boolean isMatched = isFound(kv);if (isMatched ) return ReturnCode.INCLUDE;return ReturnCode.NEXT_ROW;
}…
• Sent filtered Rows over network.
public class DocFilter implements Filter {public void filterRow(List<KeyValue> kvL) {
byte[] val = extractNeededPiece(kvL);kvL.clear();kvL.add(new KeyValue(row,fam,,val));
}….
• Sent relevant section of a Field over network.
E.g. Matched rows for a keyword
E.g. Computing a best match section from within a document for a given query
• Sent relevant Fields of a Row over network.
![Page 15: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/15.jpg)
Apache Hadoop India Summit 2011
5Byte block caching to lower
network and I/O trips to HBase
• Object caching – With growing number of objects
we encountered ‘Out of Memory’ exception
• HBase commit - Frequent flushing to HBase
introduced network and I/O latencies.
Converting Objects to intermediate Byte Blocks
increased record processing by 20x in 1 batch.
![Page 16: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/16.jpg)
Apache Hadoop India Summit 2011
6Configuration to balance Network vs. CPU vs. I/O vs. Memory
DiskI/O
Network
Memory CPU
Block Caching
IPC Caching
Compression
Compression
Aggressive GC
In a Single Machine
![Page 17: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/17.jpg)
Apache Hadoop India Summit 2011
… and it’s settings
• Network– Increased IPC Cache Limits (hbase.client.scanner.caching)
• CPU– JVM agressive heap
("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)
• I/O– LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)
• Memory– HBase block caching (hfile.block.cache.size) and overall memory
allocation for data-node and region-server.
![Page 18: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/18.jpg)
Apache Hadoop India Summit 2011
.. and parallelized to multi-machines
•HTable.batch (Get, Put, Deletes)•ParallelHTable (Scans)•FUTURE-coprocessors (hbase 0.92 release).
Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count
![Page 19: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/19.jpg)
Apache Hadoop India Summit 2011
HSearch Benchmarks on AWS
• Amazon Large instance 7.5 GB Memory * 11 Machines
with a single 7.5K SATA drive
• 100 Million Wikipedia pages of total 270GB and
completely indexed (Included common stopwords)
• 10 Million pages repeated 10 times. (Total indexing time
is 5 Hours)
• Search Query Response speed using a – regular word is 1.5 sec
– common word such as “hill” found 1.6 million matches and
sorted in 7 seconds
![Page 20: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/20.jpg)
Apache Hadoop India Summit 2011
www.sourceforge.net/bizosyshsearch
• Apache-licensed (2 versions released)
• Distributed, real-time search
• Supports XML documents with rich search syntax
and various filtration criteria such as document
type, field type.
![Page 21: Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana](https://reader036.vdocuments.mx/reader036/viewer/2022062418/5560b794d8b42a033c8b4bc7/html5/thumbnails/21.jpg)
Apache Hadoop India Summit 2011
References• Initial performance reports (Bizosys HSearch, a Nosql
search engine, featured in Intel Cloud Builders Success Stories (July, 2010)
• HSearch is currently in use at http://www.10screens.com
• More on hsearch– http://www.bizosys.com/blog/– http://bizosyshsearch.sourceforge.net/
• More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf