a new practical design for browsable over-the-network indexing
DESCRIPTION
Lucene today is a default indexing engine. Including Lucene and indexing in general, such technologies run on top of local filesystems and do not consider throughput of read/write operations as a limited resource. However, with proliferation of clouds today, over-the-network access to data is becoming commonplace. This paper proposes a new design for over-the-network indexing which is built on top of the core assumption that read/write throughput has to be optimized. As a separate function, the proposed design is created to be easily browsable whereas Lucene-like indexing can only execute search queries. Software implementation of the proposed engine is released as open source.TRANSCRIPT
.
The Over-the-Network Problem
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18...
2/18
.
Over-the-Network Problem
Data
Indexer
Index
Network
Traditional Client
Data
Indexer
Index Read, Write
Stringex Client
The
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 3/18...
3/18
.
Everything is Over-the-Network
• ... in clouds• ... inside data centers• ... in home networks
.When running over-the-network..
.
... the biggest problem is that there is a hard physical limit tothroughput
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18...
4/18
.
The "Best" Tools Today
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18...
5/18
.
The Closests Tools
1. Lucene running locally only
2. GoogleData APIs, that allow for shared control◦ not really indexing, through
3. .... that's pretty much it!
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18...
6/18
.
Target Applications
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18...
7/18
.
Target Applications
Data
Indexer
Index
Stringex Client
The
• server-less applications (read:
fully distributed)
• large-scale crowdsourcingconnected via cloud storage
• distributed storage --the same problem
• ....
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18...
8/18
.
The Stringex Problem
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18...
9/18
.
The Stringex Problem
• a very straightforward optimization problem
minimize w1ROUT + w2RIN (1)
subject to (2)
0 < RIN ≤ ROUT ≤ C, (3)
SLOCAL ≤ M ≤ SREMOTE, (4)
NLOCAL ≤ NREMOTE ≤ NUSER, (5)
• R is rate, throughput, etc.
• S is storage size, can be local andremote
• C and M are constants, set by user
• N is number of files over which theindex is split
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18...
10/18
.
Naive Stringex Client
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18...
11/18
.
Practical Assumptions
• JSON input, only top level is indexed, otherwise stringified
• several efficiency tricks1. split index in relatively small files2. distribute smoothly using random hashing3. update parts on timeout -- accumulate multiple intensive updates4. create specialmaps which allow for browsing
• JSON aggregations in files : one line is base64( JSON sring)◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSONstring))
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18...
12/18
.
Naive Client: Data StructureINPUT JSON { name : value1, age : value2, …}
Files
… name .imap { ‘bk ’: { ‘ ik’: ‘ start,end ’ , … next ‘ik’ }, … next bk } name .vmap { ‘value’: ‘ bk’, … next value } name .bk1 name .bk2 …
Key: name
…
Key: age docs .imap { ‘bk ’: { ‘docid ’: ‘ start,end ’ , … next ‘docid ’ }, … next bk }
docs .bk1 docs .bk2 …
Docs
No . vmap
Same Same
Index Data
• meta is separate fromdata
• smart maps, lets to read/write sections of files◦ specifically for chunk*API in Dropbox
• filenames are head 2-3symbols of MD5 hashes
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18...
13/18
.
Naive Client: Sync Engine Design
Stringex
Index
Stringex Client
The
Sync Engine
Optimization
Local Cache
Check 1 2
Use
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18...
14/18
.
Evaluation
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 15/18...
15/18
.
Stringex vs Lucene
3.15 3.85 4.55 5.25 5.95 6.65Index Size (log)
2.55
2.65
2.75
2.85
2.95
3.05
3.15
3.25
Thro
ughp
ut (l
og o
f byt
es/d
oc)
Lucene
Stringex
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18...
16/18
.
Wrapup
• https://github.com/maratishe/stringex has JS client• I also have a PHP client for command line Stringex
• stringex is better for browsing because items cluster naturally -- better thanLucene◦ I use it for small browsable summaries of datasets◦ ... and context-based browsable datasets
• many other uses are possible
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18...
17/18
.
That’s all, thank you ...
M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18...
18/18