a new practical design for browsable over-the-network indexing

18

Click here to load reader

Upload: marat-zhanikeev

Post on 16-May-2015

156 views

Category:

Technology


1 download

DESCRIPTION

Lucene today is a default indexing engine. Including Lucene and indexing in general, such technologies run on top of local filesystems and do not consider throughput of read/write operations as a limited resource. However, with proliferation of clouds today, over-the-network access to data is becoming commonplace. This paper proposes a new design for over-the-network indexing which is built on top of the core assumption that read/write throughput has to be optimized. As a separate function, the proposed design is created to be easily browsable whereas Lucene-like indexing can only execute search queries. Software implementation of the proposed engine is released as open source.

TRANSCRIPT

Page 1: A New Practical Design for Browsable Over-the-Network Indexing
Page 2: A New Practical Design for Browsable Over-the-Network Indexing

.

The Over-the-Network Problem

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18...

2/18

Page 3: A New Practical Design for Browsable Over-the-Network Indexing

.

Over-the-Network Problem

Data

Indexer

Index

Network

Traditional Client

Data

Indexer

Index Read, Write

Stringex Client

The

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 3/18...

3/18

Page 4: A New Practical Design for Browsable Over-the-Network Indexing

.

Everything is Over-the-Network

• ... in clouds• ... inside data centers• ... in home networks

.When running over-the-network..

.

... the biggest problem is that there is a hard physical limit tothroughput

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18...

4/18

Page 5: A New Practical Design for Browsable Over-the-Network Indexing

.

The "Best" Tools Today

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18...

5/18

Page 6: A New Practical Design for Browsable Over-the-Network Indexing

.

The Closests Tools

1. Lucene running locally only

2. GoogleData APIs, that allow for shared control◦ not really indexing, through

3. .... that's pretty much it!

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18...

6/18

Page 7: A New Practical Design for Browsable Over-the-Network Indexing

.

Target Applications

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18...

7/18

Page 8: A New Practical Design for Browsable Over-the-Network Indexing

.

Target Applications

Data

Indexer

Index

Stringex Client

The

• server-less applications (read:

fully distributed)

• large-scale crowdsourcingconnected via cloud storage

• distributed storage --the same problem

• ....

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18...

8/18

Page 9: A New Practical Design for Browsable Over-the-Network Indexing

.

The Stringex Problem

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18...

9/18

Page 10: A New Practical Design for Browsable Over-the-Network Indexing

.

The Stringex Problem

• a very straightforward optimization problem

minimize w1ROUT + w2RIN (1)

subject to (2)

0 < RIN ≤ ROUT ≤ C, (3)

SLOCAL ≤ M ≤ SREMOTE, (4)

NLOCAL ≤ NREMOTE ≤ NUSER, (5)

• R is rate, throughput, etc.

• S is storage size, can be local andremote

• C and M are constants, set by user

• N is number of files over which theindex is split

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18...

10/18

Page 11: A New Practical Design for Browsable Over-the-Network Indexing

.

Naive Stringex Client

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18...

11/18

Page 12: A New Practical Design for Browsable Over-the-Network Indexing

.

Practical Assumptions

• JSON input, only top level is indexed, otherwise stringified

• several efficiency tricks1. split index in relatively small files2. distribute smoothly using random hashing3. update parts on timeout -- accumulate multiple intensive updates4. create specialmaps which allow for browsing

• JSON aggregations in files : one line is base64( JSON sring)◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSONstring))

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18...

12/18

Page 13: A New Practical Design for Browsable Over-the-Network Indexing

.

Naive Client: Data StructureINPUT JSON { name : value1, age : value2, …}

Files

… name .imap { ‘bk ’: { ‘ ik’: ‘ start,end ’ , … next ‘ik’ }, … next bk } name .vmap { ‘value’: ‘ bk’, … next value } name .bk1 name .bk2 …

Key: name

Key: age docs .imap { ‘bk ’: { ‘docid ’: ‘ start,end ’ , … next ‘docid ’ }, … next bk }

docs .bk1 docs .bk2 …

Docs

No . vmap

Same Same

Index Data

• meta is separate fromdata

• smart maps, lets to read/write sections of files◦ specifically for chunk*API in Dropbox

• filenames are head 2-3symbols of MD5 hashes

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18...

13/18

Page 14: A New Practical Design for Browsable Over-the-Network Indexing

.

Naive Client: Sync Engine Design

Stringex

Index

Stringex Client

The

Sync Engine

Optimization

Local Cache

Check 1 2

Use

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18...

14/18

Page 15: A New Practical Design for Browsable Over-the-Network Indexing

.

Evaluation

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 15/18...

15/18

Page 16: A New Practical Design for Browsable Over-the-Network Indexing

.

Stringex vs Lucene

3.15 3.85 4.55 5.25 5.95 6.65Index Size (log)

2.55

2.65

2.75

2.85

2.95

3.05

3.15

3.25

Thro

ughp

ut (l

og o

f byt

es/d

oc)

Lucene

Stringex

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18...

16/18

Page 17: A New Practical Design for Browsable Over-the-Network Indexing

.

Wrapup

• https://github.com/maratishe/stringex has JS client• I also have a PHP client for command line Stringex

• stringex is better for browsing because items cluster naturally -- better thanLucene◦ I use it for small browsable summaries of datasets◦ ... and context-based browsable datasets

• many other uses are possible

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18...

17/18

Page 18: A New Practical Design for Browsable Over-the-Network Indexing

.

That’s all, thank you ...

M.Zhanikeev -- [email protected] -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18...

18/18