scalability of similarity searching the mufin approach pavel zezula masaryk university brno, czech...

Scalabilityof Similarity Searching

the MUFIN approach

Pavel ZezulaMasaryk University

Brno, Czech Republic

Seminar Santiago da Compostela 2

Outline of the talk

• On the importance of similarity and searching

• Main memory and disk-oriented index structures• Parallel and distributed implementations

• Future trends:– Findability and retrieval in cloud services– Self-organizing search networks

April 2014


Real-life Similarity

• Are they similar?

April 2014




April 2014


Real-Life MotivationThe social psychology view

• Any event in the history of organism is, in a sense, unique.

• Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity.

• Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc.

• Similarity is subjective a context-dependentApril 2014


Contemporary Networked MediaThe digital data view

• Almost everything that we see, read, hear, write, measure, or observe can be digital.

• Users autonomously contribute to production of global media and the growth is exponential.

• Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events.

• The elements of networked media are related by numerous multi-facet links of similarity.

April 2014


Challenge

• Networked media database is getting close to the human “fact-bases”– the gap between physical and digital world has blurred

• Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections.

WHY?It is the similarity which is in the world revealing.

April 2014


State of the art inMetric Searching technology

Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006

P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2005

Teaching material: http://www.nmis.isti.cnr.it/amato/similarity-search-book/

April 2014

http://www.nmis.isti.cnr.it/amato/similarity-search-book/

http://www.nmis.isti.cnr.it/amato/similarity-search-book/

http://www.amazon.com/gp/product/0387291466/

http://www.amazon.com/Foundations-Multidimensional-Structures-Kaufmann-Computer/dp/0123694469/ref=pd_bxgy_b_img_b


Last Similarity Search Conference

April 2014

Keynote speakers:Ricardo Baeza-YatesJiri Matas



The Big Data

• Loads on a sharp rise – usage on decline• The (3V) problem of: Volume, Variety, Velocity• Issues:

– Acquisition: what to keep and what to discard– Unstructured data: what content to extract– Datafication: render into data many new aspects– Inaccuracy: approximation, imprecision, noise

April 2014



The Big Data problem

• Shifts in thinking: – from some to all (scalability)– from clean to messy (approximate)

• Technological obstacles: heterogeneity, scale, timeliness, complexity, and privacy aspects

• Foundational challenges: scalable and secure data analysis, organization, retrieval, and modeling

April 2014



Similarity & Geometry

• Figures that have the same shape but not necessarily the same size are similar figures:

• Any two line segments are similar:

• Any two circles are similar:

A B C D

April 2014




• Any two squares are similar:

• Any two equilateral triangles are similar:

April 2014




Definition:• Two polygons are similar to each other, if:

1. Their corresponding angles are equal2. The lengths of their corresponding sides are

proportional

Two geometric figures are either similaror they are not similar at all

April 2014



Visual Similarity

• MPEG-7 multimedia content desc. standard• Global feature descriptors:

– Color, shape, texture, …

– One high-dimensional vector per image

April 2014



Multiple Visual Aspects

April 2014



Visual Similarity- Local feature descriptors – SIFT (Scale Invariant Feature Transform), SURF, etc.- Invariant to image scaling, small viewpoint change, rotation, noise, illumination

April 2014


Visual Similarity - finding coresspondence

April 2014


Biometric Similarity

• Biometrics:– methods of recognizing a person based on

physiological and behavioral characteristics• Two types of recognition problems:

– Verification – authenticity of a person– Identification – recognition of a person

• Examples:– Finger prints, face, iris, retina, speech, gait, etc.

April 2014



Biometrics: Fingerprint

• Minutiae detection:– Detect ridges (endings and branching)– Represented as a sequence of minutiae

• P=( (r1,e1,θ1), …, (rm,em,θm) )• Point in polar coordinates (r,e) and direction θ

• Matching of two sequences:– Align input sequence with database one– Compute weighted edit distance

• wins,del=620

• wrepl=[0;26] - depending on similarity of two minutiae

April 2014



Biometrics: Hand Recognition

• Hand image analysis– Contour extraction, global registration

• Rotation, translation, normalization

– Finger registration– Contour represented as a set of pixels

F={f1,…,fNF}

• Matching: modified Hausdorff distance

April 2014

FGhGFhGFH ,,,max,

Gg FfGFf GgF

gfN

FGhgfN

GFh min1

,min1

,



Remote Biometrics: Approaches

• Detection, normalization, extraction, recognition• Face recognition

– Methods:• Appearance-based – analyze the face as a whole• Model-based – compare individual features (e.g., eyes, mouth)

• Gait recognition– Less likely to be obscured, low resolution suffices– Methods are based on shape or dynamics of the person:

• Appearance-based – analyze person’s silhouettes• Model-based – compare features (e.g., trajectory, angular velocity)

April 2014



Face similarity

• Face detection• Face recognition• face detection – MPEG-7

April 2014



Search – the goals

1. We search to get results (papers, books, …)2. We ask to find answers (what time … )3. We use filters so that the right staff finds us4. We browse while wandering and way-finding

in typically restricted space

• In reality, we move fluidly between modes of ask, browse, filter, and search

April 2014



Search – some quantitative facts

• 85% of all web traffic comes from search engines

• 450+ million searches/day are performed in North America alone

• 70%+ of all searches are done on Google sites

Search is the most popular application(second to E-mail??)

April 2014



Search – some experience

• 60% of searchers NEVER go past 1st page of search results

• The top three results draw 80% of the attention

• The first few results inordinately influence query reformulation.

April 2014



Search - as an interaction

• When we search, our next actions are reactions to the stimuli of previous search results

• What we find is changing what we seek

• In any case, search must be:

fast, simple, and relevantApril 2014



Search – changes our cognitive habits

1. We are increasingly handing off the job of remembering to search engines

2. When we expect information to be easily found again, we do not remember it well

3. Our original memory of facts is changing to a memory of ways to find the facts

April 2014



Evolution of Search Engine Strategies

Scalabilitydata volumenumber of users (queries)variety of data typesmulti-aspect queries

grad

e

high

low

well established cutting-edge research

peer

-to-

peer

cent

raliz

ed

para

llel

dist

ribut

ed

self-

orga

nize

d

April 2014

Determinismexact match similarity►precise approximate►same answer variable►fixed infrastr. mapping►


The MUFIN Approach

MUFIN: MUlti-Feature Indexing Network

SEARCHdata

& q

uerie

s

infrastructure

index structureScalability

P2P structureExtensibilitymetric space

IndependenceInfrastructure as a service

April 2014

http://mufin.fi.muni.cz/tiki-index.php


Extensibility: Metric Abstraction of Similarity

• Metric space: M = (D,d)– D – domain– distance function d(x,y)

x,y,z D• d(x,y) > 0 - non-negativity• d(x,y) = 0 x = y - identity• d(x,y) = d(y,x) - symmetry• d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality

April 2014


Examples of Distance Functions

• Lp Minkovski distance (for vectors)• L1 – city-block distance

• L2 – Euclidean distance

• L¥ – infinity

• Edit distance (for strings)• minimal number of insertions, deletions and substitutions• d(‘application’, ‘applet’) = 6

• Jaccard’s coefficient (for sets A,B)

n

iii yxyxL

11 ||),(

n

iii yxyxL

1

22 ),(

ii

n

iyxyxL

max),(

1

BA

BABAd 1,

April 2014


Examples of Distance Functions

• Mahalanobis distance– for vectors with correlated dimensions

• Hausdorff distance– for sets with elements related by another distance

• Earth movers distance– primarily for histograms (sets of weighted features)

• and many others

April 2014


Similarity Search Problem

• For X D in metric space M,pre-process X so that the similarity queriesare executed efficiently.

In metric space: no total ordering exists!

April 2014

Seminar Santiago da Compostela 37April 2014

Similarity Range Query

• range query– R(q,r) = { x X | d(q,x) ≤ r }

… all museums up to 2km from my hotel …

rq


Nearest Neighbor Query

• the nearest neighbor query– NN(q) = x– x X, "y X, d(q,x) ≤ d(q,y)

• k-nearest neighbor query– k-NN(q,k) = A– A X, |A| = k– x A, y X – A, d(q,x) ≤ d(q,y)

… five closest museums to my hotel …

q

k=5


Basic Partitioning Principles

• Given a set X D in M=(D,d), basic partitioning principles have been defined:

– Ball partitioning– Generalized hyper-plane partitioning– Excluded middle partitioning


Ball Partitioning

• Inner set: { x X | d(p,x) ≤ dm }

• Outer set: { x X | d(p,x) > dm }

pdm


Generalized Hyper-plane

• { x X | d(p1,x) ≤ d(p2,x) }

• { x X | d(p1,x) > d(p2,x) }

p2

p1


Excluded Middle Partitioning

• Inner set: { x X | d(p,x) ≤ dm - }• Outer set: { x X | d(p,x) > dm + }

• Excluded set: otherwise

pdm

2r

pdm


Clustering

• Cluster data into sets– bounded by a ball region– { x X | d(pi,x) ≤ ri

c }


Scalability: Peer-to-Peer Indexing• Local search: Main memory structures• Native metric techniques: GHT*, VPT*• Transformation techniques: M-CAN, M-Chord

April 2014


Main memory structures1. ball partitioning methods

1. Burkhard-Keller Tree2. Fixed Queries Tree (Array)3. Vantage Point Tree4. Excluded Middle Vantage Point Forest

2. generalized hyper-plane partitioning approaches1. Bisector Tree2. Generalized Hyper-plane Tree

3. exploiting pre-computed distances1. AESA2. Linear AESA3. Other Methods – Shapiro, Spaghettis

April 2014




The M-tree [Ciaccia, Patella, Zezula, VLDB 1997]

1)Paged organization2)Dynamic3) Suitable for arbitrary metric spaces4) I/O and CPU optimization

- computing d can be time-consuming

April 2014


The M-tree Idea

• Depending on the metric, the “shape” of index regions changes

C D E F

A B

B

F D

EA

C

Metric: L2 (Euclidean)

L1 (city-block) L (max-metric) weighted-Euclidean quadratic form

April 2014


o7

M-tree: Example

o1o6

o10

o3

o2

o5

o4

o9

o8

o11

o1 4.5 -.- o2 6.9 -.-

o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3

o2 0.0 o8 2.9o1 0.0 o6 1.4 o10 0.0 o3 1.2

o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6

Covering radius

Distance to parentDistance to parentDistance to parent

Distance to parentLeaf entries


M-tree family

• Bulk loading• Slim-tree• Multi-way insertion• PM-tree• M2-tree• etc.

April 2014


D-Index [Dohnal, Gennaro, Zezula, MTA 2002]4 separable buckets at the first level

2 separable buckets at the second level

exclusion bucket of the whole structure

April 2014


D-index: Insertion

April 2014


D-index: Range Search

q

r

q

r

q

r

q

r

q

r

q

r

April 2014


Parallel implementations

• Trees (parallel M-tree):– Single input point – tree root, a possible bottleneck– Inherently sequential processing steps

• Paths from the root to leaves – we never know which node to access before the ancestor is processed

• Hash-based organizations:– All processing steps can be parallelized– Single input point remains

April 2014



Implementation Postulates of Distributed Indexes

• dynamism – nodes can be added and removed

• no hot-spots – no centralized nodes, no flooding by messages (transactions)

• update independence – network update at one site does not require an immediate change propagation to all the other sites

April 2014


DistributedSimilarity Search Structures

• Native metric structures:– GHT* (Generalized Hyperplane Tree)– VPT* (Vantage Point Tree)

• Transformation approaches:– M-CAN (Metric Content Addressable Network)– M-Chord (Metric Chord)

April 2014


GHT* Address Search Tree

• Based on the Generalized Hyperplane Tree [Uhl91]– two pivots for binary partitioning

p6

p5

p3

p4

p1

p2

p1 p2

p5 p6p3 p4



• Inner node– two pivots (reference objects)

• Leaf node– BID pointer to a bucket if

data stored on the current peer

– NNID pointer to a peer ifdata stored on a different peer

p1 p2

p5 p6p3 p4

BID1 BID2 BID3 NNID2

Peer 2



Peer 2 Peer 3

Peer 1


BID1 BID2 BID3 NNID2

Peer 2

p1 p2

p5 p6p3 p4

Peer 2

BID3 NNID2

p5 p6

p1 p2

GHT* Range Query• Range query R(q,r)

– traverse peer’s own AST– search buckets for all BIDs found– forward query to all NNIDs found

p6

p5

p3

p4

rq

p1

p2


AST: Logarithmic replication

• Full AST on every peer is space consuming– replication of pivots grows in a linear way

• Store only a part of the AST:– all paths to local buckets

• Deleted sub-trees:– replaced by NNID

of the leftmost peer

p13 p14p11 p12

p5 p6

p1 p2

p3 p4

p7 p8 p9 p10

NNID2 NNID3BID1 NNID4 NNID5 NNID6 NNID7 NNID8

p1 p2

p3 p4

p7 p8

BID1 NNID3 NNID5


AST: Logarithmic Replication (cont.)

• Resulting tree– replication of pivots grows in a logarithmic way

p1 p2

p3 p4

p7 p8

NNID2

NNID3

BID1

NNID5

p1 p2

p3 p4

p7 p8

BID1


p1

r1

p3

r3

VPT* Structure• Similar to the GHT* - ball partitioning is used

for AST– Based on the Vantage Point Tree [Yia93]

• inner nodes have one pivot and a radius• different traversing conditions

p2

r2

p1 (r1)

p2 (r2) p3 (r3)


M-Chord: The Metric Chord• Transform metric space to one-dimensional domain

– Use M-Index - a generalized version of the iDistance

• Divide the domain into intervals – assign each interval to a peer

• Use the Chord P2P protocol for navigation• The Skip graphs distributed protocol can be used,

alternatively

April 2014


– range query R(q,r): identify intervals of interest

• Generalization to metric spaces

– select pivots – then partition: Voronoi-style

M-Chord: Indexing the Distance

• iDistance – indexing technique for vector domains

– cluster analysis = centers = reference points pi

– assign iDistance keys to objects iCx

cixpdxiDist i ),()(

},...,{ 0 npp

April 2014


M-Chord: Chord Protocol

• Peer-to-Peer navigation protocol• Peers are responsible for intervals of keys• hops to localize a node storing a key

M-Chord set the iDistance domain make it uniform: function h

Use Chord on this domain

)(log n

)),(()( cixpdhxmchord i

April 2014


M-Chord: Range Query

• Node Nq initiates the search

• Determine intervals– generalized iDistance

• Forward requests to peers on intervals

• Search in the nodes– using local organization

• Merge the received partial answers

April 2014


M-CAN: The Metric CAN

• Based on the Content-Addressable Network (CAN)– a DHT navigating in an N-dimensional vector space

• The Idea:1. Map the metric space to a vector space

– given N pivots: p1, p2 , … , pN, transform every o into vector F(o)

2. Use CAN to

– distribute the vector space zones among the nodes– navigate in the network


CAN: Principles & Navigation

• CAN – the principles– the space is divided in zones – each node “owns” a zone– nodes know their neighbors

• CAN – the navigation– greedy routing– in every step, move to the

neighbor closer to the target location

2-d

imensi

onal vect

or

space

1

6 2

53

4

x,y


M-CAN: Contractiveness & Filtering

• Use the L∞ as a distance measure

– the mapping F is contractive

• More pivots better filtering– but, CAN routing is better for less dimensions

• Additional filtering– some pivots are only used for filtering data (inside

the explored nodes)– they are not used for mapping into CAN vector space

),())(),(( yxdyFxFL


Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework

Metric space (D,d) Operations Storage

Centralized index structures

Distributed index structures

Communication

NetVectors• Lp and quadratic formStrings • (weighted) edit and

protein sequence

Insert, delete,range query,k-NN query,Incremental k-NN

Volatile memoryPersistent memory

Performance statistics

April 2014

http://www.williamoslerhc.on.ca/Patient_Services/MRI.jpg

http://www.roswellpark.org/mri/Image_Gallery/FMR/images/aa-fmr-slices-fmr.jpg





Czech Technology Days at GT 71

MUFIN demos

• http://disa.fi.muni.cz/imgsearch/similar• http://www.pixmac.com/• http://disa.fi.muni.cz/twenga/• http://disa.fi.muni.cz/fingerprints/• http://disa.fi.muni.cz/subseq/• http://disa.fi.muni.cz/FaceMatch/• http://disa.fi.muni.cz/annotation-ui/• http://disa.fi.muni.cz/motion-match/• http://disa.fi.muni.cz/profimedia-neural_network-2

0M/May 29-30, 2014

http://disa.fi.muni.cz/imgsearch/similar

http://www.pixmac.com/



http://disa.fi.muni.cz/twenga/

http://disa.fi.muni.cz/twenga/

http://disa.fi.muni.cz/fingerprints/

http://disa.fi.muni.cz/subseq/

http://disa.fi.muni.cz/FaceMatch/

http://disa.fi.muni.cz/annotation-ui/






http://disa.fi.muni.cz/motion-match/

http://disa.fi.muni.cz/profimedia-neural_network-20M/

http://disa.fi.muni.cz/profimedia-neural_network-20M/



MUFIN demos

• http://mufin.fi.muni.cz/imgsearch/similar• http://www.pixmac.com/• http://mufin.fi.muni.cz/twenga/random• http://mufin.fi.muni.cz/fingerprints/random• http://mufin.fi.muni.cz/subseq/random• http://mufin.fi.muni.cz/mma-faces-extended/• http://mufin.fi.muni.cz/plugins/annotation• http://mufin.fi.muni.cz/motion-retrieval/rand

omApril 2014

http://mufin.fi.muni.cz/imgsearch/similar



http://mufin.fi.muni.cz/twenga/random

http://mufin.fi.muni.cz/fingerprints/random

http://mufin.fi.muni.cz/subseq/random

http://mufin.fi.muni.cz/mma-faces-extended/

http://mufin.fi.muni.cz/plugins/annotation

http://mufin.fi.muni.cz/motion-retrieval/random

http://mufin.fi.muni.cz/motion-retrieval/random



Problems with Current Applications

1. Current applications are implemented as complex software projects; it is costly, highly qualified specialists are needed

2. They fitfully need massive infrastructure to build and run multiple indices

3. Applications are much more complex than a search, which is an important supporting service

April 2014



SimilaritySearch

Computingeffectiveness efficiency

stimuli

matching extraction

evaluation executionoperators

Similarity Data Management System

April 2014

findability

retrieval

CloudServices

Application Areaswebenterprisemobilesocial


Similarity Searching in Clouds

• Retrieval – effectiveness, evaluation, operations, execution, and efficiency

• Findability – effectiveness, matching, stimuli, extraction, and efficiency

• Cloud way of computing:– Scalability – must balance load across servers and avoid bottlenecks

– Elasticity – to allow adding (reducing) capacity to a running system

– Availability – to provide high levels of usability and fault tolerance

– Privacy – to safeguard data that is valuable or sensitive against unauthorized access

April 2014


April 2014 Seminar Santiago da Compostela

Self-organizing Search Systems

• Big data in computer networks:– A very large number of users keeping data within their

devices (computers, mobiles, etc.)– A very high degree of churning users– Solution to storing and searching the data:

• Cloud-type systems – data stored within a single infrastructure• Fully distributed systems – data stored within each user’s device

that brings additional computational power to the system

FacebookYouTube

76/4



Self-organizing Search Systems

• A self-organizing system:– Devices are independent and equal in functionality– A set of devices interconnected by semantic links– There is no global control mechanism– Search mechanisms and link management based on social-

like interactions (acquaintance/friend links)• Properties:

– Scalability– Adaptability– Robustness

77/4



Architecture

• Each peer stores its piece of data– It asks and answers similarity queries

• Each peer maintains a query history– An acquaintance tie - the best answer– Friend ties – similar peer

P1

Q1

Q3

P3

Q3

P2

Q1

Q2

Q3

April 2014


150 peers organizing 200,000 images Extracted features: 45-dimensional vectors

Random initial state Five queries per peer

Experiments

• Query history– 100 queries limit

April 2014


Self-organizing Search Systems• Our Solution – Metric Social Network

– Resilience to disconnections of peers – initially 2 000– After each 20th test series:

• S1 – 200 random peers were disconnected• S2 – 200 the most knowledgeable peers were disconnected

S1 S2

80/4


scalability of similarity searching the mufin approach pavel zezula masaryk university brno, czech...

Documents

compostelaseminar santiago

similarity data management

importance of similarity

similarity proximity

similarity search conference

discardunstructured

secure data analysis

search networks