scalability of similarity searching the mufin approach pavel zezula masaryk university brno, czech...
TRANSCRIPT
Scalabilityof Similarity Searching
the MUFIN approach
Pavel ZezulaMasaryk University
Brno, Czech Republic
Seminar Santiago da Compostela 2
Outline of the talk
• On the importance of similarity and searching
• Main memory and disk-oriented index structures• Parallel and distributed implementations
• Future trends:– Findability and retrieval in cloud services– Self-organizing search networks
April 2014
Seminar Santiago da Compostela 3
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela 4
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela 5
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela 6
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela 7
Real-Life MotivationThe social psychology view
• Any event in the history of organism is, in a sense, unique.
• Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity.
• Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc.
• Similarity is subjective a context-dependentApril 2014
Seminar Santiago da Compostela 8
Contemporary Networked MediaThe digital data view
• Almost everything that we see, read, hear, write, measure, or observe can be digital.
• Users autonomously contribute to production of global media and the growth is exponential.
• Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events.
• The elements of networked media are related by numerous multi-facet links of similarity.
April 2014
Seminar Santiago da Compostela 9
Challenge
• Networked media database is getting close to the human “fact-bases”– the gap between physical and digital world has blurred
• Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections.
WHY?It is the similarity which is in the world revealing.
April 2014
Seminar Santiago da Compostela 10
State of the art inMetric Searching technology
Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006
P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2005
Teaching material: http://www.nmis.isti.cnr.it/amato/similarity-search-book/
April 2014
Seminar Santiago da Compostela 11
Last Similarity Search Conference
April 2014
Keynote speakers:Ricardo Baeza-YatesJiri Matas
Seminar Santiago da Compostela 12
The Big Data
• Loads on a sharp rise – usage on decline• The (3V) problem of: Volume, Variety, Velocity• Issues:
– Acquisition: what to keep and what to discard– Unstructured data: what content to extract– Datafication: render into data many new aspects– Inaccuracy: approximation, imprecision, noise
April 2014
Seminar Santiago da Compostela 13
The Big Data problem
• Shifts in thinking: – from some to all (scalability)– from clean to messy (approximate)
• Technological obstacles: heterogeneity, scale, timeliness, complexity, and privacy aspects
• Foundational challenges: scalable and secure data analysis, organization, retrieval, and modeling
April 2014
Seminar Santiago da Compostela 14
Similarity & Geometry
• Figures that have the same shape but not necessarily the same size are similar figures:
• Any two line segments are similar:
• Any two circles are similar:
A B C D
April 2014
Seminar Santiago da Compostela 15
Similarity & Geometry
• Any two squares are similar:
• Any two equilateral triangles are similar:
April 2014
Seminar Santiago da Compostela 16
Similarity & Geometry
Definition:• Two polygons are similar to each other, if:
1. Their corresponding angles are equal2. The lengths of their corresponding sides are
proportional
Two geometric figures are either similaror they are not similar at all
April 2014
Seminar Santiago da Compostela 17
Visual Similarity
• MPEG-7 multimedia content desc. standard• Global feature descriptors:
– Color, shape, texture, …
– One high-dimensional vector per image
April 2014
Seminar Santiago da Compostela 18
Multiple Visual Aspects
April 2014
Seminar Santiago da Compostela 19
Visual Similarity- Local feature descriptors – SIFT (Scale Invariant Feature Transform), SURF, etc.- Invariant to image scaling, small viewpoint change, rotation, noise, illumination
April 2014
Seminar Santiago da Compostela 20
Visual Similarity - finding coresspondence
April 2014
Seminar Santiago da Compostela 21
Biometric Similarity
• Biometrics:– methods of recognizing a person based on
physiological and behavioral characteristics• Two types of recognition problems:
– Verification – authenticity of a person– Identification – recognition of a person
• Examples:– Finger prints, face, iris, retina, speech, gait, etc.
April 2014
Seminar Santiago da Compostela 22
Biometrics: Fingerprint
• Minutiae detection:– Detect ridges (endings and branching)– Represented as a sequence of minutiae
• P=( (r1,e1,θ1), …, (rm,em,θm) )• Point in polar coordinates (r,e) and direction θ
• Matching of two sequences:– Align input sequence with database one– Compute weighted edit distance
• wins,del=620
• wrepl=[0;26] - depending on similarity of two minutiae
April 2014
Seminar Santiago da Compostela 23
Biometrics: Hand Recognition
• Hand image analysis– Contour extraction, global registration
• Rotation, translation, normalization
– Finger registration– Contour represented as a set of pixels
F={f1,…,fNF}
• Matching: modified Hausdorff distance
April 2014
FGhGFhGFH ,,,max,
Gg FfGFf GgF
gfN
FGhgfN
GFh min1
,min1
,
Seminar Santiago da Compostela 24
Remote Biometrics: Approaches
• Detection, normalization, extraction, recognition• Face recognition
– Methods:• Appearance-based – analyze the face as a whole• Model-based – compare individual features (e.g., eyes, mouth)
• Gait recognition– Less likely to be obscured, low resolution suffices– Methods are based on shape or dynamics of the person:
• Appearance-based – analyze person’s silhouettes• Model-based – compare features (e.g., trajectory, angular velocity)
April 2014
Seminar Santiago da Compostela 25
Face similarity
• Face detection• Face recognition• face detection – MPEG-7
April 2014
Seminar Santiago da Compostela 26
Search – the goals
1. We search to get results (papers, books, …)2. We ask to find answers (what time … )3. We use filters so that the right staff finds us4. We browse while wandering and way-finding
in typically restricted space
• In reality, we move fluidly between modes of ask, browse, filter, and search
April 2014
Seminar Santiago da Compostela 27
Search – some quantitative facts
• 85% of all web traffic comes from search engines
• 450+ million searches/day are performed in North America alone
• 70%+ of all searches are done on Google sites
Search is the most popular application(second to E-mail??)
April 2014
Seminar Santiago da Compostela 28
Search – some experience
• 60% of searchers NEVER go past 1st page of search results
• The top three results draw 80% of the attention
• The first few results inordinately influence query reformulation.
April 2014
Seminar Santiago da Compostela 29
Search - as an interaction
• When we search, our next actions are reactions to the stimuli of previous search results
• What we find is changing what we seek
• In any case, search must be:
fast, simple, and relevantApril 2014
Seminar Santiago da Compostela 30
Search – changes our cognitive habits
1. We are increasingly handing off the job of remembering to search engines
2. When we expect information to be easily found again, we do not remember it well
3. Our original memory of facts is changing to a memory of ways to find the facts
April 2014
Seminar Santiago da Compostela 31
Evolution of Search Engine Strategies
Scalabilitydata volumenumber of users (queries)variety of data typesmulti-aspect queries
grad
e
high
low
well established cutting-edge research
peer
-to-
peer
cent
raliz
ed
para
llel
dist
ribut
ed
self-
orga
nize
d
April 2014
Determinismexact match similarity►precise approximate►same answer variable►fixed infrastr. mapping►
Seminar Santiago da Compostela 32
The MUFIN Approach
MUFIN: MUlti-Feature Indexing Network
SEARCHdata
& q
uerie
s
infrastructure
index structureScalability
P2P structureExtensibilitymetric space
IndependenceInfrastructure as a service
April 2014
Seminar Santiago da Compostela 33
Extensibility: Metric Abstraction of Similarity
• Metric space: M = (D,d)– D – domain– distance function d(x,y)
x,y,z D• d(x,y) > 0 - non-negativity• d(x,y) = 0 x = y - identity• d(x,y) = d(y,x) - symmetry• d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality
April 2014
Seminar Santiago da Compostela 34
Examples of Distance Functions
• Lp Minkovski distance (for vectors)• L1 – city-block distance
• L2 – Euclidean distance
• L¥ – infinity
• Edit distance (for strings)• minimal number of insertions, deletions and substitutions• d(‘application’, ‘applet’) = 6
• Jaccard’s coefficient (for sets A,B)
n
iii yxyxL
11 ||),(
n
iii yxyxL
1
22 ),(
ii
n
iyxyxL
max),(
1
BA
BABAd 1,
April 2014
Seminar Santiago da Compostela 35
Examples of Distance Functions
• Mahalanobis distance– for vectors with correlated dimensions
• Hausdorff distance– for sets with elements related by another distance
• Earth movers distance– primarily for histograms (sets of weighted features)
• and many others
April 2014
Seminar Santiago da Compostela 36
Similarity Search Problem
• For X D in metric space M,pre-process X so that the similarity queriesare executed efficiently.
In metric space: no total ordering exists!
April 2014
Seminar Santiago da Compostela 37April 2014
Similarity Range Query
• range query– R(q,r) = { x X | d(q,x) ≤ r }
… all museums up to 2km from my hotel …
rq
Seminar Santiago da Compostela 38April 2014
Nearest Neighbor Query
• the nearest neighbor query– NN(q) = x– x X, "y X, d(q,x) ≤ d(q,y)
• k-nearest neighbor query– k-NN(q,k) = A– A X, |A| = k– x A, y X – A, d(q,x) ≤ d(q,y)
… five closest museums to my hotel …
q
k=5
Seminar Santiago da Compostela 39April 2014
Basic Partitioning Principles
• Given a set X D in M=(D,d), basic partitioning principles have been defined:
– Ball partitioning– Generalized hyper-plane partitioning– Excluded middle partitioning
Seminar Santiago da Compostela 40April 2014
Ball Partitioning
• Inner set: { x X | d(p,x) ≤ dm }
• Outer set: { x X | d(p,x) > dm }
pdm
Seminar Santiago da Compostela 41April 2014
Generalized Hyper-plane
• { x X | d(p1,x) ≤ d(p2,x) }
• { x X | d(p1,x) > d(p2,x) }
p2
p1
Seminar Santiago da Compostela 42April 2014
Excluded Middle Partitioning
• Inner set: { x X | d(p,x) ≤ dm - }• Outer set: { x X | d(p,x) > dm + }
• Excluded set: otherwise
pdm
2r
pdm
Seminar Santiago da Compostela 43April 2014
Clustering
• Cluster data into sets– bounded by a ball region– { x X | d(pi,x) ≤ ri
c }
Seminar Santiago da Compostela 44
Scalability: Peer-to-Peer Indexing• Local search: Main memory structures• Native metric techniques: GHT*, VPT*• Transformation techniques: M-CAN, M-Chord
April 2014
Seminar Santiago da Compostela 45
Main memory structures1. ball partitioning methods
1. Burkhard-Keller Tree2. Fixed Queries Tree (Array)3. Vantage Point Tree4. Excluded Middle Vantage Point Forest
2. generalized hyper-plane partitioning approaches1. Bisector Tree2. Generalized Hyper-plane Tree
3. exploiting pre-computed distances1. AESA2. Linear AESA3. Other Methods – Shapiro, Spaghettis
April 2014
Seminar Santiago da Compostela 46
The M-tree [Ciaccia, Patella, Zezula, VLDB 1997]
1)Paged organization2)Dynamic3) Suitable for arbitrary metric spaces4) I/O and CPU optimization
- computing d can be time-consuming
April 2014
Seminar Santiago da Compostela 47
The M-tree Idea
• Depending on the metric, the “shape” of index regions changes
C D E F
A B
B
F D
EA
C
Metric: L2 (Euclidean)
L1 (city-block) L (max-metric) weighted-Euclidean quadratic form
April 2014
Seminar Santiago da Compostela 48April 2014
o7
M-tree: Example
o1o6
o10
o3
o2
o5
o4
o9
o8
o11
o1 4.5 -.- o2 6.9 -.-
o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3
o2 0.0 o8 2.9o1 0.0 o6 1.4 o10 0.0 o3 1.2
o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6
Covering radius
Distance to parentDistance to parentDistance to parent
Distance to parentLeaf entries
Seminar Santiago da Compostela 49
M-tree family
• Bulk loading• Slim-tree• Multi-way insertion• PM-tree• M2-tree• etc.
April 2014
Seminar Santiago da Compostela 50
D-Index [Dohnal, Gennaro, Zezula, MTA 2002]4 separable buckets at the first level
2 separable buckets at the second level
exclusion bucket of the whole structure
April 2014
Seminar Santiago da Compostela 51
D-index: Insertion
April 2014
Seminar Santiago da Compostela 52
D-index: Range Search
q
r
q
r
q
r
q
r
q
r
q
r
April 2014
Seminar Santiago da Compostela 53
Parallel implementations
• Trees (parallel M-tree):– Single input point – tree root, a possible bottleneck– Inherently sequential processing steps
• Paths from the root to leaves – we never know which node to access before the ancestor is processed
• Hash-based organizations:– All processing steps can be parallelized– Single input point remains
April 2014
Seminar Santiago da Compostela 54
Implementation Postulates of Distributed Indexes
• dynamism – nodes can be added and removed
• no hot-spots – no centralized nodes, no flooding by messages (transactions)
• update independence – network update at one site does not require an immediate change propagation to all the other sites
April 2014
Seminar Santiago da Compostela 55
DistributedSimilarity Search Structures
• Native metric structures:– GHT* (Generalized Hyperplane Tree)– VPT* (Vantage Point Tree)
• Transformation approaches:– M-CAN (Metric Content Addressable Network)– M-Chord (Metric Chord)
April 2014
Seminar Santiago da Compostela 56April 2014
GHT* Address Search Tree
• Based on the Generalized Hyperplane Tree [Uhl91]– two pivots for binary partitioning
p6
p5
p3
p4
p1
p2
p1 p2
p5 p6p3 p4
Seminar Santiago da Compostela 57April 2014
GHT* Address Search Tree
• Inner node– two pivots (reference objects)
• Leaf node– BID pointer to a bucket if
data stored on the current peer
– NNID pointer to a peer ifdata stored on a different peer
p1 p2
p5 p6p3 p4
BID1 BID2 BID3 NNID2
Peer 2
Seminar Santiago da Compostela 58April 2014
GHT* Address Search Tree
Peer 2 Peer 3
Peer 1
Seminar Santiago da Compostela 59April 2014
BID1 BID2 BID3 NNID2
Peer 2
p1 p2
p5 p6p3 p4
Peer 2
BID3 NNID2
p5 p6
p1 p2
GHT* Range Query• Range query R(q,r)
– traverse peer’s own AST– search buckets for all BIDs found– forward query to all NNIDs found
p6
p5
p3
p4
rq
p1
p2
Seminar Santiago da Compostela 60April 2014
AST: Logarithmic replication
• Full AST on every peer is space consuming– replication of pivots grows in a linear way
• Store only a part of the AST:– all paths to local buckets
• Deleted sub-trees:– replaced by NNID
of the leftmost peer
p13 p14p11 p12
p5 p6
p1 p2
p3 p4
p7 p8 p9 p10
NNID2 NNID3BID1 NNID4 NNID5 NNID6 NNID7 NNID8
p1 p2
p3 p4
p7 p8
BID1 NNID3 NNID5
Seminar Santiago da Compostela 61April 2014
AST: Logarithmic Replication (cont.)
• Resulting tree– replication of pivots grows in a logarithmic way
p1 p2
p3 p4
p7 p8
NNID2
NNID3
BID1
NNID5
p1 p2
p3 p4
p7 p8
BID1
Seminar Santiago da Compostela 62April 2014
p1
r1
p3
r3
VPT* Structure• Similar to the GHT* - ball partitioning is used
for AST– Based on the Vantage Point Tree [Yia93]
• inner nodes have one pivot and a radius• different traversing conditions
p2
r2
p1 (r1)
p2 (r2) p3 (r3)
Seminar Santiago da Compostela 63
M-Chord: The Metric Chord• Transform metric space to one-dimensional domain
– Use M-Index - a generalized version of the iDistance
• Divide the domain into intervals – assign each interval to a peer
• Use the Chord P2P protocol for navigation• The Skip graphs distributed protocol can be used,
alternatively
April 2014
Seminar Santiago da Compostela 64
– range query R(q,r): identify intervals of interest
• Generalization to metric spaces
– select pivots – then partition: Voronoi-style
M-Chord: Indexing the Distance
• iDistance – indexing technique for vector domains
– cluster analysis = centers = reference points pi
– assign iDistance keys to objects iCx
cixpdxiDist i ),()(
},...,{ 0 npp
April 2014
Seminar Santiago da Compostela 65
M-Chord: Chord Protocol
• Peer-to-Peer navigation protocol• Peers are responsible for intervals of keys• hops to localize a node storing a key
M-Chord set the iDistance domain make it uniform: function h
Use Chord on this domain
)(log n
)),(()( cixpdhxmchord i
April 2014
Seminar Santiago da Compostela 66
M-Chord: Range Query
• Node Nq initiates the search
• Determine intervals– generalized iDistance
• Forward requests to peers on intervals
• Search in the nodes– using local organization
• Merge the received partial answers
April 2014
Seminar Santiago da Compostela 67April 2014
M-CAN: The Metric CAN
• Based on the Content-Addressable Network (CAN)– a DHT navigating in an N-dimensional vector space
• The Idea:1. Map the metric space to a vector space
– given N pivots: p1, p2 , … , pN, transform every o into vector F(o)
2. Use CAN to
– distribute the vector space zones among the nodes– navigate in the network
Seminar Santiago da Compostela 68April 2014
CAN: Principles & Navigation
• CAN – the principles– the space is divided in zones – each node “owns” a zone– nodes know their neighbors
• CAN – the navigation– greedy routing– in every step, move to the
neighbor closer to the target location
2-d
imensi
onal vect
or
space
1
6 2
53
4
x,y
Seminar Santiago da Compostela 69April 2014
M-CAN: Contractiveness & Filtering
• Use the L∞ as a distance measure
– the mapping F is contractive
• More pivots better filtering– but, CAN routing is better for less dimensions
• Additional filtering– some pivots are only used for filtering data (inside
the explored nodes)– they are not used for mapping into CAN vector space
),())(),(( yxdyFxFL
Seminar Santiago da Compostela 70
Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework
Metric space (D,d) Operations Storage
Centralized index structures
Distributed index structures
Communication
NetVectors• Lp and quadratic formStrings • (weighted) edit and
protein sequence
Insert, delete,range query,k-NN query,Incremental k-NN
Volatile memoryPersistent memory
Performance statistics
April 2014
Czech Technology Days at GT 71
MUFIN demos
• http://disa.fi.muni.cz/imgsearch/similar• http://www.pixmac.com/• http://disa.fi.muni.cz/twenga/• http://disa.fi.muni.cz/fingerprints/• http://disa.fi.muni.cz/subseq/• http://disa.fi.muni.cz/FaceMatch/• http://disa.fi.muni.cz/annotation-ui/• http://disa.fi.muni.cz/motion-match/• http://disa.fi.muni.cz/profimedia-neural_network-2
0M/May 29-30, 2014
Seminar Santiago da Compostela 72
MUFIN demos
• http://mufin.fi.muni.cz/imgsearch/similar• http://www.pixmac.com/• http://mufin.fi.muni.cz/twenga/random• http://mufin.fi.muni.cz/fingerprints/random• http://mufin.fi.muni.cz/subseq/random• http://mufin.fi.muni.cz/mma-faces-extended/• http://mufin.fi.muni.cz/plugins/annotation• http://mufin.fi.muni.cz/motion-retrieval/rand
omApril 2014
Seminar Santiago da Compostela 73
Problems with Current Applications
1. Current applications are implemented as complex software projects; it is costly, highly qualified specialists are needed
2. They fitfully need massive infrastructure to build and run multiple indices
3. Applications are much more complex than a search, which is an important supporting service
April 2014
Seminar Santiago da Compostela 74
SimilaritySearch
Computingeffectiveness efficiency
stimuli
matching extraction
evaluation executionoperators
Similarity Data Management System
April 2014
findability
retrieval
CloudServices
Application Areaswebenterprisemobilesocial
Seminar Santiago da Compostela 75
Similarity Searching in Clouds
• Retrieval – effectiveness, evaluation, operations, execution, and efficiency
• Findability – effectiveness, matching, stimuli, extraction, and efficiency
• Cloud way of computing:– Scalability – must balance load across servers and avoid bottlenecks
– Elasticity – to allow adding (reducing) capacity to a running system
– Availability – to provide high levels of usability and fault tolerance
– Privacy – to safeguard data that is valuable or sensitive against unauthorized access
April 2014
April 2014 Seminar Santiago da Compostela
Self-organizing Search Systems
• Big data in computer networks:– A very large number of users keeping data within their
devices (computers, mobiles, etc.)– A very high degree of churning users– Solution to storing and searching the data:
• Cloud-type systems – data stored within a single infrastructure• Fully distributed systems – data stored within each user’s device
that brings additional computational power to the system
FacebookYouTube
76/4
April 2014 Seminar Santiago da Compostela
Self-organizing Search Systems
• A self-organizing system:– Devices are independent and equal in functionality– A set of devices interconnected by semantic links– There is no global control mechanism– Search mechanisms and link management based on social-
like interactions (acquaintance/friend links)• Properties:
– Scalability– Adaptability– Robustness
77/4
Seminar Santiago da Compostela 78
Architecture
• Each peer stores its piece of data– It asks and answers similarity queries
• Each peer maintains a query history– An acquaintance tie - the best answer– Friend ties – similar peer
P1
Q1
Q3
P3
Q3
P2
Q1
Q2
Q3
April 2014
Seminar Santiago da Compostela 79
150 peers organizing 200,000 images Extracted features: 45-dimensional vectors
Random initial state Five queries per peer
Experiments
• Query history– 100 queries limit
April 2014
April 2014 Seminar Santiago da Compostela
Self-organizing Search Systems• Our Solution – Metric Social Network
– Resilience to disconnections of peers – initially 2 000– After each 20th test series:
• S1 – 200 random peers were disconnected• S2 – 200 the most knowledgeable peers were disconnected
S1 S2
80/4