mufin basics · semweb 21 mufin hardware mapping z10m network, 500 peers, memory-based zbatch of...

24
SEMWEB 1 MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic [email protected]

Upload: others

Post on 19-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

SEMWEB 1

MUFIN Basics

MUFIN teamFaculty of Informatics,

Masaryk UniversityBrno, Czech Republic

[email protected]

SEMWEB 2

Search problem

SEARCH

data

& q

uerie

s

infrastructureindex structure

SEMWEB 3

The thesis(intellectual proposition)

Search systems are more and more complexFuture search system will be born on the divergence of:

scale and determinism

SEMWEB 4

Trends in Scalability of Search

data volume - exponential growthnumber of users - increasing fastvariety of data types - digital databasesmulti-queries - lingual, feature, modal

SEMWEB 5

Trends in Determinism of Search

• Exact match• Precise answer• Unvaried answer

• Fixed query

• Dedicated hardware

• Similarity• Approximate answer• Satisfactory answer (advice,

recommendation)• Personalized, context aware,

proximate• Dynamic mapping, mobile

devices, infrastructure services

SEMWEB 6

Search systemsScalability

● data volume – exponential grows● number of users (queries) increase● variety of data types - digitization● multi-lingual (feature, modal) queries

Determinismexact match ► similarityprecise ► approximateunvaried answer ► good answer; advicefixed query ► personalized; context awarefixed infrastruct. ► dynamic mapping; mobile

grad

e

high

low

well established cutting-edge research

peer

-to-p

eer

cent

raliz

ed

para

llel

dist

ribut

ed

self-

orga

nize

dtime

SEMWEB 7

The MUFIN Approach

SEARCHda

ta &

que

ries

infrastructureindex structure

ScalabilityP2P structure

Extensibilitymetric space

Minkowski distance

Edit distanceJaccard’s coef.

Mahalanobis distance

Hausdforff distance

etc.

MUFIN: MUlti-Feature Indexing Network

Cloud computinginfrastructure as a service

SEMWEB 8

EXTENSIBILITYMetric Space: Abstraction of Similarity

Metric space: M = (D,d)D – domaindistance function d(x,y)∀x,y,z ∈ D

d(x,y) > 0 - non-negativityd(x,y) = 0 ⇔ x = y - identityd(x,y) = d(y,x) - symmetryd(x,y) ≤ d(x,z) + d(z,y) - triangle inequality

SEMWEB 9

Why Can the Metric Approach be Useful

Many application areas:biology, securityaudio-visual, geo. searchsoftware copy detectiondata cleaning, integration,etc.

Query by example paradigmone query image contains a lot of informationone image is worth 1000 wordsadvantage for mobile devices – min. click

SEMWEB 10

Metric Search Grows in Popularity

Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006

P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2006

SEMWEB 11

Examples of Distance FunctionsLp Minkowski distance of order p

L1 – city-block distance

L2 – Euclidean distance

L∞ – infinity

edit distance (for strings)minimal number of insertions, deletions and substitutionsd(‘application’, ‘applet’) = 6

Jaccard’s coefficient (for sets A,B)

∑=

−=n

iii yxyxL

11 ||),(

( )∑=

−=n

iii yxyxL

1

22 ),(

ii

n

iyxyxL −=

=∞ max),(

1

( )UI

BA

BABAd −=1,

SEMWEB 12

Examples of Distance FunctionsMahalanobis distance

for vectors with correlated dimensions

Hausdorff distancefor sets with elements related by another distance

Earth movers distanceprimarily for histograms (sets of weighted features)

and many others

SEMWEB 13

Image MUFIN overlayA demo on Cophir 50 M dataset (280 dim vectors)

Five combined MPEG7 global descriptor:Color Structure, max. dist.: 40, weight: 3Color Layout, max. dist.: 300, weight: 2Scalable Color, max. dist.: 3000weight: 2Edge Histogram, max. dist.: 68, weight: 4Homogeneous Texture, max. dist.: 25, weight: 0.5

SEMWEB 14

Face searchFace search demo – 6k images with people

face detection – 10k detected facesface description – 64 dimensional vectorsface comparison - advanced face des. MPEG7

Based on a publicly available software

SEMWEB 15

SCALABILITYStructured P2P networks

Objectives To scale into contemporary audio-visual data volume and query execution throughput, i.e.:

billions of objectsonline response timehundreds of queries per sec.

A peerContains metric objects, can issue/answer queries, and knows few other peers

SEMWEB 16

Why structured P2P in MUFINStructured P2P network employ a globally considered protocol to ensure that any peer can efficiently route a search to some peerthat has the desired data

Structured P2P networks are used in MUFIN for:no bottleneck, no central componentmultiple access points to the networks distribution of workload – parallel query executiondynamic structure of peers – (controlled) resilience, join, leavemechanisms for fault tolerance, replication and load balancing

SEMWEB 17

P2P Architecture of MUFIN• Native metric techniques: GHT*, VPT*• Transformation techniques: MCAN, M-Chord

(Skip-Graphs, Kademlia, etc.)

SEMWEB 18

P2P Architecture of MUFINPeers are not necessarily computersA peer size determines a lower-bound on the query response timePeer’s data can be searched by:

FilteringM-treeD-indexI-distanceEtc.

SEMWEB 19

Scalability test1M: 50 peers – memory based10M: 500 peers – memory based50M: 2000 peers – disk based

Effectiveness improves with data volumeEfficiency

lower-bounded by the peer size (20k, 20k, 25k)does not change significantly

SEMWEB 20

Infrastructure as a ServiceWhy:

Performance tuningQuery response timeQuery execution throughput

Performance adjustmentDifferent performance requirements (day – night, weekend – working days)

Experimental trials Test an applicationPurchase a new hardware

Availability - reliability

SEMWEB 21

MUFIN Hardware Mapping10M network, 500 peers, memory-basedBatch of 250 queries started from 10 peers

3248169147237969692169128100,663801

23201687802032448316566541,341862

18471624681221035516532652,66944

180617032487573618117875,10498

16051832596726911849589,262716

maxminavgmaxminavg

single query [ms]total (s)

single query [ms]queries/stotal [s]

Sequential from 1 peerParallel from 10 peers

CPUs

SEMWEB 22

Externalindex

Featureextraction

MUFIN Overview

Peer-to-Peer Networks

Multi-overlay structure

Forms• range • k-nearest• complex

Strategies• precise• approximate• social

insertdelete

features

Web service

Universal• batch, telnet, GUI

Specialized• image web interface

SEMWEB 23

MUFIN pluginNews web-sites contain images

CNN, BBC, SEZNAM, iDNESPhotography collection of US National Parks

TERRA GALLERIA Image text searchGoogle, Yahoo, Yandex, Ask, Seznam, Rajče, exalead

SEMWEB 24

Use of MUFIN in SAPIR DemosCaching – to locate cashed queriesText+Image – to perform content similarityVideo search - to perform content similarityMobile interface - to perform content similarity

Some statisticsPermanent demo:coming soon