improved search for socially annotated data authors: nikos sarkas, gautam das, nick koudas presented...

21
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Upload: loreen-mills

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Improved search for Socially Annotated DataAuthors: Nikos Sarkas, Gautam Das, Nick KoudasPresented by: Amanda Cohen Mostafavi

Page 2: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Introduction• Social Annotation: A process where users

collaboratively assign a short sequence of keywords (tags) to a number of resources▫Each tag sequence is a concise and accurate

summary of the resource’s content▫Meant to aid navigation through a collection

• Leads to searching via tags▫Enables relevant text retrieval▫Allows accurate retrieval of non-textual objects▫Presents a need for an efficient retrieval and

ranking method based on user tags

Page 3: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

RadING

•Ranking annotated data using Interpolated N-Grams

•Searching and ranking method based exclusively on user tags

•Uses interpolated n-grams to model tag sequences associated with every resource

•How does it rank?

Page 4: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Probabilistic Foundations

•Goal: To rank resources by the probability that they will be relevant to the query

•Given keyword query Q, and a collection of resources R, we apply Bayesian theorem to get:

p(R is relevant | Q) = p(Q|R is relevant)p(R is Relevant)

p(Q)

Where p(R is relevant) is the probability that R is relevant, independent of the query posed and p(Q) is the probability of the query issued

Page 5: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Probabilistic Foundations

•p(R is relevant) is constant throughout the resource collection, as well as p(Q)▫Meaning: ranking resources by p(R is

relevant|Q) is equivalent to ranking by p(Q|R is relevant)

•In order to estimate the probability of the query being “generated” by each resource, resources need to be modeled based on knowledge of social annotation

Page 6: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Dynamics and Properties of the Social Annotation Process•The goal of the tagging process is to

describe the resource’s content•User opinions crystallize quickly, can find

annotation trends after witnessing a small number of assignments

•Therefore we assume the following:▫p(Q | R is relevant) = p(Q is used to tag R)▫In English: Users will use keyword

sequences derived from the same distribution to both tag and search for a resource

Page 7: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Social Annotation Process: Things to consider…•Resources are rarely given assignments

with one tag•Also, tag positions are not random,

progress from left to right from more general to more specific

• Tags representing different perspectives on a resource are less likely to occur together in the same assigment

•Used n-gram models to model these co-occurance patterns

Page 8: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

N-gram Models

•Given an assignment made up of a sequence (s) of l tags t1…tl, the probability of this sequence being assigned to a resource is:▫p(t1,…,tl ) = p(t1)p(t2|t1)…p(tl|t1,…, tl-1)

•The purpose of using n-gram models is to approximate the probability of a subsequence with only the last n-1 tags▫In the case of a bi-gram model, p(tk|t1,…,tk-1)

approximates to p(tk|tk-1)

Page 9: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

N-gram Models

•Calculate the probability using the Maximum Likelihood equation

•c(t1, t2) = the number of occurrences of the bi-gram

•The summation is the sum of the occurrences of all bigrams involving t1 as the first tag

t

ttc

ttcttp

),(

),()|(

1

2112

Page 10: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Interpolation

•Interpolation is used to compensate for sparse data, distributes probability mass from high counts to low counts

•Used the Jelinek-Mercer interpolation technique. Applied to a bi-gram, yields:

1

10

)()(ˆ)|(ˆ)|(

210

2,1,0

202112212

tptpttpttp bg

Page 11: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Parameter Optimization

•Goal: to maximize the likelihood function L(λ1,λ2) in order to find the ideal interpolation parameters

•Definitions:▫D*: The constrained domain of λ1 and λ2

▫λ*: The global maximum of L(λ1,λ2)

▫λc : The point at which L(λ1,λ2) evaluates to its maximum value within D*, which must be found to optimize parameters

Page 12: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

RadING Optimization Framework•Step 1: If L(λ1,λ2) is unbounded, perform

1D optimization to locate λc

•Step 2: If L(λ1,λ2) is bounded, apply 2D optimization to find λ*

•Step 3: If λ* is not in D*, locate λc

Page 13: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Searching Process•Step 1: Train a bi-gram model for each

resource▫Compute the bi-gram and unigram probability

and optimize the interpolation parameters•Step 2: At query-time compute the probability

of the query keyword sequence being generated by each resource’s bi-gram model

•Use Threshold Algorithm to compute top-k results

k

j

jjkR qqpqqp1

11 )|(),...,(

Page 14: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Searching Example

Page 15: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Experimental Evaluation

•Test data: web crawl of del.icio.us▫70,658,851 assignments▫Posted by 567,539 users▫Attached to 24,245,248 unique URLs▫Average length of assignment: 2.77▫Standard deviation: 2.70▫Median: 2

Page 16: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Optimization Efficiency

Page 17: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Optimization Efficiency

Page 18: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Optimization Efficiency

Page 19: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Ranking Effectiveness

•Compares RadING ranking method to adaptations of tf/idf ranking▫Tf/Idf: concatenates resources’ assignments

into a document and performs raking based tf/idf similarity to each document

▫Tf/Idf+: computes tf/idf similarity of each individual assignment and rank resources based on average similarity

•10 Judges contacted through Amazon Mechanical Turk to measure precision

Page 20: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Ranking Effectiveness

Page 21: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Ranking Effectiveness