improved search for socially annotated data authors: nikos sarkas, gautam das, nick koudas presented...

Improved search for Socially Annotated DataAuthors: Nikos Sarkas, Gautam Das, Nick KoudasPresented by: Amanda Cohen Mostafavi

Introduction• Social Annotation: A process where users

collaboratively assign a short sequence of keywords (tags) to a number of resources▫Each tag sequence is a concise and accurate

summary of the resource’s content▫Meant to aid navigation through a collection

• Leads to searching via tags▫Enables relevant text retrieval▫Allows accurate retrieval of non-textual objects▫Presents a need for an efficient retrieval and

ranking method based on user tags

RadING

•Ranking annotated data using Interpolated N-Grams

•Searching and ranking method based exclusively on user tags

•Uses interpolated n-grams to model tag sequences associated with every resource

•How does it rank?

Probabilistic Foundations

•Goal: To rank resources by the probability that they will be relevant to the query

•Given keyword query Q, and a collection of resources R, we apply Bayesian theorem to get:

p(R is relevant | Q) = p(Q|R is relevant)p(R is Relevant)

p(Q)

Where p(R is relevant) is the probability that R is relevant, independent of the query posed and p(Q) is the probability of the query issued

Probabilistic Foundations

•p(R is relevant) is constant throughout the resource collection, as well as p(Q)▫Meaning: ranking resources by p(R is

relevant|Q) is equivalent to ranking by p(Q|R is relevant)

•In order to estimate the probability of the query being “generated” by each resource, resources need to be modeled based on knowledge of social annotation

Dynamics and Properties of the Social Annotation Process•The goal of the tagging process is to

describe the resource’s content•User opinions crystallize quickly, can find

annotation trends after witnessing a small number of assignments

•Therefore we assume the following:▫p(Q | R is relevant) = p(Q is used to tag R)▫In English: Users will use keyword

sequences derived from the same distribution to both tag and search for a resource

Social Annotation Process: Things to consider…•Resources are rarely given assignments

with one tag•Also, tag positions are not random,

progress from left to right from more general to more specific

• Tags representing different perspectives on a resource are less likely to occur together in the same assigment

•Used n-gram models to model these co-occurance patterns

N-gram Models

•Given an assignment made up of a sequence (s) of l tags t1…tl, the probability of this sequence being assigned to a resource is:▫p(t1,…,tl ) = p(t1)p(t2|t1)…p(tl|t1,…, tl-1)

•The purpose of using n-gram models is to approximate the probability of a subsequence with only the last n-1 tags▫In the case of a bi-gram model, p(tk|t1,…,tk-1)

approximates to p(tk|tk-1)

N-gram Models

•Calculate the probability using the Maximum Likelihood equation

•c(t1, t2) = the number of occurrences of the bi-gram

•The summation is the sum of the occurrences of all bigrams involving t1 as the first tag

t

ttc

ttcttp

),(

),()|(

1

2112

Interpolation

•Interpolation is used to compensate for sparse data, distributes probability mass from high counts to low counts

•Used the Jelinek-Mercer interpolation technique. Applied to a bi-gram, yields:

1

10

)()(ˆ)|(ˆ)|(

210

2,1,0

202112212

tptpttpttp bg

Parameter Optimization

•Goal: to maximize the likelihood function L(λ1,λ2) in order to find the ideal interpolation parameters

•Definitions:▫D*: The constrained domain of λ1 and λ2

▫λ*: The global maximum of L(λ1,λ2)

▫λc : The point at which L(λ1,λ2) evaluates to its maximum value within D*, which must be found to optimize parameters

RadING Optimization Framework•Step 1: If L(λ1,λ2) is unbounded, perform

1D optimization to locate λc

•Step 2: If L(λ1,λ2) is bounded, apply 2D optimization to find λ*

•Step 3: If λ* is not in D*, locate λc

Searching Process•Step 1: Train a bi-gram model for each

resource▫Compute the bi-gram and unigram probability

and optimize the interpolation parameters•Step 2: At query-time compute the probability

of the query keyword sequence being generated by each resource’s bi-gram model

•Use Threshold Algorithm to compute top-k results

k

j

jjkR qqpqqp1

11 )|(),...,(

Searching Example

Experimental Evaluation

•Test data: web crawl of del.icio.us▫70,658,851 assignments▫Posted by 567,539 users▫Attached to 24,245,248 unique URLs▫Average length of assignment: 2.77▫Standard deviation: 2.70▫Median: 2

Optimization Efficiency

Ranking Effectiveness

•Compares RadING ranking method to adaptations of tf/idf ranking▫Tf/Idf: concatenates resources’ assignments

into a document and performs raking based tf/idf similarity to each document

▫Tf/Idf+: computes tf/idf similarity of each individual assignment and rank resources based on average similarity

•10 Judges contacted through Amazon Mechanical Turk to measure precision

Ranking Effectiveness

improved search for socially annotated data authors: nikos sarkas, gautam das, nick koudas presented...

Documents