(some issues in) text ranking
DESCRIPTION
(Some issues in) Text Ranking. Recall General Framework. Crawl Use XML structure Follow links to get new pages Retrieve relevant documents Today Rank PageRank, HITS Rank Aggregation. Relevant documents. Usually: relevant with respect to a keyword, set of keywords, logical expression.. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/1.jpg)
(Some issues in )Text Ranking
![Page 2: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/2.jpg)
Recall General Framework
• Crawl– Use XML structure– Follow links to get new pages
• Retrieve relevant documents – Today
• Rank– PageRank, HITS– Rank Aggregation
![Page 3: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/3.jpg)
Relevant documents
• Usually: relevant with respect to a keyword, set of keywords, logical expression..
• Closely related to ranking– “How” relevant is it can be considered another measure
• Usually done as a separate step– Recall the Online vs. offline issue..
• But some techniques are reusable
![Page 4: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/4.jpg)
Defining Relevant Documents
• Common strategy: treat text documents as “bag of words” (BOW)– Denote BOW(D) for a document D– Bag rather than set (i.e. multiplicity is kept)– Words are typically stemmed
• Reduced to root form– Loses structure, but simplifies life
• Simple definition: – A document D is relevant to a keyword W if W is in
BOW(D)
![Page 5: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/5.jpg)
Cont .
• Simple variant– The level of relevance of D to W is the multiplicity
of W in BOW(D) – Problem: Bias towards long documents– So divide by the document length |BOW(D)|– This is called term frequency (TF)
![Page 6: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/6.jpg)
A different angle
• Given a document D, what are the “most important” words in D?
• Clearly high term frequency should be considered
• Rank terms according to TF?
![Page 7: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/7.jpg)
Ranking according to TF
A 2022Is 1023He 350...Liverpool 25Beatles 12
![Page 8: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/8.jpg)
IDF
• Observation: if w is rare in the documents set, but appears many times in a document D, then w is “important” for D
• IDF(w) = log(|Docs| / |Docs’|)– Docs is the set of all documents in the corpus,
Docs’ is the subset of documents that contain w
• TFIDF(D,W)=TF(W,D)*IDF(W)– “Correlation” of D and W
![Page 9: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/9.jpg)
Inverted Index
• For every term we keep a list of all documents in which it appears
• The list is sorted by TFIDF scores
• Scores are also kept
• Given a keyword it is then easy to give the top-k
![Page 10: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/10.jpg)
Ranking• Now assume that these documents are web
pages
• How do we return the most relevant?
• How do we combine with other rankings? (e.g. PR?)
• How do we answer boolean queries?– X1 AND (X2 OR X3)
![Page 11: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/11.jpg)
Rank Aggregation
• To combine TFIDF, PageRank..
• To combine TFIDF with respect to different keywords
![Page 12: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/12.jpg)
Part-of-Speech Tagging• So far we have considered documents only as
bags-of-words• Computationally efficient, easy to program, BUT• We lost the structure that may be very important:– E.g. perhaps we are interested (more) in documents
for which W is often the sentence subject?• Part-of-speech tagging– Useful for ranking– For machine translation– Word-Sense Disambiguation– …
![Page 13: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/13.jpg)
Part-of-Speech Tagging
• Tag this word. This word is a tag.
• He dogs like a flea
• The can is in the fridge
• The sailor dogs me every day
![Page 14: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/14.jpg)
A Learning Problem
• Training set: tagged corpus– Most famous is the Brown Corpus with about 1M
words
– The goal is to learn a model from the training set, and then perform tagging of untagged text
– Performance tested on a test-set
![Page 15: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/15.jpg)
Simple Algorithm• Assign to each word its most popular tag in the training set
• Problem: Ignores context
• Dogs, tag will always be tagged as a noun…
• Can will be tagged as a verb
• Still, achieves around 80% correctness for real-life test-sets– Goes up to as high as 90% when combined with some simple
rules
![Page 16: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/16.jpg)
(HMM) Hidden Markov Model• Model: sentences are generated by a probabilistic
process
• In particular, a Markov Chain whose states correspond to Parts-of-Speech
• Transitions are probabilistic
• In each state a word is outputted– The output word is again chosen probabilistically based on
the state
![Page 17: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/17.jpg)
HMM
• HMM is:– A set of N states– A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans– A vector of size N of initial state probabilities
Pstart– A matrix NXM of emissions probabilities Pout
• “Hidden” because we see only the outputs, not the sequence of states traversed
![Page 18: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/18.jpg)
Example
![Page 19: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/19.jpg)
3 Fundamental Problems
1) Compute the probability of a given observationSequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging3) Given a training set find the model that would make the observations most likely
![Page 20: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/20.jpg)
Tagging
• Find the most likely sequence of states that led to an observed output sequence
• Problem: exponentially many possible sequences!
![Page 21: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/21.jpg)
Viterbi Algorithm
• Dynamic Programming• Vt,k is the probability of the most probable
state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k
![Page 22: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/22.jpg)
Viterbi Algorithm
• Dynamic Programming• Vt,k is the probability of the most probable
state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k
• V0,k = Pstart(k)*Pout(k,X0)
• Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}
![Page 23: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/23.jpg)
Finding the path
• Note that we are interested in the most likely path, not only in its probability
• So we need to keep track at each point of the argmax– Combine them to form a sequence
• What about top-k?
![Page 24: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/24.jpg)
Complexity
• O(T*|S|^2)
• Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)
![Page 25: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/25.jpg)
Computing the probability of a sequence
• Forward probabilities: αt(k) is the probability of seeing the sequence
X1…Xt and terminating at state k• Backward probabilities:
βt(k) is the probability of seeing the sequenceXt+1…Xn given that the Markov process is atstate k at time t.
![Page 26: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/26.jpg)
Computing the probabilitiesForward algorithmα0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}P(O1,…On)= Σk αn(k)
Backward algorithmβt(k) = P(Ot+1…On| state at time t is k)βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}βn(k) = 1 for all kP(O)= Σk β 0(k)* Pstart(k)
![Page 27: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/27.jpg)
Learning the HMM probabilities
• Expectation-Maximization Algorithm1. Start with initial probabilities2. Compute Eij the expected number of transitions
from i to j while generating a sequence, for each i,j (see next)3. Set the probability of transition from i to j to be Eij/ (Σk Eik)4. Similarly for omission probability5. Repeat 2-4 using the new model, until convergence
![Page 28: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/28.jpg)
Estimating the expectancies
• By sampling– Re-run a random a execution of the model 100
times– Count transitions
• By analysis– Use Bayes rule on the formula for sequence
probability– Called the Forward-backward algorithm
![Page 29: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/29.jpg)
Accuracy
• Tested experimentally
• Exceeds 96% for the Brown corpus– Trained on half and tested on the other half
• Compare with the 80-90% by the trivial algorithm
• The hard cases are few but are very hard..
![Page 30: (Some issues in) Text Ranking](https://reader035.vdocuments.mx/reader035/viewer/2022081507/56815a0c550346895dc75886/html5/thumbnails/30.jpg)
NLTK
• http://www.nltk.org/
• Natrual Language ToolKit
• Open source python modules for NLP tasks– Including stemming, POS tagging and much more