web mining. why ir ? research & fun
TRANSCRIPT
![Page 1: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/1.jpg)
Web Mining
![Page 2: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/2.jpg)
Why IR?
![Page 3: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/3.jpg)
Why IR?
![Page 4: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/4.jpg)
Research & Fun
http://duilian.msra.cn
![Page 5: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/5.jpg)
Overview of Search Engine
![Page 6: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/6.jpg)
Flow Chart of SE
![Page 7: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/7.jpg)
Text Processing (1) - Indexing
A list of terms with relevant informationFrequency of termsLocation of terms Etc.
Index terms: represent document content & separate documents “economy” vs “computer” in a news article of Financial Times
To get IndexExtraction of index terms Computation of their weights
![Page 8: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/8.jpg)
Text Processing (2) - Text Processing (2) - ExtractionExtraction
Extraction of index termsWord or phrase levelMorphological Analysis (stemming in English)“information”, “informed”, “informs”, “informative”
informRemoval of stop words
“a”, “an”, “the”, “is”, “are”, “am”, …
![Page 9: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/9.jpg)
Text Processing (3) – Term Text Processing (3) – Term WeightWeight
Calculation of term weights Statistical weights using frequency information importance of a term in a document
E.g. TF*IDF TF: total frequency of a term k in a document IDF: inverse document frequency of a term k in a collection
DF: In how many documents the term appears? High TF , low DF means good word to represent text
High TF, High DF means bad word
![Page 10: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/10.jpg)
An ExampleAn ExampleDocument 1
Document 2
![Page 11: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/11.jpg)
Text Processing (4) - Storing Text Processing (4) - Storing indexing resultsindexing results
Arizona
University
:::
…
1 1 2 2
Index Word Word Info.Document 1
Document 2
1 1 1 1
![Page 12: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/12.jpg)
Text Processing (2) - Storing indexing result
![Page 13: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/13.jpg)
Text Processing (3) - Inverted File
![Page 14: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/14.jpg)
Matching & Ranking (2)
Ranking Retrieval Model
Boolean (exact) => Fuzzy Set (inexact)
Vector SpaceProbabilisticInference Net ...
Weighting SchemesIndex terms, query termsDocument characteristics
![Page 15: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/15.jpg)
Vector Space Model
![Page 16: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/16.jpg)
Techniques for efficiency New storage structure esp. for new document types
Use of accumulators for efficient generation of ranked output
Compression/decompression of indexes Technique for Web search engines
Use of hyperlinks Inlinks & outlinks (PageRank)Authority vs hub pages (HITS)
In conjunction with Directory Services (e.g. Yahoo)
Matching & Ranking (2)
![Page 17: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/17.jpg)
![Page 18: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/18.jpg)
Pagerank Algorithm
Basic idea: more links to a page implies a better page But, all links are not created equal Links from a more important page should count more than links from a weaker page
Basic PageRank R(A) for page A: outDegree(B) = number of edges leaving page B = hyperlinks on page B
Page B distributes its rank boost over all the pages it points to
![Page 19: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/19.jpg)
![Page 20: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/20.jpg)
![Page 21: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/21.jpg)
![Page 22: WEB MINING. Why IR ? Research & Fun](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c4429550346a5458b46ba/html5/thumbnails/22.jpg)
Readings Gregory Grefenstette (1998). “The Problem of Cross-Language Information
Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.
Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.
Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.
James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.
Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.
Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html
Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.
Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.”
In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)