web search and mining “wma” cs635 autumn 2013 mon thu 6:30—8:00pm lch31 (third floor lhc)...

33
Web Search and Mining WMaCS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Upload: deborah-tyler

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Web Search and Mining“WMa”

CS635 Autumn 2013Mon Thu 6:30—8:00pmLCH31 (third floor LHC)

(Venue may change)

Page 2: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Motivation Few areas of computer science have had as

much impact on our lives in the last 15 years as the Internet, the Web, search engines, e-commerce, and social media

Rapid rise of some of the largest and most accomplished companies on earth

Draws heavily from data structures, algorithms, databases, probability and statistics, machine learning, parallel computing, economics and game theory, social sciences, …(“whatever works”)

Page 3: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Highlights

1992

1994

1996

1998

2003

Netscape, Yahoo! founded

Mosaic Web brower developed

Arthur C. Clarke predicts that satellites would "bring the accumulated knowledge of the world to your fingertips" using a console combining the functionality of the photocopier, telephone, television and a small computer, allowing data transfer and video conferencing around the globe.

1970

1990 Tim Berners-Lee &co build first prototype at CERN

Backrub, later called Google, foundedGoto.com introduces paid search and ads

Google launches AdSense

2005 YouTube founded

2006 Twitter founded

1995 Alta Vista launched by Digital Equipment Corp

2009 Bing launched by Microsoft

2004 Facebook founded

CrawlingIndexing

PageRank

Text miningAuctions

PersonalizationSocial media

Collaborativefiltering

Query+clicklog mining

Local search

De-spamming

Infrastructure

Page 4: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

CS635 and CS728 CS635 = Web search and mining = first

course on Web information retrieval• Indexing, query processing, ranking• Corpus and document models, text mining• Hyperlinks and social network analysis• Large scale data analysis: “big data”

CS728 = Organization of Web information = second course with more recent topics• Bootstrapping knowledge bases• Named entity and relation recognition• Open domain knowledge extraction• Semantic annotations and search

Page 5: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Vs. CS101 Usually CS635 runs in Autumn and CS728

runs in Spring Taught CS101 in Spring 2012 Largely recovered from the trauma in

Autumn 2012, didn’t teach anything Therefore offered CS635 in Spring 2013 Offering CS635 again in Autumn 2013 to

reset the clock• To avoid falling asleep, may sneak in new

material

Page 6: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Administrivia http://soumen.in/teach/cs635a2013/ soumen.in = www.cse.iitb.ac.in/~soumen =

soumen.cse.iitb.ac.in https://moodle.iitb.ac.in/course/view.php?id=2550 Meant primarily for Mtech1, Btech3, PhD Mtech2, Btech4 ok, a little late for research? Weak prerequisites: prob+stat, databases If you can read and understand a fair bit of math,

you will need ~10 days of self-study to pick up the relevant pieces from basic machine learning

Page 7: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Course design: books Basic material from a few books

• Manning-Raghavan-Schutze online book• Baeza-Yates and Ribeiro-Neto• Managing Gigabytes (plumbing tips)• MTW second edition in progress, notes will be available

for many segments

Page 8: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Course design: research papers• Papers written much faster than you can

write a book

• Digesting papers: critical survival skill

• They sound like nothing left to do

• You get credit for pointing out loopholes, flaws, and further ideas to follow from existing papers

Page 9: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Evaluation Homeworks

• All homeworks and exams from all earlier offerings are effectively (ungraded) homeworks (many with solutions) — take advantage!

• Hands-on work (coding, simulations and measurements)

• Projects: depends on class size, interest

Exams• Limited time, in-class• Problems “served on a platter”• Hard to pose and solve real systems issues

Page 10: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Multidisciplinary synthesis Content: hypermedia, markup standards, text

and semistructured data models Activity: linking, blogging, selling, spamming Algorithms: graphs, indexing, compression,

string processing, ranking Statistics: models for text and link graph,

catching (link) spam, profiling queries Language: tagging, extraction Plumbing: networking, storage, distributed

systems, map-reduce, “big data”, cloud …

Page 11: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

CS635 syllabus Text indexing, search Document and corpus models Ranking, learning to rank Whole-document labeling/classification Measuring and modeling social networks Hyperlink assisted search and mining Web sampling, crawling, monitoring

Page 12: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

CS728 syllabus Entity and relation annotation Information extraction and integration Coreference resolution Bootstrapping Web knowledge bases Adding graph structure to text search Labeling and ranking in graph data models

Page 13: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Text indexing, search and ranking Boolean queries involving word occurrence

in documents Inverted index design, construction,

compression, updates; query processing Vector space model, relevance ranking,

recall, precision, F1, break-even Probabilistic ranking, belief networks Fast top-k search Similarity search: minhash, random

projections, locality-preserving hash

Page 14: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Document and corpus models Token, compound, phrase, stems, stopwords Language and typography issues Practical computer representations Bag of words, corpus, term-document matrix Probabilistic generative models Modeling multi-topic corpus and documents Statistical notions of semantic similarity Text clustering applications

Page 15: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Query = SL100

Diversi

ty

Query suggestion

Page 16: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Whole-document labeling/classification Topics on Yahoo!, spam vs. non-spam, etc. Bayesian classification using generative

corpus models Conditional probabilistic classification Discriminative classification Transductive, semi-supervised and active

labeling Exploiting graph connectivity for improved

labeling accuracy Machine learning needed here

Page 17: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Topic

Top

ic

Label variable

Local feature variable

Instance

Label coupling

CS705, IT655

Page 18: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Measuring and modeling social networks Prestige, centrality, co-citation Web graph: degree, diameter, dense

subgraphs, giant bow-tie Generative models: preferential attachment,

copying links, “Googlearchy” Link locality and content locality Generating synthetic social networks Compressing and indexing large social

networks, reference compression, connectivity server, reachability index

Page 19: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Gnutella Internet

Web

Page 20: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Hyperlink assisted search and mining Review of spectral graph theory CS725 Hyperlink induced topic search (HITS) and

Google’s Pagerank Large-scale computational issues Stability, topology sensitivity, spam resilience Other random walks: SALSA, PHITS, maxent

walks, absorbing walks, SimRank, … Personalized and topic-sensitive Pagerank Viral marketing, social networks Link prediction, recommender systems

Page 21: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Web sampling, crawling, monitoring Plumbing: DNS, TCP/IP, HTTP, HTML Large-scale crawling issues: concurrency,

network load, shared work pool, spider traps, politeness

Setting crawl priorities using graph properties and page contents

Sampling Web pages using random walks Monitoring change and refreshing crawls

• Driven by query workload• Driven by search and advertising customers?

Page 22: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)
Page 23: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Adding graph structure to text search XML and related (largely) tree data models Typed entity-relationship networks Path expressions, integrating word queries

with path matches; indexing, query processing

Spreading activation queries: find entity of specified type “near” matching predicates

Steiner query: explain why two or more entities (or words) are closely related

Page 24: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

“Find a person near IBM and IITB”

Page 25: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Entity annotation in text

• Need a label catalog• Michael Jordan, Stuart Russell• Collective annotation

Page 26: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Bridging the Structured-Unstructured Gap

Born in New York in 1934 , Sagan wasa noted astronomer whose lifelong passionwas searching for intelligent life in the cosmos.

person

scientist

physicist

astronomer

entity

region

city

district

state

hasDigit isDDDD

Where was Sagan born? type=region NEAR “Sagan”

Name a physicist who searchedfor intelligent life in the cosmos type=physicist NEAR “cosmos”…

When was Sagan born? type=timepattern=isDDDD NEAR“Sagan” “born”

abstraction

time

year

is-a

Page 27: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Mining the Web as a noisy database

Page 28: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Mining Web tables

“Organically grown” Web tables Much relational information No coordinated schema Use social entity and type catalogs to label

Page 29: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Small answer subgraph search

http://www.cse.iitb.ac.in/banks/

Page 30: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Bootstrapping Web knowledge bases Hearst, 1992; KnowItAll (Etzioni+ 2004)

• T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, …

Google sets

Page 31: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Information carnivores at work

Probe Word PhraseKhalid 1.3M 0Omar 6.63M 0sort 130M 0Karachi 2.51M 629Pakistan 50.5M 1

“Garth Brooks is a country” [singer],“gift such as wall” [clock]

“person like Paris” [Hilton],“researchers like Michael Jordan” (which one?)

KO :: India Pakistan Cricket SeriesA web site by Khalid Omar, sort of live from Karachi, Pakistan.

“cities such as [probe]”

“[probe] and othercities”, “[probe] is acity”, etc.

Page 32: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Sample output author; “Harry Potter”

• J K Rowling, Ron

person; “Eiffel Tower”• Gustave, (Eiffel), Paris

director; Swades movie• Ashutosh Gowariker, Ashutosh Gowarikar

What can search engines do to help?• Cluster mentions and assign IDs • Allow queries for IDs — expensive!• “Harry Potter” context in “Ron is an author”

Ambiguity andextremely skewed

Web popularity

Page 33: Web Search and Mining “WMa” CS635 Autumn 2013 Mon Thu 6:30—8:00pm LCH31 (third floor LHC) (Venue may change)

Summary Understand Web-scale text data processing Crawling, indexing, search, ranking, link

mining, personalization Mix of theory and practice VC for “big data” and “cloud” exceeds that for

“social ___” Jobs! Are you “the one”?