mining the web for information organization j. h. wang academia sinica

84
Mining the Web for Information Organization J. H. Wang Academia Sinica

Upload: stuart-welch

Post on 12-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mining the Web for Information Organization J. H. Wang Academia Sinica

Mining the Web for Information Organization

J. H. WangAcademia Sinica

Page 2: Mining the Web for Information Organization J. H. Wang Academia Sinica

2

Outline

• Introduction• Web Mining• Cross-Language Web Search• Other Applications

Page 3: Mining the Web for Information Organization J. H. Wang Academia Sinica

3

Introduction

• Huge amount of Web data– Rich and dynamic resources of human

knowledge– Multimedia – Scalability

How to organize Web data into useful information?

Page 4: Mining the Web for Information Organization J. H. Wang Academia Sinica

4

Number of Web Pages The world’s

largest search engine ?

Billions Of Textual Documents Indexed

December 1995-September 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.

Source: Search Engine Watch (Nov. 2004)

SearchEngine

ReportedSize

PageDepth

Google 8.1 billion 101K

MSN 5.0 billion 150K

Yahoo4.2 billion(estimate)

500K

AskJeeves

2.5 billion 101K+

Page 5: Mining the Web for Information Organization J. H. Wang Academia Sinica

5

Web Users and Pages (7 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Challenge of Scalability !

Total Users: 800MChinese Users: 110M

Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), and others.

Source: Global Reach, 2004

Page 6: Mining the Web for Information Organization J. H. Wang Academia Sinica

6

Web Mining

• Data Mining• Text Mining• Web Mining Technologies

Page 7: Mining the Web for Information Organization J. H. Wang Academia Sinica

7

Data Mining

• Data Mining (Knowledge Discovery in Databases) is a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases [G. Piatetsky-Shapiro and W. J. Frawley]

Page 8: Mining the Web for Information Organization J. H. Wang Academia Sinica

8

Text Mining

• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]

Page 9: Mining the Web for Information Organization J. H. Wang Academia Sinica

9

Web Mining

• Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents and services [O. Etzioni]

Page 10: Mining the Web for Information Organization J. H. Wang Academia Sinica

10

Comparison

• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases

• In text mining, the patterns are extracted from natural language texts rather than from structured databases of facts

• Web mining discovers and extracts information from Web documents and services

Page 11: Mining the Web for Information Organization J. H. Wang Academia Sinica

11

Web Mining Technologies

• Web content mining• Web structure mining • Web usage mining

Page 12: Mining the Web for Information Organization J. H. Wang Academia Sinica

12

Web Content Mining

• Unstructured documents– Free texts such as news articles

• Semi-structured documents– HTML structures and hyperlink information– Intra-document structure

• Applications: text categorization, text clustering, information extraction, computational linguistics, …

Page 13: Mining the Web for Information Organization J. H. Wang Academia Sinica

13

Web Structure Mining

• The structure of the hyperlinks within the Web– Inter-document structure– HITS, PageRank

• Social network and citation analysis• Applications: to calculate the quality

rank or relevancy of each Web page, Web page categorization, …

Page 14: Mining the Web for Information Organization J. H. Wang Academia Sinica

14

Web Usage Mining

• Techniques that could predict user behavior while the user interacts with the Web– To map the usage data of the Web server

into relational tables– To use the log data directly

• Applications: learning a user profile (personalization) vs. learning user navigation patterns

Page 15: Mining the Web for Information Organization J. H. Wang Academia Sinica

15

Related Fields of Research

• IR (Information Retrieval)• IE (Information Extraction)• ML (Machine Learning)

Page 16: Mining the Web for Information Organization J. H. Wang Academia Sinica

16

LiveTrans: Cross-language Web Search

• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html

Page 17: Mining the Web for Information Organization J. H. Wang Academia Sinica

17

Examples

Page 18: Mining the Web for Information Organization J. H. Wang Academia Sinica

18

More Examples

Page 19: Mining the Web for Information Organization J. H. Wang Academia Sinica

19

Cross Language Information Retrieval (CLIR)

• A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language

Page 20: Mining the Web for Information Organization J. H. Wang Academia Sinica

20

Cross Language Web Search

• A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language

Page 21: Mining the Web for Information Organization J. H. Wang Academia Sinica

21

Why “Cross-Language”?

• Source: Global Reach (global-reach.biz/globstats)

Page 22: Mining the Web for Information Organization J. H. Wang Academia Sinica

22

Top Ten Languages Used in the Web

Source: Internet World Stats (Sep. 20, 2006)

TOP TEN LANGUAGESIN THE INTERNET

% of allInternet Users

Internet Usersby Language

InternetPenetration

by Language

Internet Growthfor Language( 2000 - 2006 )

World Population2006 Estimate

for the Language

English 29.7 % 322,600,837 28.7 % 135.2 % 1,125,664,397

Chinese 13.3 % 144,301,513 10,8 % 346.7 % 1,340,767,863

Japanese 7.9 % 86,300,000 67.2 % 83.3 % 128,389,000

Spanish 7.5 % 81,729,671 18.7 % 231.1 % 437,502,257

German 5.4 % 58,854,682 61.3 % 113.2 % 95,982,043

French 4.6 % 49,660,498 13.0 % 307.1 % 381,193,149

Portuguese 3.1 % 34,064,760 14.8 % 349.6 % 230,846,275

Korean 3.1 % 32,372,000 45.8 % 78.0 % 73,945,860

Italian 2.7 % 28,870,000 48.8 % 118.7 % 59,115,261

Russian 2.2 % 23,700,000 16.5 % 664.5 % 143,682,757

TOP TEN LANGUAGES 79.5 % 863,981,961 21.5 % 166.7 % 4,017,088,863

Rest of World Languages 20.5 % 222,268,942 9.0 % 500.0 % 2,482,608,197

WORLD TOTAL 100.0 % 1,086,250,903 16.7 % 200.9 % 6,499,697,060

Top Ten Languages Used in the Web( Number of Internet Users by Language )

More and more non-English users!

Page 23: Mining the Web for Information Organization J. H. Wang Academia Sinica

23

Web Content by Language

Source: http://www.netz-tipp.de/languages.html (2002)

Chart of Web Content, 2002

0

200

400

600

800

1000

1200

English German French Japanese Spanish Chinese Italian Dutch Russian Korean Portuguese

Language

Milli

ons o

f Web

page

s More and more non-English pages

Page 24: Mining the Web for Information Organization J. H. Wang Academia Sinica

24

866,000,000 pages

Scalability Problem !

Number of Chinese Web Pages

Page 25: Mining the Web for Information Organization J. H. Wang Academia Sinica

25

Challenge of Cross-Language Web Search

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search terms could not be obtained from common English-Chinese translation dictionaries

中 央 處 理 器 (CPU), 電 子 商 務 (E-commerce),

個人數位助理 (PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War),非典型肺炎 (SARS), …

Page 26: Mining the Web for Information Organization J. H. Wang Academia Sinica

26

Challenge

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search requests could not be obtained from common English-Chinese translation dictionaries

• How to find effective translations automatically for query terms not included in a dictionary ?

Page 27: Mining the Web for Information Organization J. H. Wang Academia Sinica

27

CLIR

• Conventional approach to query translation – Parallel documents as the corpus– Assume long queries

• Problems of CLIR in Web search– No corpus for cross-lingual training– Short queries

“Out-of-dictionary” terms– Ex: proper nouns, new

terminologies, …

English Terminologies

Chinese Translation

mechanical strain 機械應變viscous damping 黏滯阻尼Richard Feynman 費曼Hyoplastic Left Heart Syndrome

左心發育不全症候群

NII Japan 國立情報學研究所

SARS 嚴重急性呼吸道症候群

Extracorporeal Shock Wave Lithotripsy

震波碎石

Davinci 達文西

Page 28: Mining the Web for Information Organization J. H. Wang Academia Sinica

28

Translation Lexicon Construction for CLIR

• To use the Web as the corpus for query translation– Web mining techniques

• Anchor-text-based [ACM TOIS ‘04, ACM TALIP ‘02]• Search-result-based [JCDL ‘04]

• To extract terms from real document collections as possible queries– Term extraction method [SIGIR ‘97]

Page 29: Mining the Web for Information Organization J. H. Wang Academia Sinica

29

Web Mining Approach to Term Translation Extraction

LiveTrans Engine

LiveTrans Engine

Academia SinicaAnchor textsAnchor texts

Search resultsSearch results

The Web

中央研究院 / 中研院

Source query

Target translations

Page 30: Mining the Web for Information Organization J. H. Wang Academia Sinica

30

Search-Result Page – National Palace Museum vs. 故宮博

物院

• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?

Noises

Page 31: Mining the Web for Information Organization J. H. Wang Academia Sinica

31

Coverage Rate of Top-Ranked Search-Result Pages

• 95% of popular Web queries’ translations can be found in top 30-40 result pages

• About 70% of random queries were covered

• Many relevant translations can also be found

Page 32: Mining the Web for Information Organization J. H. Wang Academia Sinica

32

Anchor-Text Set -- Yahoo vs. 雅虎

• Anchor text (link text)– The descriptive text of a

link on a Web page

• Anchor-text set– A set of anchor texts

pointing to the same page (URL)

– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ

• Anchor-text-set corpus– A collection of anchor-

text sets

Yahoo Search Engine

美国雅虎 雅虎搜尋引擎

Yahoo! America

Taiwan

China

Japan

Korea

야후 -USA

アメリカの Yahoo! http://www.yahoo.com

Page 33: Mining the Web for Information Organization J. H. Wang Academia Sinica

33

Problems

• How to extract translation candidates with correct lexical boundary?– Term extraction

• From the search-result pages• From the document collections

– Bilingual lexicon construction

• How to choose relevant candidates?– Term translation

Page 34: Mining the Web for Information Organization J. H. Wang Academia Sinica

34

Term Translation Extraction from Different Resources

Term

Extraction

Term

Extraction

Source Query

TargetTranslation

Search-ResultPages

SearchEngineSearchEngine

SimilarityEstimationSimilarityEstimation

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Anchor-Text

Corpus

WebSpiderWeb

Spider

Page 35: Mining the Web for Information Organization J. H. Wang Academia Sinica

35

Term Extraction

• Problem

DL

Doc.

國立故宮博物院

故宮博物院

故宮

立故宮

故宮博

宮博物

宮博

Correctly segmented

Incorrect text segments

Page 36: Mining the Web for Information Organization J. H. Wang Academia Sinica

36

Three Approaches to Term Extraction

• NLP (Linguistic) approach– Named entity recognition

• Extraction pattern/template approach– Wrapper generation/induction

• Statistical approach– Class-based language model– PAT-tree-based

Page 37: Mining the Web for Information Organization J. H. Wang Academia Sinica

37

PAT-tree-based Term Extraction

• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be

discarded

• CD (Context Dependency)– Dependence on the left- or right- adjacent

word/character– Low frequency n-grams can be extracted

• SCPCD: a combination of the two

Page 38: Mining the Web for Information Organization J. H. Wang Academia Sinica

38

Association Measure

1

1 11

21

1

1 11

21

1

)...()...(1

1)...(

)()(1

1)(

)(

n

i nii

n

n

i nii

nn

wwfreqwwfreqn

wwfreq

wwpwwpn

wwpwwSCP

21

111

)(

)()()(

n

nnn

wwfreq

wwRCwwLCwwCD

1

1 11

11

111

)()(1

1)()(

)()()(

n

i nii

nn

nnn

wwfreqwwfreqn

wwRCwwLCwwCDwwSCPwwSCPCD

Page 39: Mining the Web for Information Organization J. H. Wang Academia Sinica

39

Term Extraction Performance

Association Measure

Precision Recall Avg. R-P

CD 68.1 % 5.9 % 37.0 %

SCP 62.6 % 63.3 % 63.0 %

SCPCD 79.3 % 78.2 % 78.7 %

•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.

Page 40: Mining the Web for Information Organization J. H. Wang Academia Sinica

40

Speed PerformanceTable 2. The obtained average speed performance of different term extraction methods.

Term Extraction MethodTime for

PreprocessingTime for Extraction

LocalMaxs (Web Queries) 0.87 s 0.99 s

PATtree+LocalMaxs (Web Queries)

2.30 s 0.61 s

LocalMaxs (1,367 docs) 63.47 s 4,851.67 s

PATtree+LocalMaxs (1,367 docs)

840.90 s 71.24 s

LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s

PATtree+LocalMaxs (5,357 docs)

11,086.67 s 759.32 s

Page 41: Mining the Web for Information Organization J. H. Wang Academia Sinica

41

Term Translation

• Problem 故宮博物院

故宮

繪畫

書法

陶瓷

玉器

瓷器

刺繡

porcelain

Source query

Translation candidates

Similarity

Relevant terms

Page 42: Mining the Web for Information Organization J. H. Wang Academia Sinica

42

Similarity Estimation

How to decide the ranking?

1) S, Ti: frequently co-occur in the same pages– Not necessarily true

for synonyms and antonyms

2) S, Ti: have similar co-occurring context terms in the search-result pages

QueryS

QueryS .

.

.

T1

T2

Tn

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Page 43: Mining the Web for Information Organization J. H. Wang Academia Sinica

43

Chi-Square Test

• Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]

(3) . )()()()(

)(),(

2

2dcdbcaba

cbdaNtsSx

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

Available from search engine

Page 44: Mining the Web for Information Organization J. H. Wang Academia Sinica

44

Example Boolean Query for Chi-Square Test

Page 45: Mining the Web for Information Organization J. H. Wang Academia Sinica

45

Context Vector Analysis

• Context Vector Analysis– Co-occurring context terms as feature vectors– TF-IDF weighting

• Similarity measure– Cosine measure

(4) , )n

log(),(max

),(

N

dtf

dtfw

jj

iti

(5) . )()(

),(

1

2

1

2

1

m

it

m

is

tsm

icv

ii

ii

ww

wwtsS

Page 46: Mining the Web for Information Organization J. H. Wang Academia Sinica

46

The Combined Method• To take advantage of both methods

– Chi-Square Test: co-occurrence– Context Vector Analysis: similar context

(6) ,),(

),( m m

m

tsRtsSall

Rm(s,t) : Ranking of score in different methods

Rm(s,t) : Ranking of score in different methods

Page 47: Mining the Web for Information Organization J. H. Wang Academia Sinica

47

Experiments

• Web Search– Chinese search engine logs

• Dreamer Log: 228,566 unique terms, during a period of 3 months in 1998

• GAIS Log: 114,182 unique queries during a period of two weeks in 1999

• Digital Library– STICNET Database

• 33,797 scientific documents in 86 categories during 1983 and 1997

Page 48: Mining the Web for Information Organization J. H. Wang Academia Sinica

48

Experiments on Web Search

• Test sets– Popular-query set: 430 popular query terms

• Type Dic: 36%• Type OOV: 64%

– Random-query set• Randomly selected 200 Chinese queries from the

top 20,000 queries in Dreamer Log– Proper names & technical terms

• 50 scientists’ names & 50 disease names– Common terms

• Randomly selected 100 common nouns and 100 common verbs

Page 49: Mining the Web for Information Organization J. H. Wang Academia Sinica

49

Popular Chinese Query Set

Table 3. Coverage and inclusion rates for popular Chinese queries using different methods.

Method Query Type Top-1 Top-3 Top-5 Coverage

CV

Dic 56.4% 70.5% 74.4% 80.1%

OOV 56.2% 66.1% 69.3% 85.0%

All 56.3% 67.7% 71.2% 83.3%

χ2

Dic 40.4% 61.5% 67.9% 80.1%

OOV 54.7% 65.0% 68.2% 85.0%

All 49.5% 63.7% 68.1% 83.3%

Combined

Dic 57.7% 71.2% 75.0% 80.1%

OOV 56.6% 67.9% 70.9% 85.0%

All 57.2% 68.6% 72.8% 83.3%

Page 50: Mining the Web for Information Organization J. H. Wang Academia Sinica

50

Popular English Query Set

Table 4. Coverage and top-n inclusion rates for popular English query set using different methods.

Method Top-1 Top-3 Top-5 Coverage

CV 50.9% 60.1% 60.8% 80.9%

X2 44.6% 56.1% 59.2% 80.9%

Combined 51.8% 60.7% 62.2% 80.9%

Page 51: Mining the Web for Information Organization J. H. Wang Academia Sinica

51

Random Query Set

Table 5. Coverage and top-n inclusion rates for the random-query set using different methods.

Method Top-1 Top-3 Top-5 Coverage

CV 25.5% 45.5% 50.5% 60.5%

X2 26.0% 44.5% 50.5% 60.5%

Combined 29.5% 49.5% 56.5% 60.5%

Page 52: Mining the Web for Information Organization J. H. Wang Academia Sinica

52

Proper Names and Technical Terms

Table 6. Top-n inclusion rates for proper names and technical terms using the combined method.

Query Type Top-1 Top-3 Top-5

Scientist Name

40.0% 52.0% 60.0%

Disease Name

44.0% 60.0% 70.0%

Page 53: Mining the Web for Information Organization J. H. Wang Academia Sinica

53

Common Nouns and Verbs

Table 8. Top-n inclusion rates for common nouns and verbs using the combined approach.

Query Type Top-1 Top-3 Top-5

100 Common Nouns 23.0% 33.0% 43.0%

100 Common Verbs 6.0% 8.0% 10.0%

• Our methods are less reliable to common terms

Page 54: Mining the Web for Information Organization J. H. Wang Academia Sinica

54

Summary of Different Methods

• χ2 test– Fast– More suitable for high-frequency terms

• CV– Slow (for feature extraction)– Applicable to low-frequency terms

• Combined– Slow– Both high-frequency & low-frequency terms

Page 55: Mining the Web for Information Organization J. H. Wang Academia Sinica

55

Experiments on Digital Libraries

• Cross-language search for STICNET Database– 33,797 scientific documents in 86

categories, during 1983 and 1997– 410,557 English-Chinese bilingual key terms

• Challenges: – Various categories in specific domains– Hard to find translations on the Web

Page 56: Mining the Web for Information Organization J. H. Wang Academia Sinica

56

Example Cross-Lingual Queries in STICNET Database

Page 57: Mining the Web for Information Organization J. H. Wang Academia Sinica

57

STICNET Database Search Result

Page 58: Mining the Web for Information Organization J. H. Wang Academia Sinica

58

Translation of Auto-Extracted Unknown Terms

Table 9. The top-n inclusion rates of translations for auto-extracted useful unknown terms.

Query Type Top-1 Top-3 Top-5

Auto-extracted useful terms in Information

Engineering33.3% 37.5% 50.0%

Auto-extracted useful terms in Medicine

34.6% 46.2% 50.0%

• The feasibility of auto-extracted unknown terms has been shown

Page 59: Mining the Web for Information Organization J. H. Wang Academia Sinica

59

Some Examples of the Auto-Extracted Translations

English Terminologies Chinese Translation

mechanical strain 機械應變

viscous damping 黏滯阻尼

Extracorporeal Shock Wave Lithotripsy 震波碎石

Galilei, Galileo 伽利略 / 伽里略 / 加利略

Legionnaires' Disease 退伍軍人症

Page 60: Mining the Web for Information Organization J. H. Wang Academia Sinica

60

Other Applications

• Text Classification• Query Clustering• Search Result Clustering• Concept Search

Page 61: Mining the Web for Information Organization J. H. Wang Academia Sinica

61

LiveClassifier

A system that creates classifiers through Web mining

[WWW 2004]

Page 62: Mining the Web for Information Organization J. H. Wang Academia Sinica

62

LiveClassifier

Users create topic hierarchies and define classes/keywords

Page 63: Mining the Web for Information Organization J. H. Wang Academia Sinica

63

LiveClassifier

Web

Auto-extracted training data; No manually-labeled data provided

Exploiting the structure information inherent for training

Page 64: Mining the Web for Information Organization J. H. Wang Academia Sinica

64

LiveClassifier

People

Place

Subjects

Sub-subjects

Page 65: Mining the Web for Information Organization J. H. Wang Academia Sinica

65

LiveClassifier

Classifying documents

Into classes

Page 66: Mining the Web for Information Organization J. H. Wang Academia Sinica

66

LiveClassifier

Classifying short texts

Into classes

Page 67: Mining the Web for Information Organization J. H. Wang Academia Sinica

67

LiveClassifier

Page 68: Mining the Web for Information Organization J. H. Wang Academia Sinica

68

LiveClassifier

Page 69: Mining the Web for Information Organization J. H. Wang Academia Sinica

69

Page 70: Mining the Web for Information Organization J. H. Wang Academia Sinica

70

Term Clustering

Page 71: Mining the Web for Information Organization J. H. Wang Academia Sinica

71

HAC-based Binary Clustering

Page 72: Mining the Web for Information Organization J. H. Wang Academia Sinica

72

Min-Max Partition

Page 73: Mining the Web for Information Organization J. H. Wang Academia Sinica

73

Query Clustering

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

Page 74: Mining the Web for Information Organization J. H. Wang Academia Sinica

74

Thesaurus Construction from Query Log

• Query logs provide a representative terms for DL usage• Taxonomy generation from query logs

– Query clustering– Query categorization– Document categorization

Taxonomy Generation

(Query Clustering)

Query TermCategorization

Document TermCategorization

QueryLogs High-freq

terms

Low-freqterms Relevant

documents

Page 75: Mining the Web for Information Organization J. H. Wang Academia Sinica

75

Search Result Clustering

• Why search result clustering?• Why is SRC different from

document clustering?– In assessment of algorithm’s quality– Precision, recall vs. user-oriented,

subjective assessment

Page 76: Mining the Web for Information Organization J. H. Wang Academia Sinica

76

Example of Search Result Clustering

National Taiwan University NTU Hospital

Nanyang Technological University, Singapore

NTU?

Page 77: Mining the Web for Information Organization J. H. Wang Academia Sinica

77

Example Clustering Search Engines

• Vivisimo.com– Clusty.com

• KillerInfo.com• InfoNetWare.com• SnakeT (Snippet Aggregation for

Knowledge ExTraction): http://roquefort.unipi.it/ – A hierarchical clustering engine for snippets

• Mooter.com• …

Page 78: Mining the Web for Information Organization J. H. Wang Academia Sinica

78

Example on Vivisimo

Page 79: Mining the Web for Information Organization J. H. Wang Academia Sinica

79

Vivisimo (cont.)

Page 80: Mining the Web for Information Organization J. H. Wang Academia Sinica

80

Clusty.com

Page 81: Mining the Web for Information Organization J. H. Wang Academia Sinica

81

InfoNetWare.com

Page 82: Mining the Web for Information Organization J. H. Wang Academia Sinica

82

Concept Search

• Conventional search

• Concept-level search

doc Keyword search for “researcher” and “AI” and “Taiwan”

docresearcher AI

“professor”

“NTU”

“neuralnetwork”

researcherAI

Interesting document

Taiwan

Page 83: Mining the Web for Information Organization J. H. Wang Academia Sinica

83

Further Reading• Jenq-Haur Wang, Jei-Wen Teng, Wen-Hsiang Lu, and Lee-Feng Chien, "Exploiting

the Web as the Multilingual Corpus for Unknown Query Translation," Journal of the American Society for Information Science and Technology (JASIST), Vol. 57, No. 5, pp. 660-670, Special Issue on Multilingual Information Systems, Mar. 2006. (SCI, SSCI)

• Jenq-Haur Wang, Jei-Wen Teng, Pu-Jen Cheng, Wen-Hsiang Lu, and Lee-Feng Chien, "Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach," Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2004), pp. 108-116.

• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems, Vol. 22, No. 2, pp. 242-269, 2004. (SCI)

• Chien-Chung Huang, Shui-Lung Chuang, Lee-Feng Chien, Liveclassifier: Creating Hierarchical Text Classifiers through Web Corpora, Proceedings of WWW 2004, pp. 184-192.

• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Translation of Web Queries Using Anchor Text Mining. ACM Transactions on Asian Language Information Processing, pp. 159-172, 2002.

• Lee-Feng Chien, T.-I. Huang, M-C. Chien, Pat-tree-based Keyword Extraction for Chinese Information Retrieval, Proceedings of SIGIR 1997, pp. 50-58.

Page 84: Mining the Web for Information Organization J. H. Wang Academia Sinica

84

Thanks for Your Attention!

• Any question or comments?– [email protected]