mining the web for information organization j. h. wang academia sinica

Mining the Web for Information Organization

J. H. WangAcademia Sinica

2

Outline

• Introduction• Web Mining• Cross-Language Web Search• Other Applications

3

Introduction

• Huge amount of Web data– Rich and dynamic resources of human

knowledge– Multimedia – Scalability

How to organize Web data into useful information?

4

Number of Web Pages The world’s

largest search engine ?

Billions Of Textual Documents Indexed

December 1995-September 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.

Source: Search Engine Watch (Nov. 2004)

SearchEngine

ReportedSize

PageDepth

Google 8.1 billion 101K

MSN 5.0 billion 150K

Yahoo4.2 billion(estimate)

500K

AskJeeves

2.5 billion 101K+

5

Web Users and Pages (7 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Challenge of Scalability !

Total Users: 800MChinese Users: 110M

Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), and others.

Source: Global Reach, 2004

6

Web Mining

• Data Mining• Text Mining• Web Mining Technologies

7

Data Mining

• Data Mining (Knowledge Discovery in Databases) is a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases [G. Piatetsky-Shapiro and W. J. Frawley]

8

Text Mining

• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]

9

Web Mining

• Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents and services [O. Etzioni]

10

Comparison

• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases

• In text mining, the patterns are extracted from natural language texts rather than from structured databases of facts

• Web mining discovers and extracts information from Web documents and services

11

Web Mining Technologies

• Web content mining• Web structure mining • Web usage mining

12

Web Content Mining

• Unstructured documents– Free texts such as news articles

• Semi-structured documents– HTML structures and hyperlink information– Intra-document structure

• Applications: text categorization, text clustering, information extraction, computational linguistics, …

13

Web Structure Mining

• The structure of the hyperlinks within the Web– Inter-document structure– HITS, PageRank

• Social network and citation analysis• Applications: to calculate the quality

rank or relevancy of each Web page, Web page categorization, …

14

Web Usage Mining

• Techniques that could predict user behavior while the user interacts with the Web– To map the usage data of the Web server

into relational tables– To use the log data directly

• Applications: learning a user profile (personalization) vs. learning user navigation patterns

15

Related Fields of Research

• IR (Information Retrieval)• IE (Information Extraction)• ML (Machine Learning)

16

LiveTrans: Cross-language Web Search

• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html

http://livetrans.iis.sinica.edu.tw/lt.html

http://livetrans.iis.sinica.edu.tw/lt.html

17

Examples

18

More Examples

19

Cross Language Information Retrieval (CLIR)

• A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language

20

Cross Language Web Search

• A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language

21

Why “Cross-Language”?

• Source: Global Reach (global-reach.biz/globstats)

22

Top Ten Languages Used in the Web

Source: Internet World Stats (Sep. 20, 2006)

TOP TEN LANGUAGESIN THE INTERNET

% of allInternet Users

Internet Usersby Language

InternetPenetration

by Language

Internet Growthfor Language( 2000 - 2006 )

World Population2006 Estimate

for the Language

English 29.7 % 322,600,837 28.7 % 135.2 % 1,125,664,397

Chinese 13.3 % 144,301,513 10,8 % 346.7 % 1,340,767,863

Japanese 7.9 % 86,300,000 67.2 % 83.3 % 128,389,000

Spanish 7.5 % 81,729,671 18.7 % 231.1 % 437,502,257

German 5.4 % 58,854,682 61.3 % 113.2 % 95,982,043

French 4.6 % 49,660,498 13.0 % 307.1 % 381,193,149

Portuguese 3.1 % 34,064,760 14.8 % 349.6 % 230,846,275

Korean 3.1 % 32,372,000 45.8 % 78.0 % 73,945,860

Italian 2.7 % 28,870,000 48.8 % 118.7 % 59,115,261

Russian 2.2 % 23,700,000 16.5 % 664.5 % 143,682,757

TOP TEN LANGUAGES 79.5 % 863,981,961 21.5 % 166.7 % 4,017,088,863

Rest of World Languages 20.5 % 222,268,942 9.0 % 500.0 % 2,482,608,197

WORLD TOTAL 100.0 % 1,086,250,903 16.7 % 200.9 % 6,499,697,060

Top Ten Languages Used in the Web( Number of Internet Users by Language )

More and more non-English users!

http://www.internetworldstats.com/reports.htm



http://www.internetworldstats.com/stats10.htm







http://www.internetworldstats.com/stats.htm

23

Web Content by Language

Source: http://www.netz-tipp.de/languages.html (2002)

Chart of Web Content, 2002

0

200

400

600

800

1000

1200

English German French Japanese Spanish Chinese Italian Dutch Russian Korean Portuguese

Language

Milli

ons o

f Web

page

s More and more non-English pages

24

866,000,000 pages

Scalability Problem !

Number of Chinese Web Pages

25

Challenge of Cross-Language Web Search

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search terms could not be obtained from common English-Chinese translation dictionaries

中央處理器 (CPU), 電子商務 (E-commerce),

個人數位助理 (PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War),非典型肺炎 (SARS), …

26

Challenge

• Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

• 81% of the search requests could not be obtained from common English-Chinese translation dictionaries

• How to find effective translations automatically for query terms not included in a dictionary ?

27

CLIR

• Conventional approach to query translation – Parallel documents as the corpus– Assume long queries

• Problems of CLIR in Web search– No corpus for cross-lingual training– Short queries

“Out-of-dictionary” terms– Ex: proper nouns, new

terminologies, …

English Terminologies

Chinese Translation

mechanical strain 機械應變viscous damping 黏滯阻尼Richard Feynman 費曼Hyoplastic Left Heart Syndrome

左心發育不全症候群

NII Japan 國立情報學研究所

SARS 嚴重急性呼吸道症候群

Extracorporeal Shock Wave Lithotripsy

震波碎石

Davinci 達文西

28

Translation Lexicon Construction for CLIR

• To use the Web as the corpus for query translation– Web mining techniques

• Anchor-text-based [ACM TOIS ‘04, ACM TALIP ‘02]• Search-result-based [JCDL ‘04]

• To extract terms from real document collections as possible queries– Term extraction method [SIGIR ‘97]

29

Web Mining Approach to Term Translation Extraction

LiveTrans Engine

LiveTrans Engine

Academia SinicaAnchor textsAnchor texts

Search resultsSearch results

The Web

中央研究院 / 中研院

Source query

Target translations

30

Search-Result Page – National Palace Museum vs. 故宮博

物院

• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?

Noises

31

Coverage Rate of Top-Ranked Search-Result Pages

• 95% of popular Web queries’ translations can be found in top 30-40 result pages

• About 70% of random queries were covered

• Many relevant translations can also be found

32

Anchor-Text Set -- Yahoo vs. 雅虎

• Anchor text (link text)– The descriptive text of a

link on a Web page

• Anchor-text set– A set of anchor texts

pointing to the same page (URL)

– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ

• Anchor-text-set corpus– A collection of anchor-

text sets

Yahoo Search Engine

美国雅虎雅虎搜尋引擎

Yahoo! America

Taiwan

China

Japan

Korea

야후 -USA

アメリカの Yahoo! http://www.yahoo.com

33

Problems

• How to extract translation candidates with correct lexical boundary?– Term extraction

• From the search-result pages• From the document collections

– Bilingual lexicon construction

• How to choose relevant candidates?– Term translation

34

Term Translation Extraction from Different Resources

Term

Extraction

Term

Extraction

Source Query

TargetTranslation

Search-ResultPages

SearchEngineSearchEngine

SimilarityEstimationSimilarityEstimation

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Anchor-Text

Corpus

WebSpiderWeb

Spider

35

Term Extraction

• Problem

DL

Doc.

國立故宮博物院

故宮博物院

故宮

立故宮

故宮博

宮博物

宮博

…

Correctly segmented

Incorrect text segments

36

Three Approaches to Term Extraction

• NLP (Linguistic) approach– Named entity recognition

• Extraction pattern/template approach– Wrapper generation/induction

• Statistical approach– Class-based language model– PAT-tree-based

37

PAT-tree-based Term Extraction

• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be

discarded

• CD (Context Dependency)– Dependence on the left- or right- adjacent

word/character– Low frequency n-grams can be extracted

• SCPCD: a combination of the two

38

Association Measure

1

1 11

21

1

1 11

21

1

)...()...(1

1)...(

)()(1

1)(

)(

n

i nii

n

n

i nii

nn

wwfreqwwfreqn

wwfreq

wwpwwpn

wwpwwSCP

21

111

)(

)()()(

n

nnn

wwfreq

wwRCwwLCwwCD

1

1 11

11

111

)()(1

1)()(

)()()(

n

i nii

nn

nnn

wwfreqwwfreqn

wwRCwwLCwwCDwwSCPwwSCPCD

39

Term Extraction Performance

Association Measure

Precision Recall Avg. R-P

CD 68.1 % 5.9 % 37.0 %

SCP 62.6 % 63.3 % 63.0 %

SCPCD 79.3 % 78.2 % 78.7 %

•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.

40

Speed PerformanceTable 2. The obtained average speed performance of different term extraction methods.

Term Extraction MethodTime for

PreprocessingTime for Extraction

LocalMaxs (Web Queries) 0.87 s 0.99 s

PATtree+LocalMaxs (Web Queries)

2.30 s 0.61 s

LocalMaxs (1,367 docs) 63.47 s 4,851.67 s

PATtree+LocalMaxs (1,367 docs)

840.90 s 71.24 s

LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s

PATtree+LocalMaxs (5,357 docs)

11,086.67 s 759.32 s

41

Term Translation

• Problem 故宮博物院

故宮

繪畫

瓷

書法

陶瓷

玉器

瓷器

刺繡

…

porcelain

Source query

Translation candidates

Similarity

Relevant terms

42

Similarity Estimation

How to decide the ranking?

1) S, Ti: frequently co-occur in the same pages– Not necessarily true

for synonyms and antonyms

2) S, Ti: have similar co-occurring context terms in the search-result pages

QueryS

QueryS .

.

.

T1

T2

Tn

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

43

Chi-Square Test

• Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]

(3) . )()()()(

)(),(

2

2dcdbcaba

cbdaNtsSx

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

Available from search engine

44

Example Boolean Query for Chi-Square Test

45

Context Vector Analysis

• Context Vector Analysis– Co-occurring context terms as feature vectors– TF-IDF weighting

• Similarity measure– Cosine measure

(4) , )n

log(),(max

),(

N

dtf

dtfw

jj

iti

(5) . )()(

),(

1

2

1

2

1

m

it

m

is

tsm

icv

ii

ii

ww

wwtsS

46

The Combined Method• To take advantage of both methods

– Chi-Square Test: co-occurrence– Context Vector Analysis: similar context

(6) ,),(

),( m m

m

tsRtsSall

Rm(s,t) : Ranking of score in different methods

Rm(s,t) : Ranking of score in different methods

47

Experiments

• Web Search– Chinese search engine logs

• Dreamer Log: 228,566 unique terms, during a period of 3 months in 1998

• GAIS Log: 114,182 unique queries during a period of two weeks in 1999

• Digital Library– STICNET Database

• 33,797 scientific documents in 86 categories during 1983 and 1997

48

Experiments on Web Search

• Test sets– Popular-query set: 430 popular query terms

• Type Dic: 36%• Type OOV: 64%

– Random-query set• Randomly selected 200 Chinese queries from the

top 20,000 queries in Dreamer Log– Proper names & technical terms

• 50 scientists’ names & 50 disease names– Common terms

• Randomly selected 100 common nouns and 100 common verbs

49

Popular Chinese Query Set

Table 3. Coverage and inclusion rates for popular Chinese queries using different methods.

Method Query Type Top-1 Top-3 Top-5 Coverage

CV

Dic 56.4% 70.5% 74.4% 80.1%

OOV 56.2% 66.1% 69.3% 85.0%

All 56.3% 67.7% 71.2% 83.3%

χ2

Dic 40.4% 61.5% 67.9% 80.1%

OOV 54.7% 65.0% 68.2% 85.0%

All 49.5% 63.7% 68.1% 83.3%

Combined

Dic 57.7% 71.2% 75.0% 80.1%

OOV 56.6% 67.9% 70.9% 85.0%

All 57.2% 68.6% 72.8% 83.3%

50

Popular English Query Set

Table 4. Coverage and top-n inclusion rates for popular English query set using different methods.

Method Top-1 Top-3 Top-5 Coverage

CV 50.9% 60.1% 60.8% 80.9%

X2 44.6% 56.1% 59.2% 80.9%

Combined 51.8% 60.7% 62.2% 80.9%

51

Random Query Set

Table 5. Coverage and top-n inclusion rates for the random-query set using different methods.

Method Top-1 Top-3 Top-5 Coverage

CV 25.5% 45.5% 50.5% 60.5%

X2 26.0% 44.5% 50.5% 60.5%

Combined 29.5% 49.5% 56.5% 60.5%

52

Proper Names and Technical Terms

Table 6. Top-n inclusion rates for proper names and technical terms using the combined method.

Query Type Top-1 Top-3 Top-5

Scientist Name

40.0% 52.0% 60.0%

Disease Name

44.0% 60.0% 70.0%

53

Common Nouns and Verbs

Table 8. Top-n inclusion rates for common nouns and verbs using the combined approach.


100 Common Nouns 23.0% 33.0% 43.0%

100 Common Verbs 6.0% 8.0% 10.0%

• Our methods are less reliable to common terms

54

Summary of Different Methods

• χ2 test– Fast– More suitable for high-frequency terms

• CV– Slow (for feature extraction)– Applicable to low-frequency terms

• Combined– Slow– Both high-frequency & low-frequency terms

55

Experiments on Digital Libraries

• Cross-language search for STICNET Database– 33,797 scientific documents in 86

categories, during 1983 and 1997– 410,557 English-Chinese bilingual key terms

• Challenges: – Various categories in specific domains– Hard to find translations on the Web

56

Example Cross-Lingual Queries in STICNET Database

57

STICNET Database Search Result

58

Translation of Auto-Extracted Unknown Terms

Table 9. The top-n inclusion rates of translations for auto-extracted useful unknown terms.


Auto-extracted useful terms in Information

Engineering33.3% 37.5% 50.0%

Auto-extracted useful terms in Medicine

34.6% 46.2% 50.0%

• The feasibility of auto-extracted unknown terms has been shown

59

Some Examples of the Auto-Extracted Translations

English Terminologies Chinese Translation

mechanical strain 機械應變

viscous damping 黏滯阻尼

Extracorporeal Shock Wave Lithotripsy 震波碎石

Galilei, Galileo 伽利略 / 伽里略 / 加利略

Legionnaires' Disease 退伍軍人症

60

Other Applications

• Text Classification• Query Clustering• Search Result Clustering• Concept Search

61

LiveClassifier

A system that creates classifiers through Web mining

[WWW 2004]

62

LiveClassifier

Users create topic hierarchies and define classes/keywords

63

LiveClassifier

Web

Auto-extracted training data; No manually-labeled data provided

Exploiting the structure information inherent for training

64

LiveClassifier

People

Place

Subjects

Sub-subjects

65

LiveClassifier

Classifying documents

Into classes

66

LiveClassifier

Classifying short texts

Into classes

67

LiveClassifier

68

LiveClassifier

70

Term Clustering

71

HAC-based Binary Clustering

72

Min-Max Partition

73

Query Clustering

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

74

Thesaurus Construction from Query Log

• Query logs provide a representative terms for DL usage• Taxonomy generation from query logs

– Query clustering– Query categorization– Document categorization

Taxonomy Generation

(Query Clustering)

Query TermCategorization

Document TermCategorization

QueryLogs High-freq

terms

Low-freqterms Relevant

documents

75

Search Result Clustering

• Why search result clustering?• Why is SRC different from

document clustering?– In assessment of algorithm’s quality– Precision, recall vs. user-oriented,

subjective assessment

76

Example of Search Result Clustering

National Taiwan University NTU Hospital

Nanyang Technological University, Singapore

NTU?

77

Example Clustering Search Engines

• Vivisimo.com– Clusty.com

• KillerInfo.com• InfoNetWare.com• SnakeT (Snippet Aggregation for

Knowledge ExTraction): http://roquefort.unipi.it/ – A hierarchical clustering engine for snippets

• Mooter.com• …

http://roquefort.unipi.it/



78

Example on Vivisimo

79

Vivisimo (cont.)

80

Clusty.com

81

InfoNetWare.com

82

Concept Search

• Conventional search

• Concept-level search

doc Keyword search for “researcher” and “AI” and “Taiwan”

docresearcher AI

“professor”

“NTU”

“neuralnetwork”

researcherAI

Interesting document

Taiwan

83

Further Reading• Jenq-Haur Wang, Jei-Wen Teng, Wen-Hsiang Lu, and Lee-Feng Chien, "Exploiting

the Web as the Multilingual Corpus for Unknown Query Translation," Journal of the American Society for Information Science and Technology (JASIST), Vol. 57, No. 5, pp. 660-670, Special Issue on Multilingual Information Systems, Mar. 2006. (SCI, SSCI)

• Jenq-Haur Wang, Jei-Wen Teng, Pu-Jen Cheng, Wen-Hsiang Lu, and Lee-Feng Chien, "Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach," Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2004), pp. 108-116.

• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems, Vol. 22, No. 2, pp. 242-269, 2004. (SCI)

• Chien-Chung Huang, Shui-Lung Chuang, Lee-Feng Chien, Liveclassifier: Creating Hierarchical Text Classifiers through Web Corpora, Proceedings of WWW 2004, pp. 184-192.

• Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee, Translation of Web Queries Using Anchor Text Mining. ACM Transactions on Asian Language Information Processing, pp. 159-172, 2002.

• Lee-Feng Chien, T.-I. Huang, M-C. Chien, Pat-tree-based Keyword Extraction for Chinese Information Retrieval, Proceedings of SIGIR 1997, pp. 50-58.

84

Thanks for Your Attention!

• Any question or comments?– [email protected]

mining the web for information organization j. h. wang academia sinica

Documents

web users

web documents

web datarich

web server

relevant web pages

web usage miningtechniques

web page categorization

number of web pages