mining academic community jan-ming ho hohoiis.sinica.edu.tw c omputer s ystem and c ommunication l...

92
Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw Computer System and Communication Lab Institute of Information Science Academia Sinica

Post on 15-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

Mining Academic Community

Jan-Ming Hohohoiis.sinica.edu.tw

Computer System and Communication LabInstitute of Information Science

Academia Sinica

Page 2: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

2

What is Community?

In Graph Theory densely connected groups of

vertices, with sparser connection between groups

In Social Network Analysis groups of entities that share similar

properties or connect to each other via certain relations

A social network is a structure made up of nodes, representing entities from different conceptual groups, that are linked with different types of relations

Page 3: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

3

Why is Community Important?

Interesting data with community structure researcher collaboration, friendship network, WWW,

Massive Multi-player on-line gaming, electronic communications.

Groups of web pages that link to more web pages in the community than pages outside correspond to web pages on related topics

Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.

Page 4: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

4

Motivation

Understand the research network between authors, conferences and topics (rank entities by relevance for given entities)

Find and justifiably recommend research collaborators for given authors

Explore the academic social network Find out most important papers, researchers and

venues for a given topic

Page 5: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

5

Related Systems

Many digital library systems exist ACM Digital Library IEEExplorer DBLP Citeseer Libra DBConnect

Problems The coverage of dataset is not large enough Name ambiguous problem exists in

Web pages Citation records

Page 6: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

6

Libra Academic Search

http://libra.msra.cn Free computer science bibliography search

engine A test-bed for object-level vertical search

research Currently the following types of paper-related

objects can be searched: Papers, Authors, Conferences, Journals, Research

Communities

Page 7: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

7

Page 8: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

8

Page 9: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

9

DBconnect: Conference

Page 10: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

10

DBconnect: Topic

Page 11: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

11

DBconnect: Author

Page 12: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

12

ZoomInfo

(1) People Directory

(2) Developer Tools

(3) Social Network, Profile Statistics, Employment History

(4) Ability to identify ambiguous?! Ex. Can get 21 different people called “Bing Liu”

Page 13: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

13

ArnetMiner

Page 14: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

14

Our goal

Developing an automatic system to Explore the academic social network Find out most important papers, researchers and venues for a

given topic

Provide solutions for existent problems Collecting larger citation datasets

Retrieving data from web pages• Publication list finder • Extracting citation strings from web pages • Citation parser

Multilingual data sources• Chinese and English corpuses

Name dissemination mechanism in Web pages Citation records

Page 15: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

15

Our contributions

Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web Appearance Disambiguation of Personal Names Based on Network Motif," in the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), Hong Kong, Dec. 18-22, 2006

Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, "PLF: A Publication List Web Page Finder for Researchers," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007

Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and Jan-Ming Ho, "Mining Translations of Chinese Name from Web Corpora by Using Query Expansion Technique and Support Vector Machine," in Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), Silicon Valley, USA, Nov. 2-5, 2007

Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming Lee, "AEFS: Authoritative Expert Finding System Based on a Language Model and Social Network Analysis," in Proceedings of the 12th Conference on Artificial Intelligence and Applications (TAAI2007), Nov 16-17, 2007

Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser Based on Sequence Alignment Techniques," will appear in Proceedings of the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA-08)

Page 16: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

16

PLF: A Publication List Web Page Finder for

Researchers

Page 17: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

17

Agenda

Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

Page 18: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

18

Overview of a Publication List Web Page

Keep abreast of state-of-the-art research Contains citations not found elsewhere. May provide some reference materials, such as slides

and talks.

Challenges How to find the publication list web pages

Only with the given name .

Various versions or Multiple copies An author may have many affiliations.

Name ambiguity problem E.g., Dr. Bing Liu, we found that 26 people share the same name by inquiring to

ZoomInfo (people search engine).

Page 19: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

19

Problem

“Publication List Web Page?”

Page 20: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

20

Definition of Publication List

[Affiliation] Institute of Information Science, Academia

Sinica

citation string

Affiliated Personal Publication List Web Page (APPL)

a web page belongs to the affiliated web site of a specific person with the given name.

Page 21: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

21

Agenda

Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

Page 22: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

22

Process Flow

Given Names

Citation String Search

Digital Libraries

Query

collects the citation strings from digital libraries

Search EnginesInteraction

Web Page Crawlercollects the hyperlinks of web pages from search engines by using the collected citation strings as queries

web page

hyperlinks

Parsing

Citations statistics

Rank Function

analyses the statistics of all the collected hyperlinks of web pages

Analyse

Page 23: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

23

QPT2Query

Web Page1

WP2

WP3

.

.

.

WPn

Paper Title1

Jan-Ming Ho

PT2

PT3

.

.

.

PTm

A publication list web page

QPT1Query

Web Page1

WP2

WP3

.

.

.

WPn

X

Search Engine

Search Engine

Basic Concept

A publication list web page may contain many citation strings

Page 24: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

24

Agenda

Introduction Publication List Web Page Finder, PLF Performance Evaluation Conclusion, Future Work

Page 25: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

25

Dataset

Scenario Seminar members have usually published major

research works We randomly collected 200 names from the WWW ’06

Conference Committee website

APPL Types #APPL #people %population

others 0 22 11%

single-group 1 120 60%

multi-group 2 35 17.5%

3 16 8%

4 7 3.5%

Page 26: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

26

Experiment Evaluation

Evaluation metrics We consider the top-5 results derived by each link and

focus on the top-5 recall metric, which is calculated by:

R

Rrecall a

Notation

Definition

Rathe number of publication list web pages belonging to researchers listed in the dataset

R the number of publication list web pages contained in the top-5 results

Page 27: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

27

Parameter Analysis for Single-Group

(a) Fixed n mixed with different scale m

(b) Fixed m mixed with different scale n

Figure (a)

• When m increases, the recall rate also increases.

•Figure (b)

• System performance may be constrained by m.

(m, n) (m, n)

Page 28: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

28

Parameter Analysis for Multi-Group

(a) Fixed n mixed with different scale m

(b) Fixed m mixed with different scale n

Figure (a)

• It is clear that the performance when m = 40 is always better than the other settings.

Figure (b)

• The best performance (top-5 recall is 70%) occurs when n = 75.

Page 29: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

29

Performance Evaluations

(a)Performance of approaches in

single-group

(b)Performance of different ways in

multi-group1. The parameter m has a strong influence on the system’s performance; for example,

an oversized m may degrade the performance.

2. The parameter n has little influence on the system’s performance.

3. The PLF system outperforms the other two approaches on both the single-group and the multi-group datasets.

(given name + keyword)

Page 30: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

30

Conclusion

We have defined the problem of finding the publication list web pages of a researcher, and proposed “PLF” system

Ongoing work Name ambiguity problem How to merge the multiple publication list web pages

for a specific person into a single page.

Page 31: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

31

Discussion – Name Ambiguity Problem

Scenario We take the name “Bing Liu”

as an example Analyze manually

Observation Citation Count Name translation problem Partial matching problem

Page 32: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

32

Extracting Citation Strings from Web Pages

Page 33: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

33

Extract Citation Records

Structured Data

Extract

Web Page

Page 34: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

34

Challenges

The formats of publication list web pages vary There are no fixed syntactic rules for parsing

citation records Hence, We can not apply simple rules to extract citation

records automatically

Page 35: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

35

Challenges: Complex Layouts of Publication List Pages

Page 36: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

36

Ideas

The semantic structure of web pages is organized by visual arrangement.

We can utilize semi-structure information (visual ) of web pages to help extraction task.

With hierarchical structure and geometric information, DOM tree is not only a great structure to present Web pages, but also very helpful for visual pattern analysis.

Page 37: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

37

DOM Tree Presentation of Web page

Banner

Navigator

Bar

Publication List

Citation String

Citation String

Citation String

Citation String

Page 38: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

38

Architecture of Citation Extraction System

Common Style Finder

CitationRecords

Publication List Web Page

Finder

Parsing

DOM Tree

Normal Citation Model

RankingRecords

Citation Extractor

Mining Common Style

PatternsPublication List

Pages

Candidate Citation Records

CiteSeer

Citation Extraction System

Extracting Candidate Records

p

lili

a ema

T1

T2

T3 T4

T5

T3T4T5

T1T2

Estimating Citation Length

p

lili

a ema

T1

T2

T3 T4

T5

div

a

T6

p

lili

a ema

T1

T2

T3 T4

T5

div

a

T6

Page 39: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

39

Modules of Citation Extraction System

Common Style Finder find out all common style patterns for each

level of granularity in web pages

Citation Extractor explore data regions with common style

patterns distill extraction rules from those data regions rank extraction patterns based on a normal

word count distribution probability

Page 40: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

40

BibPro: A Citation Parser based on Sequence Alignment Techniques

Page 41: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

41

System Goal

CitationChomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory. 2(3) 113--124.

Our System

MetaDataAuthor: Chomsky, NoamTitle: Three models for the description of languageJournal: IRE Transactions on Information TheoryVolume: 2Issue: 3Page: 113-124Month:Year: 1956

Page 42: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

42

Basic Idea(1/2)

Encode citation to protein sequence Only keep the citation style information

order of fields field separators

Author

Title Journal

year page …

…A D T D L D Y R P H Sprotein

sequence

Page 43: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

43

Basic Idea(2/2)

To determine citation style by the order of punctuation marks and reserved words

CitationString

FeatureIndex

FeatureIndex

CitationStyle

Search ToolBLAST

FeatureIndex

CitationStyle

FeatureIndex

CitationStyle

.

.

.

.

System Preprocess Online parsing

Page 44: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

44

How to encode citation to protein sequence?

Keep the citation style information Which field should be included? (only can use 23

symbol) Which punctuation are used to separate fields?

By observing different citation styles, we define an encode table to translate each token of citation to an amino acid symbol

Page 45: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

45

Encode Table

A: AuthorT: TitleL: JournalF: Volumn valueW: Issue valueH: Page valueM: MonthY: Year X: noise (unrecognized token)S: Issue key. e.g. “no”, “No”P: Page key. e.g. “pp”, “page”V: Volume key. e.g. “Vol”, “vo”

N: numeralQ: @ # $ % ^ & * + = \ | ~ _ / ! ? 。I: ( [ { < 「K: ) ] } > 」D: . G: " “ ” R: ,C: - :E: ' ` Z: ;B: blank

Page 46: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

46

How to using protein sequence to extract metadata?

Transform extraction problem to sequence alignment problem

Form translation Unknown Answer

BASE FORM ALIGN FORM INDEX FORM

Known Answer RESULT FORM STYLE FORM INDEX FORM

Page 47: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

47

RESULT FORM (Known Answer)

Page 48: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

48

BASE FORM (Unknow Answer)

Page 49: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

49

System Structure

System PreProcess (Template Generating System) Citation Crawler Template Builder

Online Parsing (Parsing System) Template Matching Metadata Extraction

Online Parsing

SystemPreProcess Template

Database

Query Citation Metadata2

1Resource on the Internet

Page 50: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

50

Citation Crawler

BibTexBibTexBibTex

MetaDataCitation MetaDataCitation

BibTexParser

IEEEEngine

GoogleEngine

CiteSeerEngine

ACMEngine

MetaDataCitation

Citation Crawler

Page 51: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

51

BLAST-powered Template Matching

Form Translation

Query Citation

TEMPLATE DATABASE

Encode Citation

INDEXFORM BLAST

Encode Table

STYLE FORMSTYLE FORMSTYLE FORM

INDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORMINDEX FORM STYLE FORM

Page 52: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

52

Evaluation for CiteSeer DataSet

Consider the inconsistency between the Citation String and BibTex file(metadata)

Old Measurement:

New Measurement:

]Token[Token#

]Token[Token# PrecisionField

BibTexcitationquery

fieldBibTex field parsednew

]Token[Token#

]Token[Token# PrecisionField

BibTexfield parsed

fieldBibTex field parsedold

Page 53: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

53

Definition

Tokenparsedfield: denote tokens that appear in the parsed subfield

Tokenquery citation: denote tokens that appear in the query citation string

TokenBibTex field : denote tokens that appear in the specific subfield in the BibTex file

TokenBibTex : denote all tokens that appear in the BibTex fileThese tokens don' t include punctuation

Page 54: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

54

Compare with ParaCite

DataSet Collected from CiteSeer

Training Set: 2416 Testing Set: 4131

ParaCite Using default template Database

• add template to its database isn’t easy Test Testing Set

Our System Using training template Database (Training Set) Test Testing Set

Page 55: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

55

Experimental Results

ParaCite

Autor Title Journal Page Issue Year Score

new Eva

32.90%

73.35%

29.83%

4.58%25.05

%77.04

%50.22

%

ParaCite

Autor Title Journal Page Issue Year Score

old Eva99.08

%62.72

%30.46

%100.00

%93.96

%99.70

%78.81

%

Our Author Title Journal Volumn Page Issue Month Year Score

new Eva

93.73%

73.32%

51.34%

83.52%94.62

%85.11

%89.18

%96.49

%84.80

%

Our Author Title Journal Volumn Page Issue Month Year Score

old Eva90.58

%89.51

%67.66

%93.58%

96.69%

91.79%

99.49%

99.50%

91.45%

Page 56: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

56

Analysis

ParaCite only can extract one author name Old evaluation have a problem: it is highly

probable that you will obtain high accuracy, if you extract less information

Page 57: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

57

Evaluation for clean DataSet

Ciation String is fully composed of corresponding metadata

fields ofnumber Total

fields extractedcorrectly ofNumber Accuracy

Page 58: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

58

Compare with INFOMAP

DataSet Includes 160000 record Training Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA,

MISQ, and ISR) Testing Dataset: 10000 X 6 (JMIS, ACM, IEEE, APA,

MISQ, and ISR)

Page 59: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

59

Result

  Author Title Journal Volumn Page Issue Year Overall average

APA 99.67% 96.38% 97.06% 98.99% 98.71% 98.12% 99.42% 98.33%

IEEE 98.72% 98.12% 99.12% 99.30% 98.40% 98.39% 99.40% 98.78%

ACM 97.14% 95.01% 93.93% 97.19% 97.92% 97.03% 98.88% 96.73%

ISR 99.48% 96.17% 96.96% 99.15% 98.55% 98.39% 99.35% 98.29%

MISQ 98.59% 97.99% 98.98% 99.41% 98.83% 98.61% 99.54% 98.85%

JMIS 91.95% 87.90% 90.46% 99.23% 98.76% 98.03% 99.46% 95.11%

Average 97.59% 95.26% 96.09% 98.88% 98.53% 98.09% 99.34% 97.68%

Page 60: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

60

Evaluation for Cora DataSet

500 records Be used as benchmark for many papers (HMM, SVM, CRF)

Page 61: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

61

Evaluation

Divide words into four kinds: TP,FP,TN,FN

Four metrics: Word Accuracy: (TP+TN)/(TP+FP+FN+TN) Precision: TP/(TP+FP) Recall: TP/(TP+FN) F1-measure: (2*Precision*Recall)/(Precision+Recall)

Page 62: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

62

Our System

  acc. F1.

Author 97.17% 93.98%

Title 94.17% 90.13%

Journal 93.58% 83.27%

Volume 99.21% 84.62%

Page 99.21% 92.09%

Date 99.92% 98.96%

Page 63: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

63

Mining Translations of Chinese Names from Web Corpora by Using a

Query Expansion Technique and Support Vector Machine

Page 64: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

64

Agenda

Introduction Proposed Approach Experiments Conclusions and Future Work

Page 65: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

65

Background

Most of academic information can be found on the Web Scholar Google, DBLP etc.

Page 66: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

66

Problems in Searching Chinese Name

Only Chinese Corpus

Page 67: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

67

Challenges in Chinese Name Translation

Many pronunciation rules in different areas 陳 Chen (Taiwan)

陳 Tsun (Hong Kong)陳 Tan (Fukien)

Some additional words exist. Ex: 黃光明 (Kwang-Ming Frank Hwang)

Ex: 張韻詩 (Jane Win-Shih Liu)

Page 68: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

68

Common Chinese Name Translation Format

Name Format Examples

Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name)

劉豐哲 (Fon-Che Liu)黃田漢 (Ng Tian Hann)林牛 (Ngau Lam)

Type-2. (Merged Chinese given name) (Surname)

吳德琪 (Derchyi Wu)

Type-3. (Western first name) (Surname) 趙蓮菊 (Anne Chao)Type-4. (Chinese given name) (Western first name) (Surname)

黃光明 (Kwang-Ming Frank Hwang)

Type-5. (Abbreviated Chinese given name) (Surname)

張秀瑜 (S.-Y. Chang)

Type-6. (Western first name) (Abbreviated Chinese given name) (Surname)

李昭勝 (Jack-C. Lee)

Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname)

蔡桂紅 (Gwei-Hung H. Tsai)

Type-8. (Chinese given name) (Unpredictable Surname)

張韻詩 (Jane Win-Shih Liu)

Page 69: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

69

Goal

Design an automatic mechanism to translate a given Chinese name into its related English name

Page 70: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

70

Agenda

Introduction Proposed Approach Experiments Conclusions and Future Work

Page 71: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

71

Concepts of Proposed Approach

No corresponding translations

Page 72: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

72

Three Major Techniques

Query expansion technique Translation of the surname• Obtaining the related Web page snippets of the Chinese name

translation.• Solve the problem of the unrelated term existing in the name

translation.

Knowledge-based method Chinese surname database, A common dictionary, Western first

name database• Obtaining all the name-like terms from the returned Web page snippets.

SVM Chinese pronunciation database, the phonetic feature and the

distant feature, selectedatraining samples• Selecting the appropriate Chinese name translations from the

candidates.

Page 73: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

73

System Architecture

Chinese names

Query expander

Candidateextractor

SVM-basedname selector

Chinese surname database

Western first name database

On-line dictionary

Chinese pronunciation

database

ReturnedWeb page snippets

Namecandidates

Translated English names

Chinese names

Query expander

Chinese surname database

ReturnedWeb page snippets

Candidateextractor

Western first name database

On-line dictionary

Namecandidates

SVM-basedname selector

Chinese pronunciation

database

Translated English names

Page 74: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

74

Query Expander

Goal:To retrieve Web page snippets that contain both a person’s Chinese name and the translation of the person’s surname.

Name splitter Determining whether the input Chinese name contains a compound surname Chinese surname database Dividing the input Chinese name into a “Surname” part and a “given name”

part.

Surname translator Selecting appropriate surname translations. Chinese surname database The strength of relationship between each surname translation and the person

is determined by the “distance from the person’s Chinese name to the surname’s translation”.

Web page retriever Making the concept of the query word more clearly. Retrieving the related Web pages back. The new query word will be “(Chinese name) + (Surname’s translation)”.

Page 75: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

75

Distance from Two Terms

Calculation of the “distance from two terms”: where D is the distance, N is the number of non-words between the two terms.

陳威達 ( Wei-Da Chen)

The distance from the person’s Chinese name ( 陳威達 ) to the surname’s translation (Chen) is 3.

D N

Page 76: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

76

Candidate Extractor

Goal:To extract possible candidates from the retrieved Web page snippets.

Steps:1. Removing all HTML tags.

2. Identifying out all the positions of the Chinese surnames existing in the snippets.

Chinese surname database

3. Extracting any English terms near each surname in the snippets if the term has one of the following properties:– The term cannot be found in a common dictionary.– The term is a Western first name.– The length of the term is 1.

※At most three English terms in the neighborhood of the surname will be extracted.

Page 77: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

77

System Architecture 4/10

- Candidate extractor

Step1 Identifying out all the positions of the Chinese surnames existing in the snippets.

Step2 Extracting any English terms near each surname in the snippets if the term has one of the following properties:•The term cannot be found in a common dictionary.

•The term is a Western first name.

•The length of the term is 1.

The extracted terms will be the name translation candidates and be sent to SVM-based name selector for processing

Page 78: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

78

SVM-based Name Selector

Goal:To extract each candidate’s features and utilize them to determine whether the candidate is the correct translation of the input Chinese name.

Features:1. The phonetic feature:

– Phonetic similarity Soundex algorithm

2. The distant feature:– Smallest distance (between the Chinese name and the

translation candidates)– Number of appearance in the neighborhood

Page 79: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

79

Distant Features

The “neighborhood”: The close area of each occurrence of the Chinese

name. The close area is defined by a given threshold of

distance of number of words.

Number of appearance in the neighborhood

of the candidate “win-shih”: 2

Smallest distance 2

Page 80: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

80

Summary

Query expansion technique Retrieving related Web pages.

Knowledge-based method Extracting appropriate name translation candidates

from the retrieved Web pages.

SVM Learning the verification rule and Selecting appropriate name translation from extracted

candidates.

Page 81: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

81

Agenda

Introduction Proposed Approach Experiments Conclusions and Future Work

Page 82: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

82

Testing Environment and Dataset 1/3

The following tool are used: Cambridge on-line dictionary Google search engine LIBSVM

Two datasets are used: Dataset I (training & testing):

Collected from the Directory of scholars of Institute of Mathematics.

Contains 78 pieces of data. Dataset II (testing):

Collected by our program from the Website of the Directory of Division of Computer Science of National Science Council.

Contains 1,157 pieces of data, and the name translations of 40 data are not existed in Google.

Page 83: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

83

Testing Environment and Dataset 2/3

Name format ExampleDataset I Dataset II

# % # %

Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name)

丁建文 (Jen-Wen Ding)丁德榮 (Der-Rong Din)歐陽明 (Ming Ouhyang)

19 24.3%1000

89.5%

Type-2. (Merged Chinese given name) (Surname)

蔡丕裕 (Piyu Tsai) 10 12.8% 42 3.8%

Type-3. (Western first name) (Surname) 賴友仁 (Eugene Lai) 9 11.5% 9 0.8%

Type-4. (Chinese given name) (Western first name) (Surname)

劉立頌 (Alan Li-Sung liu)陳嘉懿 (Jia-Yih Joy Chen)楊豐瑞 (Fongray Frank Young)

14 17.9% 50 4.5%

Type-5. (Abbreviated Chinese given name) (Surname)

洪英超 (I.-C. Hung) 3 3.8% 0 0%

Type-6. (Western first name) (Abbreviated Chinese given name) (Surname)

曾秋蓉 (Judy C. R. Tseng) 8 10.3% 9 0.8%

Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname)

黃哲志 (Tetz C. Huang) 3 3.8% 3 0.4%

Type-8. (Chinese given name) (Unpredictable Surname)

張肇健 (Trieu-Kien Truong) 12 15.4% 4 0.4%

Page 84: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

84

Testing Environment and Dataset 3/3

The alignment accuracy Proposed by Huang (2005).

The probability of selecting the correct answers when the searched snippets contain the correct answers.

A

where Ai : The alignment accuracy of candidate i. Nd : The number of testing data. Ncc : The number of correct translation.

Performance measurement: Top-1 to Top-5 alignment accuracy.

Ai Ncc

Nd

Page 85: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

85

Results and Analysis 1/3

- Overall performance on Dataset I

70.5% top-1 accuracy

91% top-5 accuracy

Page 86: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

86

Results and Analysis 2/3

- Overall performance on Dataset II

57.9% top-1 accuracy

86.2% top-5 accuracy

Page 87: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

87

Results and Analysis 3/3

- Performance of each name type

Our system performs better in

type-1, type-2, type-4, type-6.

Name forma

tExample

Type-1

丁建文 (Jen-Wen Ding)丁德榮 (Der-Rong Din)歐陽明 (Ming Ouhyang)

Type-2

蔡丕裕 (Piyu Tsai)

Type-3

賴友仁 (Eugene Lai)

Type-4

劉立頌 (Alan Li-Sung liu)陳嘉懿 (Jia-Yih Joy Chen)

Type-5

洪英超 (I.-C. Hung)

Type-6

曾秋蓉 (Judy C. R. Tseng)

Type-7

黃哲志 (Tetz C. Huang)

Type-8

張肇健 (Trieu-Kien Truong)

Page 88: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

88

Discussions

Major reason for the low performance on Type-3, Type-5, Type-7 and Type-8 The lack of Web information.

Usually more than one correct name translations for an input Chinese name are found out. The name ambiguity problem.

Page 89: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

89

Limitations

Uncommon surname

Rely on Web resources

Search engine selecting

No name disambiguation

Page 90: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

90

Agenda

Introduction Proposed Approach Experiments Conclusions

Page 91: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

91

Conclusions

Mining information through Web corpora is effective for dealing with person name translation problem

Name ambiguity problem arises frequently

Page 92: Mining Academic Community Jan-Ming Ho hohoiis.sinica.edu.tw C omputer S ystem and C ommunication L ab I nstitute of I nformation S cience Academia Sinica

92

Thank You

Jan-Ming Ho

[email protected]

Institute of Information Science

Academia Sinica