who should i cite? learning literature search models from citation behavior

43
1 Who Should I Cite? Learning Literature Search Models from Citation Behavior CIKM’10 Advisor Jia Ling, Koh Speaker SHENG HONG, CHUNG

Upload: prem

Post on 06-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Who Should I Cite? Learning Literature Search Models from Citation Behavior. CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG. Outline. Introduction Features Similar terms Cited information Regency Cited using similar terms Similar topics Social habits Experiment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

1

Who Should I Cite? Learning Literature Search Models from Citation Behavior

CIKM’10Advisor: Jia Ling, KohSpeaker: SHENG HONG, CHUNG

Page 2: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

2

Outline

• Introduction• Features– Similar terms– Cited information– Regency– Cited using similar terms– Similar topics– Social habits

• Experiment• Conclusion

Page 3: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

3

Introduction

• Motivation

Article

?reference

Page 4: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

4

Framework of the system

Input

System

topic

abstract

Output

reference list

d1

d2

d3

.

.

.

.

q<-term(topic)+term(abstract)

q

Page 5: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

5

Pattern from article

Page 6: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

6

Pattern from article

Page 7: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

7

Example

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5

D1

D2

D1D2 D3 D4 D5

Corpus

Page 8: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

8

Similar terms

query( topic )

abstract

D1

D2

D3

D4

D5

.

.

.

.

Corpus

TF-IDF score : topic terms and abstract terms match against document contextsScoreterms(q,D1)Scoreterms(q,D2)Scoreterms(q,D3)….

Page 9: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

9

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5

D1

D2

D1D2 D3 D4 D5

query : t1, t2

f1:

Scoreterms(q,D1) = 30.5*(1+log(5/(3+1)))2+20.5*(1+log(5/(3+1)))2

= 2.084 + 1.702 = 3.786

Scoreterms(q,D4) = 0 + 40.5*(1+log(5/(3+1)))2

= 2.406

Page 10: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

10

Cited informationAccording to the feature from corpus :

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5

D1

D2

D1D2 D3 D4 D5

citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

|citing (D1) |= 3 = scorecitation-count(q,D1)

|citing (D2) |= 4 = scorecitation-count(q,D2)

|citing (D3) |= 2 = scorecitation-count(q,D3)

|citing (D4) |= 2 = scorecitation-count(q,D4)|citing (D5) |= 2 = scorecitation-count(q,D5)

f2:

Page 11: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

11

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5

D1

D2

D1D2 D3 D4 D5

D1

D2D3

D4 D5

More in-links, more important

Take friendship as an example :A has 5 friends, all of them are normal peopleB has 5 friends, but all of them are famous peopleB is more important than A

How to tell normal and famous? pagerank

Page 12: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

12

D1

D2D3

D4 D5

Scorepagerank(q,D1) = 0.15/5 + 0.85 * (Scorepagerank(q,D2) /2 + Scorepagerank(q,D3) /3+ Scorepagerank(q,D5) /2)

f3:

Page 13: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

13

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5

D1

D2

D1D2 D3 D4 D5

D2D1

scorevenue-citation-count(q,D1) = 3 + 4 + 2 = 9citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

D3

scoreauthor-citation-count(q,D1) = max{9,7} = 9 a1:{D1,D2,D3}

a2:{D1,D3,D4}

f4:

f5:

scorevenue-citation-count(q,D4) = 2 + 2 = 4

scoreauthor-citation-count(q,D5) = max{2} = 2

Page 14: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

14

h-index

• Definition : – An author has published h papers each of which

has been cited by others at least h times– Sort cited count

• Example : 1. D8(600)2. D10(10)3. D3(10)4. D2(10)5. D1(10)6. D4(6)7. D5(3)8. D6(2)9. D7(2)10. D9(2)

h-index = 6

1. D8(600)2. D10(10)3. D3(10)4. D2(10)5. D1(10)6. D4(5)7. D5(3)8. D6(2)9. D7(2)10. D9(2)

h-index = 5

Page 15: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

1515

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5

D1

D2

D1D2 D3 D4 D5

citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

scoreauthor-h-index(q,D1) = max{2,2} = 2

f6:

a1:{D1,D2,D3} a2:{D1,D3,D4}1. D2(4)2. D1(3)3. D3(2)

1. D1(3)2. D3(2)3. D4(2)

h-index = 2 h-index = 2

Page 16: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

16

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6Year:2007

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4Year:2008

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4Year:1988

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6Year:2003

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5Year:2002

D1

D2

D1D2 D3 D4 D5

Year:2009query( topic ) scoreage(q,D1) = -(2009-2007) = -2

scoreage(q,D3) = -(2009-1988) = -21

f7: ─( )

Page 17: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

17

Cited using similar terms

• TF-IDF + cited information

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6Year:2007

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4Year:2008

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4Year:1988

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6Year:2003

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5Year:2002

D1

D2

D1D2 D3 D4 D5

scoreterm-citing(q,D1) = scoreterms(q,concat(term(D2),term(D3),term(D5)))

citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

f8:

Page 18: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

18

q D D’cite

D1 D2D3D5

• Too many terms• Stopwords

scoreterm-citing(q,D1)

TF-IDF

{D1~Dm} {t1~tn}

New data set

Collect m documents (m<k) cited at least five times and terms that occur at least twice

.

.

.

pmi(t1~n,D1)

pmi(t1~n,D2)textpmi(D1)

textpmi(D2)

Assume that there are k documents in corpus

textpmi(Dk)pmi(t1~n,Dk)

Page 19: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

19

pmi(x,y) pmi(x,y)

Assume that p(x) = 1/2 p(y) = 1/2 If p(x,y) > 1/4 positive correlationIf p(x,y) = 1/4 x and y are independentIf p(x,y) < 1/4 negative correlation

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6Year:2007

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4Year:2008

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4Year:1988

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6Year:2003

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5Year:2002

D1

D2

D1D2 D3 D4 D5

citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

pmi(t1,D1) = log (ptermciting(t1,D1) / pterm(t1)pciting(D1))

pterm(t1) = (3+2+3)/(6+4+4+6+5 ) = 8/25

Page 20: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

20

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6Year:2007

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4Year:2008

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4Year:1988

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6Year:2003

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5Year:2002

D1

D2

D1D2 D3 D4 D5

citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

pciting(D1) = 3/(3+4+2+2+2)= 3/13

ptermciting(t1,D1) = (2+3)/(13+21+12+10+10) = 5/66

pmi(t1,D1) = log(5/66 / (8/25*3/13)) = 0.011

f9:

Page 21: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

21

Latent Dirichlet AllocationDocument

Topic 1 Topic 2 Topic N

Word 1 Word 2 Word M

…… ……

…………

Page 22: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

22

Assume 3 topics{ algorithm, performance, running time }

topics(q) = {4/6,1/6,1/6}topics(D1)={1/4,1/4,1/2}

f10:

scoretopics(q,D1) = cos(topics(q),topics(D1)) = 0.67

f11:

citing(D1) = {D2,D3}

topics(D2)={1/6,1/6,4/6}topics(D3)={1/4,1/2,1/4}

topics(K)=1/2 * {5/12,2/3,11/12}

Scoretopics-citing(q,D1) = cos(topics(q),topics(K))

Page 23: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

23

f12:

topics(q) = {4/6,1/6,1/6}topics(D1)={1/4,1/4,1/2}

toptopic(q) = 1citing(D1) = {D2,D3}topics(D2)={4/6,1/6,1/6}toptopic(D2) = 1topics(D3)={1/4,1/2,1/4}toptopic(D3) = 2

Assume 3 topics{ algorithm, performance, running time }

scoretopic-citation-count(q,D1) = 1

f13:

f14:

scoretopic-entropy(q,D1) = -(1/4*log(1/4)+1/4*log(1/4)+1/2*log(1/2))

Page 24: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

24

Social Habits

• Authors like to cite the paper written by him

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6Year:2007

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4Year:2008

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4Year:1988

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6Year:2003

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5Year:2002

D1

D2

D1D2 D3 D4 D5

Author: a1 a2

queryscoreauthors(q,D1) = scoreterms(authors(q),authors(D1)) = tf(a1)*idf(a1)+tf(a2)*idf(a2) = 1*(1+log(5/(3+1))) + 1*(1+log(5/(3+1)))

f15:

Page 25: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

25

Social Habits

• If author cited the document before

Author: a1 a2

Venue: CIKMContext:{t1:3,t2:2,t3:1}total: 6Year:2007

D2

D3

D4

Author: a1 a3

Venue: CIKMContext:{t1:2,t2:2}total:4Year:2008

D1

D5

Author: a1 a2 a3

Venue: CIKMContext:{t1:3,t3:1}total:4Year:1988

D1

D2

D4

Author: a2 a3

Venue: SIGIRContext:{t2:4,t4:2}total:6Year:2003

D2

D3

D5

Author: a4

Venue: SIGIRContext:{t4:3,t5:2}total:5Year:2002

D1

D2

D1D2 D3 D4 D5

scoreauthor-cited-article(q,D1) = scoreterms(authors(q) ,concat(authors(D2),authors(D3),authors(D5)))= 2*1+log(5/(3+1))+1*(1+log(5/(3+1)))

citing(D1) = {D2D3D5}

citing(D2) = {D1D3D4D5}

citing(D3) = {D1D4}

citing(D4) = {D1D3}

citing(D5) = {D2D4}

Author: a1 a2

query

f16:

Page 26: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

26

Social Habitsscoreauthor-cited-article(q,D1)• Don’t care D1 author

q D D’cite

citing(D1)={D2,D3,D5}D1

scoreauthor-cited-author(q,D1)• Care D1 author

q D D’cite

citing(D1),citing(D2),citing(D3),citing(D4)D1D2D3D4

f17:

Page 27: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

27

Social Habitsscoreauthor-cited-venue(q,D1)• fix D1 venue

q D D’cite

citing(D1),citing(D2),citing(D3)D1D2D3

f18:

Page 28: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

28

Social Habits

• Scoreauthors-coauthored(q,D1)

q D

D1D2D3D4

{a1,a2} {a3,a4}

f19:

Page 29: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

29

Weighted sum

• Combined all features by weighted sum

• Compress scale

Page 30: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

30

Experiment

• Precision

• Average precision

• Mean average precision

Page 31: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

31

Precision / Recall• Precision = Relevant Documents Retrieved / Total Retrieved Documents• Recall = Relevant Documents Retrieved / Total Relevant Documents

Example : • Dataset : 10000 data

– 500 data related to dining food– User gives a query “dining food ”, system retrieves 4000 data, 400 data

related to dining food• Precision = 400/4000• Recall = 400/500

Page 32: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

32

Page 33: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

33

Page 34: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

34

Q1 SystemRetrieval

documentsAverage precision

Q16

.

.

.

SystemRetrieval

documentsAverage precision

.

.

.

MAP = (Q1+Q2+…+Q16) / 16

Page 35: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

35

Experiment

Page 36: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

36

ExperimentWho Should I Cite?

D1

D2

D3

D4

reference System

topic+

abstract retrieveD1

D5

D3

D4

reference list

MAP

Page 37: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

37

score(q,d)= w1*f1+w2*f2+…+w19*f19

score(q,d) = w1*f1

w1 = 1, w2~w19=0

Model+

weight

D1

D2

D3

.

.

Retrieve documents

D1

D5

D3

.

.

Developmentdata

CalculateMAP

Page 38: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

38

MEANAVGPREC(model, D, N){ for(D) { q<-TERMS(TITLE(D))+TERMS(ABS) d’<-Retrieve(m,q,N) MAP<-calMAP(citedarticles(D),d’) } return MAP;}

Page 39: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

39

data = []

put q into model m( w1=1, else = 0)Get retrieval documents d’

ref

Training data d

S

If d’ S, label = 1∈Else label = -1

data update, using classifier to adjust weights

Iterative procedure

Page 40: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

40

Experiment

Page 41: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

41

Experiment

Related work features : terms, citation-count, pagerank, age, term-citing, authors

Page 42: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

42

Experiment

Page 43: Who Should I  Cite?  Learning  Literature Search Models from Citation Behavior

43

Conclusion

• Present a model for scientific article retrieval that introduces new features and new learning algorithm.

• Show that the weights for these features can be more efficiently learned by an iterative procedure.