web document modeling - university of pittsburghpeterb/2480-171/documentmodeling.pdf · web...

64
Web Document Modeling Peter Brusilovsky With slides from Jae-wook Ahn and Jumpol Polvichai

Upload: trannhan

Post on 17-Sep-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

WebDocumentModeling

PeterBrusilovsky

WithslidesfromJae-wookAhnandJumpolPolvichai

Search NavigaBon RecommendaBon

Content-based

SemanBcs/Metadata

Social

Where we are?

IntroducBon

•  Modelingmeans“theconstruc+onofanabstractrepresenta+onofthedocument”– UsefulforallapplicaBonsaimedatprocessinginformaBonautomaBcally.

•  Whybuildmodelsofdocuments?– Toguidetheuserstotherightdocumentsweneedtoknowwhattheyareabout,whatistheirstructure

– SomeadaptaBontechniquescanoperatewithdocumentsas“blackboxes”,butothersarebasedontheabilitytounderstandandmodeldocuments

Documentmodeling

DocumentsDocumentsDocuments

DocumentModels- Bagofwords- Tag-based-Link-based

-Concept-based- AI-based

ProcessingApplica+on

- Matching(IR)- Filtering- AdapBve

presentaBon,etc.

4

DocumentmodelExample

thedeathtollrisesinthemiddleeastastheworstviolenceinfouryearsspreadsbeyondjerusalem.thestakesarehigh,theraceisBght.preppingforwhatcouldbeadecisivemomentinthepresidenBalbaUle.howaBnytowniniowabecameaboomingmelBngpotandtheimagethatwillnotsoonfade.themanwhocapturedittellsthestorybehindit.

5

Outline•  ClassicIRbasedrepresenta+on–  Preprocessing–  Boolean,ProbabilisBc,VectorSpacemodels

•  Web-IRdocumentrepresenta+on–  Tagbaseddocumentmodels–  Linkbaseddocumentmodels–HITS,GoogleRank

•  Concept-baseddocumentmodeling–  LSI

•  AI-baseddocumentrepresenta+on– ANN,SemanBcNetwork,BayesianNetwork

6

MarkupLanguagesAMarkupLanguageisatext-basedlanguagethatcombinescontentwith

itsmetadata.MLsupportstructuremodeling•  Presenta+onalMarkup

–  ExpressdocumentstructureviathevisualappearanceofthewholetextofaparBcularfragment.

–  Exp.Wordprocessor•  ProceduralMarkup

–  FocusesonthepresentaBonoftext,butisusuallyvisibletotheuserediBngthetextfile,andisexpectedtobeinterpretedbyso^warefollowingthesameproceduralorderinwhichitappears.

–  Exp.Tex,PostScript•  Descrip+veMarkup

–  ApplieslabelstofragmentsoftextwithoutnecessarilymandaBnganyparBculardisplayorotherprocessingsemanBcs.

–  Exp.SGML,XML

ClassicIRmodelProcess

DocumentsDocumentsDocuments

SetofTerms“BagofWords”

Preprocessing

TermweighBng

Query(ordocuments)

Matching(byIRmodels)

8

PreprocessingMoBvaBon

•  Extractdocumentcontentitselftobeprocessed(used)

•  RemovecontrolinformaBon– Tags,script,stylesheet,etc

•  Removenon-informaBvefragments– Stopwords,word-stems

•  PossibleextracBonofsemanBcinformaBon(nounphrases,concepts,namedenBBes)

9

PreprocessingHTMLtagremoval

•  Removes<.*>partsfromtheHTMLdocument(source)

10

PreprocessingHTMLtagremoval

DETROIT—Withitsaccesstoagovernmentlifelineinthebalance,GeneralMotorswaslockedinintensenegoBaBonsonMondaywiththeUnitedAutomobileWorkersoverwaystocutitsbillsforreBreehealthcare.

11

PreprocessingTokenizing/casenormalizaBon

•  Extractterm/featuretokensfromthetext

detroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoBaBonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreBreehealthcare

12

PreprocessingStopwordremoval

•  Verycommonwords•  Donotcontributetoseparateadocumentfromanothermeaningfully

•  Usuallyastandardsetofwordsarematched/removeddetroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoBaBonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreBreehealthcare 13

NamedEn++es

•  NamedEnBBesandotherconceptsaretypicallyextractedfromfulltext

DETROIT—Withitsaccesstoagovernmentlifelineinthebalance,GeneralMotorswaslockedinintensenegoBaBonsonMondaywiththeUnitedAutomobileWorkersoverwaystocutitsbillsforreBreehealthcare.

14

Extrac+ngSeman+cInforma+on

•  SomeBmesHTMLtagsareuseful

15

PreprocessingStemming

•  Extractsword“stems”only•  AvoidwordvariaBonthatarenotinformaBve– apples,apple– Retrieval,retrieve,retrieving– Shouldtheybedis.nguished?Maybenot.

•  Porter•  Krovetz

16

PreprocessingStemming(Porter)

•  MarBnPorter,1979

•  CyclicalrecogniBonandremovalofknownsuffixesandprefixes

•  TryDemoat•  hUp://qaa.ath.cx/

porter_js_demo.html

17

PreprocessingStemming(Krovetz)

•  BobKrovetz,1993•  MakesuseofinflecBonallinguisBcmorphology

•  RemovesinflecBonalsuffixesinthreesteps– singleform(e.g.‘-ies’,‘-es’,‘-s’)– pasttopresenttense(e.g.‘-ed’)–  removalof‘-ing’

•  CheckinginadicBonary•  Morehuman-readable

18

PreprocessingStemming

•  Porterstemmingexample[detroitaccessgovernlifelinbalancgenermotorlockintensnegoBmondaiunitautomobilworkerwaicutbillreBrehealthcare]

•  Krovetzstemmingexample[detroitaccessgovernmentlifelinebalancegeneralmotorlockintensenegoBaBonmondayunitedautomobileworkerwayscutbillreBreehealthcare]

19

Termweigh+ng

•  Howshouldwerepresenttheterms/featuresa^ertheprocessessofar?

20

detroitaccessgovernmentlifeline

balancegeneralmotorlockintensenegoBaBon

mondayunitedautomobileworkerwayscutbillreBreehealthcare

?

Termweigh+ngDocument-termmatrix

•  Columns–everytermappearedinthecorpus(notasingledocument)

•  Rows–everydocumentinthecollecBon•  Example–  IfacollecBonhasNdocumentsandMterms…

21

T1 T2 T3 … TM

Doc1 0 1 1 … 0

Doc2 1 0 0 … 1

… … … … … …

DocN 0 0 0 … 1

Termweigh+ngDocument-termmatrix

•  Document-termmatrix– Binary(ifappears1,otherwise0)

•  Everytermistreatedequivalently

aKract benefit book … zoo

Doc1 0 1 1 … 0

Doc2 1 0 0 … 1

Doc3 0 0 0 … 1

22

Termweigh+ngTermfrequency

•  So,weneed“weigh&ng”– Givedifferent“importance” todifferentterms

•  TF– Termfrequency– HowmanyBmesatermappearedinadocument?– Higherfrequencyàhigherrelatedness

23

Termweigh+ngTermfrequency

24

Termweigh+ngIDF

•  IDF–  InverseDocumentFrequency– Generalityofatermàtoogeneral,notbeneficial– Example

•  “Informa+on” (inACMDigitalLibrary)–  99.99%ofarBcleswillhaveit–  TFwillbeveryhighineachdocument,IDFlow

•  “Personaliza+on”–  Say,1%ofdocumentswillhaveit–  TFagainwillbeveryhigh,IDFhigh

25

Termweigh+ngIDF

•  IDF

– N=numberofdocumentsinthecoprus– DF=documentfrequency=numberofdocumentsthathavetheterm

–  IfACMDLhas1Mdocuments•  IDF(“informaBon”)=Log(1M/999900)=0.0001•  IDF(“persoanlizaBon”)=Log(1M/50000)=30.63

log( NDF

)

26

Termweigh+ngIDF

•  informa+on,personaliza+on,recommenda+on•  Canwesay…

–  Doc1,2,3…areaboutinformaBon?–  Doc1,6,8…areaboutpersonalizaBon?–  Doc5isaboutrecommendaBon?

Doc1 Doc2 Doc3 Doc4

Doc5 Doc6 Doc7 Doc8

27

Termweigh+ngTF*IDF

•  TF*IDF– TFmulBpliedbyIDF– ConsidersTFandIDFatthesameBme– HighfrequencytermsfocusedinsmallerporBonofdocumentsgetshigherscores

28

Document benef aKract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

Termweigh+ngBM25

•  OKAPIBM25– OkapiBestMatch25– ProbabilisBcmodel–calculatestermrelevancewithinadocument

– Computesatermweightaccordingtotheprobabilityofitsappearanceinarelevantdocumentandtotheprobabilityofitappearinginanon-relevantdocumentinacollecBonD

29

Termweigh+ngEntropyweighBng

•  EntropyweighBng

– Entropyoftermti•  -1:equaldistribuBonofalldocuments•  0:appearingonly1document

30

IRmodels

•  Boolean•  ProbabilisBc•  VectorSpace

31

IRModelsBooleanmodel

•  BasedonsettheoryandBooleanalgebra

•  àd1,d3

32

IRModelsBooleanmodel

•  Simpleandeasytoimplement•  Shortcomings– Onlyretrievesexactmatches

•  NoparBalmatch

– Noranking– DependsonuserqueryformulaBon

33

IRModelsProbabilisBcmodel

•  Binaryweightvector•  Query-documentsimilarityfuncBon•  Probabilitythatacertaindocumentisrelevanttoacertainquery

•  Ranking–accordingtotheprobabilitytoberelevant

34

IRModelsProbabilisBcmodel

•  SimilaritycalculaBon

•  SimplifyingassumpBons– NorelevanceinformaBonatstartup

35

BayesTheoremandremovingsomeconstants

IRModelsProbabilisBcmodel

•  Shortcomings– Divisionofthesetofdocumentsintorelevant/non-relevantdocuments

– TermindependenceassumpBon–  Indexterms–binaryweights

36

IRModelsVectorspacemodel

•  Document=mdimensionalspace(m=indexterms)

•  Eachtermrepresentsadimension•  ComponentofadocumentvectoralongagivendirecBonàtermimportance

•  Queryanddocumentsarerepresentedasvectors

37

IRModelsVectorspacemodel

•  Documentsimilarity– Cosineangle

•  Benefits– TermweighBng– ParBalmatching

•  Shortcomings– Termindependency

38

t1

t2

d1

d2

Cosineangle

IRModelsVectorspacemodel

•  Example–  Query=“springerbook”–  q=(0,0,0,1,1)–  Sim(d1,q)=(0.176+0.176)/(√1+√(0.1762+0.1762+0.4172+0.1762+0.1762)=0.228

–  Sim(d2,q)=(0.528)/(√1+√(0.3502+0.5282))=0.323–  Sim(d3,q)=(0.176)/(√1+√(0.5282+0.1762))=0.113

Document benef aKract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.17639

IRModelsVectorspacemodel

•  Document–documentsimilarity•  Sim(d1,d2)=0.447•  Sim(d2,d3)=0.0•  Sim(d1,d3)=0.408

Document benef aKract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

40

Curseofdimensionality

•  TDT4– |D|=96,260– |ITD|=118,205

•  Iflinearlycalculatessim(q,D)– 96,260(pereachdocument)*118,205(innerproduct)comparisons

•  However,documentmatricesareverysparse– Mostly0’s– Space,calculaBoninefficienttostorethose0’s

41

Curseofdimensionality

•  Invertedindex–  Indexfromtermtodocument

42

Web-IRdocumentrepresenta+on

•  EnhancestheclassicVSM•  PossibiliBesofferedbyHTMLlanguages•  Tag-based•  Link-based– HITS– PageRank

43

Web-IRTag-basedapproaches

•  Givedifferentweightstodifferenttags– Sometextfragmentswithinatagmaybemoreimportantthanothers

– <body>,<Btle>,<h1>,<h2>,<h3>,<a>…

44

Web-IRTag-basedapproaches

•  WEBORsystem•  Sixclassesoftags

•  CIV=classimportancevector•  TFV=classfrequencyvector

45

Web-IRTag-basedapproaches

•  TermweighBngexample– CIV={0.6,1.0,0.8,0.5,0.7,0.8,0.5}– TFV(“personalizaBon”)={0,3,3,0,0,8,10}– W(“personalizaBon”)=(0.0+3.0+2.4+0.0+0.0+6.4+5.0)*IDF

46

Web-IRHITS(Hyperlink-InducedTopicSearch)

•  Link-basedapproach•  PromotesearchperformancebyconsideringWebdocumentlinks

•  WorksonaniniBalsetofretrieveddocuments•  HubandauthoriBes– Agoodauthoritypageisonethatispointedtobymanygoodhubpages

– Agoodhubpageisonethatispointedtobymanygoodauthoritypages

–  Circulardefini&onàiteraBvecomputaBon

47

Web-IRHITS(Hyperlink-InducedTopicSearch)

•  IteraBveupdateofauthority&hubvectors

48

Web-IRHITS(Hyperlink-InducedTopicSearch)

49

Web-IRHITS(Hyperlink-InducedTopicSearch)

•  N=1– A=[0.3710.5570.743]– H=[0.6670.6670.333]

•  N=10– A=[0.3440.5730.744]– H=[0.7220.6190.309]

•  N=1000– A=[0.3280.5910.737]– H=[0.7370.5910.328]

50

Doc0

Doc2Doc1

Web-IRPageRank

•  Google•  UnlikeHITS

–  NotlimitedtoaspecificiniBalretrievedsetofdocuments

–  Singlevalue

•  IniBalstate=nolink•  Evenlydividescoresto4

documents•  PR(A)=PR(B)=PR(C)=

PRD(D)=1/4=0.25

51

A

B

C

D

Web-IRPageRank

•  PR(A)=PR(B)+PR(C)+PR(D)=0.25+0.25+0.25=0.75

52

A

B

C

D

Web-IRPageRank

•  PR(B)toA=0.25/2=0.125•  PR(B)toC=0.25/2=0.125•  PR(A)

=PR(B)+PR(C)+PR(D)=0.125+0.25+0.25=0.625

53

A

B

C

D

Web-IRPageRank

•  PR(D)toA=0.25/3=0.083•  PR(D)toB=0.25/3=0.083•  PR(D)toC=0.25/3=0.083•  PR(A)=PR(B)+PR(C)+PR(D)=

0.125+0.25+0.083=0.458

•  RecursivelykeepcalculaBngtofurtherdocumentslinkingtoA,B,C,andD

54

A

B

C

D

Concept-baseddocumentmodelingLSI(LatentSemanBcIndexing)

•  Representsdocumentsbyconcepts– Notbyterms

•  Reducetermspaceàconceptspace– Linearalgebratechnique:SVD(SingularValueDecomposiBon)

•  Step(1):MatrixdecomposiBon–originaldocumentmatrixAisfactoredintothreematrices

55

Concept-baseddocumentmodelingLSI

•  Step(2):ArankkisselectedfromtheoriginalequaBon(k=reduced#ofconceptspace)

•  Step(3):Theoriginalterm-documentmatrixAisconvertedtoAk

56

Concept-baseddocumentmodelingLSI

•  Document-termmatrixA

57

Concept-baseddocumentmodelingLSI

•  DecomposiBon

58

Concept-baseddocumentmodelingLSI

•  LowrankapproximaBon(k=2)

59

Concept-baseddocumentmodelingLSI

•  FinalAk– Columns:documents– Rows:concepts(k=2)

60

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0

VT=SVDDocumentMatrix

d2

d3

d1

ExtracBngSemanBcs

•  SemanBcindexingbasedondocumentmatchtoexternalmodels– Wikipedia/Dbpedia– WordNet,NELL– Ontologies

•  SemanBcclassificaBon– Yahoo.com,dmoz.org– TopiccollecBons

61

AI-basedapproachesArBficialNeuralNetworks

•  Query,term,documentsàseparatedinto3layers

•  Term-documentweight=norm.TF-IDF

•  QueryàtermacBvaBonàsumofthesignalsexceedsathresholdàdocumentretrieval

62

AI-basedapproachesSemanBcNetworks

•  Conceptualknowledge

•  RelaBonshipbetweenconcepts

63

AI-basedapproachesBayesianNetworks

64

•  MetzlerandCro^(2004)–  IndrisearchenginebasedonInQuery

•  Inferencenetwork– Document–  RepresentaBon(term,phrases)

– Query–  InformaBonNeed

•  Calculatesprobabilityofeachdocumentfromthenetwork