web document modeling - university of pittsburghpeterb/2480-171/documentmodeling.pdf · web...
TRANSCRIPT
IntroducBon
• Modelingmeans“theconstruc+onofanabstractrepresenta+onofthedocument”– UsefulforallapplicaBonsaimedatprocessinginformaBonautomaBcally.
• Whybuildmodelsofdocuments?– Toguidetheuserstotherightdocumentsweneedtoknowwhattheyareabout,whatistheirstructure
– SomeadaptaBontechniquescanoperatewithdocumentsas“blackboxes”,butothersarebasedontheabilitytounderstandandmodeldocuments
Documentmodeling
DocumentsDocumentsDocuments
DocumentModels- Bagofwords- Tag-based-Link-based
-Concept-based- AI-based
ProcessingApplica+on
- Matching(IR)- Filtering- AdapBve
presentaBon,etc.
4
DocumentmodelExample
thedeathtollrisesinthemiddleeastastheworstviolenceinfouryearsspreadsbeyondjerusalem.thestakesarehigh,theraceisBght.preppingforwhatcouldbeadecisivemomentinthepresidenBalbaUle.howaBnytowniniowabecameaboomingmelBngpotandtheimagethatwillnotsoonfade.themanwhocapturedittellsthestorybehindit.
5
Outline• ClassicIRbasedrepresenta+on– Preprocessing– Boolean,ProbabilisBc,VectorSpacemodels
• Web-IRdocumentrepresenta+on– Tagbaseddocumentmodels– Linkbaseddocumentmodels–HITS,GoogleRank
• Concept-baseddocumentmodeling– LSI
• AI-baseddocumentrepresenta+on– ANN,SemanBcNetwork,BayesianNetwork
6
MarkupLanguagesAMarkupLanguageisatext-basedlanguagethatcombinescontentwith
itsmetadata.MLsupportstructuremodeling• Presenta+onalMarkup
– ExpressdocumentstructureviathevisualappearanceofthewholetextofaparBcularfragment.
– Exp.Wordprocessor• ProceduralMarkup
– FocusesonthepresentaBonoftext,butisusuallyvisibletotheuserediBngthetextfile,andisexpectedtobeinterpretedbyso^warefollowingthesameproceduralorderinwhichitappears.
– Exp.Tex,PostScript• Descrip+veMarkup
– ApplieslabelstofragmentsoftextwithoutnecessarilymandaBnganyparBculardisplayorotherprocessingsemanBcs.
– Exp.SGML,XML
ClassicIRmodelProcess
DocumentsDocumentsDocuments
SetofTerms“BagofWords”
Preprocessing
TermweighBng
Query(ordocuments)
Matching(byIRmodels)
8
PreprocessingMoBvaBon
• Extractdocumentcontentitselftobeprocessed(used)
• RemovecontrolinformaBon– Tags,script,stylesheet,etc
• Removenon-informaBvefragments– Stopwords,word-stems
• PossibleextracBonofsemanBcinformaBon(nounphrases,concepts,namedenBBes)
9
PreprocessingHTMLtagremoval
DETROIT—Withitsaccesstoagovernmentlifelineinthebalance,GeneralMotorswaslockedinintensenegoBaBonsonMondaywiththeUnitedAutomobileWorkersoverwaystocutitsbillsforreBreehealthcare.
11
PreprocessingTokenizing/casenormalizaBon
• Extractterm/featuretokensfromthetext
detroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoBaBonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreBreehealthcare
12
PreprocessingStopwordremoval
• Verycommonwords• Donotcontributetoseparateadocumentfromanothermeaningfully
• Usuallyastandardsetofwordsarematched/removeddetroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoBaBonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreBreehealthcare 13
NamedEn++es
• NamedEnBBesandotherconceptsaretypicallyextractedfromfulltext
DETROIT—Withitsaccesstoagovernmentlifelineinthebalance,GeneralMotorswaslockedinintensenegoBaBonsonMondaywiththeUnitedAutomobileWorkersoverwaystocutitsbillsforreBreehealthcare.
14
PreprocessingStemming
• Extractsword“stems”only• AvoidwordvariaBonthatarenotinformaBve– apples,apple– Retrieval,retrieve,retrieving– Shouldtheybedis.nguished?Maybenot.
• Porter• Krovetz
16
PreprocessingStemming(Porter)
• MarBnPorter,1979
• CyclicalrecogniBonandremovalofknownsuffixesandprefixes
• TryDemoat• hUp://qaa.ath.cx/
porter_js_demo.html
17
PreprocessingStemming(Krovetz)
• BobKrovetz,1993• MakesuseofinflecBonallinguisBcmorphology
• RemovesinflecBonalsuffixesinthreesteps– singleform(e.g.‘-ies’,‘-es’,‘-s’)– pasttopresenttense(e.g.‘-ed’)– removalof‘-ing’
• CheckinginadicBonary• Morehuman-readable
18
PreprocessingStemming
• Porterstemmingexample[detroitaccessgovernlifelinbalancgenermotorlockintensnegoBmondaiunitautomobilworkerwaicutbillreBrehealthcare]
• Krovetzstemmingexample[detroitaccessgovernmentlifelinebalancegeneralmotorlockintensenegoBaBonmondayunitedautomobileworkerwayscutbillreBreehealthcare]
19
Termweigh+ng
• Howshouldwerepresenttheterms/featuresa^ertheprocessessofar?
20
detroitaccessgovernmentlifeline
balancegeneralmotorlockintensenegoBaBon
mondayunitedautomobileworkerwayscutbillreBreehealthcare
?
Termweigh+ngDocument-termmatrix
• Columns–everytermappearedinthecorpus(notasingledocument)
• Rows–everydocumentinthecollecBon• Example– IfacollecBonhasNdocumentsandMterms…
21
T1 T2 T3 … TM
Doc1 0 1 1 … 0
Doc2 1 0 0 … 1
… … … … … …
DocN 0 0 0 … 1
Termweigh+ngDocument-termmatrix
• Document-termmatrix– Binary(ifappears1,otherwise0)
• Everytermistreatedequivalently
aKract benefit book … zoo
Doc1 0 1 1 … 0
Doc2 1 0 0 … 1
Doc3 0 0 0 … 1
22
Termweigh+ngTermfrequency
• So,weneed“weigh&ng”– Givedifferent“importance” todifferentterms
• TF– Termfrequency– HowmanyBmesatermappearedinadocument?– Higherfrequencyàhigherrelatedness
23
Termweigh+ngIDF
• IDF– InverseDocumentFrequency– Generalityofatermàtoogeneral,notbeneficial– Example
• “Informa+on” (inACMDigitalLibrary)– 99.99%ofarBcleswillhaveit– TFwillbeveryhighineachdocument,IDFlow
• “Personaliza+on”– Say,1%ofdocumentswillhaveit– TFagainwillbeveryhigh,IDFhigh
25
Termweigh+ngIDF
• IDF
– N=numberofdocumentsinthecoprus– DF=documentfrequency=numberofdocumentsthathavetheterm
– IfACMDLhas1Mdocuments• IDF(“informaBon”)=Log(1M/999900)=0.0001• IDF(“persoanlizaBon”)=Log(1M/50000)=30.63
€
log( NDF
)
26
Termweigh+ngIDF
• informa+on,personaliza+on,recommenda+on• Canwesay…
– Doc1,2,3…areaboutinformaBon?– Doc1,6,8…areaboutpersonalizaBon?– Doc5isaboutrecommendaBon?
Doc1 Doc2 Doc3 Doc4
Doc5 Doc6 Doc7 Doc8
27
Termweigh+ngTF*IDF
• TF*IDF– TFmulBpliedbyIDF– ConsidersTFandIDFatthesameBme– HighfrequencytermsfocusedinsmallerporBonofdocumentsgetshigherscores
28
Document benef aKract sav springer book
d1 0.176 0.176 0.417 0.176 0.176
d2 0.000 0.350 0.000 0.528 0.000
d3 0.528 0.000 0.000 0.000 0.176
Termweigh+ngBM25
• OKAPIBM25– OkapiBestMatch25– ProbabilisBcmodel–calculatestermrelevancewithinadocument
– Computesatermweightaccordingtotheprobabilityofitsappearanceinarelevantdocumentandtotheprobabilityofitappearinginanon-relevantdocumentinacollecBonD
29
Termweigh+ngEntropyweighBng
• EntropyweighBng
– Entropyoftermti• -1:equaldistribuBonofalldocuments• 0:appearingonly1document
30
IRModelsBooleanmodel
• Simpleandeasytoimplement• Shortcomings– Onlyretrievesexactmatches
• NoparBalmatch
– Noranking– DependsonuserqueryformulaBon
33
IRModelsProbabilisBcmodel
• Binaryweightvector• Query-documentsimilarityfuncBon• Probabilitythatacertaindocumentisrelevanttoacertainquery
• Ranking–accordingtotheprobabilitytoberelevant
34
IRModelsProbabilisBcmodel
• SimilaritycalculaBon
• SimplifyingassumpBons– NorelevanceinformaBonatstartup
35
BayesTheoremandremovingsomeconstants
IRModelsProbabilisBcmodel
• Shortcomings– Divisionofthesetofdocumentsintorelevant/non-relevantdocuments
– TermindependenceassumpBon– Indexterms–binaryweights
36
IRModelsVectorspacemodel
• Document=mdimensionalspace(m=indexterms)
• Eachtermrepresentsadimension• ComponentofadocumentvectoralongagivendirecBonàtermimportance
• Queryanddocumentsarerepresentedasvectors
37
IRModelsVectorspacemodel
• Documentsimilarity– Cosineangle
• Benefits– TermweighBng– ParBalmatching
• Shortcomings– Termindependency
38
t1
t2
d1
d2
Cosineangle
IRModelsVectorspacemodel
• Example– Query=“springerbook”– q=(0,0,0,1,1)– Sim(d1,q)=(0.176+0.176)/(√1+√(0.1762+0.1762+0.4172+0.1762+0.1762)=0.228
– Sim(d2,q)=(0.528)/(√1+√(0.3502+0.5282))=0.323– Sim(d3,q)=(0.176)/(√1+√(0.5282+0.1762))=0.113
Document benef aKract sav springer book
d1 0.176 0.176 0.417 0.176 0.176
d2 0.000 0.350 0.000 0.528 0.000
d3 0.528 0.000 0.000 0.000 0.17639
IRModelsVectorspacemodel
• Document–documentsimilarity• Sim(d1,d2)=0.447• Sim(d2,d3)=0.0• Sim(d1,d3)=0.408
Document benef aKract sav springer book
d1 0.176 0.176 0.417 0.176 0.176
d2 0.000 0.350 0.000 0.528 0.000
d3 0.528 0.000 0.000 0.000 0.176
40
Curseofdimensionality
• TDT4– |D|=96,260– |ITD|=118,205
• Iflinearlycalculatessim(q,D)– 96,260(pereachdocument)*118,205(innerproduct)comparisons
• However,documentmatricesareverysparse– Mostly0’s– Space,calculaBoninefficienttostorethose0’s
41
Web-IRdocumentrepresenta+on
• EnhancestheclassicVSM• PossibiliBesofferedbyHTMLlanguages• Tag-based• Link-based– HITS– PageRank
43
Web-IRTag-basedapproaches
• Givedifferentweightstodifferenttags– Sometextfragmentswithinatagmaybemoreimportantthanothers
– <body>,<Btle>,<h1>,<h2>,<h3>,<a>…
44
Web-IRTag-basedapproaches
• WEBORsystem• Sixclassesoftags
• CIV=classimportancevector• TFV=classfrequencyvector
45
Web-IRTag-basedapproaches
• TermweighBngexample– CIV={0.6,1.0,0.8,0.5,0.7,0.8,0.5}– TFV(“personalizaBon”)={0,3,3,0,0,8,10}– W(“personalizaBon”)=(0.0+3.0+2.4+0.0+0.0+6.4+5.0)*IDF
46
Web-IRHITS(Hyperlink-InducedTopicSearch)
• Link-basedapproach• PromotesearchperformancebyconsideringWebdocumentlinks
• WorksonaniniBalsetofretrieveddocuments• HubandauthoriBes– Agoodauthoritypageisonethatispointedtobymanygoodhubpages
– Agoodhubpageisonethatispointedtobymanygoodauthoritypages
– Circulardefini&onàiteraBvecomputaBon
47
Web-IRHITS(Hyperlink-InducedTopicSearch)
• N=1– A=[0.3710.5570.743]– H=[0.6670.6670.333]
• N=10– A=[0.3440.5730.744]– H=[0.7220.6190.309]
• N=1000– A=[0.3280.5910.737]– H=[0.7370.5910.328]
50
Doc0
Doc2Doc1
Web-IRPageRank
• Google• UnlikeHITS
– NotlimitedtoaspecificiniBalretrievedsetofdocuments
– Singlevalue
• IniBalstate=nolink• Evenlydividescoresto4
documents• PR(A)=PR(B)=PR(C)=
PRD(D)=1/4=0.25
51
A
B
C
D
Web-IRPageRank
• PR(B)toA=0.25/2=0.125• PR(B)toC=0.25/2=0.125• PR(A)
=PR(B)+PR(C)+PR(D)=0.125+0.25+0.25=0.625
53
A
B
C
D
Web-IRPageRank
• PR(D)toA=0.25/3=0.083• PR(D)toB=0.25/3=0.083• PR(D)toC=0.25/3=0.083• PR(A)=PR(B)+PR(C)+PR(D)=
0.125+0.25+0.083=0.458
• RecursivelykeepcalculaBngtofurtherdocumentslinkingtoA,B,C,andD
54
A
B
C
D
Concept-baseddocumentmodelingLSI(LatentSemanBcIndexing)
• Representsdocumentsbyconcepts– Notbyterms
• Reducetermspaceàconceptspace– Linearalgebratechnique:SVD(SingularValueDecomposiBon)
• Step(1):MatrixdecomposiBon–originaldocumentmatrixAisfactoredintothreematrices
55
Concept-baseddocumentmodelingLSI
• Step(2):ArankkisselectedfromtheoriginalequaBon(k=reduced#ofconceptspace)
• Step(3):Theoriginalterm-documentmatrixAisconvertedtoAk
56
Concept-baseddocumentmodelingLSI
• FinalAk– Columns:documents– Rows:concepts(k=2)
60
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0
VT=SVDDocumentMatrix
d2
d3
d1
ExtracBngSemanBcs
• SemanBcindexingbasedondocumentmatchtoexternalmodels– Wikipedia/Dbpedia– WordNet,NELL– Ontologies
• SemanBcclassificaBon– Yahoo.com,dmoz.org– TopiccollecBons
61
AI-basedapproachesArBficialNeuralNetworks
• Query,term,documentsàseparatedinto3layers
• Term-documentweight=norm.TF-IDF
• QueryàtermacBvaBonàsumofthesignalsexceedsathresholdàdocumentretrieval
62