mark sanderson, university of sheffield university of sheffield ciir, university of massachusetts...
Post on 19-Dec-2015
230 views
TRANSCRIPT
![Page 1: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/1.jpg)
Mark Sanderson, University of Sheffield
University of SheffieldCIIR, University of Massachusetts
Deriving concept hierarchies from text
Mark Sanderson, Bruce Croft
![Page 2: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/2.jpg)
Mark Sanderson, University of Sheffield
The question is...
� What paper already presented at this SIGIR is most like the one you’re about to see?
� We’ll have the answer, right after this!
![Page 3: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/3.jpg)
Mark Sanderson, University of Sheffield
Concept hierarchies from documents?
� Hierarchy ofconcepts, Yahoo� General down to
specific
� Child under one or more parents
� No training data
� Why?� Understandable
![Page 4: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/4.jpg)
Mark Sanderson, University of Sheffield
Current methods
� Polythetic clustering
Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4 X X
![Page 5: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/5.jpg)
Mark Sanderson, University of Sheffield
An alternative?
� Monothetic clustering
� Clusters based on a single features
� More ‘Yahoo/Dewey decimal’ like?
� Easier to understand?» Preferable to users?
� What about hierarchies of clusters?
Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4
![Page 6: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/6.jpg)
Mark Sanderson, University of Sheffield
How to arrange cluster terms?
� Existing techniques� WordNet
» earthquake, volcano (eruption?)
� Key phrases (Hearst 1998)» “such as”, “especially”
� Phrase classification (Grefenstette 1997)» NP head or modifier “types of research” from “research things”
� Hierarchical phrase analysis (Woods 1997)» Head modifier again, “car washing” under “washing”, not “car”
![Page 7: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/7.jpg)
Mark Sanderson, University of Sheffield
WordNet (aside)
� 1 sense of earthquake, sense 1
� earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)
» geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth)
» natural phenomenon, nature -- (all non-artificial phenomena)
» phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)
![Page 8: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/8.jpg)
Mark Sanderson, University of Sheffield
WordNet (aside)
� 5 senses of eruption, sense 1
� volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material)
» discharge -- (the sudden giving off of energy)
» happening, occurrence, natural event -- (an event that happens)
» event -- (something that happens at a given place and time)
![Page 9: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/9.jpg)
Mark Sanderson, University of Sheffield
Start with something simpler?
� Term clustering?� simple monothetic clusters
� No ordering.
![Page 10: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/10.jpg)
Mark Sanderson, University of Sheffield
Use subsumption
� Initially using subsumption.� Finds related terms
� Decides which is more general, which is more specific (idf?)
� Strict interpretation� X s Y iff P(x|y) = 1, P(y|x) < 1
� In practice� X s Y iff P(x|y) > 0.8, P(y|x) < 1
� P(x|y) > 0.8, P(y|x) < P(x|y)
xy
x
y
![Page 11: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/11.jpg)
Mark Sanderson, University of Sheffield
How to build a “hierarchy”
� X s Y
� X s Z
� X s M
� X s N
� Y s Z
� A s B
� A s Z
� B s Z
X
Y
Z
M N
A
B
really it’s a DAG
![Page 12: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/12.jpg)
Mark Sanderson, University of Sheffield
How to display it?
� DAGs were big� Unlikely to get all on screen
� Only want to see current focus plus route to taken there?
� Use a method users are familiar with
� Hierarchical menus
X
Y
Z
M N
A
B
Z
![Page 13: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/13.jpg)
Mark Sanderson, University of Sheffield
What about ambiguity?
� Monothetic clusters of ambiguous terms?
� Derive hierarchy from retrieved documents� Take a query and retrieve on it,
� take top 500 documents,
� build hierarchy from them.
� Topics/concepts are words/phrases taken from� Query
� Retrieved documents
� Comparison of frequencies
![Page 14: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/14.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 15: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/15.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 16: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/16.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 17: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/17.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 18: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/18.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 19: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/19.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 20: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/20.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 21: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/21.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 22: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/22.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 23: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/23.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 24: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/24.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 25: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/25.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 26: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/26.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 27: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/27.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 28: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/28.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 29: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/29.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 30: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/30.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 31: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/31.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 32: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/32.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 33: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/33.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 34: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/34.jpg)
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
![Page 35: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/35.jpg)
Mark Sanderson, University of Sheffield
Did you guess the paper?
� Bit like Peter Anick’s work?
![Page 36: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/36.jpg)
Mark Sanderson, University of Sheffield
Experiment
� Test properties of hierarchy
� Does it mimic (in some way) Yahoo-like categories?� Parent related to child?
� Parent more general than child?
![Page 37: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/37.jpg)
Mark Sanderson, University of Sheffield
Experimental set-up
� Gathered eight subjects� Presented subsumption categories and ‘random’ categories.
� Ask if parent child pair are ‘interesting’.» If yes, then what type is relationship, (roughly) from WordNet
» Aspect of
» Type of
» Same as
» Opposite of
» Don’t know
![Page 38: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/38.jpg)
Mark Sanderson, University of Sheffield
Results
� Question of parent/child pairing ‘interesting’ or not� Random, 51%
� Subsumption, 67%
� Difference significant from t-test, p<0.002
� If interesting, what is parent/child type?
Odd?
![Page 39: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/39.jpg)
Mark Sanderson, University of Sheffield
Yahoo categories?
![Page 40: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/40.jpg)
Mark Sanderson, University of Sheffield
Results and conclusions
� Interesting AND (aspect of OR type of)� Random, 28% (51% * (47% + 8%))
� Subsumption, 48% (67% * (49% + 23%))
� Appears that subsumption and an ordering based on document frequency does a reasonable job.� Term frequency work see.
» Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21
» Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):
![Page 41: Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts Deriving concept hierarchies from text Mark Sanderson,](https://reader035.vdocuments.mx/reader035/viewer/2022062313/56649d405503460f94a1981e/html5/thumbnails/41.jpg)
Mark Sanderson, University of Sheffield
Future work?
� More user studies.
� Incorporate other term relationship techniques
� Other visualisations
� Application of techniques to whole document collections.
� Presentation of Cross Language IR results?