detection of relations in textual documents manuela kunze, dietmar rösner university of magdeburg c...
TRANSCRIPT
Detection of Relations in Textual Documents
Manuela Kunze,
Dietmar Rösner
University of Magdeburg Knowledge Based Systems and Document Processing
Kunze, Rösner: Detection of Relations in Textual Documents 2
Introduction
http://en.wikipedia.org/wiki/Unsupervised_learning
Kunze, Rösner: Detection of Relations in Textual Documents 3
Introduction
• to extract information from text, you can use techniques like simple pattern matching etc.
• additional knowledge is required:• 'Thursday': a day of a week• meaning of
• (implicit) `open' vs. `close'• `Pay-what-you-wish'
• text understanding / techniques of NLP • `Exhibition of over 30 color photographs and stories of life in
China's Yunnan Province …'
Kunze, Rösner: Detection of Relations in Textual Documents 4
Introduction
ontologies contain information about:
• definition/description of concepts and
• description of instances
• kind of relation (name, type),– definition of domain and range values,
– characteristic of the relation: cardinality, transitivity, ...,
Kunze, Rösner: Detection of Relations in Textual Documents 5
Natural Language Processing
• NLP techniques: – case frame analysis– exploiting syntactic structures– corpus-based IE for an initial ontology
• corpus:– autopsy protocols (400 protocols)– different document parts:
• findings• histological findings• background• discussion• …
– short linguistic structures – typical attribute-value structures
Kunze, Rösner: Detection of Relations in Textual Documents 6
Overview
Case Frame
Analysis of Specific Syntactic Structures
Discussion/Conclusion
Kunze, Rösner: Detection of Relations in Textual Documents 7
Case Frames
• resources:– results from syntactic parser
<NP TYPE="COMPLEX" RULE="NPC3" GEN="MAS" NUM="SG" CAS="NOM"> <NP TYPE="FULL" RULE="NP1" CAS="NOM" NUM="SG" GEN="MAS"> <N>Flachschnitt</N> </NP> <PP RULE="PP1" CAS="AKK"> <PRP CAS="AKK">in</PRP> <NP TYPE="FULL" RULE="NP2" CAS="AKK" NUM="SG" GEN="NTR"> <DETD>das</DETD> <N>Zungengewebe</N> </NP> </PP> </NP>
– results from semantic tagger– description of case frames
Kunze, Rösner: Detection of Relations in Textual Documents 8
Case Frames
• (corpus-based) definition of roles for a concept– `Flachschnitt' (flat cut)
• `location'– sem. category: `tissue'– PP, case of NP: accusative, preposition: `in'
– `Herausschleudern' (skidding)• `patient'
– sem. category: `body-hum'– NP; case of NP: genitive
• `location' – sem. category: `vehicle' – PP, case of NP: dative, preposition: `aus'
Kunze, Rösner: Detection of Relations in Textual Documents 9
Case Frames…<CONCEPT TYPE="medicalOperation">
<WORD>Flachschnitt</WORD> <DESC>medizinischer Schnitt</DESC> <SLOTS> <RELATION TYPE="LOCATION"> <ASSIGN_TO>TISSUE</ASSIGN_TO> <FORM>P(akk, fak, in)</FORM> <CONTENT>in das Zungengewebe</CONTENT> </RELATION> </SLOTS> </CONCEPT>
<CONCEPT TYPE="traffic-event"> <WORD>Herausschleudern</WORD> <DESC>event</DESC> <SLOTS> <RELATION TYPE="PATIENT"> <ASSIGN_TO>BODY-HUM</ASSIGN_TO> <FORM>N(gen, fak)</FORM> <CONTENT>des Koerpers</CONTENT> </RELATION> <RELATION TYPE="LOCATION"> <ASSIGN_TO>VEHICLE</ASSIGN_TO> <FORM>P(dat, fak, aus)</FORM> <CONTENT></CONTENT> </RELATION> </SLOTS> </CONCEPT>
…
Kunze, Rösner: Detection of Relations in Textual Documents 10
Case Frames
• coverage of phrases like `fracture of elbow joint'?
• abstraction– `fracture' (sem. category: `trauma')
• role `patient': sem. category: `bone'
– `bruise' (sem. category: `trauma')• role `patient': sem. category: `organ'
– `hematoma' (sem. category: `trauma')• role `patient': sem. category: `tissue'
• concept x (sem. category: `trauma')– role `patient': sem. category: `body-part'
Kunze, Rösner: Detection of Relations in Textual Documents 11
Case Frames
• results:– relations are defined by the case frame
• name/type of relation• domain, range
– corpus-based abstractions:• redefinition of semantic restriction
– use the least general hypernym as semantic restriction
• not yet extracted:– information about the characteristic of a relation
Kunze, Rösner: Detection of Relations in Textual Documents 12
Overview
Case Frame
Analysis of Specific Syntactic Structures
Discussion/Conclusion
Kunze, Rösner: Detection of Relations in Textual Documents 13
Analysis of Specific Syntactic Structures
• from general to specific information• resources:
– results from syntactic parser– results from semantic tagger– description of interpretation of syntactic structures
• Which word class can be interpreted as concept/instance?
• Which word class describes a relation?– adjective in a NP: describes the noun in the NP relation `prop‘– negations: negate concepts, verbs, or properties of a concept– particle: modification of adjectives
Kunze, Rösner: Detection of Relations in Textual Documents 14
Analysis of Specific Syntactic Structures
CLMed N ADJ
prop(N, ADJ)
N interpreted as concept
ADJ interpreted as concept
results:
prop_catadj(N,ADJ)
Kunze, Rösner: Detection of Relations in Textual Documents 15
Analysis of Specific Syntactic Structures
`liver tissue bloodless‘
Steps:
bloodless*blood
concentrationbloodless
liver_tissue* tissueliver tissue
• nouns and adjectives are interpreted as concept/instance
• adjectives describe a relation• in general: 'prop'
prop_blood-concentrationprop_blood-concentration
conceptinstancerelation
Kunze, Rösner: Detection of Relations in Textual Documents 16
Analysis of Specific Syntactic Structures`liver tissue bloodless‘
…
<owl:Class rdf:ID="lebergewebe">
<rdfs:subClassOf><owl:Class rdf:ID="tissue"/></rdfs:subClassOf></owl:Class>
<owl:Class rdf:ID="blood-concentration"/>
<owl:Class rdf:ID="blutleer">
<rdfs:subClassOf rdf:resource="#blood-concentration"/></owl:Class>
<owl:ObjectProperty rdf:ID="prop_blood-concentration">
<rdfs:domain rdf:resource="#tissue"/><rdfs:range rdf:resource="#blood-concentration"/></owl:ObjectProperty>
<lebergewebe rdf:ID="Lebergewebe_6">
<prop_blood-concentration><blutleer rdf:ID="blutleer_7"/></prop_blood-concentration></lebergewebe> …
Kunze, Rösner: Detection of Relations in Textual Documents 17
Analysis of Specific Syntactic Structures"kaum wahrnehmbare Unterblutungen"(Engl. "hardly detectable hematomas")
results of syntactic parser:<NP TYPE="FULL" RULE="NP4" CAS="_" NUM="PL" GEN="FEM">
<ADJP RULE="ADJP1">
<ADV>kaum</ADV>
<ADJ>wahrnehmbare</ADJ>
</ADJP>
<N>Unterblutungen</N>
</NP>
results of semantic tagger:– `kaum': weak-graduation– `wahrnehmbar': unknown token– `Unterblutung': trauma
resources for interpretation:• N: concept/instance• ADJ:
• concept/instance• rel: prop
• ADV:• concept/instance• rel: mod
adverb specifies adjective
adjective specifies noun
Kunze, Rösner: Detection of Relations in Textual Documents 18
Analysis of Specific Syntactic Structures
`hardly detectable hematomas‘ Steps:
detectable* unspecified
hematoma* traumahematoma
• nouns, adjectives and adverbs are interpreted as concept/instance
• adjectives and adverbs describe relations
prop_unspecifiedprop_unspecified
conceptinstancerelation
hardly* hardly weak-graduation
mod_weak-graduationmod_weak-graduation
Kunze, Rösner: Detection of Relations in Textual Documents 19
Analysis of Specific Syntactic Structures`hardly detectable hematomas‘
<owl:Class rdf:ID="unterblutung"><rdfs:subClassOf rdf:resource="#trauma"/></owl:Class>
<owl:Class rdf:ID="trauma"/>
<owl:Class rdf:ID="wahrnehmbar">
<rdfs:subClassOf rdf:resource="#unspecified"/></owl:Class>
<owl:Class rdf:ID="unspecified"/>
<owl:Class rdf:ID="kaum">
<rdfs:subClassOf rdf:resource="#weak-graduation"/></owl:Class>
<owl:Class rdf:ID="weak-graduation"/>
Kunze, Rösner: Detection of Relations in Textual Documents 20
Analysis of Specific Syntactic Structures`hardly detectable hematomas‘
<owl:ObjectProperty rdf:ID="mod_weak-graduation">
<rdfs:domain rdf:resource="#unspecified"/>
<rdfs:range rdf:resource="#weak-graduation"/></owl:ObjectProperty>
<owl:ObjectProperty rdf:ID="prop_unspecified">
<rdfs:domain rdf:resource="#trauma"/>
<rdfs:range rdf:resource="#unspecified"/></owl:ObjectProperty>
<unterblutung rdf:ID="Unterblutungen_5">
<prop_unspecified rdf:resource="#wahrnehmbare_4"/></unterblutung>
<wahrnehmbar rdf:ID="wahrnehmbare_4">
<mod_weak-graduation rdf:resource="#kaum_3"/></wahrnehmbar>
<kaum rdf:ID="kaum_3"></kaum>
Kunze, Rösner: Detection of Relations in Textual Documents 21
Analysis of Specific Syntactic Structures
conceptinstancerelation
Protégé Plugin for Visualization: Ontoviz
Phrases like: • NP NP NP• NP N Adj Conj Adj• NP N conj N Adj• …
Kunze, Rösner: Detection of Relations in Textual Documents 22
Analysis of Specific Syntactic Structures
• results– definition of concepts/instances– corpus-based definition/concretion of relations:
• prop prop_catADJ
• information about domain, relation
• not extracted:– information about the characteristic of a relation
Kunze, Rösner: Detection of Relations in Textual Documents 23
Overview
Case Frame
Analysis of Specific Syntactic Structures
Discussion/Conclusion
Kunze, Rösner: Detection of Relations in Textual Documents 24
Conclusion
• NLP techniques for extraction of information– analyse syntactic structures – information about semantic categories– result: corpus-based description of an initial ontology
• case frame analysis– relations are described in the case frame– disadvantage: creation of case frames– advantage: a definition of the relation
• analysis specific syntactic structures– a general interpretation of tokens and the syntactic structures– redefined by results from the semantic tagger– disadvantage: in some case, only the general relation definition is
delivered– advantage: less effort to describe the resources
Kunze, Rösner: Detection of Relations in Textual Documents 25
Conclusion
• no information about the characteristic of a relation (cardinality, …)
• solutions– analyse occurrences in the corpus
• corpus-based assumption about cardinality
– integration of additional knowledge• initial domain specific ontology
Kunze, Rösner: Detection of Relations in Textual Documents 26
Key Aspects for IE
• ‘conceptual’ preprocessing steps: Names of concepts occur in different linguistic structures; compound vs. complex noun phrase (like ‘liver tissue’ and ’tissue of liver’)
– handle only one canonical linguistic structure as a representative for all paraphrases
• treatment of generalisation within local contexts – The token ‘liver’ may occur in the first sentence of a paragraph. In the next sentences
of the paragraph, only the hypernym ‘organ’ is used.
• concept or instance: which term in a linguistic structure has to be interpreted as a concept and which as an instance of a concept resp.
• definition of the scope for a concept: – a paragraph starts with a description of an organ (e.g. organ ‘liver’ in: ‘The liver
shows ... . Bloodrichness of the tissue.’ ), after this follows a description of parts of the organ (e.g., ‘Gewebe’). In such cases, additional knowledge about the domain has to be employed (for example, about meronyms or holonyms)
– tissue part-of liver vs tissue part-of concept X