detection of relations in textual documents manuela kunze, dietmar rösner university of magdeburg c...

26
Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg Knowledge Based Systems and Document Processing

Upload: sybil-franklin

Post on 23-Dec-2015

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Detection of Relations in Textual Documents

Manuela Kunze,

Dietmar Rösner

University of Magdeburg Knowledge Based Systems and Document Processing

Page 2: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 2

Introduction

http://en.wikipedia.org/wiki/Unsupervised_learning

Page 3: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 3

Introduction

• to extract information from text, you can use techniques like simple pattern matching etc.

• additional knowledge is required:• 'Thursday': a day of a week• meaning of

• (implicit) `open' vs. `close'• `Pay-what-you-wish'

• text understanding / techniques of NLP • `Exhibition of over 30 color photographs and stories of life in

China's Yunnan Province …'

Page 4: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 4

Introduction

ontologies contain information about:

• definition/description of concepts and

• description of instances

• kind of relation (name, type),– definition of domain and range values,

– characteristic of the relation: cardinality, transitivity, ...,

Page 5: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 5

Natural Language Processing

• NLP techniques: – case frame analysis– exploiting syntactic structures– corpus-based IE for an initial ontology

• corpus:– autopsy protocols (400 protocols)– different document parts:

• findings• histological findings• background• discussion• …

– short linguistic structures – typical attribute-value structures

Page 6: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 6

Overview

Case Frame

Analysis of Specific Syntactic Structures

Discussion/Conclusion

Page 7: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 7

Case Frames

• resources:– results from syntactic parser

<NP TYPE="COMPLEX" RULE="NPC3" GEN="MAS" NUM="SG" CAS="NOM">       <NP TYPE="FULL" RULE="NP1" CAS="NOM" NUM="SG" GEN="MAS">         <N>Flachschnitt</N>       </NP>       <PP RULE="PP1" CAS="AKK">         <PRP CAS="AKK">in</PRP>         <NP TYPE="FULL" RULE="NP2" CAS="AKK" NUM="SG" GEN="NTR">           <DETD>das</DETD>           <N>Zungengewebe</N>         </NP>       </PP>     </NP>

– results from semantic tagger– description of case frames

Page 8: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 8

Case Frames

• (corpus-based) definition of roles for a concept– `Flachschnitt' (flat cut)

• `location'– sem. category: `tissue'– PP, case of NP: accusative, preposition: `in'

– `Herausschleudern' (skidding)• `patient'

– sem. category: `body-hum'– NP; case of NP: genitive

• `location' – sem. category: `vehicle' – PP, case of NP: dative, preposition: `aus'

Page 9: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 9

Case Frames…<CONCEPT TYPE="medicalOperation">

        <WORD>Flachschnitt</WORD>         <DESC>medizinischer Schnitt</DESC>         <SLOTS>                 <RELATION TYPE="LOCATION">                         <ASSIGN_TO>TISSUE</ASSIGN_TO>                         <FORM>P(akk, fak, in)</FORM>                         <CONTENT>in das Zungengewebe</CONTENT>                 </RELATION>         </SLOTS> </CONCEPT>

<CONCEPT TYPE="traffic-event">         <WORD>Herausschleudern</WORD>         <DESC>event</DESC>         <SLOTS>                 <RELATION TYPE="PATIENT">                         <ASSIGN_TO>BODY-HUM</ASSIGN_TO>                         <FORM>N(gen, fak)</FORM>                         <CONTENT>des Koerpers</CONTENT>                 </RELATION>                 <RELATION TYPE="LOCATION">                         <ASSIGN_TO>VEHICLE</ASSIGN_TO>                         <FORM>P(dat, fak, aus)</FORM>                         <CONTENT></CONTENT>                 </RELATION>         </SLOTS> </CONCEPT>

Page 10: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 10

Case Frames

• coverage of phrases like `fracture of elbow joint'?

• abstraction– `fracture' (sem. category: `trauma')

• role `patient': sem. category: `bone'

– `bruise' (sem. category: `trauma')• role `patient': sem. category: `organ'

– `hematoma' (sem. category: `trauma')• role `patient': sem. category: `tissue'

• concept x (sem. category: `trauma')– role `patient': sem. category: `body-part'

Page 11: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 11

Case Frames

• results:– relations are defined by the case frame

• name/type of relation• domain, range

– corpus-based abstractions:• redefinition of semantic restriction

– use the least general hypernym as semantic restriction

• not yet extracted:– information about the characteristic of a relation

Page 12: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 12

Overview

Case Frame

Analysis of Specific Syntactic Structures

Discussion/Conclusion

Page 13: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 13

Analysis of Specific Syntactic Structures

• from general to specific information• resources:

– results from syntactic parser– results from semantic tagger– description of interpretation of syntactic structures

• Which word class can be interpreted as concept/instance?

• Which word class describes a relation?– adjective in a NP: describes the noun in the NP relation `prop‘– negations: negate concepts, verbs, or properties of a concept– particle: modification of adjectives

Page 14: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 14

Analysis of Specific Syntactic Structures

CLMed N ADJ

prop(N, ADJ)

N interpreted as concept

ADJ interpreted as concept

results:

prop_catadj(N,ADJ)

Page 15: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 15

Analysis of Specific Syntactic Structures

`liver tissue bloodless‘

Steps:

bloodless*blood

concentrationbloodless

liver_tissue* tissueliver tissue

• nouns and adjectives are interpreted as concept/instance

• adjectives describe a relation• in general: 'prop'

prop_blood-concentrationprop_blood-concentration

conceptinstancerelation

Page 16: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 16

Analysis of Specific Syntactic Structures`liver tissue bloodless‘

<owl:Class rdf:ID="lebergewebe">

<rdfs:subClassOf><owl:Class rdf:ID="tissue"/></rdfs:subClassOf></owl:Class>

<owl:Class rdf:ID="blood-concentration"/>

<owl:Class rdf:ID="blutleer">

<rdfs:subClassOf rdf:resource="#blood-concentration"/></owl:Class>

<owl:ObjectProperty rdf:ID="prop_blood-concentration">

<rdfs:domain rdf:resource="#tissue"/><rdfs:range rdf:resource="#blood-concentration"/></owl:ObjectProperty>

<lebergewebe rdf:ID="Lebergewebe_6">

<prop_blood-concentration><blutleer rdf:ID="blutleer_7"/></prop_blood-concentration></lebergewebe> …

Page 17: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 17

Analysis of Specific Syntactic Structures"kaum wahrnehmbare Unterblutungen"(Engl. "hardly detectable hematomas")

results of syntactic parser:<NP TYPE="FULL" RULE="NP4" CAS="_" NUM="PL" GEN="FEM">

<ADJP RULE="ADJP1">

<ADV>kaum</ADV>

<ADJ>wahrnehmbare</ADJ>

</ADJP>

<N>Unterblutungen</N>

</NP>

results of semantic tagger:– `kaum': weak-graduation– `wahrnehmbar': unknown token– `Unterblutung': trauma

resources for interpretation:• N: concept/instance• ADJ:

• concept/instance• rel: prop

• ADV:• concept/instance• rel: mod

adverb specifies adjective

adjective specifies noun

Page 18: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 18

Analysis of Specific Syntactic Structures

`hardly detectable hematomas‘ Steps:

detectable* unspecified

hematoma* traumahematoma

• nouns, adjectives and adverbs are interpreted as concept/instance

• adjectives and adverbs describe relations

prop_unspecifiedprop_unspecified

conceptinstancerelation

hardly* hardly weak-graduation

mod_weak-graduationmod_weak-graduation

Page 19: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 19

Analysis of Specific Syntactic Structures`hardly detectable hematomas‘

<owl:Class rdf:ID="unterblutung"><rdfs:subClassOf rdf:resource="#trauma"/></owl:Class>

<owl:Class rdf:ID="trauma"/>

<owl:Class rdf:ID="wahrnehmbar">

<rdfs:subClassOf rdf:resource="#unspecified"/></owl:Class>

<owl:Class rdf:ID="unspecified"/>

<owl:Class rdf:ID="kaum">

<rdfs:subClassOf rdf:resource="#weak-graduation"/></owl:Class>

<owl:Class rdf:ID="weak-graduation"/>

Page 20: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 20

Analysis of Specific Syntactic Structures`hardly detectable hematomas‘

<owl:ObjectProperty rdf:ID="mod_weak-graduation">

<rdfs:domain rdf:resource="#unspecified"/>

<rdfs:range rdf:resource="#weak-graduation"/></owl:ObjectProperty>

<owl:ObjectProperty rdf:ID="prop_unspecified">

<rdfs:domain rdf:resource="#trauma"/>

<rdfs:range rdf:resource="#unspecified"/></owl:ObjectProperty>

<unterblutung rdf:ID="Unterblutungen_5">

<prop_unspecified rdf:resource="#wahrnehmbare_4"/></unterblutung>

<wahrnehmbar rdf:ID="wahrnehmbare_4">

<mod_weak-graduation rdf:resource="#kaum_3"/></wahrnehmbar>

<kaum rdf:ID="kaum_3"></kaum>

Page 21: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 21

Analysis of Specific Syntactic Structures

conceptinstancerelation

Protégé Plugin for Visualization: Ontoviz

Phrases like: • NP NP NP• NP N Adj Conj Adj• NP N conj N Adj• …

Page 22: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 22

Analysis of Specific Syntactic Structures

• results– definition of concepts/instances– corpus-based definition/concretion of relations:

• prop prop_catADJ

• information about domain, relation

• not extracted:– information about the characteristic of a relation

Page 23: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 23

Overview

Case Frame

Analysis of Specific Syntactic Structures

Discussion/Conclusion

Page 24: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 24

Conclusion

• NLP techniques for extraction of information– analyse syntactic structures – information about semantic categories– result: corpus-based description of an initial ontology

• case frame analysis– relations are described in the case frame– disadvantage: creation of case frames– advantage: a definition of the relation

• analysis specific syntactic structures– a general interpretation of tokens and the syntactic structures– redefined by results from the semantic tagger– disadvantage: in some case, only the general relation definition is

delivered– advantage: less effort to describe the resources

Page 25: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 25

Conclusion

• no information about the characteristic of a relation (cardinality, …)

• solutions– analyse occurrences in the corpus

• corpus-based assumption about cardinality

– integration of additional knowledge• initial domain specific ontology

Page 26: Detection of Relations in Textual Documents Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing

Kunze, Rösner: Detection of Relations in Textual Documents 26

Key Aspects for IE

• ‘conceptual’ preprocessing steps: Names of concepts occur in different linguistic structures; compound vs. complex noun phrase (like ‘liver tissue’ and ’tissue of liver’)

– handle only one canonical linguistic structure as a representative for all paraphrases

• treatment of generalisation within local contexts – The token ‘liver’ may occur in the first sentence of a paragraph. In the next sentences

of the paragraph, only the hypernym ‘organ’ is used.

• concept or instance: which term in a linguistic structure has to be interpreted as a concept and which as an instance of a concept resp.

• definition of the scope for a concept: – a paragraph starts with a description of an organ (e.g. organ ‘liver’ in: ‘The liver

shows ... . Bloodrichness of the tissue.’ ), after this follows a description of parts of the organ (e.g., ‘Gewebe’). In such cases, additional knowledge about the domain has to be employed (for example, about meronyms or holonyms)

– tissue part-of liver vs tissue part-of concept X