human language technologies for the semantic web department of computer science, university of...

Post on 27-Mar-2015

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Human Language Technologies for the Semantic Web

Department of Computer Science,University of Sheffield

Fabio Ciravegna and Yorick Wilks

F. Ciravegna- AKT Town Meeting April 2003

Language Technologies

• Goal– Building systems able to process Natural

Language in its written or spoken form

• Methodology– Use of Language Analysis

• Technologies (examples):• Information Extraction from Text• Question Answering • Text Generation

F. Ciravegna- AKT Town Meeting April 2003

HLT for Kn. Management

• Use of HLT for Knowledge– Acquisition – Retrieval– Publication

• Main benefits– Cost Reduction– Time needed for KM– Improving knowledge accessibility

• Accessing/Diffusing/Understanding

F. Ciravegna- AKT Town Meeting April 2003

HLT in AKT for KM

acquisition retrieval publishing

Text mining

Information Extraction from Text

Text Generation

F. Ciravegna- AKT Town Meeting April 2003

HLT for Semantic Web

• Use of HLT for:– Document annotation– Information integration from different

sources

• Benefit– Reduce annotation needs– Retrieve and integrate dispersed

information

F. Ciravegna- AKT Town Meeting April 2003

Information Extraction

• Textual documents are pervasive (e.g. Web) – Contained knowledge cannot be queried,

therefore cannot be• Used by automatic systems• Easily managed by humans

• IE can identify information in documents– e.g. to populate a database– e.g. to annotate documents

• Method: natural language analysisWordsInformationKnowledge

IE tasks

Named Entities Template Elements

Template Relations

Scenario Template

WASHINGTON, D.C. (October 5, 1999) - nQuest Inc. today announced that Paul Jacobs, former Vice-President of E-Commerce at SRA International, has joined the company's executive management team as president.

nQuest Inc. Paul Jacobs.SRA International

Company: nQuest Inc. Date: today InPerson: Paul JacobsInRole: president

Company: SRA InternationalOutPerson: Paul JacobsOutRole: Vice-President of E-Commerce,

F. Ciravegna- AKT Town Meeting April 2003

IE Tools @ Sheffield

• GATE: – General Architecture for Language

Engineering– Used to integrate HLT modules

• Annie:– Rule-based Named Entity Recogniser– Download at www.gate.ac.uk

• Amilcare:– Adaptive IE system– Portable using examples– www.nlp.shef.ac.uk/amilcare

F. Ciravegna- AKT Town Meeting April 2003

IE Tools @ Sheffield (2)

• Melita: – Annotation tool – supported by adaptive IE (Amilcare)– Learns how to annotate– www.aktors.org/technologies/melita/

• Lasie– IE system for complex event extraction– Manual rule development– www.dcs.shef.ac.uk/research/groups/nlp/funded/

lasie.html

F. Ciravegna- AKT Town Meeting April 2003

•An architecture•A macro-level organisational picture for LE software systems.

• A framework•for programmers, GATE is an object-oriented class library that implements the architecture.

• A development environment•for language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.

• Free software (LGPL). Mature robust software (in development since 1995). •Comes with…

• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

GATE is…

F. Ciravegna- AKT Town Meeting April 2003

Some users…

At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary

College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;

• the Perseus Digital Library project, Tufts University, US.

F. Ciravegna- AKT Town Meeting April 2003

GATE and Content Extraction

ANNIE - Open-source IE system in GATE, providing modules needed for content extraction– Pre-processing– Named entity recognition– Coreference resolution

• ANNIE handles proper names, pronouns, and nominals

• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results

• Contact Hamish Cunningham (hamish@dcs.shef.ac.uk)

F. Ciravegna- AKT Town Meeting April 2003

Amilcare Active annotation for the Semantic Web

• Tool for adaptive IE from Web-related texts– Specifically designed for document annotation– Trains with a limited amount of examples– Effective on different text types

• From free texts to rigid docs (XML,HTML, etc.)

– Tools for:• Normal user

– Able to annotate a corpus

• Amilcare Expert– Able to optimise experiments

• IE Expert– Able to edit rules

– Uses Annie for preprocessing up to Named Entity Recognition

[Ciravegna – IJCAI 2001]

F. Ciravegna- AKT Town Meeting April 2003

Implementation details

• 100% Java• External Interfaces:

– API for use from other programs– GUI for manual training

• Requirements:– 10M on HD– Up to 300M RAM

• Contact Fabio Ciravegna (fabio@dcs.shef.ac.uk)

F. Ciravegna- AKT Town Meeting April 2003

Users• Integrated with SW annotation tools:

– MnM (Open Univ.) – Ontomat (Karlsruhe Univ.) – Melita (Sheffield Univ.)

• Users:– Merck (D), – ISOCO (SP), – Quinary (I), – Ontoprise (D)– University College Dublin (IE), – 2 departments of CNRS (F)– University of Trier (D), – University of Texas (Austin, USA)

F. Ciravegna- AKT Town Meeting April 2003

Document Annotation

• Many application areas require document annotation (enrichment)– Knowledge Management

• Protocol analysis in industry (Kingston 94)

• Italian police: 100 annotators/6 pages a day each– Semantic Web (Staab00, Motta02, Ciravegna02)

• Annotation is generally manual– Expensive– Inefficient – Difficult– Tedious & Tiring

• Error prone (15-30% inter-annotator disagreement)– Never ending

F. Ciravegna- AKT Town Meeting April 2003

Melita• Document annotation tool

– Use adaptive IE engine to support annotation

• IE System:– Trains while users annotate– Provides preliminary annotation for new documents

• Advantages– Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases– Validating extracted information

• Simpler & less error prone • Speeds up corpus annotation

– Learns how to improve capabilities

F. Ciravegna- AKT Town Meeting April 2003

Annotation with IE

User Annotates

Trains on annotated corpus

Bare TextBare Text

AnnotationComparison

Retrains using errors, missing tags and mistakes

Annotates

F. Ciravegna- AKT Town Meeting April 2003

Bare Text User

Corrects

Annotates

Uses corrections to retrain

Annotation with Suggestions

F. Ciravegna- AKT Town Meeting April 2003

Cooperation:is IE a Useful Support?

CMU Seminars TASK Test:250 texts (Amilcare report the best IE results ever)

Location

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Speaker

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Stime

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Etime

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

F. Ciravegna- AKT Town Meeting April 2003

Integrating Information

• Information is available over the Web– Dispersed– In textual format

• IE as basis for retrieval and integration of information – Unsupervised learning using

• The redundancy of the web

• Available Repositories– Collections of documents/data– Known services (e.g. databases, digital libraries, search

engines)

to bootstrap learning and produce simple high precision IE applications

F. Ciravegna- AKT Town Meeting April 2003

Mining Web Sites

• Extracting knowledge from CS Web sites

NamePositionEmail/TelephoneInvolvement in projectsPublicationsCo-workers

Person:

•Information distributed•Challenges

•Retrieving information•Integrating Information•Largely unsupervised by user

F. Ciravegna- AKT Town Meeting April 2003

Mining Web sites

People and Projectnames

HomePageSearch

Project/People name lists and hyperlinksBasket:

• Annotates known names• Trains on annotations to discover

the HTML structure of the page• Recovers all names and hyperlinks

• Mines the site looking for Project and People names

• Uses •Generic patterns•Annie•Citeseer for likely bigrams

F. Ciravegna- AKT Town Meeting April 2003

Mining Web sites

Projects/People Web pages

HomePageSearch

Extracts personal data•Addresses•Tel number•Email address•…

Project/People name lists and hyperlinksBasket:Name lists and hyperlinks Personal data People and ProjectsBasket:

F. Ciravegna- AKT Town Meeting April 2003Name lists and hyperlinks Personal data People and ProjectsBasket:

HomePageSearch

People Publications

Mining Web sites

• Annotates known papers• Trains on annotations to

discover the HTML structure• Recovers co-authoring

information

Name lists and hyperlinks Personal data Co-authoring informationPeople and ProjectsBasket:

F. Ciravegna- AKT Town Meeting April 2003

Paper discovery

F. Ciravegna- AKT Town Meeting April 2003

Focus on people

F. Ciravegna- AKT Town Meeting April 2003

User Role

• Providing:– A URL– List of services (e.g. Google)

• Train wrappers using examples

– some examples of fillers (e.g. projects)

• In case, correcting intermediate results

F. Ciravegna- AKT Town Meeting April 2003

Rationale

• Large collections (e.g. Web) contain redundant information– Redundancy can be used to bootstrap learning

• Mining the Web for information– Learned patters

• Integration of information – Multiple evidence

• Different strategies with different reliability• Scruffy works!

– User corrections of data in case

F. Ciravegna- AKT Town Meeting April 2003

Conclusion

• In AKT we are using HLT (IE) for:– Helping in document annotation– Integrating information from different

sources

• Benefit:– Reduce annotation needs– Retrieve and integrate dispersed

information• Minimum user intervention

top related