natural language processing within the archaeotools project

22
Natural Language Natural Language Processing within the Processing within the Archaeotools Project Archaeotools Project Michael Charno, Stuart Jeffrey, Julian Richards, Fabio Ci Michael Charno, Stuart Jeffrey, Julian Richards, Fabio Ci Stewart Waller, Sam Chapman and Ziqi Zhang. Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009 CAA Williamsburg, March 2009

Upload: calvin-skinner

Post on 03-Jan-2016

21 views

Category:

Documents


2 download

DESCRIPTION

Natural Language Processing within the Archaeotools Project. Michael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna , Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Natural Language Processing within the Archaeotools Project

Natural Language Processing within Natural Language Processing within the Archaeotools Projectthe Archaeotools Project

Michael Charno, Stuart Jeffrey, Julian Richards, Fabio CiravegnaMichael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna,Stewart Waller, Sam Chapman and Ziqi Zhang. Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009CAA Williamsburg, March 2009

Page 2: Natural Language Processing within the Archaeotools Project

“To support research, learning and teaching with high quality and dependable digital resources.”

Page 3: Natural Language Processing within the Archaeotools Project

AHRC-EPSRC-JISC eScience research grants scheme:AHRC-EPSRC-JISC eScience research grants scheme:

AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks

BUILDS UPON: Common Information Environment Enhanced Geospatial browser

PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield

Joint Information Systems Committee

Page 4: Natural Language Processing within the Archaeotools Project

• Work package 1 - Advanced Faceted Classification /Geo-spatial Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest.and Media) – Reported on in CAA, Budapest.

• Work package 2 – Natural language processing /Data-mining of Work package 2 – Natural language processing /Data-mining of Grey Literature.Grey Literature.

• Work package 3 – Data-mining of Historic Literature; plus Work package 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Work packages:

Page 5: Natural Language Processing within the Archaeotools Project

• WP1 Datasets include:– National Monuments Records (Scotland, Wales, England)– Excavation Index (EH)– Archive Holdings– Local Authority Historic Environment Records

• WP2/3 Datasets include:– ‘Grey’ (Gray) Literature– Proceedings of the Society of Antiquaries of Scotland (PSAS)

• Thesauri include:– Thesaurus of Monuments Types (TMT)– Thesaurus of Object Types – MIDAS Period list– UK Government list of administrative areas, County, District, Parish (CDP) –

Not MIDAS

Page 6: Natural Language Processing within the Archaeotools Project

OracleRDBMS

MIDAS XML Record

Information Extraction RDF Resource

Knowledge triple store

XML Docs of Thesaurus

Query

User Interface

Information Extraction

When, Where, What ontologiesas entries to faceted index

Input

Input

Page 7: Natural Language Processing within the Archaeotools Project

UP TO DATE VERIOSN OF THIS

Page 8: Natural Language Processing within the Archaeotools Project

• Work package 1 - Advanced Faceted Classification /Geo-spatial Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest.and Media) – Reported on in CAA, Budapest.

• Work package 2 – Natural language processing /Data-mining of Work package 2 – Natural language processing /Data-mining of Grey Literature.Grey Literature.

• Work package 3 – Data-mining of Historic Literature; plus Work package 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Work packages:

Page 9: Natural Language Processing within the Archaeotools Project
Page 10: Natural Language Processing within the Archaeotools Project
Page 11: Natural Language Processing within the Archaeotools Project

BARROW

BARROW BARROW

Page 12: Natural Language Processing within the Archaeotools Project

“I never said she stole my money”

“I never said she stole my money”

“I never said she stole my money”

“I never said she stole my money”

“I never said she stole my money”

“I never said she stole my money”“I never said she stole my money”

Someone else said it, but I didn’t.

I simply didn’t ever say it.

I might have implied it, but I never said it.

I said someone stole it, I didn’t say it was she.

I just said she probably borrowed it.

I said she stole someone else’s moneyI said she stole something, but not my money.

“I never said she stole my money”

Was it Bonnie or Clyde?

Page 13: Natural Language Processing within the Archaeotools Project

State-of-the-art review – approaches to rule induction… Two mainstream methodologies towards rule induction:

Human handcrafted rules (Rule based system) – built manually by analysing example annotations and derive human readable discriminative patterns

• Easy to understand, easy to implement, effective for structured texts and simple patterns, no need for training learning models but…

• Not robust to less-structured texts, and time consuming and difficult to derive rules for large amount of example annotations

Machine learned rules (Machine Learning) – built automatically by analysing example annotations and converting features into numeric representations, which are to be consumed by mathematic models to derive discriminative pattern that are not readable

• Very robust, copes with large amounts of data and complex patterns; we only select features and machine analyses examples and induce rules, but…

• Very sensitive to feature selections, implementation and feature tuning are difficult and takes time; may not work well with few amounts of examples

Page 14: Natural Language Processing within the Archaeotools Project

The fundamental idea... The fundamental idea is to study the features of positive and negative examples of

entities, and/or their surrounding N words over a large collection of annotated documents (training data prepared by human) and design rules that capture instances of a given type. (Nadeau et al, 2006) Then apply the rules to new corpus, and classify each individual token (both previously seen and unseen) into suitable classes.

Features - descriptors or characteristic attributes of words designed for algorithmic consumption

Positive examples - instances of a given type to be extracted Negative examples - any text units that are not annotated as the given type

Page 15: Natural Language Processing within the Archaeotools Project
Page 16: Natural Language Processing within the Archaeotools Project
Page 17: Natural Language Processing within the Archaeotools Project

The fundamental idea...

Example annotations

in highlighted colours are

positive examples

Un-annotated texts are negative examples

Features of this annotation:•first_letter_capitalised: true•word_found_in_gazetteer: true• preceded_by: the

• followed_by: period

Page 18: Natural Language Processing within the Archaeotools Project

Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to:

• Grid reference (easting and northing)• Report title*• Report creator*• Report publisher*• Report publication date*• Report publisher contact• Bibliography & references

Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to:

• What (subject)• Where (place name)• When (temporal info)• Event date

Page 19: Natural Language Processing within the Archaeotools Project

From the 1st batch of annotated corpus 35 unique annotated documents Number of annotations by class:

publisher.name: 93 title: 53 date.event: 129 coverage.temporal: 2185 subject: 7935 publisher.contact: 21 date.publication: 28 coverage.spatial.placename:1467 creator: 67

Page 20: Natural Language Processing within the Archaeotools Project

Class Useful features to test *

What (subject) • word text• word stem (root)• word lemma (root format in dictionary)• word orthography• word Part-of-Speech• word position in document (e.g., on page #)• word position in page (e.g., position relative to page start offset and end offset)• word membership in Gazetteer• word general entity class (e.g., person, organisation, location, date, time)

• Plus above features of preceding 5 words’ and succeeding 5 words’

Where (placename)

When (temporal)

Event date

* These features are generally applied to every other classes too. See following slides.

Page 21: Natural Language Processing within the Archaeotools Project

Class Useful features to test

Title Features marked with * plus special word identifier (e.g., report, survey, evaluation)

Creator Features marked with * plus relative position to title

Publisher name Features marked with * plus relative position to title

Publisher contact Features marked with * plus relative position to title

Publication date Features marked with * plus relative position to title

Grid reference points • Identifier special word token (e.g., grid point, grid reference, easting, northing)• Pattern

Bibliography/references • Identifier special word token• Pattern (e.g., person name followed by year)

Page 22: Natural Language Processing within the Archaeotools Project