ariadne: final report on natural language processing

36
D16.4: Final Report on Natural Language Processing Authors: Andreas Vlachidis, USW Douglas Tudhope, USW Milco Wansleeben, LU Jeremy Azzopardi, SND Katie Green, ADS Lei Xia, ADS Holly Wright, ADS Ariadne is funded by the European Commission’s 7th Framework Programme.

Upload: ariadnenetwork

Post on 22-Jan-2018

195 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: ARIADNE: Final Report on Natural Language Processing

D16.4: Final Report on Natural Language Processing

Authors: Andreas Vlachidis, USWDouglas Tudhope, USWMilco Wansleeben, LUJeremy Azzopardi, SNDKatie Green, ADSLei Xia, ADSHolly Wright, ADS

Ariadne is funded by the European Commission’s 7th Framework Programme.

Page 2: ARIADNE: Final Report on Natural Language Processing

Theviewsandopinionsexpressedinthisreportarethesoleresponsibilityoftheauthor(s)anddonotnecessarilyreflecttheviewsoftheEuropeanCommission.

ARIADNED16.4:FinalReportonNaturalLanguageProcessing(Public)

Version:1.0(final) 16thJanuary2017

Authors: AndreasVlachidis,UniversityofSouthWales,USW

DouglasTudhope,UniversityofSouthWales,USW

MilcoWansleeben,UniversiteitLeiden,LU

JeremyAzzopardi,SND

KatieGreen,ADS

LeiXia,ADS

HollyWright,ADSQualityReview

AchilleFelicetti,PIN

Versions Nr. Authors&changesmade DateDraft .2 DouglasTudhopeandAndreasVlachidis(USW) 7.11.2016Draft .3 HollyWright(ADS) 10.11.2016Draft .4 Douglas Tudhope, Andreas Vlachidis (USW), Jeremy Azzopardi

(SND)30.11.2016

Final 1.0 HollyWright(ADS) 16.01.2017

Page 3: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

3

TableofContents

Glossary.............................................................................................................................4

ExecutiveSummary............................................................................................................5

1. Introduction..................................................................................................................6

1.1 Background................................................................................................................................6

2 Rule-basedNLPforDutchlanguagegreyliterature........................................................8

2.1 Introduction...............................................................................................................................82.2 NERPipelinesforDutchreports.................................................................................................8

2.2.1 NERPipelineVersion1..................................................................................................................92.2.2 NERPipelineVersion2..................................................................................................................92.2.3 NERPipelineVersion3................................................................................................................102.2.4 Version3PerformanceIssuesandImprovements......................................................................102.2.5 Longertermactions(requiringinputfromDutcharchaeologydomainexperts)........................14

3 Rule-basedNLPforSwedisharchaeologicalreports.....................................................163.1 SwedishlanguagepilotgeneralNLPpipeline...........................................................................163.2 SwedishlanguagerevisedgeneralNLPpipeline......................................................................20

4 Rule-basedNLPinvestigationsonspecificcasestudies................................................21

4.1 Numismaticcasestudy.............................................................................................................214.2 Data/NLPmultilingualcasestudyonitemleveldataintegration............................................214.3 NLPpipelinesmadeavailableforfurtherwork........................................................................24

5 MachineLearningAPIfortheADSGreyLiteratureLibrary...........................................26

5.1 Introduction.............................................................................................................................265.2 NamedEntityRecognition(NER)WebService.........................................................................27

6 Conclusion...................................................................................................................31

7 Appendix1:InstructionsforAnnotatingGreyLiteratureDocuments...........................34

AnnotationPrinciples..............................................................................................................................34EntitiesAnnotation.................................................................................................................................35

8 Appendix2:InitialglossaryofSwedishdatecontextmarkers......................................36

Page 4: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

4

Glossary

ADS ArchaeologyDataServiceArchaeotools

NLPprojecttocreatetoolsforarchaeologiststoallowarchaeologiststodiscover,shareandanalysedatasets

CIDOC-CRM

TheCIDOCConceptualReferenceModel(CRM)providesdefinitionsandaformalstructurefordescribingtheimplicitandexplicitconceptsandrelationshipsusedinculturalheritagedocumentation

CRF ConditionalRandomFieldF-Measure MeasureofaccuracycalculatedfromrecallandprecisionmeasurementsGATE AcomputerarchitectureframeworkforNLPGATEfication

Thedesignandtransformationoptionsfortranslatingtheoriginalresources(ofSKOSifiedRDFfiles)intoGATEfriendlyOWL-Litestructures

GoldStandard Atestsetofhumanannotateddocumentsdescribingthedesirablesystemoutcome

Greyliturature Unpublishedreports IE InformationExtractionJAPE SpeciallydevelopedpatternmatchinglanguageforGATELinkedOpenData

Awayofpublishingstructureddatathatallowsmetadatatobeconnectedandenriched

NLP NaturalLanguageProcessing NER NamedEntityRecognitionOBIE OntologyBasedInformationExtractionOWL-Lite OntologiesinGATEpurelysupporttheaimsofinformationextractionand

arenotstand-aloneformalontologiesforlogic-basedpurposesPolysemy Multiple,relatedmeaningsRCE RijksdienstCultureelErfgoedThesauriRDF ResourceDescriptionFrameworkSENESCHAL SemanticENrichmentEnablingSustainabilityofarCHAeologicalLinksSKOS SimpleKnowledgeOrganizationSystemSTAR SemanticTechnologiesforArchaeologicalResourcesSTELLAR SemanticTechnologiesEnhancingLinksandLinkeddataforArchaeological

Research.Stringmatching Actionofmatchingseveralstrings(patterns)withinalargerstringortextSVM LinearSupportVectorMachineSynonymy SimilarmeaningsTextMining TheprocessofderivinginformationfromtextTrainingdata/documents

TheannotatedtextusedtotrainNLPclassifiers

URI UniqueResourceIdentifierUSW UniversityofSouthWalesXML ExtensibleMarkupLanguageXSL MicrosoftExcelformat

Page 5: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

5

ExecutiveSummary

Thisdocument isadeliverable (D16.4)of theARIADNEproject (“AdvancedResearch Infrastructurefor Archaeological Dataset Networking in Europe”), which is funded under the EuropeanCommunity'sSeventhFrameworkProgramme.ItpresentsthefinalresultsoftheworkcarriedoutinTasks 16.2 “Natural Language Processing (NLP)”. NLP is an interdisciplinary field ofcomputerscience,linguisticsand artificial intelligence that uses many different techniques to explore theinteractionbetweenhuman(natural)andcomputerlanguages.

Thepartners continued to focusononeof themost important,but traditionallydifficult toaccessresources in archaeology; the largely unpublished reports generated by commercial or “rescue”archaeology,commonlyknownas“greyliterature”,exploringbothrule-basedandmachinelearningNLP methods, the use of archaeological thesauri in NLP, and various Information Extraction (IE)methodsintheirownlanguage.

USWextended their English language rule basedmethodsusing theGATE toolkit forNER (NamedEntityRecognition)toDutchandSwedish languagegrey literaturereports, incollaborationwithLUandDANS(Dutchreports)andSND(Swedishreports).Thismadeuseofglossariesandthesaurifromthe partners, including the Dutch Rijksdienst Cultureel Erfgoed (RCE) Thesauri. The process ofimporting the thesauri resources into a specific framework (GATE), and the suitability andperformance of the selected resources when used for the purposes of Named Entity Recognition(NER)wereanalysed.

TheNERtechniqueswerefocusedonthegeneralarchaeologicalentitiesofArchaeologicalContext,Material,PhysicalObject(Finds),Monument,Place,andTemporal(TimeAppellation).Themethodsproved capable of extracting CIDOC CRM element and in some case studies Getty Art andArchitectureThesaurusconcepts,inadditiontothenativevocabularies.GeneralarchaeologicalNLP(GATE) pipelines for English, Dutch and Swedish have been developed. In addition experimentalpipelinesweredevelopedfortwoexploratorythematiccasestudiesondataintegration,wheretheoutputisexpressedasRDFLinkedDataviaaCRMbaseddatamodel.AnEnglishlanguagepipelineisavailable foranumismaticcasestudy.English,DutchandSwedishpipelineareavailable foracasestudyof item leveldata/NLP integrationona loose themebasedaroundarchaeological interest inwoodenobjectsandtheirdating,asexpressedindifferentkindsofdatasetsandreports.Bothcasestudieshave resulted in interactivedemonstratorsoperatingover theARIADNELinkedDataCloud.All7pipelinesarefreelyavailableasopensourceARIADNEoutcomes.

The Archaeology Data Service (ADS) at the University of York continued developing a machinelearning-basedNLPtechniquewhichhasnowbeenintegrateditintoanewmetadataextractionwebAPI,whichtakespreviouslyunseenEnglishlanguagetextasinput,andidentifiesandclassifiesnamedentitieswithin the text. The outputs can then be used to enrich resource discoverymetadata forexisting and future resources. This API can be incorporated into existing interfaces and used byarchaeological practitioners to automatically generate metadata related to text-based contentuploadedonaper-filebasis,orbyusingbatchcreationofmetadataformultiplefiles.

Thisreportpresentsthefinalresultsoftheworkcarriedouttodate,andpresentsthe issuestobeaddressedduringtheremainderoftheARIADNEProject.

Page 6: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

6

1. IntroductionThisdocument isadeliverable (D16.4)of theARIADNEproject (“AdvancedResearch Infrastructurefor Archaeological Dataset Networking in Europe”), which is funded under the EuropeanCommunity'sSeventhFrameworkProgramme.ItpresentsthefinalresultsoftheworkcarriedoutinTask 16.2 “Natural Language Processing (NLP)”. NLP is an interdisciplinary field ofcomputerscience,linguisticsand artificial intelligence that uses many different techniques to explore theinteractionbetweenhuman(natural)andcomputerlanguages.

1.1 Background

In D16.2, the ARIADNE partners explored NLPwith the aim ofmaking text-based resourcesmorediscoverable and useful. The partners specifically focused on one of the most important, buttraditionallydifficult toaccess resources inarchaeology; the largelyunpublishedreportsgeneratedbycommercialor“rescue”archaeology,commonlyknownas“greyliterature”.

The partners explored aspects of rule-based and machine learning approaches, the use ofarchaeologicalthesauriinNLP,andvariousInformationExtraction(IE)methods.Therule-basedworkwas carried out by the University of South Wales, in partnership with Leiden University, onarchaeologythesauriforNLP,whichappliedNamedEntityRecognition(NER)totheDutchRijksdienstCultureelErfgoed(RCE)Thesauri.TheprocessofimportingasubsetofRCEthesauriresourcesintoaspecificframework(GATE),andthesuitabilityandperformanceoftheselectedresourceswhenusedforthepurposesofNamedEntityRecognition(NER)werediscussed.Thisrevealedissuesrelatingtothe role of the RCE thesauri inNER and further development of techniques for the annotation ofDutch compound noun forms. Some of these issues are addressed in the current deliverable,includingadaptingtheontologyresourcetotherequirementsoftheNLPtask.

As reported inD16.2,University of SouthWales also undertook a study for aDutchNERpipeline,which included the results of the early pilot evaluation based on the input of a single, manuallyannotateddocument. Thereportalsopresentedtheresultsof thevocabulary transformation taskfrom spreadsheets to RDF/XML hierarchical structures, expressed as an OWL-Lite (ontology).Observations related to the vocabulary transformation process and pipeline results, and revealedinitial issues that affect vocabulary usage and focus of the NER exercise. The document wasannotatedwith respect to the followingentities;Actor, Place,Monument,Archaeological Context,Artefact,Material, Period. The NER pipeline is configured to identify the following entities: Place,Physical Thing (i.e. Monument), Physical Object (i.e. Artefact), Time Appellation (i.e. Period),Material,Context.Eachentityproduceddiffering levelsof results,which insomecasesweregood,butothersneededtobeexploredfurtherforimprovement.

Work was also carried out by the Archaeology Data Service (ADS) at the University of York, todevelop and evaluate machine learning-based NLP techniques and integrate them into a newmetadataextractionwebapplication,whichtakespreviouslyunseenEnglishlanguagetextasinput,and identifies and classifies named entities within the text. The outputs were used to enrich theresourcediscoverymetadata for existing and future resources. The intentionwas to create a finalapplicationwithaweb-based,userfriendlyinterfacethatcanbeusedbyarchaeologicalpractitionersto automatically generatemetadata related to uploaded text-based content on a per-file basis orusingbatchcreationofmetadataformultiplefiles.

Thework described in D16.2 revealed theNERmoduleworked successfully and produced correctentitiesfortheclassesithasbeentrainedtoidentify.Itwasusefulforextractingresourcediscoverymetadata from unstructured archaeological data, particularly grey literature reports, for resourcediscovery indexing, where little or no metadata currently exists. From a data managementperspectivehowever,thelargequantitiesofentitiesextractedbytheNERmodulewerefelttobetoolargetoeffectivelymanage.Aprototypeannotationtoolbuiltintothewebapplicationwascreated

Page 7: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

7

toallowuserstoproducemoretrainingdatatobettertrainthemodule.TheintentionofADSwastocontinuetoworktorefinethetool,especiallywithregardtotheinterfacetomakeiteasierandmoreintuitive to use, exploring crowdsourcing for processing large quantities of unstructured data,improvementof the text extractionmodule, anddevelopmentof amodule toexport the selectedmetadatainavarietyofformats.Integrationofthewebapplicationandtechniqueswereexploredto“tidy”,groupandranktheentitiesoutputfromtheNERmoduleusingtextclustering,andgeneratingclusterlabelsbasedonthecontentinrespectiveclusters.

Page 8: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

8

2 Rule-basedNLPforDutchlanguagegreyliterature

2.1 Introduction

As stated in D16.2, Information Extraction (IE) is a specific NLP techniquewhich extracts targetedinformationfromtextualcontent.Itisaprocesswherebytextualinputisanalysedtoformatextualoutput capable of further manipulation. Rule-based IE systems consist of a pipeline of cascadedsoftwareelements thatprocess input insuccessivestages.Hand-craftedrulesmakeuseofdomainknowledge and vocabularies, together with domain-independent linguistic syntax, in order tonegotiatesemanticsincontext.

Theemploymentofrule-basedIEanddomainvocabularyresourcesdistinguishesthisapproachfromsupervisedmachine learningwork,which relies on the existence and quality of training data. Theabsence of a training corpus coupled with the availability of a significant volume of high qualitydomain-specific knowledge organisation resources, such as a conceptual model, thesauri andglossaries were contributing factors to the adoption of rule-based techniques in this study. Rulesinvokeinputfromgazetteers, lexicons,dictionariesandthesauritosupportthepurposesofNamedEntity Recognition (NER). Such word classification systems contain specific terms of predefinedgroups, such as names of people or organisations, week days, months etc., which can be madeavailable to thehand-crafted rules. In addition, rule-based IE techniquesexploit a rangeof lexical,part of speech and syntactical attributes that describe word level features, such as word case,morphologicalfeaturesandgrammarelementsthatsupportdefinitionofrichextractionrules,whichareemployedbytheNERprocess.

Rule-based techniques have previously been employed successfully with English languagearchaeology reports from the ADS Grey Literature Digital Library as part of the STAR Project (incollaboration with ADS), yielding promising evaluation results1. This took advantage of existingarchaeologicalvocabulariesfromEnglishHeritage(EH)andprovedcapableofsemanticenrichmentofgreyliteraturereportsconformingbothtoarchaeologicalthesauriandcorrespondingCIDOCCRMontologyclassesrepresentingarchaeologicalentities,suchasArtefacts,Features,MonumentsTypesandPeriods.ThisEnglishlanguageNLPworkhasbeencontinuedwithinARIADNE,reportedonbelowinSection4,appliedtoannotationsofRomancoins.TheGATEframework2usedforthisworkistheoutcomeofa20yearoldprojectestablishedin1995attheUniversityofSheffieldwithaworldwidesetofusers;theGATEcommunityhasbeeninvolvedinaplethoraofEuropeanresearchprojects.

AmajordevelopmentoftherulebasedapproachwithinARIADNEhasbeenthegeneralisationoftheprevious rule based techniques to Dutch language grey literature. This faces the challenge of adifferent set of vocabularies available via the Rijksdienst Cultureel Erfgoed (RCE). It also faces theissueofdifferencesinlanguagecharacteristics,forexamplecompoundnounforms.

InitialworkwasreportedinD16.2.Buildingontheseoutcomes,furtherworkonDutchlanguageNLPisdiscussedbelow.

2.2 NERPipelinesforDutchreports

ThissectiondiscussesthelatestversionoftheDutchNERpipeline(version3),whichaddressedtheshort term actions following the review of the earlier NER pipeline(version 2)as discussed inDeliverable16.2. Such limitations concerneda) vocabulary coverageand suitability forNLPof the

1VlachidisA,TudhopeD.2016.Aknowledge-basedapproachtoInformationExtractionforsemanticinteroperabilityinthe

archaeologydomain.JournaloftheAssociationforInformationScienceandTechnology,67(5),1138–1152,Wiley2GATE(GeneralArchitectureforTextEngineering)https://gate.ac.uk

Page 9: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

9

Rijksdienst Cultureel Erfgoed (RCE) Thesauri and b) plasticity of information extraction rules toaddress complex scenarios of compound noun forms and negated phrases. The latest versionfocusedonadaptingandenhancing theRCE resources toNLPand inparticular to the taskofNERwith respect to Archaeological Context (Features), Material, Physical Object (Finds), Monument,Place,andTemporal(TimeAppellation)entities.Somerulemodificationsandcorrectionswerealsomadeaimedatimprovingtheperformanceofthepipelinebutthelatestversionhasnotaddressedtheissuesofcompoundnounformsextractionandnegationdetection. Thedetailsofthelatestversionintermsoftheappliedimprovementsandproblemsencounteredarediscussed.First,wereviewtheearlierNERversionsofthepipeline.Threepipelineversionshavebeendeveloped so far. All pipelines addressed the task of NER with respect to the following entitiesArchaeologicalContext(Features),Material,PhysicalObject(Finds),Monument,Place,andTemporal(Time Appellation). Their difference primarily resides on the size and origin of the vocabularyresourcestheyengage.AstheNERversionsprogressedtheruleshavebeenimprovedandenhancedinformedbyformativeevaluation.

2.2.1 NERPipelineVersion1TheNERpipelinedevelopedinpreviousworkfortheSTARprojectwasGazetteerBasedincontrasttothe various pipelines developed for ARIADNE which are Ontology Based. The major differencebetweenGazetterandOntologybasedpipelinesrelatestotheconstructofthevocabularyresourcesinGATE.Itisarathertechnicalissuethataffectsthewayvocabularyresourcesaremadeavailabletothe information extraction rules and the flexibility of those resources to accommodate semanticfeatures, synonyms, and to support partial matches. Gate ontologies provide better control forexploitingpartsofthevocabularythroughtransitiveParent-Childrelationshipsandfeaturematching.Gazetteers on the other hand, provide a flexible matching over word tokens and are easier toconstruct either as flat lists or featured lists from vocabulary resources of Excel spreadsheets orWorddocuments.Dependingontheformatofvocabularyresourcesandtheaimsofaninformationextractiontaskontologybasedorgazetteerbasedoracombinationofthetworesourcestypescanbeengagedbythepipeline.

ThefirstpipelineversionutilisedarangeofvocabularyresourcesthatweremadeavailableinExcelspreadsheet format. The vocabularies originated from theArchis database and partly consisted ofsubsets of the National Thesaurus RCE. In details, the vocabularies contained Artefact types,Monument types (complex types),Materials,ArchaeologicalContext (akaFeatures /grondsporen),Periods,andPlacenames.AsetofstraightforwardJAPErulesexploitedtheresourcesanddeliveredasetofnamedentitiesviamappingvocabularytoentitytypes.

2.2.2 NERPipelineVersion2ThesecondversionofthepipelineutilisedtheskosifiedversionoftheRCEthesauri(availablefromhttp://www.erfgoedthesaurus.nl/).TheXMLversionsoftheskosifiedresourceswereretrievedusingadedicatedAPIkeyandtheresourcestransformedtoOWL-LiteformatontologyusingXSLtemplatestructures.ThedetailsoftheGATEficationprocess(i.e.transformationoftheoriginalXMLresourcestoOWL-Lite) isdiscussed inD16.2.Overall, fiveseparatethesauristructuresweretransformedandjoined under a unified OWL-Lite structure. These are, the Archaeological Types; Complextypen(monuments), Perioden (Periods), Artefacttypen (artefacts) and the Global Thesauri; Locaties(Locations),andMaterialen (Materials).There isa lackofadedicated thesaurus forArchaeologicalcontexttermsas inthegrondsporen.xsl resourcesused inversion1.However, thevastmajorityofarchaeologicalcontextterms(grondsporen.xsl)arecontainedintheArtefacttypenstructure.Asetof

Page 10: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

10

JAPErulesisusedinversion3ofthepipelineforexploitingonlythepartsofthestructurerelevanttoarchaeologicalcontext.

The second version of the pipeline employed a range of JAPE rules that exploited the Ontologyvocabulary and delivered matches with respect to the targeted entities (Archaeological context,Artefacts,Materials,Monuments,Places,Periods). Inaddition,a flatgazetteer resourcecontainingperiodrelatedsuffixes,suchasA.D,B.C,voorChristusetc.wasconstructedandusedinthedefinitionofJAPErulestargetedatmatchingnumericaldatese.g.1200AD,800v.Chr.Similarly,asetofJAPEruleswasdefinedformatchinggridreferencesandgeographiccoordinates(numericalplaces),suchas216.518/568.889.Thepipelinealsoexperimentedwithpartialmatchingofvocabularytargetedatmatchingcompoundnounformcases.TheavailableRCEthesuari,originallyavailableinXMLformat,were transformed into gazetteer listings and engaged by JAPE rules targeting partial matching ofterms. Thiswasanexperimental effort aimedatmatching compoundnoun forms,which regularlyoccurinDutchandaffecttheperformanceoftheNERtask.Theinitialresultswereencouraginganddemonstrated thematching of compound noun form can be achieved. However, partialmatchingalsodeliveredasignificantamountofnoisethataffectedprecision(seediscussioninD16.2).

2.2.3 NERPipelineVersion3ThethirdversionoftheDutchNERpipelineaimedatimprovingarangeofvocabularyissuesaffectingthe second version, primarily relating to the RCE thesauri and secondly to the quality of the goldstandard. In particular, the third version addressed the problem of the overloaded ontology classlabelswhichhadresultedfromthetransformationoftheoriginalXML(RCE)thesauritoontologyandnotedwithversion2.Version3automaticallyandmanuallyenhancedtheontologywithanumberofalternativelabels,spellingvariationsandsynonymsandalsoaddressedmajorissuesofGoldStandardquality that unnecessarily undermined the performance of the pipeline. Minor corrections andmodifications were also made in JAPE rules aiming to improve the performance of the pipelinewheneverthatwaspossible.

2.2.4 Version3PerformanceIssuesandImprovements

Thissectiondiscusses indetailthethreemainstrands(Vocabulary,GoldStandardandvariousRuleimprovements)appliedinthelatestpipeline.Thesectionalsorevealswaysandtechniquesthatcouldbe employed to overcome some of the performance problems that were encountered but notaddressedbythethirdversion.Itendswithdiscussionoffurtherwork.

VocabularyRelatedThe main effort of the vocabulary improvement was offered in breaking down the overloadedvocabularyentries intoindividualtermconstituents.Supplementaryeffortsenhancedandmodifiedthe vocabulary with spelling variations and synonyms informed by the Gold Standard input. Inaddition, a new set of thesauri structures were added into the ontology for complementing thevocabulary.TheRCEthesauriwerenotnecessarilydevelopedwithNaturalLanguageProcessinginmindandasaresultcontainentriesthatarenotsuitableforautomaticandalgorithmictermmatchingduetotheirmulti-term,sometimesdescriptiveandverbosepunctuationstructure. Forexamplethevocabularyentryamulet/talismananditschildentryamulet/talisman–kruisvormigarenotsuitableforNLP.Itis

Page 11: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

11

veryunlikelythatsuchentrieswillbefoundinnatural languagetexthavingthisformat.Mostlikelyeither amulet or talismanwill be found as individual entries and if an adjective is used, such askruisvormig(cruciform)thiswillfollowagrammaticallycorrectsyntaxform(i.e.kruisvormigamuletinsteadofamuletkruisvormig).Therefore,entriesliketheaboveshouldbeenhancedwithlabelsthatare closer to what is likely to appear in text rather than containing descriptive and non naturallanguagedescriptions.

TheaforementionedcaseofoverloadedlabelswasoftenfoundintheArtefactTypenThesaurusandin the ComplexType thesaurus. Other thesauri resources of material, periods and places mostlycontained single term entries. A set of XSL templates was developed for breaking down theoverloaded entries. The overloaded entries followed a pattern for joining synonyms andspecialisationstogetherunderasinglelabel.Theforwardslash(/)characterjoinssynonymsasinthecaseamulet/talisman.Thehyphen(-)characteraddsspecialisationasinthecaseamulet/talisman–kruisvormig.Thecomma(,)characteraddsaformofperiphrasticdescriptionwhichcanbetreatedasan alternative label, for example amfoor, dikwandig aardewerk (amphora, thick walled pottery).Somerarecasesuse thecolumn (:) character toaddageneralisation, forexamplegeelwitbakkendgedraaid:beker(yellowwhitebakedtwisted:beker).

TheXSLtemplatesincorporatedtheabovepatternstogeneratethenewvocabularylabels.Forexample:

• amulet/talisman→twolabels(amulet,talisman)

• amulet/talisman–kruisvormig→twolabels(kruisvormigamulet,kruisvormigtalisman)

• amfoor,dikwandigaardewerk→twolabels(amfoor,dikwandigaardewerk)

• geelwitbakkendgedraaid:beker→onelabelgeelwitbakkendgedraaidbeker

The employment of XSLT transformation and the automatic enhancement of the vocabulary withalternative labels undoubtedly has some trade-offs. In most cases special characters for joiningsynonymsandexpressingspecialisationsorgeneralisationsarestandardacrossthethesauriandthetransformation delivers useful alternative labels. However, there are cases that do not follow thestandard use of special characters or are very verbose (eg huisplattegrond:4-schepig - type St.Oedenrode) .Suchcasesduetotheircomplexityarenotmatchedbythetransformationtemplatesandareignored.Theremightbeapossibilitythattheautomatictransformationhasdeliveredsomeerroneousorfalseresults.However,suchresultswillnotreflectactualusesofthenaturallanguageanditisveryunlikelythattheywilldeliveranymatchesintheNERprocess.

ManualVocabularyEnhancement

AnalysisofaGoldStandard(humanannotateddocuments)hassuggestedseveralalternativelabelsand synonyms that were used to enhance existing vocabulary entries. The extent of the GoldStandardisnotsufficienttosuggestafullycomprehensivelistofsynonymsandalternativelabelsfortherangeofthevocabularyterms.However,wheneverpossibletheGoldStandardinputwasusedtoenhanceexistingvocabularytermsparticularlywithfrequentlyusedspellingvariations.Forexample,midden, laat (mid, late) and other period related prefixes can appear with bracket as (Midden)MesolithicumorwiththeacronymMESOL, insteadoftheoriginallabel,MiddenMesolithicum.Suchalternative labels are applicable tomany terms andnot just those contained in the gold standardwereadded to thevocabulary.Other casesofmanualvocabularyenhancementconcerngroupsofsynonymsor specialisationswhichare frequentlyusedwithin theGold Standardand fall under an

Page 12: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

12

existing more generic vocabulary entry. For example the entry zaad/vrucht/noot/pit(seed/fruit/nut/kernel) together with the automatically generated labels (from splitting on theforward slash) was enhanced with the Gold Standard derived terms graan (grain), vlas (flax),macroresten (micro-remains),gerst (barley)and tawe (wheat). Other less importantmodificationsconcerndescriptivelabels,suchasoverig(other)andonbekend(unknown)whichwererenamedto<overig>and<onbekend>torefrainfrommatching.

GoldStandardRelatedSeveral Gold Standard related issues that already appeared in the former versions of the pipelinehavebeenaddressedbythelatestNEReffort.TheGoldStandardconsistsof7long(someareupto300 pages) grey literature reports containing thousands of annotated text instances. However thelarge number of annotations has in some cases affected the quality of the annotations. Threeseparateissuesrelatingtothegoldstandarddefinitionhavebeenidentified.

1. Missing annotations: These are cases of valid annotations that the human annotator hasclearlyfailedtoidentify.Thelargevolumeofthedocumentsdictatesacumbersometaskofmanual annotation to identifywhere such cases ofmissed annotationsmight happen. Forexample kuil (pit) is a frequently occurring word and which in some instances might beoverlookedbyahumanannotator. Insuchmissingannotationcases, theNERpipelinemayproducea(correct)annotationwhich,unfortunately,isrecordedasafalsepositivematch,ieit isnot falsebuta truematchthathasnotbeen identifiedby thehumanannotator.SuchcaseswhenpossibletoidentifywerecorrectedintheGoldStandardsotheprecisionratesofthepipelinewouldnotbepenalisedunnecessarily.

2. Out of scope: Due to potential ambiguity in the manual annotation guidelines (ormisinterpretationby theannotators), thehumanannotators sometimesmarkedentities intheGoldStandardthatareoutofthescopeoftheNERtask.Suchcasesincludedannotationof contemporary dates (eg 20 April 1987) and place names outside Netherlands (e.g.Belgium). Such cases were removed from the Gold Standard to avoid harming the recallratesofthepipeline.

3. Non–considered: These casesaredifferent from themissingannotationson thebasis thattheformer(non–considered)arecasesthathavenotbeenoverlooked(asinthecaseofkuil)buthavenotbetaken intoconsiderationasrelevant termstothemanualannotationtask.Mostlikelythemanualannotationguidelinesdidnotmakeclearthatsuchentitiesarewithinthe scope of the task. Such cases include terms like gebouw (building), akker (field), weg(road), etc, which are not necessarily monuments but are included in the monuments(complex types) thesaurus.No actionwas takenwith regards to such cases but the termshavebeenidentifiedandcanberesolvedinfutureversionsofthepipeline.

4. MaterialorObject:Thisparticularcaseappearstobearealproblemwiththesemanticsoflanguage use in the archaeology domain regardless of language (the same behaviour hasbeenobserved inEnglishand inDutch).Theproblem is summarisedunder thenotion thatmaterial finds can constitute small findse.g.pottery (aardewerk inDutch) andas suchareannotatedasobjects(finds).Tothehumanannotatordinstictionbetweenthematerialsenseandtheobjectsenseofthepotterytermsmaybeapparent incontext (taking intoaccountthe intendeduseof the annotations).However, this formof distinction is hard to addresswithcomputationalmethods.

Page 13: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

13

JAPERuleRelatedJAPErule-relatedissuesdescribecaseswheretheperformanceofthepipelineisaffectedduetothelimitations or under-performance of information extraction rules. The third version of the NERpipeline improved many cases of underspecified rules that were identified during evaluation ofversion 2 and also identified some new cases of under-performance that require furtherimprovement.ThefollowingcasesofJAPErule-relatedissueshavebeenaddressedandidentifiedforfurtherimprovement.ImprovedCases(JAPErules)

1. PeriodandNumericaldates:Theperiodsuffixgazetteerlisthasbeenenhancedwithentries,suchaseeuw(century)ande(th)toenablematchingofe.g.8eeeuw(8thcentury).Ithasalsobeenenhancedwith improved rules formatching rangeofdatese.g. tussen1600en1900(between1600and1900)

2. Place Grid Reference: New rules have been added to address alternative grid referencepatternswhichwerenotmatchedfrompreviousversione.g.216581/568889

3. AdditionalLookups:Newruleshavebeenaddedforexploitinginputfromthenewlyaddedthesauri structures in the ontology, such as the Erfgoedthesaurus material and theLandscapeelementsoftheObjecttypenthesaurus.

4. Place, upper case restriction: The restriction that any Place entitymatchmust commencewith an upper case letter has been lifted for those cases commencingwith s', such as 's-Heerenberg,'s-Gravelandetc.

FurtherImprovementCases(furtherworkonJAPErules)

1. Compound noun forms: Compound noun forms appear in Dutch regularly joining period

termswithobjects,objecttermswithmaterial,materialtermswitharchaeologicalcontextsetc.Awayforwardoftacklingsuchcasesistoemploypartialmatchingoverwordsinsteadofthewholewordmatching.Partialmatching ispossible inGATEbut shouldbeplannedandexecutedcarefullydue to thesignificantamountofnoisegeneratedand thecomplicationsfor entity type assignment (i.e. will the compound form carry a single type, or two entitytypes).

2. Negated entities: The current version does not exclude thematching of negated phrases(e.g.geenvondsten).Suchnegatedphrasesmightinfactbeacommentinthereportthatnoevidencehasbeenfoundforapotentialfindingandthusshouldnotbeannotated(atleastasasimpleNERinstance).Anegationdetectionmodulewouldenabledetectionofthenegatedentitiesandannotateaccordingly.

3. Non-Singlewordmatching:Thecurrentpipelineimposesrestrictionsonthepartofspeechtype of the matched terms, requiring all matches to be nouns except for period/timeappellations.Thenounvalidationiscurrentlyachievedusingthepartofspeechinputthatisassignedonsinglewords.Thus,twowordvocabularyentries(egMetamorfegesteente)arenot ignored from matching. In order to enable matching on non-single word, the noun

Page 14: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

14

validationshouldhappenoveranounphrasenot justonasinglewordtoken. TheEnglishNounPhrasemoduleofGATEcouldbeadaptedinDutchtoenablethiskindofvalidation.

4. Adjective matching: The noun validation restriction excluded adjectives from matching.However, the Gold Standard contains many material entities of adjectival form, such asbronzen(bronze),stenen(stone)etc.Therestrictioncanbeeasilyliftedbutcarefulplanningis required in order to conclude such cases either as individual material entities or asmoderatorsofobjectormonumententities.

5. Placenamesasorganisationsor surnames: Several falsepositivematchesof placenamesoccuraspartofanorganisation'snameorsurname. Inorderto improvematchingonsuchcases, actor and organisation entities should be identified first, in order to exclude themmatchingasPlaces.TheActorsOrganizationsandActorsPersonthesaurihavealreadyaddedin the ontology. A future version of the pipeline could address such entities and improvematchingoverplacenames.

6. ErroneousPartofSpeechandTokenizerinput.JAPErulesandvocabularyLookupmatchingrelies on input from the POS and Tokenizer modules. In the cases where such input iserroneous (egaverbtaggedasnoun,awrongwordroot)LookupmatchingandJAPErulesdelivererroneousmatches.However,suchcasesarefewanddonotsignificantlyaffecttheperformanceofthepipeline.

2.2.5 Longertermactions(requiringinputfromDutcharchaeologydomain

experts)

• Identify themost frequentcasesof terms thatcontribute tocompoundnoun forms. Itwillnot be efficient to produce part-matches via gazetteer from the totality of the“Archeologische artefacttypen” thesaurus, as this will have an impact on precision(generatingtoomanypartmatches).Insteadaselectedsetoftermsthatfrequentlyappearin compound forms should be identified and exposed as gazetteer list. For example“aardewerk”hasmuchmorechanceofappearingasacompoundnounthanotherterms,soitshouldbeprioritisedforpart-matching.

• Identify entity-type combinations that deliver compound noun forms. Based on the GSresults, it appears there are three main combination types a) material+artefact, b)period+artefact and c) artefact+artefact. It would be helpful to discuss these combinationformswithDutcharchaeologistsbeforeresolvingonanyNLPmatchingrules.

• Inadditiontotheabove,theannotationapproachtowardscompoundentityformsshouldbediscussedandfinalised.Atthisstageitisnotclearthenumberandtypeofannotationsthatshould be delivered from a compound entity form. For example consider the case of“aardewerkfragment”(potteryfragment).Willitdeliver:

o a single span annotation (aardewerkfragment) associated with two SKOSreferencesonefor“aardewerk”andanotherfor“fragment”;

o twoseparateannotationseachassociatedwithaSKOSreference;

o threeannotations,twoseparateannotations(asabove)andathirdforthewholespanannotatedas“P45.consists_of”property.

• Similarly,somethoughtshouldbegiventowardstheannotationofcompoundentityformsof part-known constituents, where only one of the parts is “known” to the ontology. For

Page 15: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

15

example“aardewerkmagering”,orhow“magering”,which isnotaknown (available) term,will be treated. Will it just be ignored and only a single annotation will be delivered i.e.“aardewerk”,orwillitbeincludedintheannotationspan(whichisalsopossible).

• 'Expert annotator' review of the existing GS for consistency and in light of the automaticoutputresults.

• Actionsconcerningaddingnewthesauriconcepts,andreleasingrespectiveSKOSreferences.

• Rearranging a thesaurus structure for adding new broader terms for a set of specialisedtermsalreadyincludedintheresourcee.g.“Nederzetting”(settlement).

• With regards to the above, a quick fix for NLP purposes which would not requirerestructuring the resource, could be adding an alternative label of the broad term to theexisting specialised terms e.g. “Nederzetting met stedelijk karakter”. Or to use a generalpurpose thesaurus that contains the broad term (Nederzetting), such as ErfgoedthesaurusObjecttypenfordeliveringmatcheswithSKOSreferencerespectivetothebroadtermnottoaspecialisedterm.

Page 16: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

16

3 Rule-basedNLPforSwedisharchaeologicalreportsSubsequenttotheDutchNERwork,theSwedishNationalDataService(SND)andUSWcollaboratedonaninitialinvestigationofthegeneralisationofthe(USW)Englishlanguagerule-basedapproachtoNER on Swedish language archaeological reports.USW contributed on the technical side and SNDcontributed with archaeological reports, vocabularies, mappings to AAT, manual annotations andevaluation of the NLP outputs, amongst other work. Due to the limited time available, this is anexploratoryinvestigationintendedtolaythegroundworkforfurtherdevelopment.

3.1 SwedishlanguagepilotgeneralNLPpipeline

SND collaborated with USW in the creation of an NLP general pipeline as part of the metadataenrichmenteffortbycontributingtotheSwedish-languageversionofthegeneralpipeline(GP).Thestepsinvolvedwere:

1) Initialmanualannotationsoffirstrun–NinearchaeologicalreportsinSwedishwereselectedfromSND’s corpusofpublished studies. Thesenine reportswereannotatedby a groupofthreearchaeologistsanddatamanagers.Annotationconsistedofmarkingkeywordswithineach textby thecategoriesused for theDutchNLPwork, following similar Instructions forAnnotatorsasfortheDutchNERwork(seeAppendix1):

a. Contextb. Materialc. Monumentd. Objecte. Placef. Timeappellations

2) Vocabularies – In parallel with manual annotation, vocabularies for each of theabovementionedcategorieswerealsoproduced.TheseSwedish-languagevocabularieswerebasedonSND’sownvocabularies,whichareused formetadatawithin itsownonlinedatacatalogue.

Vocabulary mapping: In order to aid the mapping of keywords within the ARIADNE portal, theSwedishlanguagevocabularieswererevisedandmappedtoGetty’sArtsandArchitectureThesaurus(AAT).ThismappingwasdoneusingSKOSMappingPropertiestofacilitatesemanticwebintegrationandmetadataenrichmentwithintheARIADNEregistryandportal.ThisisdescribedinmoredetailinARIADNED15.1-ReportonThesauriandTaxonomies.

TheresultsofthiseffortwereusefulforevaluatingtheuseoftheNLPgeneralpipelineformetadataenrichment, and produced a number of recommendations for future improvements. Future workincludes explorationof anexpandedgeneral vocabulary for Swedish archaeology. Thiswouldbe ausefulresourceformetadataenrichment,bothautomatedandotherwise.

To provide a benchmark, SND annotated 9 documents following the Instructions for Annotators.These were archaeological reports from four different archaeological investigators (one countymuseum, one private company, one university funded organization, and the National HeritageBoard’s contract archaeology division), and from different phases of the archaeological process(desk-based assessment, field evaluation and final excavation). These were accompanied by therelevantSNDvocabularies.OneissueforfutureworkflaggedupinthemanualannotationprocessisthattheInstructionsassumeasingleentityforanygivenword.Itisnotpossibletoannotatethetextsin “layers”, i.e. put one single word inmultiple categories (e.g. “Vråkeramik”, “Vrå” being both aplacenameandapotterytype,aswellasalocaltermforthetimeperiodthatcoversthePittedWareculture,etc…).

Page 17: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

17

Abriefnoteonthevocabularies:

Places-Therearetwoseparate listsforthegeographical information(Places).Oneisbasedonthemodern administrative use of counties (Sw. län), municipalities (Sw. kommun) and parishes (Sw.församling),theothercontainsthenamesofhistorical/traditionalprovinces(landskap)andparishes(socken3).Onlythefirstwasusedinthepilotpipeline.

TimeAppellations - Time periodswere provided alongwith someof themost common local sub-divisionsofarchaeologicalperiods,aswellas inflections.This listofgeographical/localvariations isnotcomprehensive.

ObjectsandMaterials-ThelistsforArchaeologicalObjectsandMaterialsaretakenfromtheSwedishHistoryMuseum.Theyarenotspellcheckedandconsideredascontrolledvocabularies.

Monuments-TheMonumentsvocabulary istheoneusedbytheSwedishNationalHeritageBoard,withsomeadditionallesscommontermsandarchitecturalfeatures,aswellasinflections.

Context Types - This is a list created by the SND during the work of annotation - it is notcomprehensive and is not controlled by any archaeological authority. In Swedish archaeology, thetermfeatureisusuallyused,ratherthancontext(exceptforurbanand/orhistoricalarchaeology(ca.1100-1800AD),wheretheSingleContextmethodismorecommonlyused).

Following the availability of the annotated benchmark corpus, USW built a pilot Swedisharchaeological NER Pipeline targeting the six entity types (Context, Object, Material, Monument,Place, Time Appellation). The pipeline uses the OPEN NLP Tokeniser, Sentence Splitter, Part ofSpeechtaggerforSwedishandaGazetteercreatedfromthevocabularyfilesprovided.

Thiswas evaluated against the benchmark following the standardGATE evaluation procedure andprecision/recallmetricsforthe9documentswereproducedtogetherwiththeannotationsincontextoftheoriginaldocument.Theresultswerepromisingconsideringtheearlystageofthepipelineandtheunderpinningvocabularies.

TheevaluationresultswereanalysedbySNDandalsoUSWfromatechnicalperspectiveaccordingtothestandardNLPevaluationdimensions.Theevaluationisdiscussedbelow,togetherwithsuggestedpointsforfuturework:

• MissedtermsbyNLP:Thebiggestmisswasduetothelackoftermvariations,suchassingular/plural,definite/indefinite,possessive,andcompoundterms.Examplesinclude

Vocabularyterm Greylit.term Greylit.form

Nedgrävning nedgrävningar Plural

Stolpe Stolpar/stolppar Plural(misspelledplural)

3 Socken is an archaic name for the original rural church parishes, “kyrk-socken”. It also describes a secular area, a

sockenkommun ("rural area locality") or a taxation area, a jordbokssocken. The socken system was in many ways thepredecessor to modern municipalities. In 1862, the socken parishes in Sweden were abolished as administrative areas

duringmunicipalityreforms.Thejordbrukssockenterm("taxationarea")remainedinuseuntilthe"Reformforregistrationofrealproperty"1976–1995wascomplete.Nofurtheralterationstothesockennamesorbordersweremadeafterthis.Even though the term socken is no longer in use administratively, it is still used for cataloging and registering events,

artefacts and archives within the research fields of history, archaeology, botany, and history of languages (such astoponymyanddialectresearch).

Page 18: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

18

Lager lagret Definite

störhål störhålets Possessive

kulturlager kulturlagerrest Compound

stolphål stolphålsbotten compound

störhål stör stem

Thisconfirmedaknownlimitationoftheinitialpilotpipeline,whichdidnotattempttomatchtermvariations.

• Missingvocabularyelements:Severalmissedtermswereduetoincompleteinitialvocabularycoverage,especiallyforMonumentsandObjectsandsometemporalterms.

• Compoundtermsaremissedandthusposechallengesforrecall:Thisisasimilarissuetothat

encounteredwhenworkingontheDutchgrey-literature.Anexampleofacompoundtermis‘fornlämningsområdet’. Configuring the system for partial (part of word) matching couldsupportmatchingof compound terms.However, thiswouldbring a riskof noise and falsepositivematching.Ratherthanenablingpartialmatchingonthewholerangeofvocabularyterms,partialmatchingcouldberestrictedtoasmallsub-setofcommonlyoccurringterminas compound parts. Thismight enable the capture amajority of compound caseswithoutintroducingtoomuchnoise.

• FalsePositives:Manualannotationisnotconsistentinsomecases:some(notall)ofthefalse

positive results are due to inconsistency in the manual annotation (as encountered withotherlanguages).Oneparticularcaseis‘Torv’orpeat.Inthisarticle,‘torv’isonlypresentasamaterialwithinadeposit.However,itisusedforC14analyses.Itispossiblethatitwasnotmarkedbecause itdidnotappear tobeofdirect interest.Thiscouldbe thecase forotherterms.Infuturework,arevisedandmorespecificsetofInstructionsforAnnotatorscouldbeconsidered for the Swedish context with amore specific description of the entities to beannotatedforSwedisharchaeologicalpractice.

Inotherfalsepositives,mostlynotduetoproblemswithmanualannotation,sometermsarevery generic, e.g.: Vad, Väg, Byggnad, Hus, Struktur. ‘Vad’ also has multiple meanings,including‘what’and‘ford’.Abettercontextawarenessmighthelptocatchfalsepositives.Insomecasesthenumberoffalsepositivescouldbereducedbylookingatcontextmarkere.g.Hyddacouldbecountedasamonumentunlessthewords‘I’or‘en’arepresentbeforetheterm itself. Another example is hög – it can mean ‘mound’/’heap’ or ‘high’. If it hasmeasurementssuchas0,2mbeforeit,theseindicateheight.

Moregenerally,contextcouldbeusedasindicatorsofwhenageneraltermisbeingusedinan archaeological sense; in/definite articles in from of some monument terms tend toindicate non-archaeological remains, while terms without definite articles are used forarchaeological features -e.g. vägsträcking,väg.This issomewhatspeculativecurrentlybutcouldbeatopicforfuturework.

• Revised user guidelines: Some false positives in the NLP outcomes are the results of

occasionallackofadherencetotheguidelinesinmanualannotation,whichcanbeimprovedwith practice, so to speak. In some cases, the entities in the guidelines (and in the NERexercisegenerally)couldbeclarifiedfortheSwedishcontext,egdoes’Monument(Complex)Types’ indicate that it is (partof)acomplex? Itwouldbeuseful tocreatespecificcategorydescriptionsfortheSwedishsituation(seebelowonMonumentandContextannotation).

Page 19: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

19

• Context entities performed well, although Monuments had low recall (more vocabularyneeded)

• Material had good Recall, less good Precision: As with English and Dutch, there is theproblemofambiguityinwhetheratermistreatedasMaterialvsObjectsense.Theproblemincludes the termsKeramik,kvarts, flinta,kol,bränd lera, tegel, tall,ben, skärvsten,malm,porslin. The term ‘Kol’ could be removed from the objects vocabulary and replaced with‘kolbit’–charcoalpieces,toavoidconflictwiththematerialvocabulary.Againmorecontextaware NER would also help. This can include adjacent terms and consideration ofsingular/plural. If a number/quantifier is present before the term (e.g. ‘ben’, ‘sten’), thisusually indicatesobjectsratherthanmaterials.Another indicator iswhenadjectivesbeforethe termare in theplural form (e.g. ‘brändaben’),which also indicates quantity and thus(usually)objects.Articlesarealsoanotherindicatorofobjectsratherthanmaterials.

Additionally,fortheusecaseforagivenNLPpipelineshouldbeconsidered;isthedistinctionbetweenmaterialandobjectactuallyrelevantforthemainintendeduses?

• ManyNLPoutcomesmarkedaspartiallycorrecttermsareinfactmorespecificinstancesofexistingvocabularytermsandarethuscorrect.

• MonumentandContextannotation: thedefinitionof thesecategoriescouldbeclarifiedasthere may be some overlap – Anläggningar (equated with archaeological context in thisexercise) means a structure, building, installation, something which was constructed.Archaeologicalcontext,ontheotherhand,istightlylinkedtoastratigraphicevent,andmayormaynot include structures. Thus, a soil deposit is an archaeological context but not ananläggning, while a rubble wall foundation can be both an archaeological context and ananläggning.Astoneovenmayormaynotbeanarchaeologicalcontext in itself,but it isananläggning.Anexecutionsite isnotandarchaeologicalcontextbut it isananläggning.ThedistinctionbetweentheseentitiesshouldberevisitedforSwedisharchaeologywithupdatedvocabularies - it may be that the issue revolves around the treatment of grouping andphasing interpretation for NLP purposes versus the previous identification of stratigraphiccontexts.

• The context vocabulary list needs to be reviewed forwords such as ‘långsida’, ‘gavel’ andotherarchitecturalcomponents,whichshouldeitherbecompletelyremovedorelsemovedtothemonuments/objectslists.

• Placenames - someplacenames, like ‘Mark’, ‘Ny’ and ‘Vara’ are also very commonwords,andcouldthusleadtolowprecision.Recallseemstobelowmostlyduetomanylow-levelornon-administrativeplacenamespresentinthetexts.

• Dating - Numerical dates should be catered for. A preliminary glossary of date contextmarkerscanbefoundinAppendix2:

• Negationdetectionwouldbeausefuladdition.

• TablesposechallengesandmeritaspecificNLPmodule.

• Abbreviations and dating terminology (including ± symbol denoting C14 dates) would beuseful additions to the vocabularies. Geological periods, minor place names, informalregionalnames,andpotteryphase/typologynamesmaybeinterestingvocabulariestoaddinthefuture.

Page 20: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

20

3.2 SwedishlanguagerevisedgeneralNLPpipeline

Taking account of the evaluation of the pilot Swedish NLP system by SND and USW, a revisedSwedisharchaeologicalNERpipelinewasproducedwithimprovedmatchingontermvariation(akeyissuebroughtupbytheevaluation).Thismakesuseofastemmer4(morphologicalanalyser)inorderto address the pilot system's limitations on term variation discussed above. The stemmer enablesmatchingonword root input rather thanon thewholestring, allowingmatchingof singular/pluraland other term variations from a single vocabulary entry. Although the quality of the stemmerdictatesthequalityoftermvariationmatching(withscopeforsomelossofrecall),itispreferred,interms of time scale and final result, to employ a stemmer than enriching a vocabularywith eachterm'svariations.ThisgeneralSwedishNLPpipelineisavailableasanARIADNEoutcomealongwiththeotherpipelinesdescribedhere.

4http://snowball.tartarus.org/algorithms/swedish/stemmer.html

Page 21: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

21

4 Rule-basedNLPinvestigationsonspecificcasestudiesTwoadditionalcasestudiesconductedonspecificapplicationareasarebrieflyreportedhere.

4.1 Numismaticcasestudy

Natural LanguageProcessing techniqueswereemployedbyUSW (HypermediaResearchGroup) toextractnumismaticinformationfromasamplesetofsixEnglishlanguagereportsfromtheADSGreyLiterature library to demonstrate the potential of NLP in data integration. The resulting datawasexpressed in the same CIDOC CRM, AAT and Nomisma form used for the numismatic item levelintegrationcasestudy investigatedaspartofWP14.Anextract fromthisresultingCRMbasedRDFwas integrated into the FORTH-ICS case study demonstrator and it was found that the NLPtechniques had identified items from the report text not explicitly mentioned in the site recordmetadata.SeeARIADNED15.25foradiscussionofthe itemlevel investigation.TheNLPtechniqueswere slightly adapted from the informationextractionpipelineused in theSTARProject'sOPTIMAtoolkit6, including some grammatical patterns for Relation Extraction. This is capable of extracting'rich phrases' combining CRM semantic entities, such as 'medieval silver coin', 'late Roman coins','coinsdatingtoAD350–53','coinsbelongedtothesecondhalfofthe3rdcenturyAD'.TheNomismawithitsnumismaticvocabularyincludingcoindenomination,waspartoftheinformationextraction,eg yielding 'radiate of Tetricus I or II dating to around the AD 270s' (NER of Emperors was notincludedinthisexercisebutthatwouldformoneofthenextprioritiestoincorporate).

4.2 Data/NLPmultilingualcasestudyonitemleveldataintegration

Asa final integrative taskwithinWP16, itwasdecided to investigate a specific case studyof itemleveldata/NLPintegrationwithNLPoutputexpressedasRDFandmadeavailableforexplorationinan interactive demonstrator. Inspired by the DANSwork on dendrochronological analysis, a loosethemetoorganisethestudywaschosenbasedaroundarchaeologicalinterestinwoodenobjectsandtheir dating, as expressed in different kinds of datasets and reports. Accordingly, the NLP wasfocused on concepts relevant to this theme, such as samples, materials, objects and temporalinformation, togetherwith their connections. Theworkwas undertaken by USW on the technicalside, in collaboration with DANS and SND as regards Dutch and Swedish archaeological datasets,reportsandvocabularies.This is verymuchanexploratoryprototype (with limited resources),onenotintendedtorevealnewscientificfindingsbutrathertoshowfuturepossibilitiesofthesemantictechniquesforlargerscaleeffortsonmultilingualintegrationofdatasetswithreports.Greyliteraturereportsareavastbutunder-utilisedresourcewhichcanbeusedtogetherwithdatasetswheretheyexist formeta researchand large scale studies.NLPhas thepotential toextractmore informationfromthereportsthancanbefoundinthemetadataalone.The multilingual demonstrator aims to investigate the potential for NLP information extractiontechniquestoachieveadegreeofsemanticinteroperabilitybetweenarchaeologicaldatasetsandthetextual content of grey literature reports. The case study has a broad theme relating to woodenmaterial including shipwrecks, with a focus on indications of types of wooden material, samplestaken,woodenobjectswithdatingfromdendrochronologicalanalysis,etc.

The resources comprise English andDutch languagedatasets and grey literature reports, togetherwith Swedish archaeological reports. The ADS Grey literature archives were searched for reportsrelating to "dendrochronology" and 11 documents were retrieved. DANS provided a sample of 9

5ARIADNED15.2forthcomingathttp://www.ariadne-infrastructure.eu/Resources6VlachidisA,TudhopeD.2016.Aknowledge-basedapproachtoInformationExtractionforsemanticinteroperabilityinthe

archaeologydomain.JournaloftheAssociationforInformationScienceandTechnology,67(5),1138–1152,Wiley

Page 22: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

22

Dutch reports and SND provided 5 Swedish reports based on a focus on wood material anddendrochronologicalanalysis.ADSdatasetsincludedtwoshipwreckdatasets(NewportMedievalShipandtheFlowerofUgie)andtheVernacularArchitectureGroupdatabase.DANSfacilitatedanextractfromthedatabaseoftheinternationalDigitalCollaboratoryforCulturalDendrochronology(DCCD7).ACRMbaseddatamodelwasdesigned to connect thedataelementsand theNLPentities,whichinclude Object, Sample, Material, Place (in some cases), date ranges. A spine vocabulary wasidentifiedfromtheAAThierarchiesformaterialandobjects.CorrespondingDutchtermsfortheAATconcepts mostly existed already from WP15 mapping work, while Swedish terms for the AATconceptswereproducedbySND,aspartoftheirWP16effort.

AnextractofrelevantsectionsfromtheSwedishreportswasproducedmanuallyforthecasestudy.The Dutch pipeline explored the potential for automatic detection of dendrochronology relatedsections based on a bespoke glossary. The Dutch pipeline has a component that detects sectionsrelevanttothecasestudy.Whilethiswasasimpletechniquebasedonafixednumberofsentencessurrounding a glossary lookup, it proved sufficient for the exploratory case study.More elaborateversions would involve additional rules and pattern detection. In future work, the automaticdetectionofsectionstoemphasiseforNLPorconverselytoavoidwouldbeausefulcomponent,inlightofthelengthofsomearchaeologicalreports.

Following formative evaluation, certain English andDutch termswere excluded frommatching (ieactingas'stopwords')duetotheirhighpotentialforproducingfalsepositives.PolysemousSwedishterms,suchaslager,mightalsobegoodcandidatesforstopwords,duetotheirambiguity.Aswiththe general NLP pipelines, improvements to the Dutch and Swedish Part of Speech taggers andStemmerswouldbeanimmediatefocusinfuturework,togetherwithanimprovedglossaryofdateindicatorsandcontextmarkers.Forexample,contextmarkersforsingleyears(asopposedtootherinstancesofintegers)wouldbeveryvaluable.Anotherprioritywouldbeacarefulmanualannotationset to drive systematic evaluation,whichwould also require an iterative approach to refining theguidelinesformanualannotation.Vocabulariesneedbeimprovedusingtermsfromalargercorpusthan the one used in this limited study. Resources such as glossaries, word lists, trade/thematiclexicons, and other such resources could be used to enlarge the vocabularies being used. Suchenlargementswouldneedtobeevaluatedagainstmanuallyannotatedtextssothatprecisionisnotnegativelyaffected.

Aswiththegeneralcase,ambiguitybetweenmaterialandobjectsensesprovedchallenginginsomecases (forbothmachineNLP and human annotation). For example, in the Swedish reports, itwassometimesdifficulttodistinguishbetweenatree(e.g.tall,orpinetree)andmaterialmadefrompine.Ifthedistinctionisconsideredimportant, itwouldbeusefultomakeuseofcontextmarkers intheNER,whichmightallowforanimprovementinprecision.Theterm‘tall’byitselfcanbeambiguousdue to the interchangeableway thatmaterials and objects are discussed in archaeological report.However,wordsensedisambiguation techniquescouldbeused to resolve suchambiguitiesas, forexample,inthephrase‘avtall’,or‘ofpine’,wherethetermismorelikelytoindicatematerialthanobject.Inaddition,thedefiniteformofatreename(egtallen–thepine)seemstobeoftenusedinSwedishtoconnotethematerialused,ratherthanaspecifictree.

FutureeffortswillhavethechoicetofocusoneitherthematicNLP(asinthiscase)ormoregenericarchaeologicalNLP(orbothcases).Followingathemehasbeenuseful inthisexercise,as ithelpedcontain the problem to a specific theme within archaeology. Such smaller themes could bedeveloped by smaller groups, and could facilitate subsequent efforts made to create a moreencompassing tool. A relatively broad theme, as in this exercise, arguably posesmore challengesthanmoreconcretetopics,suchastheexerciseoncoinsreportedintheprevioussection.

7DCCDdatabase-http://dendro.dans.knaw.nl

Page 23: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

23

ExperimentalNERpipelinesweredevelopedfortheaboveentitiesandvocabularyinEnglish,Dutch,Swedish, building on thework for the general NLP pipelines described above. The atomic entitiesresulting from the NER were combined into CRM properties, where considered appropriate. Theworkisstillatanearlystageandresultsarepreliminarywithneedoffurtherrefinementtoreducefalsepositivesandextend thevocabulariesused.Morework isalsoneededonRelationExtraction(RE)algorithmsthatassertCRMpropertiesbetweenconnections.TheEnglishlanguageNLPoutputisbased on grammatical patterns for Relation Extraction, building on previous work for the STARproject8.FortheDutchandSwedishreports,simplerNERtechniquesareusedthatdonotattemptconnectionsbetweenentitiesextracted(otherthanoccurrencewithinthesamesentence).Apriorityin futurework is to apply a pattern-based information extraction approach toDutch and Swedishreportssimilar to theEnglish languagework. ImprovedNLPpipelinescouldbesubstituted intothedataintegrationworkflowdevelopedforthecasestudy.

Illustrativeexamplesof thevariousNLPoutput (withcolourcoding indicatingthesemanticentitiesidentified)includethefollowing:

The calculation of the common felling period for each dated timber from this floor suggests aconstructiondatebetweenAD1682andcAD1699.

The fellingdateofAD704/5 identified for these timbers indicates that the structurewas inuseduringtheearlyeighthcentury.

ThissamplehasacalculatedfellingdaterangeofAD1609toAD1645.

TwotimbersdatedfromthewestwingroofproducefellingdatesinthewinterofAD1735/6andthespringofAD1736.

Theresults identifiedthatoneboardwasdatablebytree-ringdatingtechniques,withthisboardfelledineitherthelate-sixteenthcenturyorearlyseventeenthcentury.

Manyoftheoakboardsappearedfromexternalexaminationtobetimberssuitableforanalysis;

DendrochronologischonderzoekdoorStichtingRINGinAmersfoortwijstuitdatdeeikwaaruitdepaalisvervaardigd,isgeveldtussen55en69naChr.

Dedateringenopbasis vandendrochronolo-gischonderzoekvanhethoutuitde sporen6en9wijzenuitdateeneventueledereparatievoor62naChr.

Wel valt opdathetaantal dateerbareeiken toeneemtnaarmatewedichter indebuurt vande4600-4550BCkomen.

Eenvandepaalgenomendendro-monsterleverdeeenkapdatumvan1516±6.38

Tvåprovertogsfrånåtelpålenochkundegenomendendrokronologiskanalysdaterastill1730-tal.

Samtligaprovdaterasochtäckerdenmestexaktadateringenvinterhalvåret1677/78.

Prov1somvarbearbetatvirkeavekdateradestillfällningsårvinterhalvåren1536/37.

Den större fartygslämningen , daterad till tidigt 1800-tal har troligen varit en skuta förkustseglationellerfiske.

8VlachidisA,TudhopeD.2016.Aknowledge-basedapproachtoInformationExtractionforsemanticinteroperabilityinthe

archaeologydomain.JournaloftheAssociationforInformationScienceandTechnology,67(5),1138–1152,Wiley

Page 24: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

24

The aim of this exploratory case study was to investigate the potential of the various semantictechniquesemployedinamultilingualdemonstrator.Theresultssuggestthattheapproachisworthcontinued investigation. The NLP techniques were able to generate CRM/AAT based output fromEnglish,DutchandSwedishtextsinthesameformatastheinstancedataextractedandmappedtotheCRM/AAT.Ofcourse,theNLPderivedRDFstatementsdonotcarrythesamedegreeofreliabilityasthosederivedfromthedatasetsandfalsepositivescanbefound.Infuturework,someindicationof the provenance (and hence reliability) of the RDF data could be included in the CRM model.Nonetheless,theprincipleofsemanticdataintegrationfromtextdocumentsanddatabaseshasbeendemonstrated.

TheoutputfromGATEisexpressedasRDFusingtheCIDOCCRMmodelwithconnectionsalsomadetotheAAT.Thiswasachievedbymeansofanewmapping/extractiontool,STELETO,developedbyUSW forARIADNEandavailable as open source9. STELETO is a 'lite' cross platformversionof theSTELLAR.CONSOLEapplicationdevelopedfor theSTELLARProject1011 -asimplerversionofSTELLARwithan improvedcommand line functionality.STELETO isageneraldelimited textdataconversionutilitythatcanbeusedformappingandextractinginstancedatatotheCRMviatemplatesthathidethecomplexityoftheCRMmodel.ItwasusedfortheCRMbasedRDFforboththedatasetsandtextreports.

Themultilingualdemonstratorcross-searchesoverthedatasetsandtextreportsviaSPARQLqueries.Output is expressed as RDF using essentially the same CIDOC CRM model as used for the CoinsDemonstratorwithmappingsmadetotheAAT.Theoutcomeisapilotdemonstratorofthetechnicalpossibilities,operatingoveraLinkedDataexpressionof theoutput,whichofferscrosssearchoverboth the datasets and text reports via an interactive, browser based SPARQL query builder. ItdemonstratesthepotentialforalternativeuserinterfacestoaplainSPARQLendpointbuildingonthe'widget'techniquesdevelopedintheSENESCHALproject12.TheworkisongoingandwillbereportedintheforthcomingARIADNEdeliverableD15.3ReportonSemanticAnnotationandLinking13.

4.3 NLPpipelinesmadeavailableforfurtherwork

USWdevelopedthreeseparategeneralarchaeologyNamedEntityRecognitionpipelinesforEnglish,Dutch,andSwedishlanguagesusingtheGATEplatform.Thepipelinesarerule-basedanddrivenbydomainvocabularyexpressedasOWL-LiteontologiesorflatgazetteerlistsasinthecaseofSwedishpipeline.ThevocabularyoftheEnglishpipelineoriginatesfromtheHeritageDatavocabularies14,theDutch vocabulary from Erfgoed Thesaurus15, and the Swedish from SND (in house resources). Allthreepipelines focuson the recognitionof the followingentities;Archaeological context (i.e.post-hole,ditchetc),PhysicalObjects,Materials,Temporal (as inPeriodsandas inNumericalDatesbutnot contemporary dates) and Monument types. In the case of Dutch and Swedish reports,PlacenamesandGridreferencesarealsoaddressed.

Inaddition,asdescribedabove,experimentalEnglish,DutchandSwedishlanguagepipelinesforthedata/NLP data integration case study on the wood / dendrochronology theme were developed,

9STELETOhttps://github.com/cbinding/STELETO10STELLARProjecthttp://hypermedia.research.southwales.ac.uk/kos/stellar/11BindingC.,CharnoM.,JeffreyS.,MayK.,TudhopeD.:TemplateBasedSemanticIntegration:FromLegacyArchaeological

DatasetstoLinkedData.InternationalJournalonSemanticWebandInformationSystems,11(1),1-29.IGIGlobal.12SENESCHALProjecthttp://hypermedia.research.southwales.ac.uk/kos/SENESCHAL/13ARIADNED15.3forthcomingathttp://www.ariadne-infrastructure.eu/Resources14http://www.heritagedata.org/blog/vocabularies-provided/15http://www.erfgoedthesaurus.nl/

Page 25: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

25

togetherwith an English languagepipeline for thenumismatic data integration case study.All theNLPpipelinesarefreelyavailableasopensourceARIADNEoutcomesofWP1616.

TheplanforfutureworkistousetheAATvocabularyasaspineforcrosssearchingamongdifferentlanguages.An initial studyon theMaterialentity showed that theAATcoverage for thisparticularentity type for English and Dutch is around 80%. A sub-set of the Swedish vocabulary has beenmanuallymappedtoAATconcepts,aspartoftheworkforWP16.

16English,Dutch,Swedishrule-basedNLPpipelineshttps://github.com/avlachid/Multilingual-NLP-for-Archaeological-Reports-Ariadne-Infrastructure

Page 26: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

26

5 MachineLearningAPIfortheADSGreyLiteratureLibrary

5.1 Introduction

Asreported inD16.2, theADShasbuiltupontheworkand lessons learned fromtheArchaeotoolsproject, to further develop NLP tools and help the archaeological domain better access the vastresourceofunstructureddigitaldataavailabletoarchaeologistsintheformoftext.Thistexttypicallyexists inPDF,MSWord,orplaintext fileswithintheADSLibraryofUnpublishedFieldworkreports(alsoknownastheGreyLiteratureLibrary),digitisedjournalcollections,andreportsdepositedwithinprojectarchives.

TrainingdatadevelopedbyArchaeotoolswasappliedtoaclassifier.Aclassifierisamachinelearningtoolthattakesdataitemsandplacesthemintoclassesresultinginastatisticalmodel,whichisusedtoextractentitiesfromenteredtext.Afteranevaluationofclassifiers,theCRFclassifierwaschosen,asitwaseasiertoimplementintothewebapplicationandrequiredlesscomputingtimetoproduceresults.Themodelsbuiltby theclassifierwithgazetteerswere thendirectlyapplied to theunseendata from grey literature reports. As there is currently no Gold Standard for archaeological greyliterature,agroupofreportsfromtheNorthYorkshireregion(knowingtherehadnotbeenprevioustrainingongrey literature fromaNorthYorkshiredataset)were chosenandmanually scored. Thegazetteers were especially useful for improving extraction performance, when applied to moreunseen corpora. This confirmed there is substantial overlap of information from various corporawithinthegrey literature.Totrain theCRFclassifier,awindowsizeof fivesurroundingtokensandthefollowingfeaturesetwasused:

• N-Gramswithmaxlengthofsixtokens(i.e.contiguoussequenceofwords)

• Exacttokenstring

• Featuresfrompreviouswordclasssequence

• ArchaeologicalGazetteer

Aprototypewebapplication interfacewasdeveloped for testinganddemonstrationpurposesandalso reported in D16.2. The prototype allowed domain experts to annotate reports, generateresource discovery metadata where none exists, and generate metadata which can be used tofurthertraintheclassifiers.Theapplicationwasdesignedtoallowtexttobeenteredintoan“inputtextarea”,orafile(PDForDOC)tobeuploadedtotheapplication.Whenusingthelatteroptiontheprototypeextracted textoutofaPDForDOC fileautomatically,anddisplayed it in the ‘input textarea’.Whileonlyaprototype,theinterfaceshowedhowtheAPImightbevisualisedifimplementedin an existing interface,whichmay be useful for producingmore training data in the future, as italloweduserstocorrectresultswhichcanthenbeusedbythetrainingclassifier.

Toextractthepossiblemetadatafromtheuploadeddocuments,anNERmodulewascreatedandtheprototypewasbuiltasasimpleJavaapplicationwrittentoutilisedtheCRFclassifier.Whentextwasenteredintothe“inputtextarea”entitieswereextractedfromthetextusingtheNERmodulebasedontheCRFclassifier.Theextractedentitiesweredisplayedassuggestedmetadatatotherightoftheenteredtext,anduserscanassesstherelevanceoftheextractedentities.ThewebapplicationalsodetectedandextractedUKgridreferencesusingmanuallycraftedregularexpressions.Extractedgridreferences were automatically verified using UK Geospatial data held within an Oracle Spatialdatabase, where incorrect grid references can be filtered out from the result. By clicking on themagnifyingglassiconsbesideeachentitygenerated,userscouldjumpdirectlytothewordinthetextfromwhichtheresultwasderived.

Page 27: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

27

Figure1:ScreenshotoftheprototypeADSwebapplicationshowingentitiesextractedfromthetext.

TheentitiesextractedbytheNERmoduleusingthismethod(usingarelativelyshortpieceoftext),specificallycomposedtoprovidean introductoryoverviewofanarchive),producedverysuccessfulresults, and the relatively smallnumberofentitieswereeasy toviewandmanagewithin thewebapplication by a user, although this becamemore complicatedwhen testedwith a larger body oftext. For a full analysis and examples of the entities extracted using the NERmodule, please seeD16.2.

5.2 NamedEntityRecognition(NER)WebService

In D16.2, it was stated ADS planned to continue development of the web application, but afteradditionalconsiderationitwasdecidedfurtherrefinementoftheinterfacewouldbelessusefulthanthecreationofanNERWebServiceAPI,whichcouldbemadefreelyavailabletothearchaeologicaldomain.Subsequently,developmenttimewasfocussedoncreatingandrefininganAPIthatallowsusers to submit NER tasks to an ADS server, which then returns a set of terms, including theircategoryandoffsets,whichdeveloperscanincorporateintotheirexistinginterfaces.

TheAPI follows commonpractice for a RESTfulHTTPweb service.Users submit a task and clientsPOSTJSONtoanAPIendpoint. If successful, itwill returnHTTPstatus200,andreturnJSON in theresponse.Dependingonthecomplexityofthetaskandlengthofthecontent,theAPImayreturntheresultasynchronously,inwhichcasetheresultsarenotimmediatelyavailable,andadelaymustbeimplementedonthedeveloper’sendaftersubmittingatask.

APIendpointandsupportedmethods

AllAPIURLsarerelativetotherootendpointontheADSserver:

• POSThttp://ads.ac.uk/nlp/api/nse/analysis--forsimpletextanalysissynchronousoperation,returnsasetofnamedentities

Page 28: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

28

• POST http://ads.ac.uk/nlp/api/nse/asyncAnalysis -- for lengthy analysis and asynchronousoperation,returnsasetofnamedentities

TheAPIprovidesabasicHTMLinterface:

http://ads.ac.uk/nlp/demo.jsf

If an endpoint supports POST, it can be opened in a web browser, and raw JSON data can besubmitted.Thedemoprogramimplementedpreviouslyisnowusingtheseservices.

Figure2:ScreenshotofthesimpleAPIHTMLinterface.

Createatask

BelowisasimpleexampleofatestJSONrequest.Itincludestwoparameters,botharemandatory.

content string Textcontenttobeanalysed

isFragment boolean InformservicetoaddstringmatchingoperationforextremelyshortcontentwherecontextistooshorttoprovideusefulinformationforNERtask

Task.json

{

"Content":

"ThevarioussitesthatButserAncientFarmoccupiedovertheyearswereall,inonewayoranother,basedontheconceptofdemonstratingwhatafarm,whichwouldhaveexistedintheBritishIronAgecirca300BC,mighthavebeen like. Itwasfoundedin1972astheButserAncientFarmProjectandoccupiedsitesonLittleButserHill,HampshireUK,theso-calledDemonstrationSiteinthegroundsofQueenElizabethCountryPark,HampshireandfinallyitmovedtoitspresentsiteatBascombDownin1991.TheworkwasextendedtoincludetheconstructionofaRomanVillain2002.",

"isFragment":"false"

}

Page 29: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

29

Thefollowingcommandcantesttheservice:

curl-H"Content-Type:application/json"-XPOST-d‘{json}’http://ads.ac.uk//nlp/api/nse/asyncAnalysis

SampleResponse:

{"Entities":

{"term":[{"category":"placename","value":"ButserAncientFarm","startOffSet":23,"endOffSet":42},{"category":"subject","value":"farm","startOffSet":145,"endOffSet":149},{"category":"temporal","value":"BritishIronAgecirca300BC","startOffSet":183,"endOffSet":212},{"category":"placename","value":"ButserAncientFarm","startOffSet":266,"endOffSet":285},{"category":"placename","value":"LittleButserHill","startOffSet":316,"endOffSet":334},{"category":"placename","value":"Hampshire","startOffSet":431,"endOffSet":440},{"category":"placename","value":"BascombDown","startOffSet":485,"endOffSet":497},{"category":"temporal","value":"Roman","startOffSet":562,"endOffSet":567},{"category":"subject","value":"Villa","startOffSet":568,"endOffSet":573},{"category":"temporal","value":"2002","startOffSet":577,"endOffSet":581}]},"

Date":"2016-10-04T08:52:47.263Z"

}

category pre-defined categories such as the locations, subject, andperiod

value Identifiedtokens

startOffSet/endOffSet stringstartandendindex

Fromthesmallparagraphofsampletext,theservicehassuccessfullyrecognised:

Placename:ButserAncientFarm

Placename:LittleButserHill

Placename:Hampshire

Placename:BascombDown

Subject:farm

Subject:Villa

Temporal:BritishIronAgecirca300BC

Temporal:Roman

Temporal:2002

Asstated inD16.2,ADSplannedtotest theAPIaspartof theredevelopmentof theOASISsystem(theonline system for indexing archaeological grey literature in theUK). The aimwas to allow anarchaeologist to upload a report to OASIS, and by choosing to use the NER service, be able toautomaticallyextract suggestedmetadata for the report.Themetadata could thenbeacceptedorrejected by the user and then automatically populated into the correct fields within OASIS.Unfortunatelythetimeframeforthismajorre-developmentprojectacrossseveralUKorganisations

Page 30: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

30

has been delayed, and this testing was not possible within the ARIADNE project. The API wascirculated to ARIADNE partners for review however, and both the University of SouthWales andINCIPITCSICtestedtheserviceandprovidedinterimfeedback.Itwasfoundthatwhiletheservicedidnotreturnanyfalsepositives,itfailedtoreturnallpotentialpositives.Thiswouldindicatethatwhilethemetadatageneratedby theservice is reliable, itmaynotbecomplete. Itwasdetermined thatthiswaslikelyduetoaneedformoretrainingdata,and/oranadjustmenttothealgorithm.ADSwillcontinue towork on the service beyond the completion of the ARIADNE project, to see if furtherimprovement is possible. The service is currently freely available for use by the archaeologicalcommunity,andisoneoftheservicesdevelopedthoughARIADNE,butavailabletoall.

Page 31: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

31

6 ConclusionTheARIADNEpartnersinvolvedinthisdeliverablecontinuedtoexploreNLPwiththeaimofmakingtext-basedresourcesmorediscoverableanduseful.Thepartnershavespecificallyfocusedononeofthe most important, but traditionally difficult to access resources in archaeology; the largelyunpublished reportsgeneratedbycommercialor“rescue”archaeology, commonlyknownas“greyliterature”.Thepartnersexploredaspectsofrule-basedandmachinelearningapproaches,theuseofarchaeologicalthesauriinNLP,andvariousInformationExtraction(IE)methods.

USWextended their English language rule basedmethodsusing theGATE toolkit forNER (NamedEntityRecognition)toDutchandSwedish languagegrey literaturereports, incollaborationwithLUandDANS(Dutchreports)andSND(Swedishreports).Thismadeuseofglossariesandthesaurifromthe partners, including the Dutch Rijksdienst Cultureel Erfgoed (RCE) Thesauri. The process ofimporting the thesauri resources into a specific framework (GATE), and the suitability andperformance of the selected resources when used for the purposes of Named Entity Recognition(NER)wereanalysed.

TheNERtechniqueswerefocusedonthegeneralarchaeologicalentitiesofArchaeologicalContext,Material,PhysicalObject(Finds),Monument,Place,andTemporal(TimeAppellation).Themethodsproved capable of extracting CIDOC CRM element and in some case studies Getty Art andArchitecture Thesaurus concepts, in addition to the native vocabularies. The English languageNLPpipeline has been evaluated in previouswork for the STAR Project and has gone through severaliterations. TheDutch and Swedishpipelineswereevaluatedaspart of theARIADNEwork and thefindings are reported in this deliverable. In total, three versions of the Dutch pipeline and twoversionsoftheSwedishpipelineweredeveloped.

GeneralarchaeologicalNLP(GATE)pipelinesforEnglish,DutchandSwedishhavebeendeveloped.Inadditionexperimentalpipelinesweredevelopedfor twoexploratory thematiccasestudiesondataintegration, where the output is expressed as RDF Linked Data via a CRM based data model. AnEnglish language pipeline is available for a numismatic case study. English, Dutch and Swedishpipeline are available for a case study of item level data/NLP integration on a loose themebasedaroundarchaeologicalinterestinwoodenobjectsandtheirdating,asexpressedindifferentkindsofdatasets and reports. Both case studies have resulted in interactive demonstrators operating overthe ARIADNE Linked Data Cloud. All seven pipelines are freely available as open source ARIADNEoutcomes.

WithintheconstraintsoftheresourcesavailableforWP16,therulesfortheDutchandSwedishNERandREpipelineswerenotaselaborateasthepattern-basedEnglishlanguagerules.Nonetheless,theoutcomes are promising and show thepotential for the applicationofNLPmethods toDutch andSwedish language reports. Furtherwork is neededbefore anoperational capability is achieved, asdiscussed above. In particular, work on enlarging the vocabularies available for the NLP andstructural modification of these resources would be helpful, including adapting the thesaurusterminologyinsomecasesforNLPpurposes.

Furtherdevelopmentof techniques for theannotationofcompoundnounformsare important forextending the English language techniques to Dutch and Swedish. Tables pose challenges in alllanguagesandmeritaspecialisedNLPmodule.Theambiguitybetweenmaterialandobjectinnaturallanguage use should be revisited. This has proved a problematic issue in each language for bothmachineandhumanannotators. Ifthedistinction is indeedimportant(itmaynotbedependingontheusecase)thenfurtherrefinementofNERtechniquesisrequired.Thiscouldincludeidentificationof contextmarkers for each case to informpattern based rules. Amore elaborate list of contextmarkers for dates would be a cost effective addition in light of the archaeological concern withdating.AnappropriatesetofexpertannotatedreportsisnecessaryforevaluatingandimprovingNLPtechniques.ThesetofentitiesforNERshouldberevisitedfortheintendedusecases.Effortshould

Page 32: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

32

bedevotedtocreatingannotationguidelinestailoredtothecontextofeachlanguageandcreatingagoldstandardsetofannotatedreports.Thisshoulditselfbeevaluated;itissometimesthecasethatapparentfalsepositivesareinfactcausedbyomissionsinthemanualannotation.

The Archaeology Data Service (ADS) at the University of York, continued to develop and evaluatemachinelearning-basedNLPtechniquesandintegratethemintoanewmetadataextractionNamedEntity Recognition module, which takes previously unseen English language text as input, andidentifiesandclassifiesnamedentitieswithinthetext.Theoutputscanthenbeusedtoenrichtheresourcediscoverymetadata forexistingandfutureresources.Thefinaloutput for thisdeliverablewas intended tobeamore refinedWebapplication interface,butafteradditional consideration itwasdecidedthiswouldbelessusefulthanthecreationofanNERWebServiceAPI,whichcouldbeimplementedbyanyoneinthearchaeologicalcommunity.

Early work revealed the NER module worked successfully and produced correct entities for theclasses itwas trained to identify. TheNLP toolswereveryuseful forextracting resourcediscoverymetadata from unstructured archaeological data, particularly grey literature reports, for resourcediscovery indexing, where little or no metadata currently exists. From a data managementperspectivehowever,thelargequantitiesofentitiesextractedbytheNERmodulemaybetoolargeto effectively manage and this will need further exploration. The Web Service API is currentlyavailableforuseandintegrationintootherinterfaces,andwillcontinuetobedevelopedbeyondthecompletionoftheARIADNEproject.

The partners have successfully explored a variety of NLP techniques to make text-basedarchaeological resources more discoverable and useful. However, there are still areas requiringfurtherworktofullyachievethepotentialofthetechniquesexplored.

Negationinunstructuredarchaeologicaltextcontentwasobservedduringtheevaluation.Itiscrucialforanypracticalimplementationthisisrecognised.Forexample,ifanamedentityoccursinsidethescopeofanegationthenthatnamedentityshouldnotbeincludedintheoutput.Negationdetectionshouldbeexplored.

Some English language NLP research has begun to investigate the issue of negation detection inarchaeological grey literature reports, with a view to distinguishing a finding of evidence, forexample, of Roman activity from statements reporting a lack of evidence, or no sign of Romanremains. A technique previously used in the biomedical domain was adapted to archaeologicalvocabularyandwritingstyle.EvaluationonrulestargetedatidentifyingnegatedcasesoffourCIDOC-CRMentitiesgavepromisingresults,Recall80%andPrecision89%17.Furtherresearchisneededonnegation detection (e.g. a negative finding) and the ability to discriminate in reports betweenimportantfindingsofarchaeologicalevidenceandmentionsinpassingoflessimportantinformation.Thisispickedupbelow.

Feedbackfromdomainexpertssuggeststhatalthoughtheentitiesdetectedbythesystemarevalidterms,somearenotconsideredimportant.Although,wedonotthinkthisisaproblemfromanNLPperspective,nevertheless, it isdesirableforthesystemtobemoreselective.Sofar,workhasbeenfocusedonNER,butitmaybepossibletosolvethisissueusingtechniquesfromEntityLinking(EL).TheproblemisdistinctfromtheNERmodule,asitdoesnotidentifytheoccurrenceofthe“names”,buttheirreference.Inordertobuildsuchasystem,aknowledgebaseisneeded.ItmaybepossibletodevelopthisknowledgebasefromavailablearchaeologicalLinkedOpenData.EntitylinkingcanbedefinedasmatchingatextualentitydetectedbytheNERmodule,toaknowledgebaseentry,suchas

17VlachidisA,TudhopeD.2015.Negationdetectionandwordsensedisambiguationindigitalarchaeologyreportsforthepurposesofsemanticannotation.Program:electroniclibraryandinformationsystems49,2(2015),118-134.

Page 33: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

33

aLinkedDatanodethatisacanonicaltermforthatentity.However,entitiesareoftendetectedbytheNERmodulewhich have different surface forms, including abbreviations, shortened forms, oraliases.Therefore,ELmustfindanentrydespitechangesinthedetectedstringbytheNERmodule.EntityAmbiguity (EA) resolution is another problem thatwill need to be resolvedwhenusing thistechnique. For instance, “Roman”, can match multiple Linked Data entries as either “subject” or“temporal”. The last difficulty is the absence of the entity in the knowledgebase. Processing largetextcollectionsguaranteesthatmanyentitieswillnotappearintheLinkedData,sothesystemmaynotbeabletocopewiththissituation.Addressinga“negativesample”couldbeusedwhencreatingthetrainingdata,asopposedtothepositivesamplestakenduringtheoriginal trainingof thedatausedforthistool.

Page 34: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

34

7 Appendix1:InstructionsforAnnotatingGreyLiteratureDocuments

The manual annotation task aims to annotate grey literature documents with respect toarchaeologicalconcepts that relate toculturalandheritagedata.Theannotation is focusedontheNamedEntityRecognition(NER)taskandparticularlyintheidentificationofthefollowingconcepts;

• TimeAppellations,• ArchaeologicalObjects• Materials• Places• Monument(Complex)Types• ArchaeologicalContexttypes(alsoknownasfeatures).

The annotation should target the above types in “isolation”, following the aims of NER, thusactivities,relationshipsandeventsarenotinscope.Actorsarenotinthescopeofthecurrenttask.

Itisproposedtousethefollowinghighlightcolourstomarktheannotations.• TimeAppellations(Blue)• ArchaeologicalObjects(Brown)• Materials(Purple)• Places(Yellow)• Monuments(Green)• ArchaeologicalContext(Light-Green)• Negation(Red)

Youmayuseanyotherhighlightcoloursofyourchoice formarking theannotations. In thiscasepleasegiveacolourkeydefinitionatthetopofthedocument.

AnnotationPrinciples

Annotatorsshouldproducetheirmanualannotationswiththefollowingprincipalsinmind.1. Negation Detection: Any of the above entities that are negated should be annotated as

Negation.Forexample“Noevidenceofpottery”shouldbeannotatedasNegation.Thespanof this particular annotation types should cover the WHOLE sentence clause denotingnegation.

2. Toconsiderhowrelevantistheentitytotheoveralldiscourse.Topicalitycouldaffectcasesofambiguitye.g.PhysicalObjectsorMaterials.Forexampletheterm'brick'caneitherreferto a material (a brick wall) or to a physical object (a brick found in context). Annotatorsshoulddecideontheconceptualalignmentoftermsthatcanbeeithermaterialsorphysicalobjects. Other case of topicality might affect Place names which can also be as peoplesurnames(commoninEnglishnotsosureifthisisthecaseinDutch).

3. Annotators should consider, plural when applicable as well as spelling variations andacronyms common in the archaeology domain eg CBM (cERAMIC Building Material).Compound words containing any of the targeted entities should also be annotated. Theannotation should span only on the part of compoundword relevant to an entity type. Ifmore thanoneentities are relevant then respective annotationhighlight colours shouldbeusedfordistinguishingthepartsofthecompoundword. CompoundworksarecommoninDutch (not that much in English). For example the word “steentijd vindplaatsen” shoulddelivertwoannotationspans(steentijdasTimeAppellation)and(vindplaatsenasPlace).

4. Annotators should consider conjuncted phrases. For example annotators should considerconjunctionsofthekind'EarlyRomantoLateRoman','Potteryandbrick',aswellas'workedflint','smallfinds'etc.

Page 35: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

35

EntitiesAnnotation

Time Appellations: All time appellations both numerical and lexical. eg Roman, 1045 AD,early-midIronageetc

Archaeological Objects: Objects of archaeological interest such as finds, small find,architecturalelementsetc.

Materials:Objectsofarchaeological interest,contemporarymaterialof littlearchaeologicalinterestsuchasplasticshouldbeexcludedfromannotation.

Places:PlacesofarchaeologicalinterestandrelevantPlacenames.Gridreferencesmayalsobeannotated.

Monument(Complex)Types:Suchasbuildingtypesandarchitecturalfeatures

ArchaeologicalContexttypes:Contextsrevealedduringtheexcavationprocess,alsoknowsinDutchasfeatures,suchaspit,pitfill,deposit,andlargercontextgroupingsaspost-hole,post-holestructures,circularpitsetc.

Page 36: ARIADNE: Final Report on Natural Language Processing

Deliverable16.4:Finalreportonnaturallanguageprocessing January2017

36

8 Appendix2:InitialglossaryofSwedishdatecontextmarkers

±

B.P.

BP

e.Kr.

e.v.t.

EfterKristus

Eftervårtideräkning

evt

f.Kr.

f.v.t.

f.v.t.b

fvt

fvtb

FöreKristus

Förevårtideräkning

Medel

Medeltida

Sen

Sentida

v.t.

vt

Yngre

Ålder

Äldre