annie and jape gate training course 23 november 2006 diana maynard andrey shafirin

Post on 01-Apr-2015

224 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ANNIE and JAPE

GATE Training Course23 November 2006

Diana MaynardAndrey Shafirin

Alala 2

GATE and Information Extraction

● Basic introduction to IE and GATE

● Overview of ANNIE

● JAPE: rule writing

● JAPE debugger

GATE and IE

● IE is one of the core tasks GATE is designed for

● IE is the basis for many other, more complex applications, e.g. semantic annotation

● Cornerstone of IE is Named Entity Recognition

Alala 4

A Typical IE System

1. Pre-processing – format detection – tokenisation – word segmentation – sense disambiguation – sentence splitting – POS tagging

2. Named entity detection – entity detection – coreference

Alala 5

Two Approaches to IE

Knowledge Engineering● rule based ● developed by experienced

language engineers ● make use of human intuition ● obtain marginally better

performance ● development could be very

time consuming ● some changes may be hard

to accommodate

Learning Systems● use statistics or other

machine learning ● developers do not need LE

expertise ● requires large amounts of

annotated training data ● some changes may require

re-annotation of the entire training corpus

Alala 6

Named Entity Recognition● NE involves identification of proper names in texts, and

classification into a set of predefined categories of interest.

● Three universally accepted categories: person, location and organisation

● Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.

● Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

Alala 7

ANNIE

Unicode Tokeniser

FS GazetteerLookup

SentenceSplitter

Hepple POSTagger

Input:URL or text

Document format(XML, HTML, SGML, email, …)

GATEDocument

CharacterClass Sequence

Rules

Lists

JAPE SentencePatterns

Brill RulesLexicon

SemanticTagger

OrthoMatcher

JAPE IEGrammarCascade

GATE DocumentXML dump of

IE AnnotationsOutput:

ANNIEIE modules

NOTE: square boxes areprocesses, rounded ones aredata.

PronominalCoreferencer JAPE Grammar

Alala 8

Unicode Tokeniser

•Bases tokenisation on Unicode character classes

•Language-independent tokenisation

•Declarative token specification language, e.g.:

"UPPERCASE_LETTER" LOWERCASE_LETTER"* >

Token; orthography=upperInitial; kind=word

Look at the ANNIE English tokeniser and at tokenisers for other languages (in plugins directory) for more information and examples

Alala 9

Gazetteer● Set of lists compiled into Finite State Machines ● 60k entries in 80 types, inc.: organization; artifact; location; amount_unit; manufacturer; transport_means; company_designator; currency_unit; date; government_designator; ...

● Each list has attributes MajorType and MinorType and Language): city.lst: location: city: englishcurrency_prefix.lst: currency_unit: pre_amountcurrency_unit.lst: currency_unit: post_amount

● Attributes are used as input to JAPE grammars● List entries may be entities or parts of entities, or they

may contain contextual information (e.g. job titles often indicate people)

Alala 10

The Named Entity Grammar● JAPE phases run sequentially and constitute a cascade

of FSTs over annotations ● hand-coded rules applied to annotations to identify NEs ● annotations from format analysis, tokeniser. POS tagger

and gazetteer modules ● use of contextual information ● rule priority based on pattern length, rule status and rule

ordering ● Common entities: persons, locations, organisations,

dates, addresses.

Orthomatcher

● Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown

● Matching rules are invoked between annotations of the same type, or between an existing annotation and an “Unknown” annotation

● The latter is the only case where an annotation type can be changed

● Lookup tables of aliases and exceptions (i.e. overriding of matching rules)

● Also pronominal coreference (see User Guide)

Alala 12

JAPE: a Jolly And Pleasant Experience

● Grammars (cascades of phases)– Phases (lists of rules)

● Rules– LHS (patterns)– RHS (actions)

● Priority– Implicit

● longest match● first mention

– Explicit● priority

LHS of JAPE rules

● The LHS of the rule contains patterns to be matched, in the form of annotations (and optionally their attributes).

● Annotation types to be recognised must be declared at the beginning of the phase

● Annotations may be combined using traditional operators [ | * + ?]

● There is no negative operator

● More than one pattern can be matched in a single rule

● Left and right context (not to be annotated) can be matched

Examples of LHS patterns

({Lookup.majorType == location}) :loc    

---------------------

({Token.string == "in"} |  {Token.string == "by"})

({Year}) :date 

--------------------

(

({Lookup.majorType == jobtitle}  ):jobtitle  

  {Surname}  

):person  

RHS of JAPE rules

({Lookup.majorType == location}) :loc    

:loc.Location = {kind = “city", rule = “Location1"}

----------------------

(

({Lookup.majorType == jobtitle}  ):jobtitle  

  {Surname}  

):person

:jobtitle.JobTitle = {rule = "PersonJobTitle"},

 :person.Person = {kind = “Surname", rule = "PersonJobTitle"}  

Complex RHS ● JAPE RHS is quite limited in what you can do ● But you can use any Java you like on the RHS of the

rule ● Useful for e.g. removing temporary annotations and

percolating and manipulating features from previous annotations

● Also means you can use JAPE for many other things apart from just creating annotations, e.g. counting things, manipulating the text, adding annotations to the document, etc.

● And you don’t have to be a JAVA expert to do it.● Although it helps to have friends who are….

Example of using Java in a ruleRule: FirstName({Lookup.majorType == person_first}):person-->{

gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");gate.Annotation personAnn = (gate.Annotation)person.iterator().next();gate.FeatureMap features = Factory.newFeatureMap();features.put("gender", personAnn.getFeatures().get("minorType"));features.put("rule", "FirstName");outputAS.add(person.firstNode(), person.lastNode(),

"FirstPerson", features);}

Available Java objects

● bindings: binding variables● doc: GATE Document● annotations: all GATE Document annotations● inputAS, outputAS: phase input and output

annotations● ontology

See documentation for more details…..

Alala 19

JAPE Application modes● Brill (fires all matches)● First (shortest match fires)● Once (Phase exits after first match)● All (as for Brill, but matching continues from offset

following the current one, not from the end of the last match)

● Appelt (priority ordering: longest match fires, then explicit rule priority, then first defined rule fires)

Note that prioritisation only operates within a single phase, not globally

20

{A}+ Application Modes

A A AAppelt

Once

Brill

First

All

Example: “China Sea”

Rule:   Location1  

Priority: 25  

 (  

({Lookup.majorType == loc_key, Lookup.minorType == pre})?  

{Lookup.minorType == country}  

{Lookup.majorType == loc_key, Lookup.minorType == post})?  

)  :locName -->

:locName.Location = {kind = "location", rule = "Location1"}   

Rule: Location2  

Priority: 20  

 ({Lookup.minorType == location}) :location  -->   

:location.Name = {kind = "location", rule=GazLocation}

JAPE Hints and Tricks

● JAPE is quite limited in some respects as to what can be done– There is no negative operator– It can be slow if it is badly written, e.g. ({Token})*– Context is consumed, which can make rule-writing

awkward– Priority can be difficult to set correctly

● But fear not, there is generally a sneaky way around it…..

How to avoid a pattern from matchingRule: disablePattern

Priority: 1000

(<pattern>)

{}

● Instead of having a negative operator, we can simply put a high priority rule which does nothing when fired.

● This will be preferred to a lower priority rule which performs the action intended, i.e. only in the case when the former pattern doesn’t apply.

How to play with input annotations

Input: Person Organisation VerbWork Split…Rule: RelationWorkIn

({Person} {VerbWork} {Organisation}){… /* create annotation of type “Relation” */ …}

● Use existing annotations to find relations● We ignore Tokens to enable more flexibility, i.e. there

could be additional words between the annotations specified

● Split ensures we don’t cross sentence boundaries

How to deal with overlapping annotations

● Because matched annotations are consumed, when two annotations overlap (e.g. in gazetteer lists), the second one will never be matched.

● E.g. for the string “hALCAM” with Lookups hAL, ALCAM, and CAM, ALCAM will never be matched

● Solution is to delete the annotations once matched, and then rerun the same grammar phase over the text

● The process may need to be repeated several times (determine by trial and error)

More examples

● In the GATE User Guide under the section “Useful tricks with JAPE”

● Look in the ANNIE grammars and in the foreign language grammars – there are many examples of little tricks

● Check the GATE mailing list archives

Custom Processing Resource for your grammars 1. Java developer extends GATE's default JAPE Transducer

creating Java classpackage com.yourcompany;import gate.creole.Transducer;public class CustomTransducer extends Transducer {}

2. JAPE developer adds definition in the plugin’s creole.xml

<RESOURCE><NAME>My custom JAPE Transducer</NAME><CLASS> com.yourcompany.CustomTransducer </CLASS><PARAMETER NAME="document" RUNTIME="true"</PARAMETER><PARAMETER NAME="inputASName" RUNTIME="true“ OPTIONAL="true">java.lang.String </PARAMETER><PARAMETER NAME="outputASName" RUNTIME="true“ OPTIONAL="true">java.lang.String</PARAMETER><PARAMETER NAME="grammarURL" DEFAULT=“myDir/myMain.jape" SUFFIXES="jape">java.net.URL</PARAMETER><PARAMETER NAME="encoding" DEFAULT="UTF-8">java.lang.String</PARAMETER>

</RESOURCE>

3. GATE user opens custom resource in GATE GUI

Right-Click on “Processing Resources”In the pop-up menu select “New >” --> “My custom JAPE Transducer”

JAPE debugger● Speeds up the development of JAPE grammars

● Integrated in GATE GUI

● Friendly for non-experts

Allows you to:● Inspect the pattern matching

● Find overridden rules

● Detect complex inter-rule influence

● And many other things

Inspection of pattern matching

Overridden rules

Inter-rule influence (finding problem)

Inter-rule influence (what is that?)

Inter-rule influence (problem synopsis)

Text processed:

… of the J. L. Kellog Graduate School of Management and the Indiana University School of Business …

Conflicting rule:Rule: NotPersonFullPriority: 80// Det + Surname// This rule was commented course //J.L. Kellog processed without J. //17.06.03(

{Token.category == DT} | {Token.category == PRP} | {Token.category == RB}

)(

(PREFIX)* (UPPER) (PERSONENDING)?

):foo

Shadowed rule:Rule: PersonFullExtPriority: 100// F.W. Jones Fred Jones// Andrew "Flip" Filipowski// Andrew J. "Flip" Filipowski//({Token.category == DT})?( ((FIRSTNAME | FIRSTNAMEAMBIG))+ (INITIALS)? ((FIRSTNAME | FIRSTNAMEAMBIG) )* (PREFIX)* ((UPPER)):surname (PERSONENDING)?):person-->

Coming soon…..JAPE4What JAPE4 IS:● a new version of internal language in GATE release 4● language is based on original JAPE● incorporate best practices from JAPE, Jape+ and Japec● 3-5 times faster than JAPE

What JAPE4 IS NOT:● an improved version of original Jape, Jape+ or Japec but rather

a new language● a language backward compatible with JAPE

In most cases it seems to be possible to easily modify original Jape, Jape+ or Japec grammars to be compatible with JAPE4 specification.

top related