©2003 paula matuszek taken primarily from a presentation by lin lin. lin/globalinfosys/gate.ppt csc...

29
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pp CSC 9010: Text Mining Applications Fall, 2003 Introduction to GATE Dr. Paula Matuszek [email protected] Taken primarily from a presentation by Lin Lin http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Upload: neil-richardson

Post on 04-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

CSC 9010: Text Mining Applications

Fall, 2003

Introduction to GATE

Dr. Paula [email protected]

Taken primarily from a presentation by Lin Lin http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Page 2: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

What is GATE? Stands for General Architecture for Text

Engineering. The theory behind GATE is SALE

(Software Architecture for Language Engineering):– computer processing of human language– computer infrastructure for software

development

Page 3: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Who Use GATE?

Scientists performing experiments that involve processing human language

Developers developing applications with language processing components

Teachers and students of courses about language and language computation

Page 4: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

How GATE can Help? Specify an architecture, or organizational

structure, for language processing software Provide a framework, or class library, that

implements the architecture and can be used to embed language processing capabilities in diverse applications

Provide a development environment built on top of the framework made up of convenient graphical tools for developing components

Page 5: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

What are GATE Components?

Reusable software chunks with well defined interfaces

Used in Java beans and Microsoft’s .Net

Page 6: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

GATE as an architecture Breaks down to three types of components:

– LanguageResources (LRs) – represent entities such as lexicons, documents,

corpora, annotation schemas, or ontologies;

– ProcessingResources (PRs) – represent entities that are primarily algorithmic, such

as parsers, generators or ngram modelers;

– VisualResources (VRs) – represent visualization and editing components that

participate in GUIs.

Page 7: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

LRs: Corpora, Documents, and Annotations

A Corpus in Gate is a Java Set whose members are Documents.

Documents are modeled as content plus annotations plus features.

Annotations are organized in graphs, which are modeled as Java sets of Annotation.

Page 8: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Documents Processing in GATE

Document:– Formats including XML, RTF, email, HTML,

SGML, and plain text.– Identified and converted into GATE

annotation format.– Processed by PRs.– Results stored in a serial data store (based

on Java serialization) or as XML.

Page 9: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Built-in GATE Components Resources for common LE data

structures and algorithms, including documents, corpora and various annotation types

A set of language analysis components for Information Extraction (e.g. ANNIE)

A range of data visualization and editing components

Page 10: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Develop Language Processing Functionality using GATE

Programming, or the development of Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both.

The development environment is used for:– visualization of the data structures produced

and consumed during processing– debugging– performance measurement

Page 11: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

CREOLE

A Collection of REusable Objects for Language Engineering

The set of resources integrated with GATE

All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data.

Page 12: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

PRs: ANNIE

A family of Processing Resources for language analysis included with GATE

Stands for A Nearly-New Information Extraction system.

Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.

Page 13: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE IE Modules

Page 14: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE Components Tokenizer Gazetteer Sentence Splitter Part of Speech Tagger

– produces a part-of-speech tag as an annotation on each word or symbol.

Semantic Tagger OrthoMatcher Coreference Module

Page 15: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE Component: Tokenizer

Token Types– word, number, symbol, punctuation, and

spaceToken. A tokenizer rule has a left hand side and

a right hand side.

Page 16: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Tokenizer Rule Operations used on the LHS:

– | (or) –  * (0 or more occurrences)  – ? (0 or 1 occurrences)  – + (1 or more occurrences)

The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value1};...;{attribute  n}={value n}

Page 17: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Example Tokenizer Rule"UPPERCASE_LETTER" "LOWERCASE_LETTER"* 

>  

Token;orth=upperInitial;kind=word;

– The sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

Page 18: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE Component: Gazetteer

The gazetteer lists used are plain text files, with one entry per line.

Each list represents a set of names, such as names of cities, organizations, days of the week, etc.

src\gate\resources\Creole\gazeteer\Default\*.lst

Page 19: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Example Gazetteer List A small section of the list for units of

currency: …… Ecu  

European Currency Units  FFr  Fr  German mark  German marks  New Taiwan dollar  New Taiwan dollars  NT dollar  NT dollars

……

Page 20: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE Component: Semantic Tagger

Based on JAPE language, which contains rules that act on annotations assigned in earlier phases.

Produce outputs of annotated entities.

Page 21: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE Component: Sentence Splitter

Segments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of

abbreviations to help distinguish sentence-marking full stops from other kinds.

Page 22: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

ANNIE Component: OrthoMatcher

Adds identity relations between named entities found by the semantic tagger, in order to perform coreference.

Does not find new named entities, but it may assign a type to an unclassified proper name.

Page 23: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Create a New Resource Write a Java class that implements

GATE’s beans model. Compile the class, and any others that it

uses, into a Java Archive (JAR) file. Write some XML configuration data for

the new resource. Tell GATE the URL of the new JAR and

XML files.

Page 24: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Example: Create a New Component Called GoldFish GoldFish:

– Is a processing resource– Look for all instances of the word “fish” in

the document– Add an annotation of type “GoldFish”

Page 25: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Example: Create GoldFish Using BootStrap Wizard

Page 26: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

GoldFish: default files created

Creates Java code in Goldfish.java. Creates XML configuration for GoldFish

in resource.xml.

Page 27: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Create an Application with PRs Applications model a control strategy for the

execution of PRs. Currently only pipeline execution is

supported.– Simple pipelines: group a set of PRs together

in order and execute them in turn.– Corpus pipelines: open each document in the

corpus in turn, set that document as a runtime parameter on each PR, run all the PRs on the corpus, then close the document

Page 28: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Additional Facilities

JAPE– a Java Annotation Patterns Engine, provides

regular-expression based pattern/action rules over annotations.

– The file “Main.jape” contains a list of the grammars to be used for for Named Entity Recognition, in the correct processing order.

– Used in ANNIE.

Page 29: ©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. lin/GlobalInfoSys/GATE.ppt CSC 9010: Text Mining Applications

©2003 Paula Matuszek

Taken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt

Embedding ANNIE

Create a stand alone ANNIE extraction system.

Example code that will embed ANNIE in an application that takes URLs as inputs and produces named entities as outputs.