gate overview and demo university of washington clma treehouse presentation october 8, 2010 prescott...
TRANSCRIPT
![Page 1: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/1.jpg)
GATEOverview and Demo
University of WashingtonCLMA Treehouse Presentation
October 8, 2010Prescott Klassen
![Page 2: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/2.jpg)
Overview
• Summary of GATE information and documentation found at gate.ac.uk
• GATE Developer features, components, and plug-ins
• IDE Demo• Embedded GATE• Using GATE with Condor on Patas• GATE code samples
![Page 3: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/3.jpg)
Background• Sheffield Natural Language Processing Group at the University of
Sheffield• Released 1996 – re-written and re-released 2002 • Latest Release GATE 5.2.1 (May 6, 2010) – Windows, Linux,
Solaris, and Mac OS• Beta Release GATE 6.0 (Beta 1 – August 21, 2010)• 100% Java Reference Implementation• Compatible with IBM Unstructured Information Management
Architecture (UIMA)• Open Source (GNU Library General Public License)• XML Corpus Encoding Standard (XCES) format, used by the
American National Corpus
![Page 4: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/4.jpg)
What is GATE?
• An architecture describing how language processing systems are made up of components.
• A framework (class library) written in Java and tested on Linux, Windows and Solaris.
• A graphical development environment built on the framework (IDE for NLP)
![Page 5: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/5.jpg)
GATE Products• GATE Developer
– IDE for language processing components bundled with the ANNIE (A Nearly-New Information Extraction system) and plug-ins
• GATE Teamware– Web app for collaborative semantic annotation projects incorporating a
workflow engine and a backend service infrastructure • GATE Embedded
– Object library optimized for inclusion in applications• GATE Services
– Hosted services for cloud application development• GATE Wiki
– Wiki/CMS• GATE Cloud
– Cloud computing solution for hosted large-scale text processing
![Page 6: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/6.jpg)
GATE Components
• Language Resources (LRs)—documents, corpora and ontologies
• Processing Resources (PRs)—parsers, stemmers, co-reference resolvers, ML components, etc.
• Visual Resources (VRs)—IDE components that provide a visual interface (GUI) to GATE components and plug-ins
![Page 7: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/7.jpg)
Language Resources
• Documents, corpora, and ontologies• Can persist in Java Serial Store or Lucene Serial Data Store• Document = content + annotations + features• “Stand-off” Markup• Annotations as Directed Acyclic Graphs (start Node, end
Node, ID, type, Feature Map, pointers into the sources document—character offsets)
• Input Formats: Plain Text, HTML,SGML,XML, RTF, Email, PDF, Microsoft Word
• Ontology support (Sesame2,OWLIM3)
![Page 8: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/8.jpg)
Processing Resources
• ANNIE (a Nearly-New Information Extraction System)– Document Reset– Tokeniser– Gazetteer– Sentence Splitter– RegEx Sentence Splitter– Part of Speech Tagger– Semantic Tagger – Orthographic Coreference (OrthoMatcher)– Pronominal Coreference
![Page 9: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/9.jpg)
Processing Resources
• JAPE (Java Annotation Pattern Engine): – Regular expressions over annotations– Finite state transduction over annotations based on
regular expressions– Not against strings but against annotation graphs– Non-deterministic
• ANNIC: ANNotations-In-Context– full-featured annotation indexing and retrieval system– Searchable Serial DataStore– Based on Lucene
![Page 10: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/10.jpg)
Processing Resources
• The Annotation Diff Tool– enables two sets of annotations in one or two
documents to be compared– figures are generated for precision, recall, F-
measure• Corpus Benchmark Tool– Apply evaluation across an entire corpus
• Balance Distance Measure (BDM) Ontology Tool
![Page 11: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/11.jpg)
Processing Resources (PlugIns)
• OntoGazetteer• HashGazetteer• Gazetteer List Collector• Large KB Gazetteer• Ontology-Aware JAPE Transducer• Batch Learning PR (LibSVM, PAUM algorithm,
Weka interface)• Machine Learning PR (Maxent, Weka and SVM
Light)
![Page 12: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/12.jpg)
Resources on the Web sitegate.ac.uk
• User Guide• Movie Tutorials• Developer’s Guide/API docs• NLP Application Programmer’s Guide• Research Papers• GATE project descriptions• Demos• Plug-in Info• Commerical/Academic partnerships• Etc…
![Page 13: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/13.jpg)
IDE Demo
![Page 14: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/14.jpg)
What is GATE Embedded?
• Everything in GATE IDE without the GUI• A Java framework for many different types of
NLP solutions• A complex assortment of core functionality
and plug-ins• Extensible and Composable– GATE can be included as a component in other
Java Frameworks and vice-versa
![Page 15: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/15.jpg)
Example Application with a GATE Embedded Component
![Page 16: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/16.jpg)
Running GATE (“Hello World”)import gate.*;import gate.creole.*;
public class Main {
public static void main(String[] args) throws Exception {
Gate.setGateHome(new File(<Path to GATE>));Gate.setPluginsHome(new File(<Path to Plugins>));
Gate.init(); // start GATE}
![Page 17: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/17.jpg)
Registering Directories
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), "ANNIE").toURL());
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), "Information_Retrieval").toURL());
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), "Stemmer_Snowball").toURL());
![Page 18: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/18.jpg)
Creating Processing ResourcesSerialAnalyserController annieController =
(SerialAnalyserController) Factory.createResource("gate.creole.SerialAnalyserController",Factory.newFeatureMap(),Factory.newFeatureMap(), "ANNIE");
FeatureMap params = Factory.newFeatureMap();
annieController.add((ProcessingResource) Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.tokeniser.DefaultTokeniser", params));annieController.add((ProcessingResource) Factory.createResource("stemmer.SnowballStemmer", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.gazetteer.DefaultGazetteer", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.splitter.RegexSentenceSplitter", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.POSTagger", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.ANNIETransducer", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.orthomatcher.OrthoMatcher", params));
FeatureMap coRefParams = Factory.newFeatureMap();coRefParams.put("resolveIt", "true");
annieController.add((ProcessingResource) Factory.createResource("gate.creole.coref.Coreferencer", coRefParams));
![Page 19: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/19.jpg)
Creating Language ResourcesCorpus corpus = Factory.newCorpus("DUC Queries");
@SuppressWarnings("static-access")File topicsFile = new File(ConfigMgr.getTopicFilePath() + "topics.xml");gate.Document topicDoc = Factory.newDocument(topicsFile.toURL());
corpus.add(topicDoc);annieController.setCorpus(corpus);
annieController.execute();
![Page 20: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/20.jpg)
Iteration and CleanupAnnotationSet defaultAnnotations = topicDoc.getAnnotations();AnnotationSet originalMarkup = topicDoc.getAnnotations("Original markups");AnnotationSet topicAnnotationSet = originalMarkup.get("TOPIC");
for (Annotation topicAnnotation : topicAnnotationSet) { ArrayList<Query> topicQueryArrayList = new ArrayList<Query>();
if (ConfigMgr.isQueryBreakdown()) {topicQueryArrayList = Utilities.buildTopicMultiQuery(topicAnnotation,
originalMarkup, defaultAnnotations, config); } else {
topicQueryArrayList = Utilities.buildTopicQuery(topicAnnotation, originalMarkup, defaultAnnotations, config); }
String topicKey = null;
topicKey = topicQueryArrayList.get(0).getDucTopicName(); globalQueryHash.put(topicKey, topicQueryArrayList);}
topicDoc.cleanup();Factory.deleteResource(topicDoc);corpus.cleanup();Factory.deleteResource(corpus);
![Page 21: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/21.jpg)
Iterating through Annotations public static AnnotationSet getChildAnnotationSet(
String childAnnotationSetName, Annotation annotation, AnnotationSet parentAnnotationSet) throws NullPointerException {
AnnotationSet childAnnotationSet = null;
// traverse nested Annotation Set for named annotation using parent offsets to delimit rangetry { childAnnotationSet = parentAnnotationSet.get(childAnnotationSetName,
annotation.getStartNode().getOffset(), annotation.getEndNode().getOffset()); if (childAnnotationSet == null) {
throw new NullPointerException(); }} catch (Exception e) { System.err.println(e.getMessage());}
return childAnnotationSet; }
![Page 22: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/22.jpg)
Example Script for Compiling on Patas#! /bin/bash
javac -classpath .:/NLP_TOOLS/tool_sets/gate/gate-5.1/bin/gate.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/activation.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-contrib-1.0b2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-launcher.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jdom.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/antlr.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-nodeps.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-trax.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/Bib2H‚Ñ¢L.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-discovery-0.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-fileupload-1.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang-2.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-logging.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/concurrent.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-asm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compiler-jdt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gateHmm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/geronimo-ws-metadata_2.0_spec-1.1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/GnuGetOpt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/icu4j.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jakarta-oro-2.0.5.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/javacc.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxb-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxen-1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxws-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jwnl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j-1.2.14.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lubm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core-2.2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/mail.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml-1.9.8+2039483.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ontotext.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/orajdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox-0.7.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/pg73jdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/poi-2.5.1-final-20040804.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-beans-2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-core-2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/stax-api-1.0.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/tm-extractors-0.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl-3.2.3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xercesImpl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xml-apis.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xmlunit-1.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xpp3-1.1.3.3_min.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xstream-1.2.jar:edu.mit.jwi_2.1.5.jar ling573extractive/*.java
![Page 23: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/23.jpg)
GATE Condor Scriptuniverse = javaexecutable = ling573extractive/Main.classarguments = ling573extractive.Mainoutput = ling573extractive.outputerror = ling573extractive.errorjar_files =
/NLP_TOOLS/tool_sets/gate/gate-5.1/bin/gate.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jdom.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang-2.4.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-asm.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compiler-jdt.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j-1.2.14.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core-2.2.0.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml-1.9.8+2039483.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ontotext.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox-0.7.2.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/orajdbc3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl-3.2.3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xercesImpl.jar,edu.mit.jwi_2.1.5.jar
java_vm_args = -Xmn100M -Xms500M -Xmx500M+RequiresWholeMachine = TrueRequirements = ( Memory > 0 && TotalMemory >= (7*1024) )queue
![Page 24: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen](https://reader034.vdocuments.mx/reader034/viewer/2022042615/56649c4d5503460f948f2e95/html5/thumbnails/24.jpg)
Discussion