poio api and graf-xml @ balisage 2013
TRANSCRIPT
Poio API and GrAF-XML
A radical stand-off approach inlanguage documentation and language typology
Jonathan Blumtritt, Cologne Center for eHumanities, University of ColognePeter Bouda, Centro Interdisciplinar de Documentao Lingustica e SocialFelix Rau, Department of Linguistics, University of Cologne
Overview
Existing infrastructure and workflows
CLARIN
Annotation graphs
GrAF and Poio API
Example: Elan EAF to GrAF-XML
CLASS
Fieldwork
Fotos
Existing Infrastructure
LD tools and standards
Elan: EAF, MPEG, WAV
Toolbox: TXT, XML, WAV
Arbil: IMDI/CIMDI (Component MetaData Infrastructure)
Praat: XML, WAV
...
No standards for tier hierarchies, tier names or annotation schemes
Efforts in ISOcat
European initiative within the European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (CLARIN)
aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data
Started in 2006, part of a roadmap process, timeline currently ending 2020
CLARIN-D: working groups in Germany
Curation projects for different research areas in linguistics
Annotation Graphs
the underlying data model for linguistic annotations
pivot structure for linguistic data
time vs. byte offsets
not hierarchical (but trees are also graphs)
stand-off annotation
"It is important to recognize that translation into AGs does not magically create compatibility among systems whose semantics are different." [Bird & Liberman 2001]
AGs visualized
GrAF
GrAF: Graph Annotation Framework
ISO 24612: Language resource management - Linguistic annotation framework (LAF)
Started as stand-off version of XCES
API and representation as data structures, not a file format
GrAF/XML as XML representation
Used for the MASC of the ANC
Nodes, edges, regions, annotations, feature structures
TEI and GrAF
Schemata for GrAF created with TEI Roma
Custumized version of TEI P5 schema
ODD: One Document Does it all
GrAF is not TEI compliant
Share data types and feature structures of annotations
TEI has stand-off variant, uses XPointer/XLinkPrimary data has to be XML
Why we use GrAF
Because it's new! :-)
No inline markup
Radical stand-off approachEasier to share and manage data
Preferred solution to archive cultural heritage
Ideal for sparse annotations
Existing code: Java and Python
The beauty of annotation graphs
Poio API
Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages
Subset of GrAF to represent tier based annotation
Filters and filter chains for search
Plugin mechanism for file formatsMapping semantics: tiers and annotations to nodes and edges
Meta-data for additional information (tier types etc.)
Example: Mapping of EAF to GrAF-XML
Elan EAF
so [...]
GrAF entities
GrAF structure
GrAF-XML
so
Tier hierarchies
[ ['utterance..K-Spch'],
['utterance..W-Spch', ['words..W-Words', ['part_of_speech..W-POS'] ], ['phonetic_transcription..W-IPA'] ],
['gestures..W-RGU', ['gesture_phases..W-RGph', ['gesture_meaning..W-RGMe'] ] ],
['gestures..K-RGU', ['gesture_phases..K-RGph', ['gesture_meaning..K-RGMe'] ] ]]
The code
ag = poioapi.annotationgraph.AnnotationGraph()parser = poioapi.io.ElanParser("example.eaf")writer = poioapi.io.graf.Writer()converter = poioapi.io.graf.GrAFConverter(parser, writer)
converter.parse()converter.write("example.hdr")
Analysis workflows
Graph-based methods
Pipe to scientific Python libraries
GrAF connectors for major linguistic workflow tools (GATE and Apache UIMA)
Example: Polysemy in dictionaries
Example: Counting word orders
CLASS
Thank you for your attention!
Links
Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthropology-language-typology/curation-project-1.html
Poio API:http://media.cidles.eu/poio/poio-api/
GrAF:http://www.xces.org/ns/GrAF/1.0/
CLASS:http://class.uni-koeln.de