closing the gap: data models for documentary linguistics
DESCRIPTION
Talk at Latrobe University (May 2005, Melbourne)TRANSCRIPT
Closing the Gap: Data Models for
Documentary Linguistics
Baden HughesDepartment of Computer Science and Software Engineering
The University of [email protected]
Latrobe Uni - Linguistics Seminar - 20050505 2
Overview
� Overall Context� The Electronic Data Format Challenge� Common Problems� Data Encoding Models
� Lexicons, interlinear texts, paradigms, syntactic trees, annotation standards, query languages
� Linguistic Motivations vs Computational Interests� New Types of Data Exploration� Effects on Linguistic Analysis� New Tools� Conclusions
Latrobe Uni - Linguistics Seminar - 20050505 3
Overall Context
� Large amounts of human language data continues to be managed in electronic form and analysed in fieldwork-driven linguistic documentation
� Increasing focus on acquisition-centric methodologies which have vastly increased the rate of growth of linguistic data
� Reasonably static basic linguistic data structures largely grounded in print domain
Latrobe Uni - Linguistics Seminar - 20050505 4
The Electronic Data Format Challenge
� The methods used for the digital encoding of linguistic data are often disparate� Often at best reduced to native formats supported by
widely-used tools such as Shoebox� Conversion is typically complex and lossy
� Sometimes this can’t be predicted in advance� Many utility manipulation functions required to move
data between analytical applications and outputs� These functions are largely external to analytical
environments, with some notable exceptions (eg regular expression manipulation)
Latrobe Uni - Linguistics Seminar - 20050505 5
Common Problems
� Despite diversity of language and analytical approach, many documentary and descriptive linguists face a common challenge: the interoperability and longevity of electronic data generated in fieldwork settings.
� Repurposing data� Publishing data on the web� Publishing in papers� New analysis tools� New generation formats
Latrobe Uni - Linguistics Seminar - 20050505 6
The Emergence of Abstract Language Data Encoding Models� Recently, a number formal data encoding models for
linguistic data types have emerged from projects investigating "best practice" methods for preserving linguistic data.
� We will briefly consider models for� lexicons� interlinear texts� paradigms� syntactic trees� annotation standards� query languages
Latrobe Uni - Linguistics Seminar - 20050505 7
Data Models (1)
� Lexicons� Bell & Bird (2001)
� Interlinear Text� Bow, Hughes & Bird (2003)� Hughes, Bird & Bow (2003)
� Linguistic Paradigms� Penton, Bow, Bird & Hughes (2004)� Penton & Bird (2004)
Latrobe Uni - Linguistics Seminar - 20050505 8
Data Models (2)
� Syntactic Trees� Lai & Bird (2004)
� Annotation Standards� Farrar, Lewis & Langendoen (2002)� Farrar & Langendoen (2003)
� Query Languages� Bird, Chen, Davidson, Lee & Zheng (2005)� Cassidy & Bird (2000) � Taylor (2004)
Latrobe Uni - Linguistics Seminar - 20050505 9
Linguistic Motivations
� Data models – so what ?� It is the combined utility of these models that makes
them attractive to documentary linguists� The challenge is to lower the barrier to use of these
technologies in fieldwork and analytical contexts� Linguistics (mostly) don’t care about the technology,
they just want to do linguistics!� Computer scientists are generally not interested in
linguistics …
Latrobe Uni - Linguistics Seminar - 20050505 10
Computational Interests
� The development of such models may be inherently interesting to computationally inclined researchers� Human language data encoding and annotation is
genuinely interesting in computer science terms; unfortunately basic data modelling isn't
� Technologists have a bad habit of providing advice which is intended well but lacks traction for non-technical communities (eg “use XML”)
� Many of the solutions are XML-based, but contain many more components than just XML encoded data
Latrobe Uni - Linguistics Seminar - 20050505 11
New Types of Data Exploration (1)
� Open implemented solutions for a range of manipulations are available� Lexicons
� Generation of different types of lexicons
� Interlinear Text (see following examples …)� Generation of different types of interlinear text � Induction of morphosyntactic glossing from lexicons� Generation of lexicons from interlinear text� Enrichment of lexicons from interlinear text
Latrobe Uni - Linguistics Seminar - 20050505 12
Nenets Interlinear (1)
Latrobe Uni - Linguistics Seminar - 20050505 13
Nenets Interlinear (2)
Latrobe Uni - Linguistics Seminar - 20050505 14
New Types of Data Exploration (2)
� Open implemented solutions for a range of manipulations are available� Syntactic Trees
� Induction of trees from interlinear text� Creation of interlinear text from syntactic tree drawing� Creation of lexicons from syntactic trees
� Paradigms (see following examples …)� Generation of different types of paradigms� Induction of paradigms from interlinear text� Annotation of interlinear text from paradigms� Enrichment of lexicons from paradigms
Latrobe Uni - Linguistics Seminar - 20050505 15
Kanarese Paradigm (1)
Latrobe Uni - Linguistics Seminar - 20050505 16
Kanarese Paradigm (2)
Latrobe Uni - Linguistics Seminar - 20050505 17
Effects on Linguistic Analysis
� Integrated encoding standards for linguistic data affect the practice of linguistic analysis� Some analysis types are now easier� New possibilities emerge� New analytical challenges are discovered� Data linkage/integration is certainly one of the
improvements
Latrobe Uni - Linguistics Seminar - 20050505 18
New Tools
� The next generation of tools which support these data models natively are emerging eg FIELD, ELAN, Toolbox (almost)
� “Middleware” which allows the translation of legacy formats to and from these models are reasonably widely available
� Analytical tools are increasingly being implemented with web-grounded technologies and using web-derived models
� Open source/open data approaches are becoming pervasive
Latrobe Uni - Linguistics Seminar - 20050505 19
Conclusion
� Reducing the gap between computationally tractable representations on which a high degree of functionality can be built and simple underlying formats driven by fieldwork-oriented tools
� Reduces the intermediate data-munging steps which require technical knowledge rather than linguistic knowledge is advantageous to all parties
� While we are not quite “there yet”, the light at the end of the tunnel is definitely there
� Growing community of philosophically aligned computer scientists and linguists
Latrobe Uni - Linguistics Seminar - 20050505 20
References
� Bell & Bird, 2001. A Preliminary Study of the Structure of Lexicon Entries. Proceedings of the Workshop on Web-Based Language Documentation and Description.
� Bow, Hughes & Bird 2003. Towards a General Model for Interlinear Text. Proceedings of EMELD 2003.
� Farrar, Lewis & Langendoen, 2002. A Common Ontology for Linguistic Concepts.Proceedings of the Knowledge Technologies Conference.
� Farrar & Langendoen, 2003. A linguistic ontology for the Semantic Web. GLOT International 7(3)
� Hughes, Bird & Bow, 2003. Encoding and Presenting Interlinear Text Using XML Technologies. Proceedings of ALTW 2003.
� Lai & Bird, 2004. Querying and Updating Treebanks: A Critical Survey and Requirements Analysis. Proceedings of ALTW 2004.
� Penton, Bow, Bird & Hughes, 2004. Towards a General Model for Linguistic Paradigms.Proceedings of EMELD 2004.
� Penton & Bird, 2004. Representing and Rendering Linguistic Paradigms. Proceedings of ALTW 2004.
� Bird, Chen, Davidson, Lee & Zheng, 2005. Extending XPath to Support Linguistic Queries. Proceedings of PLANX 2005.
� Cassidy & Bird, 2000. Querying databases of annotated speech. Proceedings of the Eleventh Australasian Database Conference.
� Taylor, 2004. XSLT as a Linguistic Query Language. BSc(Hons) Thesis, University of Melbourne.
Latrobe Uni - Linguistics Seminar - 20050505 21
Questions ? Comments ?