what’s needed for lexical databases? experiences with kirrkirr

24
What’s needed for lexical databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford University http://www.sultry.arts.usyd.edu.au/ kirrkirrr/

Upload: colin

Post on 28-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

What’s needed for lexical databases? Experiences with Kirrkirr. Christopher Manning and Kristen Parton Depts of Computer Science and Linguistics Stanford University http://www.sultry.arts.usyd.edu.au/kirrkirrr/. Overview. Background on the Kirrkirr project - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What’s needed for lexical databases? Experiences with Kirrkirr

What’s needed for lexical databases?

Experiences with Kirrkirr

Christopher Manning and Kristen PartonDepts of Computer Science and Linguistics

Stanford Universityhttp://www.sultry.arts.usyd.edu.au/kirrkirrr/

Page 2: What’s needed for lexical databases? Experiences with Kirrkirr

Overview

Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Page 3: What’s needed for lexical databases? Experiences with Kirrkirr

Background: Kirrkirr

A dictionary browser/visualization tool In use with a dictionary of Warlpiri, an Indige-

nous Australian language (large for such a dictionary - 10 Mb – with exx, crossrefs, etc.)

Dictionary is maintained by linguists as text files, with text editor, in an ad hoc format

We convert it automatically into validated XML (stack-based error-correcting Perl parser)

Kirrkirr software is written in Java (JDK1.1, any platform) and uses XML text file “database”

Page 4: What’s needed for lexical databases? Experiences with Kirrkirr

Warlpiri Warumungu

Alawa

Page 5: What’s needed for lexical databases? Experiences with Kirrkirr

Kirrkirr: Objectives

Exploit the power of a computer interface in mediating between users and dictionary data

Present a dictionary in a way which is flexible, interactive, customizable, and fun

Do visualization: networks of words, domains, activities, dictionary reversal (W-E E-W)

Suitable for diverse users, with widely varying literacy levels: inter alia linguists, elementary school children, teachers, and native speakers

Aid linguistic science: for subtle linguistic judgments, one needs speaker involvement

Page 6: What’s needed for lexical databases? Experiences with Kirrkirr

Usability

We’ve been doing paper and electronic dictionary usability testing (Corris, Manning, Poetsch, and Simpson 1999, 2001)

10/6/00: Steve Patrick Jampijinpa, Jessie Patrick Nangala and Samara Napangardi Steve started to look at it with the children, … taking them through the

exercises in the dictionary worksheet, and getting them to do the typing and mousing. JP was keen to look up words, Samara, being younger, was more interested in flashing things and banging keys, but was also keen to be involved. They were keen to look up words which had pictures…. They were disappointed not to find puluku in the dictionary – Samara tried to look it up under cow as well. JP was a slow careful speller, and so could type in words she wanted to know without having them written in front of her. We used the rhyme sort to find rhymes. While rhyme is not a feature of Warlpiri songs, it is useful for teaching phonics. Steve asked whether the dictionary would be at the school, and was pleased to hear that when Carmel got some more RAM it would be.

Page 7: What’s needed for lexical databases? Experiences with Kirrkirr

Overview

Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Page 8: What’s needed for lexical databases? Experiences with Kirrkirr

The many aspects of databases

Three levels: a logical level specifying query semantics between physical data level and external views of/interfaces to the data

Data model; data integrity and consistency Query language Concurrency control, transaction

management, and data recovery We’re not doing this – like most XML work?

(Abiteboul et al. 2000) – but some people need this Storage and query optimization; indices

Page 9: What’s needed for lexical databases? Experiences with Kirrkirr

Choices for dictionary representation A relational database (Nathan and Austin 1992, …)

The flexible, hierarchical, ordered text structure of dictionaries means that this is painful to do; retrieving dictionary entries may involve innumerable joins

A text file (“the document culture”) Common in practice. No data integrity, etc. But portable and tangible. Authors like it.

As semi-structured data Matches variable, non-rigid, and extensible

hierarchical structure found in dictionaries

Page 10: What’s needed for lexical databases? Experiences with Kirrkirr

But semi-structured data is a continuum… From highly structured data that could easily be

represented in a relational or OO database (but isn’t for interchange or trendiness reasons)

To very unstructured text data, with occasional limited markup of basic structure

Linguistic databases tend to be at the unstructured end of the continuum

But (unfortunately for linguists) most work on semi-structured databases has focused on the quite structured end … with only very limited work aimed at text databases

Page 11: What’s needed for lexical databases? Experiences with Kirrkirr

Crucial observation for dictionary databases In fairly unstructured databases, the contents of

fields are also likely to be quite free-form Desired querying is likely to involve flexible

content-based queries Current XML query language proposals don’t

adequately support this style of usage Even standard techniques for text, like word-based

inverted file indices, often contain restrictions, such as allowing wildcards only at the end of words, which greatly limit their usefulness in text applications (e.g., PAT (Salminen and Tompa 1994) can’t search for ‘-isms’)

Page 12: What’s needed for lexical databases? Experiences with Kirrkirr

Ramifications for indexing

Pre-indexing is often not particularly useful or effective over text databases

Regular expressions are often more suitable Linguists often want to ask pattern questions

(words with a high vowel after a velar) We can do “fuzzy spelling” spelling correc-

tion without Soundex-style precomputation In Kirrkirr, we’re working on doing online

morphological analysis, which is again usefully viewed as a finite-state transduction

Page 13: What’s needed for lexical databases? Experiences with Kirrkirr

Indexing

Indexing is not particularly needed: you can grep 10 Mb in 2–3 seconds on standard PC (users are happy to wait)

XML indexing research has concentrated on the structured end of the problem: Regular expressions over path structures are not of

much use for textbases We mainly need queries over textual content within

XML entities There are not complex join conditions but simple use of

intersection or alternation Realistic search needs do not add excessive combina-

toric complexity: A linear search of the text is sufficient

Page 14: What’s needed for lexical databases? Experiences with Kirrkirr

Data models/schemas

Data consistency and correctness are vitally important Even if authors like text editors, it’s a license

to make errors and inconsistencies Every kind of validation available has been

useful (DTD, id/idref-style constraints) One dictionary data model doesn’t fit all

E.g., Warlpiri dictionary has unusual organization via paradigm examples

I feel that exploring mediators will be more profitable than complex standards

Page 15: What’s needed for lexical databases? Experiences with Kirrkirr

Overview

Background on the Kirrkirr project What’s needed for dictionary databases Kirrkirr data structure and data access

Page 16: What’s needed for lexical databases? Experiences with Kirrkirr

Data structures and data access in Kirrkirr Data maintained by lexicographers in text files Backslash codes, but with end tags, nesting Converted to XML via Perl parser

Result is guaranteed to be valid XML (though heuristic parser can make semantic errors)

This has involved a lot of work and revealed many inconsistencies in the data. Painful!

Automatic data consistency and integrity maintenance is really useful, I’d argue!

But text gives freedom, ease-of-use, tangibility (UI issues win: cf. Excel vs. Access)

Page 17: What’s needed for lexical databases? Experiences with Kirrkirr

Indices/tables

Kirrkirr builds and stores on disk two custom indices/tables over the XML One indexes Warlpiri headwords to XML file

positions, and holds a few extra bits of info (about pictures, subentry status, etc. (so the scroll list can be displayed quickly)

The other indexes English glosses to Warlpiri words

Maintained in memory at runtime (not that large, allows easy regexp-based

fuzzy spelling matching)

Page 18: What’s needed for lexical databases? Experiences with Kirrkirr

KirrkirrDictionary Browser

<DICTIONARY>

<ENTRY>...</ENTRY>

<ENTRY>...</ENTRY>

<ENTRY>...</ENTRY>

</DICTIONARY>

word position bitsword position bitsword position bits

XML Warlpiri dictionary file

Indices in memory

XML Parser

XML Document Object Model

Our “logical level” is Java code with hardwired methods for each query – though we have also experimented with XQL (for parts of it)

Kirrkirr data access

English WarlpiriEnglish Warlpiri

Dic-tio-naryinterface

grep(Jakarta-ORO)

Page 19: What’s needed for lexical databases? Experiences with Kirrkirr

Data access

Scroll list display, simple lookups and searches over headwords and glosses done purely from in-memory indices

Getting cross-references for network display, semantic domains, pictures, HTML, etc. is done by using index to jump into XML file, and then parsing it (with SAX until end of entry)

Complex searches are done as entity-sensitive regexp search over either the whole dictionary file, or the entries that the search is restricted to (found via the headword index)

Page 20: What’s needed for lexical databases? Experiences with Kirrkirr

Customizing Format with XSLT

XSLT stylesheets format dictionary entries in ways suited to the needs of different users E.g., simple formats for low literacy users

The resulting HTML pages show typed cross-references in the dictionary as colored hyperlinks between different words

Since the XML is parsed at run-time, we can add extra information by “parameter passing” from the program to the XSLT E.g. file locations for pictures, search titles

Page 21: What’s needed for lexical databases? Experiences with Kirrkirr

English-Warlpiri Dictionary

Source dictionary is only Warlpiri-English, but a bidirectional dictionary is needed by users

An English index was built from glosses so that glosses link to equivalent Warlpiri entries

Basis for English wordlist and fast search Multiword glosses are indexed everywhere

except for stopwords, giving easy lookup One underlying dictionary: data consistency The XML entries of all Warlpiri equivalents to an

English word are merged, and passed to an XSLT stylesheet which merged HTML

Page 22: What’s needed for lexical databases? Experiences with Kirrkirr

Warlpiri Morphological Parsing

Warlpiri is an agglutinating language: nyangulparnangkunya -ngu -lpa =rna =ngkusee -PAST -IPFV =1SG.SUBj =2SG.OBJ ‘I was looking at you.’

For lookup/linking, users or the program have to know the root/citation form

This is difficult for people with limited literacy We have been developing a morphological

analyzer so we can look up any form, and link words in examples, etc. (Finite state methods)

Page 23: What’s needed for lexical databases? Experiences with Kirrkirr

Conclusions

The data structuring and data integrity of a semi-structured database are great for dictionaries

A query language, which supported textual content-based queries well, would be great too

At present, though, we do not have many good options, and Kirrkirr get by with limited ad hoc indices and text searches, done via a dictionary abstraction layer in the code

This hasn’t troubled us too much; UI issues have normally been much bigger challenges

Page 24: What’s needed for lexical databases? Experiences with Kirrkirr

Acknowledgements

Ken Hale, Mary Laughren, Robert Hoogenraad Jane Simpson, David Nash Nic Gambold, Kay Ross Kevin Jansz, Nitin Indurkhya, Kevin Lim Miriam Corris, Susan Poetsch and many others….