1 language documentation in west africa july 27 2010 winneba, ghana david nathan endangered...
TRANSCRIPT
1
Language Documentation in West AfricaJuly 27 2010
Winneba, Ghana
David NathanEndangered Languages Archive
Hans Rausing Endangered Languages ProjectSOAS, University of London
Data management
2
Data management
Data management is crucial for language documentation because … documentation revolves around data data can be irreplaceable media data needs special management you may end up with hundreds if not
thousands of files
3
This session …
4
Your data management
are you satisfied? do you manage your data? how? what problems do you have?
5
Relationship to archiving
what is the relationship between data management and archiving? well managed data will be easier to archive well managed data will be worth archiving well managed data will be useful and
shareable with others archives are responsible for preserving, and
perhaps disseminating, materials. Like libraries, they are not completely responsible for the quality of its holdings
6
The core of data management
making what you have (or a certain collection of it) findable, explicit and understandable is more important than having lots of higgledy-piggledy stuff
Sam Atintono’s video
7
Findable?
by ourselves by others by the computer
8
Explicit?
all structures, conventions are clearly defined
9
Understandable?
your explanations and metadata are responsible for carrying understanding the the materials into the future, for a range of others
you should express your understandings and implicit knowledge
10
And …
your data should also be portable … imagine it was all placed on someone else’s
computer right now
11
The path of data
Where is data before it enters your computer? documents eg fieldnotes audio and video recordings pictures, diagrams in your mind
12
The process of getting it into the computer is often where information gets lost, e.g.: typing up fieldnotes and leave out
diagrams/markings/structures putting recordings, pictures etc into
computer without linking them to their context (metadata/captions etc)
13
Four aspects of data management
data management outside or prior to the computer
file management information management within files links between information
14
Outside or prior to the computer
categories of things e.g. people - names, ages, num of children etc events – locations, time, participants
information objects and their structures and conventions eg dictionaries have headwords, POS,
glosses, examples - subentries may repeat the structures
other texts have embedded categories – may depend on linguistic system or hearer knowledge (e.g. book titles)
15
we need to distinguish data structures from the representations of them
while working on data, represent structures explicitly
G:\ELAR_EXAMPLE_DATA\data\dryer2008walman
16
In your computer
file management design a well-organised system of folders
so that you can always find your stuff according to what it is, not when/where you last used it
don’t just put your files where your software wants to put them
don’t just name your files as your software wants to name them
17
File structures and names
design the folder structure as a logical hierarchy that suits your project and way of working have materials gathered within one overall
directory (e.g. for backup) make directories for relevant categories,
e.g. media types, days, sessions design it so that you will always be able to
find things you need to be able to create and move
folders
18
Designing a file structure
it should relates to real, actual life locations should be obvious to you, so you
will know where to look for things (hint: where do you keep your socks; passport; favorite cup; shoe polish?)
the best location is "the place that you would naturally look to find something"
19
On identifiers
real world objects are inherently identified because of their physical uniqueness, location etc. An unlabelled cassette is only poorly identified. Digital objects have no such physical independence - they depend on the identifiers that we give them
we can recognise three types of identifiers: semantic keys relative
20
On identifiers
semantic, e.g. Nelson Mandela The Sound of Music SA_JA_Bongo_Palace_Land Dispute
Trial_015_29-04-2010.wav *
* SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-2010.wav
21
On identifiers
keys (disambiguators), e.g. 1137204 (a student number) 0803 211 6148 (a telephone number),
p12893fh23.pdf (some system's reference number)
22
On identifiers
relative, e.g. 67 High Street the secretary index.html metadata.xls
23
On identifiers
your collection will have a mix of these but it is important to be aware of the differences, for example: semantic identifiers can be types or tokens
(or masters vs. copies) keys: a program or process might depend
on the identifier to work properly if you move items with relative names you
may destroy their meaning
24
Objects and indentities
the identity of a digital object relies on its location
the full identity of a file is its path + filename. The path is a representation of the directory/folder hierarchy
if the full identity is naturally unambiguous then everything is fine, compare the following: c:\\dogs\spaniels\rover.jpg c:\\cars\british\rover.jpg or lectures\syntax\20091103\lect-notes.doc
25
Objects and identities
but semantic identifiers are potentially dangerous, because just adding more of them to disambiguate filenames will not work: my\rover.jpg my\white_rover.jpg
so objects in your system which are not naturally semantically unique need identifiers which are either keys, or relative
26
File naming
we tend to be unsystematic in naming files. This can often be OK, and if you have a method that already does everything you need to do (and will need to do in the future) then you do not need to change anything. But filenames that are unsystematic or are non-standard will cause problems for most people
27
Filename rules
don't accept the default filename suggested by an application when you first attempt to save it
don’t accept the default location suggested by an application
a new file: put it where it belongs, immediately. If necessary, create the place (directory/path) where it belongs (I often create a new blank file in the right place, and only then start adding the content)
28
Filename rules
all filenames should have correct extensions each filename should have only one ".",
before the extension do not use characters other than letters,
numbers, hyphen - and underscore _ wherever possible, avoid non-ASCII characters
in filenames keep filenames short, just long enough to
contain the necessary identifier - don't fill them up with lots of information about the content (that is metadata! - see later)
29
Make filenames sortable
make filenames usefully sortable:
20100119lecture.doc 20100203lecture.doc
gr_transcription_1.txtgr_transcription_2.txtgr_transcription_9.txt gr_transcription_53.txt
gr_transcription_001.txtgr_transcription_002.txtgr_transcription_009.txtgr_transcription_053.txt
30
Associating files
you can make resources sortable together by giving them the same filename root (the part before the extension), or part of the root:
gr_reefs.wavgr_reefs.eafgr_reefs.txt
paaka_photo001.jpgpaaka_photo002.jpgpaaka_txt_conv203.wavpaaka_txt_conv203.eafpaaka_txt_lex.doc
31
Avoid metadata in filenames
avoid stuffing metadata into filenames. The filename is an identifier, not a container for information. Some people try to put language names, locations, speech genres, dates, speakers' names etc all into their filenames
better to use a simple (semantic) filename or a key (i.e. meaningless) filename, and then create a metadata table to contain all the information. The table can contain all the information, fully expressed, and will also be extensible for further metadata
32
Avoid metadata in filenames
i.e. NOT Paaka_Reefs_Dan_BH_3Oct97.wav better:
paaka_063.wav paaka_063.txt
language topic speaker location date
Paakantyi Reefs at Mutawintyi
Dan Herbert Broken Hill 1997-10-03
pakka_063.txt
33
make sure to carefully design a filename system for your important data and to document that system so that somebody else can understand it.
some examples that could be improvedG:\ELAR_EXAMPLE_DATA\data\delta_abbixxx\aud03_May2008\MD 01Sam’s samples
34
Filename system
make sure to carefully design a filename system for your important data and to document that system so that somebody else can understand it.
Sam’s new system (eg an audio file)
aaa_bb_cc_yyyy-mm-dd_nnn.wav
35
A filenaming system
aaa_bb_cc_yyyy-mm-dd_nnn.wavaaa = village codebb = (main) speaker codecc = genre/event codeyyyy-mm-dd = date (why this order?)nnn = optional number (e.g. 001).wav = correct extension for file content type
36
Documenting the filename system
write down the system (as in previous slide) document the codes – this is probably also
part of your metadata
37
Metadata
metadata is data about data for identification, management, retrieval providing the context and understanding of
that data carries those understandings into the future,
to others reflects knowledge and practices of providers defines and constrains audiences and usages
for data the goals of language documentation
heighten the importance of metadata
38
Metadata formats
common or standard: IMDI (ISLE Metadata Initiative, DoBeS) OLAC (Open Language Archives
Community) EAD (Encoded Archival Description), others
ELAR has its own set, currently being developed. For ELAR deposit-wide metadata in deposit form also, depositor’s own metadata
39
On metadata formats
each depositor can also have different metadata!
types of metadata are relative to each project, consultants, community ...
our goal: to maximise the amount and quality of metadata
quality and extent is more important than standards and comparability
many depositors are sending extensive metadata in a variety of formats including spreadsheets
40
Types of metadata
depositor's / delegates' details descriptive metadata administrative metadata preservation metadata access protocols metadata for individual files
41
Depositors and delegates
name address contact details (telephone, fax, email, URL) role affiliation date of birth nationality
42
Descriptive metadata
title, description, subject, summary keywords subject language, community location time span
43
Administrative metadata
project details funding and hosting institutions
details of other copies of data modifications and status details of accession agreement
cf. deposit form access
access protocols (see elsewhere) group membership identification
44
Preservation metadata
(original) carrier media formats, size provenance (source/history)
45
File-level metadata
media files duration, file size MIME type, content type, format
text files font, character set, encoding format, markup
access protocols
46
Metadata
examplesG:\ELAR_EXAMPLE_DATA\data\coelhoG:\\ELAR_EXAMPLE_DATA\data\bowern\1G:\ELAR_EXAMPLE_DATA\data\delta_abbixxx
G:\ELAR_EXAMPLE_DATA\data\kansakar\Baram_Elan_Sample_DataG:\ELAR_EXAMPLE_DATA\data\roundyet more metadata
how can I create it: design it! use a table format, eg MS Excel or use plain text
47
Some other advice
have only one master version of each resource (but have it well backed up)
for text, use plain text wherever possible what is "plain text?" how do I create it? how do I convert to plain text? what is good software for plain text?*
always use Unicode for special characters
*http://notepad-plus.sourceforge.net/
48
On files, applications and Windows
Windows associates files with software according to their extension the characters that follow the last dot in
the filename, e.g. reality.doc
fantasy.jpg dreaming.xls day.dreaming.xls
if you double click on files to open them, you are hostage to Windows’ preference this can be avoided, and changed how?
49
MS Word
valuable but for making print materials (letters, reports, books), not for writing or managing data
encourages you to use features that do not exist in other software, or may not exist in the future e.g. colour, font size and weight, spacing
the file format is Microsoft's commercial secret - users need to buy it, and Microsoft may change it
your data may not be shareable or archivable G:\ELAR_EXAMPLE_DATA\data\dryer2008walman
View invisiblecharacters
50
MS Word
however, it is still legitimate and useful to use it for some parts of your project but
understand the limitations use it well; this means
using minimal typography, or all typographical effects 100% consistent,
controlled by styles, and documented
51
Backup
how do you do backup? there are many ways of doing backup but it
must be habitual and effective (i.e. tested) backup is for when you have a disaster! backup problems recommended:
keep 3 copies of your data/work on 2 different forms of media (e.g. hard
disk and flash card) in 2 different places
52
Threats to your computer and data
power loss/surges loss and theft viruses (use a good antivirus or a computer
system that resists viruses such as Linux family, e.g. Ubuntu)
53
End