1 language documentation in west africa july 27 2010 winneba, ghana david nathan endangered...

53
1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London Data management

Upload: timothy-garrett

Post on 02-Jan-2016

223 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

1

Language Documentation in West AfricaJuly 27 2010

Winneba, Ghana

David NathanEndangered Languages Archive

Hans Rausing Endangered Languages ProjectSOAS, University of London

Data management

Page 2: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

2

Data management

Data management is crucial for language documentation because … documentation revolves around data data can be irreplaceable media data needs special management you may end up with hundreds if not

thousands of files

Page 3: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

3

This session …

Page 4: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

4

Your data management

are you satisfied? do you manage your data? how? what problems do you have?

Page 5: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

5

Relationship to archiving

what is the relationship between data management and archiving? well managed data will be easier to archive well managed data will be worth archiving well managed data will be useful and

shareable with others archives are responsible for preserving, and

perhaps disseminating, materials. Like libraries, they are not completely responsible for the quality of its holdings

Page 6: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

6

The core of data management

making what you have (or a certain collection of it) findable, explicit and understandable is more important than having lots of higgledy-piggledy stuff

Sam Atintono’s video

Page 7: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

7

Findable?

by ourselves by others by the computer

Page 8: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

8

Explicit?

all structures, conventions are clearly defined

Page 9: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

9

Understandable?

your explanations and metadata are responsible for carrying understanding the the materials into the future, for a range of others

you should express your understandings and implicit knowledge

Page 10: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

10

And …

your data should also be portable … imagine it was all placed on someone else’s

computer right now

Page 11: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

11

The path of data

Where is data before it enters your computer? documents eg fieldnotes audio and video recordings pictures, diagrams in your mind

Page 12: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

12

The process of getting it into the computer is often where information gets lost, e.g.: typing up fieldnotes and leave out

diagrams/markings/structures putting recordings, pictures etc into

computer without linking them to their context (metadata/captions etc)

Page 13: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

13

Four aspects of data management

data management outside or prior to the computer

file management information management within files links between information

Page 14: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

14

Outside or prior to the computer

categories of things e.g. people - names, ages, num of children etc events – locations, time, participants

information objects and their structures and conventions eg dictionaries have headwords, POS,

glosses, examples - subentries may repeat the structures

other texts have embedded categories – may depend on linguistic system or hearer knowledge (e.g. book titles)

Page 15: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

15

we need to distinguish data structures from the representations of them

while working on data, represent structures explicitly

G:\ELAR_EXAMPLE_DATA\data\dryer2008walman

Page 16: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

16

In your computer

file management design a well-organised system of folders

so that you can always find your stuff according to what it is, not when/where you last used it

don’t just put your files where your software wants to put them

don’t just name your files as your software wants to name them

Page 17: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

17

File structures and names

design the folder structure as a logical hierarchy that suits your project and way of working have materials gathered within one overall

directory (e.g. for backup) make directories for relevant categories,

e.g. media types, days, sessions design it so that you will always be able to

find things you need to be able to create and move

folders

Page 18: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

18

Designing a file structure

it should relates to real, actual life locations should be obvious to you, so you

will know where to look for things (hint: where do you keep your socks; passport; favorite cup; shoe polish?)

the best location is "the place that you would naturally look to find something"

Page 19: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

19

On identifiers

real world objects are inherently identified because of their physical uniqueness, location etc. An unlabelled cassette is only poorly identified. Digital objects have no such physical independence - they depend on the identifiers that we give them

we can recognise three types of identifiers: semantic keys relative

Page 20: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

20

On identifiers

semantic, e.g. Nelson Mandela The Sound of Music SA_JA_Bongo_Palace_Land Dispute

Trial_015_29-04-2010.wav *

* SA_JA_Bongo_Palace_Land Dispute Trial_015_29-04-2010.wav

Page 21: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

21

On identifiers

keys (disambiguators), e.g. 1137204 (a student number) 0803 211 6148 (a telephone number),

p12893fh23.pdf (some system's reference number)

Page 22: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

22

On identifiers

relative, e.g. 67 High Street the secretary index.html metadata.xls

Page 23: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

23

On identifiers

your collection will have a mix of these but it is important to be aware of the differences, for example: semantic identifiers can be types or tokens

(or masters vs. copies) keys: a program or process might depend

on the identifier to work properly if you move items with relative names you

may destroy their meaning

Page 24: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

24

Objects and indentities

the identity of a digital object relies on its location

the full identity of a file is its path + filename. The path is a representation of the directory/folder hierarchy

if the full identity is naturally unambiguous then everything is fine, compare the following: c:\\dogs\spaniels\rover.jpg c:\\cars\british\rover.jpg or lectures\syntax\20091103\lect-notes.doc

Page 25: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

25

Objects and identities

but semantic identifiers are potentially dangerous, because just adding more of them to disambiguate filenames will not work: my\rover.jpg my\white_rover.jpg

so objects in your system which are not naturally semantically unique need identifiers which are either keys, or relative

Page 26: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

26

File naming

we tend to be unsystematic in naming files. This can often be OK, and if you have a method that already does everything you need to do (and will need to do in the future) then you do not need to change anything. But filenames that are unsystematic or are non-standard will cause problems for most people

Page 27: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

27

Filename rules

don't accept the default filename suggested by an application when you first attempt to save it

don’t accept the default location suggested by an application

a new file: put it where it belongs, immediately. If necessary, create the place (directory/path) where it belongs (I often create a new blank file in the right place, and only then start adding the content)

Page 28: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

28

Filename rules

all filenames should have correct extensions each filename should have only one ".",

before the extension do not use characters other than letters,

numbers, hyphen - and underscore _ wherever possible, avoid non-ASCII characters

in filenames keep filenames short, just long enough to

contain the necessary identifier - don't fill them up with lots of information about the content (that is metadata! - see later)

Page 29: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

29

Make filenames sortable

make filenames usefully sortable:

20100119lecture.doc 20100203lecture.doc

gr_transcription_1.txtgr_transcription_2.txtgr_transcription_9.txt gr_transcription_53.txt

gr_transcription_001.txtgr_transcription_002.txtgr_transcription_009.txtgr_transcription_053.txt

Page 30: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

30

Associating files

you can make resources sortable together by giving them the same filename root (the part before the extension), or part of the root:

gr_reefs.wavgr_reefs.eafgr_reefs.txt

paaka_photo001.jpgpaaka_photo002.jpgpaaka_txt_conv203.wavpaaka_txt_conv203.eafpaaka_txt_lex.doc

Page 31: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

31

Avoid metadata in filenames

avoid stuffing metadata into filenames. The filename is an identifier, not a container for information. Some people try to put language names, locations, speech genres, dates, speakers' names etc all into their filenames

better to use a simple (semantic) filename or a key (i.e. meaningless) filename, and then create a metadata table to contain all the information. The table can contain all the information, fully expressed, and will also be extensible for further metadata

Page 32: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

32

Avoid metadata in filenames

i.e. NOT Paaka_Reefs_Dan_BH_3Oct97.wav better:

paaka_063.wav paaka_063.txt

language topic speaker location date

Paakantyi Reefs at Mutawintyi

Dan Herbert Broken Hill 1997-10-03

pakka_063.txt

Page 33: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

33

make sure to carefully design a filename system for your important data and to document that system so that somebody else can understand it.

some examples that could be improvedG:\ELAR_EXAMPLE_DATA\data\delta_abbixxx\aud03_May2008\MD 01Sam’s samples

Page 34: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

34

Filename system

make sure to carefully design a filename system for your important data and to document that system so that somebody else can understand it.

Sam’s new system (eg an audio file)

aaa_bb_cc_yyyy-mm-dd_nnn.wav

Page 35: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

35

A filenaming system

aaa_bb_cc_yyyy-mm-dd_nnn.wavaaa = village codebb = (main) speaker codecc = genre/event codeyyyy-mm-dd = date (why this order?)nnn = optional number (e.g. 001).wav = correct extension for file content type

Page 36: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

36

Documenting the filename system

write down the system (as in previous slide) document the codes – this is probably also

part of your metadata

Page 37: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

37

Metadata

metadata is data about data for identification, management, retrieval providing the context and understanding of

that data carries those understandings into the future,

to others reflects knowledge and practices of providers defines and constrains audiences and usages

for data the goals of language documentation

heighten the importance of metadata

Page 38: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

38

Metadata formats

common or standard: IMDI (ISLE Metadata Initiative, DoBeS) OLAC (Open Language Archives

Community) EAD (Encoded Archival Description), others

ELAR has its own set, currently being developed. For ELAR deposit-wide metadata in deposit form also, depositor’s own metadata

Page 39: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

39

On metadata formats

each depositor can also have different metadata!

types of metadata are relative to each project, consultants, community ...

our goal: to maximise the amount and quality of metadata

quality and extent is more important than standards and comparability

many depositors are sending extensive metadata in a variety of formats including spreadsheets

Page 40: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

40

Types of metadata

depositor's / delegates' details descriptive metadata administrative metadata preservation metadata access protocols metadata for individual files

Page 41: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

41

Depositors and delegates

name address contact details (telephone, fax, email, URL) role affiliation date of birth nationality

Page 42: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

42

Descriptive metadata

title, description, subject, summary keywords subject language, community location time span

Page 43: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

43

Administrative metadata

project details funding and hosting institutions

details of other copies of data modifications and status details of accession agreement

cf. deposit form access

access protocols (see elsewhere) group membership identification

Page 44: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

44

Preservation metadata

(original) carrier media formats, size provenance (source/history)

Page 45: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

45

File-level metadata

media files duration, file size MIME type, content type, format

text files font, character set, encoding format, markup

access protocols

Page 46: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

46

Metadata

examplesG:\ELAR_EXAMPLE_DATA\data\coelhoG:\\ELAR_EXAMPLE_DATA\data\bowern\1G:\ELAR_EXAMPLE_DATA\data\delta_abbixxx

G:\ELAR_EXAMPLE_DATA\data\kansakar\Baram_Elan_Sample_DataG:\ELAR_EXAMPLE_DATA\data\roundyet more metadata

how can I create it: design it! use a table format, eg MS Excel or use plain text

Page 47: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

47

Some other advice

have only one master version of each resource (but have it well backed up)

for text, use plain text wherever possible what is "plain text?" how do I create it? how do I convert to plain text? what is good software for plain text?*

always use Unicode for special characters

*http://notepad-plus.sourceforge.net/

Page 48: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

48

On files, applications and Windows

Windows associates files with software according to their extension the characters that follow the last dot in

the filename, e.g. reality.doc

fantasy.jpg dreaming.xls day.dreaming.xls

if you double click on files to open them, you are hostage to Windows’ preference this can be avoided, and changed how?

Page 49: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

49

MS Word

valuable but for making print materials (letters, reports, books), not for writing or managing data

encourages you to use features that do not exist in other software, or may not exist in the future e.g. colour, font size and weight, spacing

the file format is Microsoft's commercial secret - users need to buy it, and Microsoft may change it

your data may not be shareable or archivable G:\ELAR_EXAMPLE_DATA\data\dryer2008walman

View invisiblecharacters

Page 50: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

50

MS Word

however, it is still legitimate and useful to use it for some parts of your project but

understand the limitations use it well; this means

using minimal typography, or all typographical effects 100% consistent,

controlled by styles, and documented

Page 51: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

51

Backup

how do you do backup? there are many ways of doing backup but it

must be habitual and effective (i.e. tested) backup is for when you have a disaster! backup problems recommended:

keep 3 copies of your data/work on 2 different forms of media (e.g. hard

disk and flash card) in 2 different places

Page 52: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

52

Threats to your computer and data

power loss/surges loss and theft viruses (use a good antivirus or a computer

system that resists viruses such as Linux family, e.g. Ubuntu)

Page 53: 1 Language Documentation in West Africa July 27 2010 Winneba, Ghana David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project

53

End