THE DATA TRANSCRIPTION AND ANALYSIS (DTA) TOOL. HANDS ON WORKSHOP
Development of Linguistic Linked Open Data (LLOD) Resources for Collaborative Data-Intensive Research in the Language Sciences
María Blume, Pontificia Universidad Católica del Perú, Isabelle Barrière, Long Island University & Yeled V'Yalda Early Childhood CenterCristina Dye, Newcastle University, andTed Caldwell, GorgesWith the invaluable help of Carissa Kang and Jonathan Masci, Cornell University
July 25th, 2015
This tool funded by
National Science Foundation. CI-TEAM program. “Transforming the Primary Research Process Through Cybertool Dissemination: An Implementation of a Virtual Center for the study of language acquisition”. (María Blume and Barbara Lust). NSF OCI-0753415
Purpose and goals
To create a culture of national and international collaboration among researchers and their labs.
to create shared principles and methods of data documentation, management and collaboration
to enable the practice of these principles and methods through the use of cybertools.
To provide a new generation of researchers and students, including those with diverse disciplinary, geographical and cultural backgrounds, with a solid foundation in these principles and methods through the use of these new cybertools.
GoalPurpose
Purpose and goals
To create a tool for collaboration that can allow for the management, documentation and analysis of crosslinguistic language data.
To provide a resource that allows users to manage data across datasets and projects, including the ability to reuse previously collected data.
Purpose Goal
http://vcla.clal.cornell.edu/
Virtual Center for the Study of Language Acquisition (VCLA)
The Virtual Center for Language Acquisition Research
A community of researchers that are linked in their assumption that the most fundamental questions of language acquisition require interdisciplinary collaboration, both theoretical and empirical methods, and a cross-linguistic approach.
Eight founding member institutions. One international collaborator in Peru.
The VCLA website
A center that unites through the web a series of research labs across the country and the world.
[Its] mission is to foster collaborative research among researchers working in the area of language acquisition, collaborations which are potentially interdisciplinary, which may be at a distance geographically and which may involve the comparative study of multiple languages, and interactions on shared data, as well as a variety of lab methods.
The VCLA
List of projects by VCLA members to give undergraduate and graduate students and other researchers ideas for future research and collaboration.
Courses
Courses
We have created a series of courses centered on research methodology, best practices and the intensive hands-on experience with cybertools such as the Experiment Bank and Web DTA.
The Web Conferences: Elluminate
Cornell and UTEP students meeting during our first course.
The Web Conferences: Elluminate
A UTEP student presents her research proposal to peers and faculty at UTEP and Pontificia Universidad Católica del Perú.
The Virtual Linguistics Lab
The VLL portal provides structured access to the components of a virtual linguistic lab:
Materials for the scientific and collaborative study of language acquisition.
web-based courses, integrating synchronous and asynchronous forms of interactive information distribution.
Meeting the Challenges through a Virtual Linguistics Lab
The VLL includes a series of web-based courses, integrating synchronous
and asynchronous forms of interactive information distribution,
a web-based experiment bank and data transcription and analysis tool, with an associated set of data collected over 20 years by the Cornell Language Acquisition Lab and other labs across the USA.
a series of structured audio-visual demonstrations and related learning modules.
These materials are integrated into a university-supported cyberinfrastructure to ensure the high availability needs of a distance learning program
VLL Components
Laboratory methods: Research methods manual. Standards. Courses
Teaching materials. Audio/visual samples (lessons, assignments,
data). Web conferences. Discussion board.
VLL Portal Topics
Teaching Modules
Provide graduate and undergraduate students with a set of interactive web-based lessons which teach them the specific procedures of investigating language knowledge.
These link to Audio/video examples. Glossary The experiment bank The methods manual The Data Transcription and analysis tool. Published or unpublished papers Specific exercises/homework.
The modules
provide students with selected excerpts of language data to be studied and analyzed
give students a virtual experience of an interview of a subject and real experience of analysis of the subject's language.
allow students to learn a method to use in own research or practice
allow students to learn how to analyze previously collected data
A teaching module
The teaching modules give students access to:
•Audio/Visual examples.
•PowerPoint presentations explaining the methods.
•Readings
•Interactive assignments.
Audio/Visual Materials
Teaching the procedure for the Act Out task.
Audio/Visual Materials
An experimental study showing the Elicited Imitation task done with a 2-year-old in Peru.
An interactive assignment
Elicited Imitation assignment comparing monolingual and bilingual children.
These assignments train students to transcribe and analyze data, and compare their results to the original paper’s results.
An interactive assignment
A child subject enjoys the experiment.
These samples give the student a virtual experience of data collection.
Cybertools
Cybertools
Multilingualism questionnaire. Data Transcription and Analysis Tool
(DTA) includes an Experiment Bank gives access to Libraries of comparable
data. DTA User’s Manual.
Virtual workshops.
Cybertool access through VLL
Data quality: the opportunities Technology can enable:
Precision and completeness in data-capture procedures
Capacity for many levels of structural description and analysis
Capacity to link points of data along multiple dimensions
Why do we need the DTA tool in the study of language acquisition and use?
Multiple languages Multiple formats Multiple methods of data collection
observational vs. experimental, cross-sectional or longitudinal.
Multiple aspects of metadata age and/or developmental/cognitive stage of
speaker. social and pragmatic context culture.
Data management and use
Different labs practice distinct forms of data management.
The scientific use of any single record requires access to many levels of data, ranging from raw (establishing provenance) to structured and analyzed data (establishing intellectual worth).
Data Transcription and Analysis Tool (WebDTA)
31
http://webdta.clal.cornell.edu/site/login A primary research tool which provides the
user with a web interface which guides him/her through steps for generating, storing and accessing data.
Users contribute data in a structured, uniform manner.
Users access calibrated data from a shared relational database.
Diverse data become comparable at many levels.
The Data Transcription and Analysis Tool (WebDTA)
32
Collects all information related to a study (experimental or observational) in the same location.
Makes all information about the study available to the public. Researchers seeking to replicate or criticize it. Students studying the particular method or
research topic. Trains researchers and students on how to
organize research data.
WebDTA Tool33
It stores its data in a relational database on a centralized server (other systems store flat text files).
It supports both Natural Speech and Experimental data.
It can be used for both Research and Education (structured teaching modules).
It is open-ended. New specialized coding screens can be added.
It has robust query capabilities based on its relational database structure.
Brief development history
Virtual Language Laboratory (VLL), its Data Transcription and Analysis Tool (DTA, WebDTA) and the proprietary methodology that supports these were developed over 30 years of personal effort by Prof. Lust and student and peer contribution.
Several rudimentary versions of the DTA were sketched out and crafted in old software. However, when user friendly relational databases became common place, research and student users were able to define a new approach. A more powerful version of the DTA using FoxPro as the engine was developed. Katharina Boser, Reiko Mazuka, Julie Eisele, Paul Navarre,
David Parkinson, Shamitha Somashekar, and María Blume.
Brief development history
Cliff Crawford provoked the CLAL's development of a web-based interface for the DTA tool and has held major responsibility for programming of the first web-based interface, using PostgreSQL.
The current version of the DTA tool, unifying the previously independent cybertool Experiment Bank with the DTA was developed by Ted Caldwell and Greg Kops at Gorges, Web Development and Internet Solutions (http://www.gorges.us/) with María Blume and Barbara Lust, and input from students of the Cornell Language Acquisition Lab (Natalia Buitrago, Gabriel Clandorf, Poornima Guna, Jennie Lin, and Jordan Whitlock and UTEP Marina Kalashnikova and Martha Rayas).
DTA Schema
Structure
The current version of the WebDTA tool is built on Yii, a PHP web development framework that uses the "Model-View-Controller" pattern to structure the application and the "Active Record" pattern to manage records from the database.
MySQL is used for the database platform. All are open source technologies.
External links
We are collaborating with Cornell University’s Albert Mann Library in their current pilot program, DataStaR (Data Staging Repository) intended to help researchers create high quality metadata in the formats required by external repositories…”(Steinhart 2010: 1) (Funded by the National Science Foundation (Grant No. 111-0712989)
The program adopts a semantic web approach to metadata. At present, one VCLA dataset (Sinhala language) from more than
400 children studied in Sri Lanka has been entered in DataStaR, linking the VCLA database to the Library staging repository, and is available for collaborative use through this repository.
DataStaR uses RDF (Resource Description Framework (RDF)) statements and OWL (Web Ontology Language) classes in order to integrate different metadata frameworks across disciplines.
http://datastar.mannlib.cornell.edu/display/n6291 and http://www.news.cornell.edu/stories/Oct11/SinhalaTools.html
Project sample
An Experimental Project https://webdta.clal.cornell.edu/projects/151/overview A Natural Speech Corpus
https://webdta.clal.cornell.edu/projects/15/datasets/40/sessions
Coding Gesture https://webdta.clal.cornell.edu/projects/235/datasets/249/sessions/2648/transcriptions/1669/utterances
Code-switching project https://webdta.clal.cornell.edu/projects/230/overview
Queries https://webdta.clal.cornell.edu/queries
Accessing the DTA
Permission required due to Human Subjects Issues.
Need to contact Barbara Lust ([email protected]) or María Blume ([email protected])
Go to https://webdta.clal.cornell.edu/ (the link is in a doc called DTA address which we e-mailed you along with documents containing data from children which you can use to practice. We have removed the identifying data.)
Acknowledgments María Blume and Barbara Lust. 2008.
Transforming the Primary Research Process Through Cybertool Dissemination: An Implementation of a Virtual Center for the Study of Language Acquisition. NSF OCI-0753415
Lust, Barbara. 2003. Planning Grant: A Virtual Center for Child Language Acquisition Research. National Science Foundation. NSF BCS-0126546
VCLA founding members: Cornell: Marianella Casasola, Claire Cardie, James Gair, and Qi
Wang. NeuroFocus: Elise Temple Boston College: Claire Foley Rutgers University at New Brusnwick: Liliana Sánchez. Rutgers University at Newark: Jennifer Austin California State University at San Bernardino: YuChin Chien. Southern Illinois University at Carbondale: Usha Lakshmanan.
Acknowledgments VCLA affiliates:
City University of New Yors: Gita Martohardjono, Valerie Shafer, and Isabelle Barrière .
Newcastle University: Cristina Dye. Ben Gurion University at the Negev: Yarden Kedar Tyndale University College and Seminary: Sujin Yang. Columbia University: Joy Hirsch. University of Texas at El Paso: Ellen Courtney and Alfredo
Urzúa. University of California at San Diego: Sarah Callahan. Pontificia Universidad Católica Del Perú: Jorge Iván Pérez
Silva Kyungsung University: Kwee Ock Lee Central Institute of English and Foreign Languages: R.
Amritavalli Osmania University: A. Usha Rani.
Acknowledgments
Janet McCue and Barbara Lust 2004-2006. National Science Foundation Award: Planning Information Infrastructure Through a New Library-Research Partnership. (SGER=Small Grant for Exploratory Research)
American Institute for Sri Lankan Studies, Cornell University Einaudi Center.
Cornell University Faculty Innovation in Teaching Awards, Cornell Institute for Social and Economic Research (CISER).
New York State Hatch grant.
Our application developers Ted Caldwell and Greg Kops (GORGES).
Our consultants Cliff Crawford and Tommy Cusick;
Our student RAs: Darlin Alberto, Gabriel Clandorf, Natalia Buitrago, Poornima Guna, Jennie Lin, and Jordan Whitlock formerly at Cornell now MIT, and Marina Kalashnikova. Martha Rayas Tanaka, Lizzeth Pattison, María Jiménez, and Mónica Martínez at UTEP.
The students at all the participating institutions that helped us with comments and suggestions.
References
Berners-Lee, Tim. .3/2009. Ted Lecture. Tim Berners-Lee on the next Web. http://en.wikipedia.org/wiki/Linked_data .
Bickel, Balthasar, Bernard Comrie, and Martin Haspelmath. 2008. Leipzig Glossing Rules: Conventions for Interlinear Morpheme-by-Morpheme Glosses. Available online at http://www.eva.mpg.de/lingua/resources/glossing-rules.php .
Blume, María and Barbara Lust, 2011a and in prep. Data Transcription and Analysis Tool User’s Manual. (with the collaboration of Shamitha Somashekar, and Tina Ogden).
Blume, María and Barbara Lust. 2011b. Presentation to the National Science Foundation. CI Team Principal Investigator’s Meeting. University of Illinois at Urbana Champaign, Ill. May 24-26. Transforming the Primary Research Process Through Cybertool Dissemination: An Implementation of a Virtual Center for the Study of Language Acquisition. NSF OCI-0753415.
Farrar, S.O. and Langendoen, D.T. 2003 A linguistic ontology for the semantic web. GLOT International, 7(3) 97-100.
Khan, Huda, Brian Caruso, Brian Lowe, Jon Corson-Rikert, Diane Dietrich and Gail Steinhart. 2011. DataStaR: Using the Semantic Web approach for Data Curation. International Journal of Digital Curation 2(6): 209-221.
Lowe, Brian. 2009. DataStaR: Bridging XML and OWL in Science Metadata Management. Metadata and Semantics Research 46: 141-150. http://www.springerlink.com/content/q0825vj78ul38712/
References
Lust, Barbara, Suzanne Flynn, María Blume, Elaine Westbrooks, and Theresa Tobin. (2010). Constructing Adequate Documentation for Multi-faceted Cross Linguistic Language Data: A Case Study from a Virtual Center for Study of Language Acquisition. In Grenoble, Lenore and Louanna Furbee, (eds.), Language Documentation: Theory, Practice and Values. pp. 127-152. Amsterdam/Philadelphia: John Benjamins.
Open Archives Initiative (OAI), http://www.openarchives.org/ (15 Mar. 2005). Open Language Archives Community (OLAC), http://www.language-archives.org/ (24
Feb. 2011). Simons, G. Farrar, JS., Fitzsimons, B., Lewis, W., Langendoen, D.T. and Gonzalez, H.
2004a. The semantics of markup: Mapping legacy markup schemas to a common semantics.
Simons, G., Fitzsimons, B., Langendoen, D.T., Lewis, Wm., Farrar, S., Lanham, A., Basham, R. and Gonzalez H. 2004b. http://emeld.org/workshop/2004/langendoen-paper.html
Steinhart, Gail. 2010. DataStaR: A Data Staging Repository to Support the Sharing and Publication of Research Data. 31st Annual IATUL Conference - The Evolving World of e-Science: Impact and Implications for Science and Technology Libraries. June 20-24, 2010. West Lafayette, IN. http://docs.lib.purdue.edu/iatul2010/conf/day2/8/.
DTA: Project list
DTA: Project info
Metadata on Experimental or Naturalistic research.
These screens help students and researchers save/access the basic information for a research study and also keep track of publications, presentations, related studies, and bibliography related to a research project.
DTA Metadata: Subject info
Subject information that allows for one subject’s data to be used in multiple datasets.
DTA: Research Design
These screens help students and researchers save/access the research study’s design.
DTA: Summary Report
This report shows the data at the project level.
DTA: Summary Report
From the project report one can access the summary reports for the different datasets of the project.
DTA: Summary Report
DTA Metadata: Session info
This screen provides info for every time a subject was recorded for a given dataset.
DTA: Recordings
One can include several “recordings” for each session, including audio, video, and previous transcripts.
DTA: Transcription
This screen allows one to transcribe, switch between recordings, and time-align recordings and transcripts.
DTA: Basic Natural Speech Coding
Basic levels of linguistic coding to train students.
Additional levels of general or project-specific coding can be created by users.
DTA: Experimental Coding for Grammaticality Judgment Task.
An example of a project specific coding created for an experimental task.
DTA: Query
A multi-condition query.
Different queries can be created and saved by users as needed.
The Virtual Workshops
The Virtual Workshops Topics
Virtual Workshops teach users how to navigate our cybertools.
The Virtual Workshops: The DTA Manual
It allows users to take notes and has quizzes to check for understanding of the cybertool.
The Virtual Workshops: A video demonstration
Prof. Lust explains the purpose and motivation of the cybertool to students and researchers at Cornell, Rutgers New Brunswick, MIT and UTEP.
The Virtual Workshops: A video demonstration
The DTA tool programmer, Ted Caldwell, shows students the different DTA screens and their purpose with added comentary by María Blume and Barbara Lust.