création de la banque de corpus comere : un partenariat corpus-écrits – ortolang -tei-cmc

AG Corpus-écrits, 21 novembre

Consortium Corpus-écrits

SIG TEI-CMC

Open Resources and TOols for LANGuage

http://comere.orghttp://hdl.handle.net/11403/comere

Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham,Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin

2

http://www.tei-c.org/Activities/SIG/CMC/

http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

http://www.tei-c.org/Activities/SIG/CMC/

Our subject and goals

Our subject:

building and annotating corpora of computer-mediated

communication (CMC) – as resources for empirical research on

CMC phenomena in the Humanities (linguistics, communication

science, language technology, …)

Cette resource doit donc être libre d'accès (open

access research data) afin d'être réutilisable par les

communautés de chercheurs

Nous reviendrons plus tard sur ce point

All genres of interpersonal communication mediated

through computer networks (the internet) and used

via personal computers and/or mobile devices: chats,

online forums, instant messaging, tweets, comments

on weblogs, discussions in wikis and on “social net-

work” sites, interactions in multimodal communication

environments such as Skype, MMORPGs or “virtual

worlds” (e.g., SecondLife), SMS, WhatsApp, ....

Computer-mediated communication (CMC):



Our subject:

building and annotating corpora of computer-mediated

communication (CMC) – as resources for empirical research on

CMC phenomena in the Humanities (linguistics, communication

science, language technology, …)

Our vision: These corpora shall be …

interoperable (i) with each other and (ii) with other types of

linguistic corpora (text corpora, speech corpora)

represented conformant to established encoding standards in

the field of Digital Humanities

linguistically annotated in order to allow for sophisticated

queries and language-focused research


The problem / challenge:

By now, there are no established standards for the

representation of CMC genres

Established standards for the representation of text genres do

not include models for the representation of the peculiarities of

CMC

“Off the shelf” NLP tools for automatic linguistic analysis and

annotation (tokenizers, part-of-speech taggers, lematizers,

normalizers, parsers) do not perform well on CMC data

(because they usually have been trained on edited text and

therefore can’t handle “non-standard” phenomena and

multimodal elements in CMC discourse)


Our goals:

work on solutions for these desiderata

develop suggestions for standards for

- packaging and sharing (mono- and multimodal) CMC

corpora,

- modeling these types of “texts” within a framework which is

conformant with the encoding framework of the Text

Encoding Initiative (TEI) and thus with a widely accepted de-

facto standard in the field of Digital Humanities,

- processing and annotating these corpora (part-of-speech,

normalization, ...) with NLP tools.

Who belongs to our community (so far)?

French CMC corpora

Infrastructure for languagesNational consortium on corpora

National infrastructure for Digital Humanities

Our kernel projects and founding members

http://hdl.handle.net/11403/comere

Dortmund Chat Corpus

http://www.chatkorpus.tu-dortmund.de

German Reference Corpus of CMC

http://www.tinyurl.com/derik-llc

Wikipedia corpus in DeReKo

(Mannheim)

Scientific network

„Empirical research of CMC“

http://www.empirikom.net

German CMC corpora

Dutch CMC corpora

SoNaR

(Stevin Nederlandstalig Referentiecorpus)

Italian CMC pilot corpus

http://http://glottoweb.org/web2corpus/

2013, 2014-European workshops on CMC corpora (Dortmund- special journal issue (JLCL)

Activities and initiatives (past and future)

9

Our pathway

2013creation of the TEI-CMC SIG

End of 2014Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC

2015Application to CLARIN-DETranform existing German corpora into TEI-CMC

2015 OctoberInternational CMC conferenceRennes (Ledegen)

2015Submission of TEI-CMC model

2015Launch largerCMC-corporacommunity

2016Common system of basic CMC-annotations(POS tagging)

Objective: Kernel corpus assembling existing corpora of different CMC

genres and new corpora build on data extracted from the Internet. These

heterogeneous corpora will be structured and processed in a uniform way,

complemented with metadata. CoMeRe will be released as OpenData

through the national infrastructure Ortolang, following constraints which will

be reused for the forthcoming “Corpus de Référence du Français”.

Project supported by the national

consortium Corpus-écrits, sub-part of

Huma-Num, and Ortolang

Variety + Standards + Open Access

Consortium Corpus-écrits

http://comere.orghttp://hdl.handle.net/11403/comere

11

ServeurLocal LRL

Dépositeur individuel

Ingénieur :Kun Jin

Groupe qualité

Discussion avecdépositeur

Groupe étiquetageTAL : TEI-v2

TEI-V1

Financements : ORTOLANG > Corpus-écrits > LRL

Ref Tokens Partici. Posts Envir.

(Antoniadis,2014) 449 313 359 22 052 SMS

(Falaise, 2014) 35 M 25 000 3 M textchat

(Ledegen, 2014) 357 000 850 22 000 SMS

(Reffay et al., 2014) 600 000 67 + 4 groups- textchat: 6 790- emails: 2 030 - forums: 2 686

LMS

(Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat

(Abendroth-Timmeret al., 2014)

273 546 26 + 4 groups 1 200 Blog

(Longhi, Marinica, 2014)

567 851 205 34273 Tweet

14

Informalbusiness

Informal

Informal

education

education

education

politic

23

http://repository.mmulce.org/

25

Mono- Mode- Modality

- Textchat- Forum- SMS- Tweets- Email- Blogs (image not means of interaction)

MultiModalities

LMS:- email- forum- chat

MultiModes

Conf system:- Audiochat- Textchat

Verbal Verbal & Non-verbal

Conference system,3D environmentEtc.- Audiochat- Textchat- Icones- Collec prod

WhiteboardWord proc.Semantic maps

- Avatars- …

26

InteractionSpace

Time(s)

Locations

ParticipantsEnvironments

AuthorAdresse(s)GroupNetwork

CourseSessionChannelSimultaneity

27

New macro-level elements

http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI

Computer-Mediated Communication in TEI: What Lies AheadTEI-MM 2013 (Rome)

1.5 mn video

* Paper: (Wigham & Chanier, 2013) CALL

journal

* Data: (Wigham, 2013) LETEC corpus

Modality interplay

Computer-Mediated Communication in TEI: What Lies AheadTEI-MM 2013 (Rome)

Multimodalité : Verbal et non verbal

(Wigham & Chanier, 2013)

30

Collab wordprocessor

Audio:clarification

Textchat:Correction(with error)

Textchat:Requestconfirmation

Context: Lyceum conf environment, 3 learners (English L2) working intoa word processor: one writing, others helping

Maintenant en TEI-speech

31http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

33

l'utilisateur est autorisé à télécharger une copie du corpus […]

• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […]

• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […]

• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur

le fondement de la présente licence d'utilisation.

Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus)

Example of corpus licence displayed on the National Infrastructure for DigitalHumanities and considered as being"open access"

Viewing but not re-using isthat OA ?

création de la banque de corpus comere : un partenariat corpus-écrits – ortolang -tei-cmc

Science

multimodal cmc corpora

cmc phenomena

cmc discourseour subject

cmc data

goalsour subject

communication science

annotating corpora of

encoding standards