création de la banque de corpus comere : un partenariat corpus-écrits – ortolang -tei-cmc
TRANSCRIPT
AG Corpus-écrits, 21 novembre
Consortium Corpus-écrits
SIG TEI-CMC
Open Resources and TOols for LANGuage
http://comere.orghttp://hdl.handle.net/11403/comere
Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham,Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
2
http://www.tei-c.org/Activities/SIG/CMC/
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
Our subject and goals
Our subject:
building and annotating corpora of computer-mediated
communication (CMC) – as resources for empirical research on
CMC phenomena in the Humanities (linguistics, communication
science, language technology, …)
Cette resource doit donc être libre d'accès (open
access research data) afin d'être réutilisable par les
communautés de chercheurs
Nous reviendrons plus tard sur ce point
All genres of interpersonal communication mediated
through computer networks (the internet) and used
via personal computers and/or mobile devices: chats,
online forums, instant messaging, tweets, comments
on weblogs, discussions in wikis and on “social net-
work” sites, interactions in multimodal communication
environments such as Skype, MMORPGs or “virtual
worlds” (e.g., SecondLife), SMS, WhatsApp, ....
Computer-mediated communication (CMC):
Our subject and goals
Our subject and goals
Our subject:
building and annotating corpora of computer-mediated
communication (CMC) – as resources for empirical research on
CMC phenomena in the Humanities (linguistics, communication
science, language technology, …)
Our vision: These corpora shall be …
interoperable (i) with each other and (ii) with other types of
linguistic corpora (text corpora, speech corpora)
represented conformant to established encoding standards in
the field of Digital Humanities
linguistically annotated in order to allow for sophisticated
queries and language-focused research
Our subject and goals
The problem / challenge:
By now, there are no established standards for the
representation of CMC genres
Established standards for the representation of text genres do
not include models for the representation of the peculiarities of
CMC
“Off the shelf” NLP tools for automatic linguistic analysis and
annotation (tokenizers, part-of-speech taggers, lematizers,
normalizers, parsers) do not perform well on CMC data
(because they usually have been trained on edited text and
therefore can’t handle “non-standard” phenomena and
multimodal elements in CMC discourse)
Our subject and goals
Our goals:
work on solutions for these desiderata
develop suggestions for standards for
- packaging and sharing (mono- and multimodal) CMC
corpora,
- modeling these types of “texts” within a framework which is
conformant with the encoding framework of the Text
Encoding Initiative (TEI) and thus with a widely accepted de-
facto standard in the field of Digital Humanities,
- processing and annotating these corpora (part-of-speech,
normalization, ...) with NLP tools.
Who belongs to our community (so far)?
French CMC corpora
Infrastructure for languagesNational consortium on corpora
National infrastructure for Digital Humanities
Our kernel projects and founding members
http://hdl.handle.net/11403/comere
Dortmund Chat Corpus
http://www.chatkorpus.tu-dortmund.de
German Reference Corpus of CMC
http://www.tinyurl.com/derik-llc
Wikipedia corpus in DeReKo
(Mannheim)
Scientific network
„Empirical research of CMC“
http://www.empirikom.net
German CMC corpora
Dutch CMC corpora
SoNaR
(Stevin Nederlandstalig Referentiecorpus)
Italian CMC pilot corpus
http://http://glottoweb.org/web2corpus/
2013, 2014-European workshops on CMC corpora (Dortmund- special journal issue (JLCL)
Activities and initiatives (past and future)
9
Our pathway
2013creation of the TEI-CMC SIG
End of 2014Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC
2015Application to CLARIN-DETranform existing German corpora into TEI-CMC
2015 OctoberInternational CMC conferenceRennes (Ledegen)
2015Submission of TEI-CMC model
2015Launch largerCMC-corporacommunity
2016Common system of basic CMC-annotations(POS tagging)
Objective: Kernel corpus assembling existing corpora of different CMC
genres and new corpora build on data extracted from the Internet. These
heterogeneous corpora will be structured and processed in a uniform way,
complemented with metadata. CoMeRe will be released as OpenData
through the national infrastructure Ortolang, following constraints which will
be reused for the forthcoming “Corpus de Référence du Français”.
Project supported by the national
consortium Corpus-écrits, sub-part of
Huma-Num, and Ortolang
Variety + Standards + Open Access
Consortium Corpus-écrits
http://comere.orghttp://hdl.handle.net/11403/comere
11
ServeurLocal LRL
Dépositeur individuel
Ingénieur :Kun Jin
Groupe qualité
Discussion avecdépositeur
Groupe étiquetageTAL : TEI-v2
TEI-V1
Financements : ORTOLANG > Corpus-écrits > LRL
12
13
Ref Tokens Partici. Posts Envir.
(Antoniadis,2014) 449 313 359 22 052 SMS
(Falaise, 2014) 35 M 25 000 3 M textchat
(Ledegen, 2014) 357 000 850 22 000 SMS
(Reffay et al., 2014) 600 000 67 + 4 groups- textchat: 6 790- emails: 2 030 - forums: 2 686
LMS
(Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat
(Abendroth-Timmeret al., 2014)
273 546 26 + 4 groups 1 200 Blog
(Longhi, Marinica, 2014)
567 851 205 34273 Tweet
14
Informalbusiness
Informal
Informal
education
education
education
politic
15
16
17
18
19
20
21
22
24
25
Mono- Mode- Modality
- Textchat- Forum- SMS- Tweets- Email- Blogs (image not means of interaction)
MultiModalities
LMS:- email- forum- chat
MultiModes
Conf system:- Audiochat- Textchat
Verbal Verbal & Non-verbal
Conference system,3D environmentEtc.- Audiochat- Textchat- Icones- Collec prod
WhiteboardWord proc.Semantic maps
- Avatars- …
26
InteractionSpace
Time(s)
Locations
ParticipantsEnvironments
AuthorAdresse(s)GroupNetwork
CourseSessionChannelSimultaneity
27
New macro-level elements
http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI
Computer-Mediated Communication in TEI: What Lies AheadTEI-MM 2013 (Rome)
1.5 mn video
* Paper: (Wigham & Chanier, 2013) CALL
journal
* Data: (Wigham, 2013) LETEC corpus
Modality interplay
Computer-Mediated Communication in TEI: What Lies AheadTEI-MM 2013 (Rome)
Multimodalité : Verbal et non verbal
(Wigham & Chanier, 2013)
30
Collab wordprocessor
Audio:clarification
Textchat:Correction(with error)
Textchat:Requestconfirmation
Context: Lyceum conf environment, 3 learners (English L2) working intoa word processor: one writing, others helping
Maintenant en TEI-speech
31http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
32
33
l'utilisateur est autorisé à télécharger une copie du corpus […]
• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […]
• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […]
• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur
le fondement de la présente licence d'utilisation.
Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus)
Example of corpus licence displayed on the National Infrastructure for DigitalHumanities and considered as being"open access"
Viewing but not re-using isthat OA ?
34
35
36
37