does size matter? when small is good enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Does size matter? When small is better.

A.L. Gentile A.E. Cano A.-S. Dadzie V. LanfranchiN. Ireson

Department of Computer Science,University of Sheffield

2011, 23rd March

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 1 / 20



Outline

1 Introduction

2 Email Corpus

3 Dynamic Topic Classification Of Short Texts

4 Experiments

5 Conclusions




Settings

Main Goal: observation of the influence of the size of documentson the accuracy of a defined text processing task

Hypothesis: results obtained using longer texts may beapproximated by short texts, of micropost size, i.e., maximumlength 140 characters

Dataset: artificially generated comparable corpora, consisting oftruncated emails, from micropost size (140 characters), andsuccessive multiples thereofMethodology:

corpus-driven topic extractiondocument topic classification




Research Questions

statistical analysis of emails exchanged via oak mailing list(over a period of six months) to determine if email is indeed usedas a short messaging service

content analysis of emails as microposts, to evaluate to whatdegree the knowledge content of truncated or abbreviatedmessages can be compared to the complete message

determine if the knowledge content of short emails may be usedto obtain useful information about e.g., topics of interest orexpertise within an organisation, as a basis for carrying out taskssuch as expert finding or content-based social network analysis(SNA)




Micropost Services

mainly used for social information exchange, but also used inmore formal (working) environments[Herbsleb et al., 2002, Isaacs et al., 2002]

sometimes perceived negatively in the workplace, as they may beseen to reduce productivity [TNS US Group, 2009], and/or posethreats to security and privacywhere restrictions to use are in place, alternatives are sought thatobtain the same benefits

same communication patterns in alternative media: email as ashort message service for communication via, e.g., mailing lists




Oak mailing list based corpus

six month period (July - January 2011), totalling 659 emails

Figura: Email length distributionA.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 6 / 20



Dynamic Topic Classification Of Short Texts

Goal: evaluate to what degree the knowledge content of a shortermessage can be compared to that of a full message.

Chosen task: text classification on non-predefined topics.

Test bed: generated by preprocessing the oak email corpus toobtain several fixed-size corpora.Method:

Corpus-driven topic extraction: a number of topics areautomatically extracted from a document collection; each topic isrepresented as a weighted vector of terms;Document topic classification: each document is labelled with thetopic it is most similar to, and classified into the correspondingcluster.




Topic Extraction: Proximity-based Clustering

Document corpus D = {d1, ...,dk}each document di = {t1, ..., tv} is a vector of weighted terms

Term clusters C = {c1, ...,ck}(clustering performed by using as feature space the invertedindex of D)

each cluster ck = {t1, ..., tn} is a vector of weighted termseach cluster ideally represents a topic in the document collection




Email Topic Classification

sim(d ,Ci): similarity between documents and clusters (by cosinesimilarity)

labelDoc(D): each document d is mapped to the topic Ci , whichmaximises the similarity sim(d ,Ci)




Dataset Preparation: Comparable Corpora

Corpus Name Maximum text length of each document

corpus140 truncated at length 140 (if longer, full text otherwise)







mainCorpus full email body




Experimental Approach

corpus-driven topic extraction: using mainCorpus

email classification: apply the labelDoc procedure to all differentcomparable corpora, including mainCorpus.




Corpus Topics

Figura: Visualisation of topic clusters.A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 12 / 20



Results

Precision Recall F-Measurecorpus140 0.80 0.63 0.69corpus280 0.90 0.86 0.87corpus420 0.93 0.92 0.92corpus560 0.98 0.96 0.97corpus700 0.95 0.94 0.94corpus840 0.98 0.98 0.98corpus980 0.99 0.99 0.99




Conclusions

a fair portion of the emails exchanged, for the corpus generatedfrom a mailing list, are very short, with approximately 40% fallingwithin the single micropost size, and 65% up to two microposts;

for the text classification task described, the accuracy ofclassification for micropost size texts is an acceptableapproximation of classification performed on longer texts, with adecrease of only ∼ 5% for up to the second micropost blockwithin a long e-mail.




Future Work

enriching the micro-emails with semantic information (e.g.,concepts extracted from domain and standard ontologies) wouldimprove the results obtained using unannotated text

investigate the influence of other similarity measures

application to expert finding tasks, exploiting dynamic topicextraction as a means to determine authors’ and recipients’ areasof expertise

formal evaluation of topic validity will be required, including thehuman (expert) annotator in the loop




References

Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002).Introducing instant messaging and chat in the workplace.In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages171–178.

Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002).

The character, functions, and styles of instant messaging in the workplace.In Proc., ACM conference on Computer supported cooperative work, pages 11–20.

TNS US Group (2009).

Social media exploding: More than 40% use online social networks.http://www.tns-us.com/news/social_media_exploding_more_than.php.


http://www.tns-us.com/news/social_media_exploding_more_than.php

does size matter? when small is good enough

Documents