does size matter? when small is good enough

16
Introduction Email Corpus Dynamic Topic Classification Of Short Texts Experiments Conclusions Does size matter? When small is better. A.L. Gentile A.E. Cano A.-S. Dadzie V. Lanfranchi N. Ireson Department of Computer Science, University of Sheffield 2011, 23 rd March A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better.

Upload: aba-sah

Post on 28-Nov-2014

449 views

Category:

Documents


0 download

DESCRIPTION

This paper reports the observation of the influence of the size of documents on the accuracy of a defined text processing task.Our hypothesis is that based on a specific task (in this case, topic classification), results obtained using longer texts may be approximated by short texts, of micropost size, i.e., maximum length 140 characters.Using an email dataset as the main corpus, we generate several fixed-size corpora, consisting of truncated emails, from micropost size (140 characters), and successive multiples thereof, to the full size of each email. Our methodology consists of two steps: (1) corpus-driven topic extraction and (2) document topic classification. We build the topic representation model using the main corpus, through k-means clustering, with each \emph{k}-derived topic represented as a weighted number of terms. We then perform document classification according to the k topics: first over the main corpus, then over each truncated corpus, and observe the variance in classification accuracy with document size.The results obtained show that the accuracy of topic classification for micropost-size texts is a suitable approximation of classification performed on longer texts.

TRANSCRIPT

Page 1: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Does size matter? When small is better.

A.L. Gentile A.E. Cano A.-S. Dadzie V. LanfranchiN. Ireson

Department of Computer Science,University of Sheffield

2011, 23rd March

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 1 / 20

Page 2: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Outline

1 Introduction

2 Email Corpus

3 Dynamic Topic Classification Of Short Texts

4 Experiments

5 Conclusions

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 2 / 20

Page 3: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Settings

Main Goal: observation of the influence of the size of documentson the accuracy of a defined text processing task

Hypothesis: results obtained using longer texts may beapproximated by short texts, of micropost size, i.e., maximumlength 140 characters

Dataset: artificially generated comparable corpora, consisting oftruncated emails, from micropost size (140 characters), andsuccessive multiples thereofMethodology:

corpus-driven topic extractiondocument topic classification

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 3 / 20

Page 4: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Research Questions

statistical analysis of emails exchanged via oak mailing list(over a period of six months) to determine if email is indeed usedas a short messaging service

content analysis of emails as microposts, to evaluate to whatdegree the knowledge content of truncated or abbreviatedmessages can be compared to the complete message

determine if the knowledge content of short emails may be usedto obtain useful information about e.g., topics of interest orexpertise within an organisation, as a basis for carrying out taskssuch as expert finding or content-based social network analysis(SNA)

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 4 / 20

Page 5: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Micropost Services

mainly used for social information exchange, but also used inmore formal (working) environments[Herbsleb et al., 2002, Isaacs et al., 2002]

sometimes perceived negatively in the workplace, as they may beseen to reduce productivity [TNS US Group, 2009], and/or posethreats to security and privacywhere restrictions to use are in place, alternatives are sought thatobtain the same benefits

same communication patterns in alternative media: email as ashort message service for communication via, e.g., mailing lists

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 5 / 20

Page 6: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Oak mailing list based corpus

six month period (July - January 2011), totalling 659 emails

Figura: Email length distributionA.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 6 / 20

Page 7: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Dynamic Topic Classification Of Short Texts

Goal: evaluate to what degree the knowledge content of a shortermessage can be compared to that of a full message.

Chosen task: text classification on non-predefined topics.

Test bed: generated by preprocessing the oak email corpus toobtain several fixed-size corpora.Method:

Corpus-driven topic extraction: a number of topics areautomatically extracted from a document collection; each topic isrepresented as a weighted vector of terms;Document topic classification: each document is labelled with thetopic it is most similar to, and classified into the correspondingcluster.

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 7 / 20

Page 8: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Topic Extraction: Proximity-based Clustering

Document corpus D = {d1, ...,dk}each document di = {t1, ..., tv} is a vector of weighted terms

Term clusters C = {c1, ...,ck}(clustering performed by using as feature space the invertedindex of D)

each cluster ck = {t1, ..., tn} is a vector of weighted termseach cluster ideally represents a topic in the document collection

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 8 / 20

Page 9: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Email Topic Classification

sim(d ,Ci): similarity between documents and clusters (by cosinesimilarity)

labelDoc(D): each document d is mapped to the topic Ci , whichmaximises the similarity sim(d ,Ci)

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 9 / 20

Page 10: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Dataset Preparation: Comparable Corpora

Corpus Name Maximum text length of each document

corpus140 truncated at length 140 (if longer, full text otherwise)

corpus280 truncated at length 280 (if longer, full text otherwise)

corpus420 truncated at length 420 (if longer, full text otherwise)

corpus560 truncated at length 560 (if longer, full text otherwise)

corpus700 truncated at length 700 (if longer, full text otherwise)

corpus840 truncated at length 840 (if longer, full text otherwise)

corpus980 truncated at length 980 (if longer, full text otherwise)

mainCorpus full email body

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 10 / 20

Page 11: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Experimental Approach

corpus-driven topic extraction: using mainCorpus

email classification: apply the labelDoc procedure to all differentcomparable corpora, including mainCorpus.

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 11 / 20

Page 12: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Corpus Topics

Figura: Visualisation of topic clusters.A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 12 / 20

Page 13: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Results

Precision Recall F-Measurecorpus140 0.80 0.63 0.69corpus280 0.90 0.86 0.87corpus420 0.93 0.92 0.92corpus560 0.98 0.96 0.97corpus700 0.95 0.94 0.94corpus840 0.98 0.98 0.98corpus980 0.99 0.99 0.99

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 13 / 20

Page 14: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Conclusions

a fair portion of the emails exchanged, for the corpus generatedfrom a mailing list, are very short, with approximately 40% fallingwithin the single micropost size, and 65% up to two microposts;

for the text classification task described, the accuracy ofclassification for micropost size texts is an acceptableapproximation of classification performed on longer texts, with adecrease of only ∼ 5% for up to the second micropost blockwithin a long e-mail.

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 14 / 20

Page 15: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

Future Work

enriching the micro-emails with semantic information (e.g.,concepts extracted from domain and standard ontologies) wouldimprove the results obtained using unannotated text

investigate the influence of other similarity measures

application to expert finding tasks, exploiting dynamic topicextraction as a means to determine authors’ and recipients’ areasof expertise

formal evaluation of topic validity will be required, including thehuman (expert) annotator in the loop

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 15 / 20

Page 16: Does Size Matter? When Small is Good Enough

IntroductionEmail Corpus

Dynamic Topic Classification Of Short TextsExperimentsConclusions

References

Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002).Introducing instant messaging and chat in the workplace.In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages171–178.

Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002).

The character, functions, and styles of instant messaging in the workplace.In Proc., ACM conference on Computer supported cooperative work, pages 11–20.

TNS US Group (2009).

Social media exploding: More than 40% use online social networks.http://www.tns-us.com/news/social_media_exploding_more_than.php.

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi, N. Ireson Does size matter? When small is better. 16 / 20