cl and social media

1

CL and Social Media

LING 575Fei Xia

Week 2: 01/11/2011

2

Outline• A few announcements

• Personal vs. Business email

• Email zone classification

• Deception detection

• Hw2

• Hw1: quick update from the students

3

A few announcements

4

Databases on Patas

• Three mysql databases on patas/capuchin– enron: the ISI database

• Same to have many more senders• The tables are slightly different from the paper

– berkeley_enron: the database from Berkeley – zonerelease: email zone annotation

• Query the database:– usrid: enronmail– password: askwhy

5

Databases on Patas (cont)

• mysql -u enronmail -p -h capuchin• enter your password (“askwhy”)• use database_name;• show tables;• select * from table_name limit 5;

• mysql API for Perl and other languages

6

Recent workshops on social media

• NAACL 2010 workshop: http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf

• ACL 2011 workshop: (due date is 4/1) http://research.microsoft.com/en-us/events/lsm2011/default.aspx

• International conference on Weblogs and Social Media: in conjunction with IJCAI-2011 (due date is 1/31) http://www.icwsm.org/2011/cfp.php

http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf

http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf

http://research.microsoft.com/en-us/events/lsm2011/default.aspx

http://research.microsoft.com/en-us/events/lsm2011/default.aspx

http://www.icwsm.org/2011/cfp.php

7

Personal vs. business emails

8

Task

• Determine whether an email is personal or business

• (Jabbari et al., 2006)– Manual annotation – Inter-annotator agreement– Automatic classification

9

Annotated data

• Available at http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm

• Stored on patas under $data_dir/personal_vs_business/

• Size:– 12,500 emails– 83% business, 17% personal– Mismatch between the paper and the data

http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm



10

Class labels

• Business:– core business, routine admin, inter-employee

relations, soliciting, image, keeping_current

• Personal:– close personal, personal maintenance, personal

circulation

11

Inter-annotator agreement• 2,200 emails are double annotated:

– 6% disagreement– 82% are labeled as “business” by both– 12% are labeled as “personal” by both

• disagreements: about 130 emails– 25% for subscription– 18% for travel arrangement– 13% for colleague meetings– 8% for service provided to Enron employees

• Questions: – What do annotators see? The email only or the thread? Do they only look at the

email body, or do they look at the “To” field as well?

12

Automatic classification• Classification algorithm: (Guthrie and Walker, 1994)

• Data: – 4,000 messages on “core business”– 1,000 messages on “close personal”

• Results:

0.93 (system accuracy) vs. 0.94 (inter-annotator agreement)

13

(Guthrie and Walker, 1994)Algorithm for text classification

• Let T1, T2, …, Tk be class labels.

• Assumption: a test document with class label Ti have similar “word” distributions with the union of training documents with Ti.

• Training:– partition the set of words into W1, W2, …, Wm

– for each Ti,

• “merge” the documents in the training data whose class label is Ti

• calculate pij for each Wj

• Ex: |T|=2, |W|=3, pij is (0.1, 0.05, 0.85) for T1, and (0.01, 0.2, 0.79) for T2

• Testing:– let nj be the frequency of the words in the test document that belongs to Wj

• Ex: the frequencies are (10, 200, 8900)– choose the Ti that maximizes

14

(Guthrie and Walker, 1994):Experiments

• Two class labels: T1 and T2

• Three word sets: W1, W2, and W3

– W1 includes the top 300 most frequent words in Docs(T1) that are not among the top 500 most frequent words in Docs(T2).

– W2 includes the top 300 most frequent words in Docs(T2) that are not among the top 500 most frequent words in Docs(T1).

– W3 includes the rest of the words

• Accuracy: 100%

15

Issues• Using word features: the words in a business email could vary a lot

depending on what the business is.

• Other important cues: – the relation between the sender and the recipient

• Do they work in the same company?• What is the path between them in the company report chain?• Are they friends?

– other emails in the same thread– the nature of the sender/recipient/company’s work and the words in the emails

(e.g., “stock”, “parent meeting”)– …

• Other ideas?

16

Email zoning

17

Email zone classification• Task: given a message, break it down to zones (e.g., header,

greeting, body, disclaimer, etc.)

• Today’s paper: Andrew Lampert, Robert Dale, and Cecile Paris, 2009. Segmenting Email Message Text into Zones. In Proc. of EMNLP-2009

• Data: – Available at http://zebra.thoughtlets.org/– Stored on patas under

$data_dir/email_zoning_dataset/EmailZoneData/– Stored on capuchin as a mysql database called “zonerelease”

http://zebra.thoughtlets.org/

18

Email zones in (Estival et al., 2007)

• Five categories:– Author text– Signature– Advertisement (automatically appended ones)– Quoted text– Reply lines

19

Email zones in (Lampert et al., 2009)

• Sender zones– Author: new content from the current email sender, excluding

any text that has been included from previous messages.– Greetings: e.g., “Hi, Mike”– Signoff: e.g., “thanks. AJ”

• Quoted conversation zones– Reply: content quoted from a previous message– Forward: Content from an email message outside the current

conversation thread that has been forwarded by the current email sender

20

Email zones (cont)

• Boilerplate zones: Boilerplate zones contain content that is reused without modification across multiple email messages– Signature– Advertising– Disclaimer– Attachment: automatically generated text

22

Manual annotation

• Annotated data:– almost 400 email messages– 11881 lines (7922 non-blank lines)– use the Berkeley database (“berkeley_enron”)– one annotator

• Use 10-fold cross validation

23

Automatic classification

• Classifier: SVM

• Two approaches:– two stages: (zone fragment classification)• segment a message into zone fragments• classify those fragments

– one stage:• classify each line

24

Detecting zone boundaries

• Different kinds of boundaries:– Blank boundaries: line 12– Separate boundaries: line 17-20– Adjoining boundaries: lines 10 and 11

• Use heuristic approach:– consider every blank line or lines beginning with 4+

repeated punctuation marks– cannot handle adjoining boundaries– high recall, low precision

25

Classifying zone fragments

• Features:– Graphic features: layout of text in the email– Orthographic features: the use of distinctive chars

and char sequences including punctuation, capital letters and numbers

– Lexical features: information about the words used in the email text

26

Graphic features• the number of words in the text fragment• the number of characters in the text fragment• the start position of the text fragment• the end position of the text fragment• the average line length (in chars) within the text

fragement• the length of the text fragment relative to the previous

fragment• the number of blank lines preceding the text fragement• …

27

Orthographic features• whether all lines start with the same character (e.g., ‘>’);• whether a prior text fragment in the message contains a quoted

header;• whether a prior text fragment in the message contains repeated

punctuation characters;• whether the text fragment contains a URL;• whether the text fragment contains an email address;• whether the text fragment contains a sequence of four or more

digits;• the number of capitalised words in the text fragment;• the percentage of capitalised words in the text fragment;• …

28

Lexical features• word unigram• word bigram

• whether the text fragment contains the sender’s name;• whether a prior text fragment in the message contains

the sender’s name;• whether the text fragment contains the sender’s initials;

and• whether the text fragment contains a recipient’s name.

29

Results

30

Confusion matrix for nine-zone line classification

31

Precision and recall

32

Issues

• Sequence labeling problem:– add features that look at the labels of preceding segments

• Is the 9-zone label set sufficient?

• How to take advantage of emails in the bigger context?– emails in the same discussion thread– emails by the same sender– general email structure: e.g., greeting, body, signoff, etc.

33

Deception detection

34

Papers for today

• [11] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. “Lying words: Predicting deception from linguistic style”. Personality and Social Psychology Bulletin, 29:665–675, 2003.

• [13] L. Zhou, J.K. Burgoon, J.F. Nunamaker Jr, and D. Twitchel, 2004. “Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication”. Group Decision and Negotiation, 13:81–106, 2004.

35

(Newman et al., 2003)

• Assumptions: Deceptive communications should be characterized by– fewer first-person singular pronouns (e.g., “I”, “me”, and

“my”): disassociate one from one’s statements

– more words reflecting negative emotion: feel guilt about lying or about the topic they are discussing

– fewer "exclusive" words (e.g., “except”, “but”, “without”) and more action words (e.g., “walk”): due to the reduce of cognitive resources

36

Experiments: Five studies

• videotaped abortion attitudes• typed abortion attitudes• handwritten abortion attitudes• feelings about friends• mock crime

37

Experiments

• Trained on four studies and used the "classifier" on the remaining study

• Accuracy: about 61%

• They found these four types of words have the weights consistent with their assumptions.

38

(Zhou et al., 2004)

• Experiments:– students are asked to exchange emails about a

desert survival task– students are asked to tell the truth or lies– features: 27 linguistic cues

39

Hypothesis

• Deceptive senders display – higher (a) quantity, (b) expressivity, (c) positive

affect, (d) informality, (e) uncertainty, and (f) nonimmediacy, and

– less (g) complexity, (h) diversity, and (i) specificity of language in their messages

than truthful senders and than their

respective receivers

40

Linguistics cues

• quality:– # of words– # of verbs– # of NPs– # of sentences

• expressivity:– # of adj/adv divided by # of nouns and verbs

41

Linguistics cues (cont)

• positive effect: expression of positive emotion

• informality: # of misspelled words / # of words

• uncertainty: – # of modifiers (adj/adv)– # of modal verbs– # of uncertainty words– # of third person pronouns

42

linguistic cues (cont)

• nonimmediacy:– passive voice– generalizing terms– (fewer) self references– group references: first person plural pronouns

43

Linguistic cues• Complexity:– Ave # of clauses per sent– Ave sentence length– Ave word length– …

• Diversity:– lexical diversity– content word diversity– redundancy– …

44

Issues

• Different settings for deceptions could affect the cues (e.g., length of the messages):– interviews– emails– blogs

– lie or asked to lie

45

Hw2

• Your presentation

• Reading assignments

• Suggestions for others’ projects

cl and social media

Documents