cl and social media

45
CL and Social Media LING 575 Fei Xia Week 2: 01/11/2011 1

Upload: chogan

Post on 24-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

CL and Social Media. LING 575 Fei Xia Week 2: 01/11/2011. Outline. A few announcements Personal vs. Business email Email zone classification Deception detection Hw2 Hw1: quick update from the students. A few announcements. Databases on Patas. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CL and Social Media

1

CL and Social Media

LING 575Fei Xia

Week 2: 01/11/2011

Page 2: CL and Social Media

2

Outline• A few announcements

• Personal vs. Business email

• Email zone classification

• Deception detection

• Hw2

• Hw1: quick update from the students

Page 3: CL and Social Media

3

A few announcements

Page 4: CL and Social Media

4

Databases on Patas

• Three mysql databases on patas/capuchin– enron: the ISI database

• Same to have many more senders• The tables are slightly different from the paper

– berkeley_enron: the database from Berkeley – zonerelease: email zone annotation

• Query the database:– usrid: enronmail– password: askwhy

Page 5: CL and Social Media

5

Databases on Patas (cont)

• mysql -u enronmail -p -h capuchin• enter your password (“askwhy”)• use database_name;• show tables;• select * from table_name limit 5;

• mysql API for Perl and other languages

Page 6: CL and Social Media

6

Recent workshops on social media

• NAACL 2010 workshop: http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf

• ACL 2011 workshop: (due date is 4/1) http://research.microsoft.com/en-us/events/lsm2011/default.aspx

• International conference on Weblogs and Social Media: in conjunction with IJCAI-2011 (due date is 1/31) http://www.icwsm.org/2011/cfp.php

Page 7: CL and Social Media

7

Personal vs. business emails

Page 8: CL and Social Media

8

Task

• Determine whether an email is personal or business

• (Jabbari et al., 2006)– Manual annotation – Inter-annotator agreement– Automatic classification

Page 9: CL and Social Media

9

Annotated data

• Available at http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm

• Stored on patas under $data_dir/personal_vs_business/

• Size:– 12,500 emails– 83% business, 17% personal– Mismatch between the paper and the data

Page 10: CL and Social Media

10

Class labels

• Business:– core business, routine admin, inter-employee

relations, soliciting, image, keeping_current

• Personal:– close personal, personal maintenance, personal

circulation

Page 11: CL and Social Media

11

Inter-annotator agreement• 2,200 emails are double annotated:

– 6% disagreement– 82% are labeled as “business” by both– 12% are labeled as “personal” by both

• disagreements: about 130 emails– 25% for subscription– 18% for travel arrangement– 13% for colleague meetings– 8% for service provided to Enron employees

• Questions: – What do annotators see? The email only or the thread? Do they only look at the

email body, or do they look at the “To” field as well?

Page 12: CL and Social Media

12

Automatic classification• Classification algorithm: (Guthrie and Walker, 1994)

• Data: – 4,000 messages on “core business”– 1,000 messages on “close personal”

• Results:

0.93 (system accuracy) vs. 0.94 (inter-annotator agreement)

Page 13: CL and Social Media

13

(Guthrie and Walker, 1994)Algorithm for text classification

• Let T1, T2, …, Tk be class labels.

• Assumption: a test document with class label Ti have similar “word” distributions with the union of training documents with Ti.

• Training:– partition the set of words into W1, W2, …, Wm

– for each Ti,

• “merge” the documents in the training data whose class label is Ti

• calculate pij for each Wj

• Ex: |T|=2, |W|=3, pij is (0.1, 0.05, 0.85) for T1, and (0.01, 0.2, 0.79) for T2

• Testing:– let nj be the frequency of the words in the test document that belongs to Wj

• Ex: the frequencies are (10, 200, 8900)– choose the Ti that maximizes

Page 14: CL and Social Media

14

(Guthrie and Walker, 1994):Experiments

• Two class labels: T1 and T2

• Three word sets: W1, W2, and W3

– W1 includes the top 300 most frequent words in Docs(T1) that are not among the top 500 most frequent words in Docs(T2).

– W2 includes the top 300 most frequent words in Docs(T2) that are not among the top 500 most frequent words in Docs(T1).

– W3 includes the rest of the words

• Accuracy: 100%

Page 15: CL and Social Media

15

Issues• Using word features: the words in a business email could vary a lot

depending on what the business is.

• Other important cues: – the relation between the sender and the recipient

• Do they work in the same company?• What is the path between them in the company report chain?• Are they friends?

– other emails in the same thread– the nature of the sender/recipient/company’s work and the words in the emails

(e.g., “stock”, “parent meeting”)– …

• Other ideas?

Page 16: CL and Social Media

16

Email zoning

Page 17: CL and Social Media

17

Email zone classification• Task: given a message, break it down to zones (e.g., header,

greeting, body, disclaimer, etc.)

• Today’s paper: Andrew Lampert, Robert Dale, and Cecile Paris, 2009. Segmenting Email Message Text into Zones. In Proc. of EMNLP-2009

• Data: – Available at http://zebra.thoughtlets.org/– Stored on patas under

$data_dir/email_zoning_dataset/EmailZoneData/– Stored on capuchin as a mysql database called “zonerelease”

Page 18: CL and Social Media

18

Email zones in (Estival et al., 2007)

• Five categories:– Author text– Signature– Advertisement (automatically appended ones)– Quoted text– Reply lines

Page 19: CL and Social Media

19

Email zones in (Lampert et al., 2009)

• Sender zones– Author: new content from the current email sender, excluding

any text that has been included from previous messages.– Greetings: e.g., “Hi, Mike”– Signoff: e.g., “thanks. AJ”

• Quoted conversation zones– Reply: content quoted from a previous message– Forward: Content from an email message outside the current

conversation thread that has been forwarded by the current email sender

Page 20: CL and Social Media

20

Email zones (cont)

• Boilerplate zones: Boilerplate zones contain content that is reused without modification across multiple email messages– Signature– Advertising– Disclaimer– Attachment: automatically generated text

Page 21: CL and Social Media

21

Page 22: CL and Social Media

22

Manual annotation

• Annotated data:– almost 400 email messages– 11881 lines (7922 non-blank lines)– use the Berkeley database (“berkeley_enron”)– one annotator

• Use 10-fold cross validation

Page 23: CL and Social Media

23

Automatic classification

• Classifier: SVM

• Two approaches:– two stages: (zone fragment classification)• segment a message into zone fragments• classify those fragments

– one stage:• classify each line

Page 24: CL and Social Media

24

Detecting zone boundaries

• Different kinds of boundaries:– Blank boundaries: line 12– Separate boundaries: line 17-20– Adjoining boundaries: lines 10 and 11

• Use heuristic approach:– consider every blank line or lines beginning with 4+

repeated punctuation marks– cannot handle adjoining boundaries– high recall, low precision

Page 25: CL and Social Media

25

Classifying zone fragments

• Features:– Graphic features: layout of text in the email– Orthographic features: the use of distinctive chars

and char sequences including punctuation, capital letters and numbers

– Lexical features: information about the words used in the email text

Page 26: CL and Social Media

26

Graphic features• the number of words in the text fragment• the number of characters in the text fragment• the start position of the text fragment• the end position of the text fragment• the average line length (in chars) within the text

fragement• the length of the text fragment relative to the previous

fragment• the number of blank lines preceding the text fragement• …

Page 27: CL and Social Media

27

Orthographic features• whether all lines start with the same character (e.g., ‘>’);• whether a prior text fragment in the message contains a quoted

header;• whether a prior text fragment in the message contains repeated

punctuation characters;• whether the text fragment contains a URL;• whether the text fragment contains an email address;• whether the text fragment contains a sequence of four or more

digits;• the number of capitalised words in the text fragment;• the percentage of capitalised words in the text fragment;• …

Page 28: CL and Social Media

28

Lexical features• word unigram• word bigram

• whether the text fragment contains the sender’s name;• whether a prior text fragment in the message contains

the sender’s name;• whether the text fragment contains the sender’s initials;

and• whether the text fragment contains a recipient’s name.

Page 29: CL and Social Media

29

Results

Page 30: CL and Social Media

30

Confusion matrix for nine-zone line classification

Page 31: CL and Social Media

31

Precision and recall

Page 32: CL and Social Media

32

Issues

• Sequence labeling problem:– add features that look at the labels of preceding segments

• Is the 9-zone label set sufficient?

• How to take advantage of emails in the bigger context?– emails in the same discussion thread– emails by the same sender– general email structure: e.g., greeting, body, signoff, etc.

Page 33: CL and Social Media

33

Deception detection

Page 34: CL and Social Media

34

Papers for today

• [11] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. “Lying words: Predicting deception from linguistic style”. Personality and Social Psychology Bulletin, 29:665–675, 2003.

• [13] L. Zhou, J.K. Burgoon, J.F. Nunamaker Jr, and D. Twitchel, 2004. “Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication”. Group Decision and Negotiation, 13:81–106, 2004.

Page 35: CL and Social Media

35

(Newman et al., 2003)

• Assumptions: Deceptive communications should be characterized by– fewer first-person singular pronouns (e.g., “I”, “me”, and

“my”): disassociate one from one’s statements

– more words reflecting negative emotion: feel guilt about lying or about the topic they are discussing

– fewer "exclusive" words (e.g., “except”, “but”, “without”) and more action words (e.g., “walk”): due to the reduce of cognitive resources

Page 36: CL and Social Media

36

Experiments: Five studies

• videotaped abortion attitudes• typed abortion attitudes• handwritten abortion attitudes• feelings about friends• mock crime

Page 37: CL and Social Media

37

Experiments

• Trained on four studies and used the "classifier" on the remaining study

• Accuracy: about 61%

• They found these four types of words have the weights consistent with their assumptions.

Page 38: CL and Social Media

38

(Zhou et al., 2004)

• Experiments:– students are asked to exchange emails about a

desert survival task– students are asked to tell the truth or lies– features: 27 linguistic cues

Page 39: CL and Social Media

39

Hypothesis

• Deceptive senders display – higher (a) quantity, (b) expressivity, (c) positive

affect, (d) informality, (e) uncertainty, and (f) nonimmediacy, and

– less (g) complexity, (h) diversity, and (i) specificity of language in their messages

than truthful senders and than their

respective receivers

Page 40: CL and Social Media

40

Linguistics cues

• quality:– # of words– # of verbs– # of NPs– # of sentences

• expressivity:– # of adj/adv divided by # of nouns and verbs

Page 41: CL and Social Media

41

Linguistics cues (cont)

• positive effect: expression of positive emotion

• informality: # of misspelled words / # of words

• uncertainty: – # of modifiers (adj/adv)– # of modal verbs– # of uncertainty words– # of third person pronouns

Page 42: CL and Social Media

42

linguistic cues (cont)

• nonimmediacy:– passive voice– generalizing terms– (fewer) self references– group references: first person plural pronouns

Page 43: CL and Social Media

43

Linguistic cues• Complexity:– Ave # of clauses per sent– Ave sentence length– Ave word length– …

• Diversity:– lexical diversity– content word diversity– redundancy– …

Page 44: CL and Social Media

44

Issues

• Different settings for deceptions could affect the cues (e.g., length of the messages):– interviews– emails– blogs

– lie or asked to lie

Page 45: CL and Social Media

45

Hw2

• Your presentation

• Reading assignments

• Suggestions for others’ projects