cl and social media
DESCRIPTION
CL and Social Media. LING 575 Fei Xia Week 2: 01/11/2011. Outline. A few announcements Personal vs. Business email Email zone classification Deception detection Hw2 Hw1: quick update from the students. A few announcements. Databases on Patas. - PowerPoint PPT PresentationTRANSCRIPT
1
CL and Social Media
LING 575Fei Xia
Week 2: 01/11/2011
2
Outline• A few announcements
• Personal vs. Business email
• Email zone classification
• Deception detection
• Hw2
• Hw1: quick update from the students
3
A few announcements
4
Databases on Patas
• Three mysql databases on patas/capuchin– enron: the ISI database
• Same to have many more senders• The tables are slightly different from the paper
– berkeley_enron: the database from Berkeley – zonerelease: email zone annotation
• Query the database:– usrid: enronmail– password: askwhy
5
Databases on Patas (cont)
• mysql -u enronmail -p -h capuchin• enter your password (“askwhy”)• use database_name;• show tables;• select * from table_name limit 5;
• mysql API for Perl and other languages
6
Recent workshops on social media
• NAACL 2010 workshop: http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf
• ACL 2011 workshop: (due date is 4/1) http://research.microsoft.com/en-us/events/lsm2011/default.aspx
• International conference on Weblogs and Social Media: in conjunction with IJCAI-2011 (due date is 1/31) http://www.icwsm.org/2011/cfp.php
7
Personal vs. business emails
8
Task
• Determine whether an email is personal or business
• (Jabbari et al., 2006)– Manual annotation – Inter-annotator agreement– Automatic classification
9
Annotated data
• Available at http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm
• Stored on patas under $data_dir/personal_vs_business/
• Size:– 12,500 emails– 83% business, 17% personal– Mismatch between the paper and the data
10
Class labels
• Business:– core business, routine admin, inter-employee
relations, soliciting, image, keeping_current
• Personal:– close personal, personal maintenance, personal
circulation
11
Inter-annotator agreement• 2,200 emails are double annotated:
– 6% disagreement– 82% are labeled as “business” by both– 12% are labeled as “personal” by both
• disagreements: about 130 emails– 25% for subscription– 18% for travel arrangement– 13% for colleague meetings– 8% for service provided to Enron employees
• Questions: – What do annotators see? The email only or the thread? Do they only look at the
email body, or do they look at the “To” field as well?
12
Automatic classification• Classification algorithm: (Guthrie and Walker, 1994)
• Data: – 4,000 messages on “core business”– 1,000 messages on “close personal”
• Results:
0.93 (system accuracy) vs. 0.94 (inter-annotator agreement)
13
(Guthrie and Walker, 1994)Algorithm for text classification
• Let T1, T2, …, Tk be class labels.
• Assumption: a test document with class label Ti have similar “word” distributions with the union of training documents with Ti.
• Training:– partition the set of words into W1, W2, …, Wm
– for each Ti,
• “merge” the documents in the training data whose class label is Ti
• calculate pij for each Wj
• Ex: |T|=2, |W|=3, pij is (0.1, 0.05, 0.85) for T1, and (0.01, 0.2, 0.79) for T2
• Testing:– let nj be the frequency of the words in the test document that belongs to Wj
• Ex: the frequencies are (10, 200, 8900)– choose the Ti that maximizes
14
(Guthrie and Walker, 1994):Experiments
• Two class labels: T1 and T2
• Three word sets: W1, W2, and W3
– W1 includes the top 300 most frequent words in Docs(T1) that are not among the top 500 most frequent words in Docs(T2).
– W2 includes the top 300 most frequent words in Docs(T2) that are not among the top 500 most frequent words in Docs(T1).
– W3 includes the rest of the words
• Accuracy: 100%
15
Issues• Using word features: the words in a business email could vary a lot
depending on what the business is.
• Other important cues: – the relation between the sender and the recipient
• Do they work in the same company?• What is the path between them in the company report chain?• Are they friends?
– other emails in the same thread– the nature of the sender/recipient/company’s work and the words in the emails
(e.g., “stock”, “parent meeting”)– …
• Other ideas?
16
Email zoning
17
Email zone classification• Task: given a message, break it down to zones (e.g., header,
greeting, body, disclaimer, etc.)
• Today’s paper: Andrew Lampert, Robert Dale, and Cecile Paris, 2009. Segmenting Email Message Text into Zones. In Proc. of EMNLP-2009
• Data: – Available at http://zebra.thoughtlets.org/– Stored on patas under
$data_dir/email_zoning_dataset/EmailZoneData/– Stored on capuchin as a mysql database called “zonerelease”
18
Email zones in (Estival et al., 2007)
• Five categories:– Author text– Signature– Advertisement (automatically appended ones)– Quoted text– Reply lines
19
Email zones in (Lampert et al., 2009)
• Sender zones– Author: new content from the current email sender, excluding
any text that has been included from previous messages.– Greetings: e.g., “Hi, Mike”– Signoff: e.g., “thanks. AJ”
• Quoted conversation zones– Reply: content quoted from a previous message– Forward: Content from an email message outside the current
conversation thread that has been forwarded by the current email sender
20
Email zones (cont)
• Boilerplate zones: Boilerplate zones contain content that is reused without modification across multiple email messages– Signature– Advertising– Disclaimer– Attachment: automatically generated text
21
22
Manual annotation
• Annotated data:– almost 400 email messages– 11881 lines (7922 non-blank lines)– use the Berkeley database (“berkeley_enron”)– one annotator
• Use 10-fold cross validation
23
Automatic classification
• Classifier: SVM
• Two approaches:– two stages: (zone fragment classification)• segment a message into zone fragments• classify those fragments
– one stage:• classify each line
24
Detecting zone boundaries
• Different kinds of boundaries:– Blank boundaries: line 12– Separate boundaries: line 17-20– Adjoining boundaries: lines 10 and 11
• Use heuristic approach:– consider every blank line or lines beginning with 4+
repeated punctuation marks– cannot handle adjoining boundaries– high recall, low precision
25
Classifying zone fragments
• Features:– Graphic features: layout of text in the email– Orthographic features: the use of distinctive chars
and char sequences including punctuation, capital letters and numbers
– Lexical features: information about the words used in the email text
26
Graphic features• the number of words in the text fragment• the number of characters in the text fragment• the start position of the text fragment• the end position of the text fragment• the average line length (in chars) within the text
fragement• the length of the text fragment relative to the previous
fragment• the number of blank lines preceding the text fragement• …
27
Orthographic features• whether all lines start with the same character (e.g., ‘>’);• whether a prior text fragment in the message contains a quoted
header;• whether a prior text fragment in the message contains repeated
punctuation characters;• whether the text fragment contains a URL;• whether the text fragment contains an email address;• whether the text fragment contains a sequence of four or more
digits;• the number of capitalised words in the text fragment;• the percentage of capitalised words in the text fragment;• …
28
Lexical features• word unigram• word bigram
• whether the text fragment contains the sender’s name;• whether a prior text fragment in the message contains
the sender’s name;• whether the text fragment contains the sender’s initials;
and• whether the text fragment contains a recipient’s name.
29
Results
30
Confusion matrix for nine-zone line classification
31
Precision and recall
32
Issues
• Sequence labeling problem:– add features that look at the labels of preceding segments
• Is the 9-zone label set sufficient?
• How to take advantage of emails in the bigger context?– emails in the same discussion thread– emails by the same sender– general email structure: e.g., greeting, body, signoff, etc.
33
Deception detection
34
Papers for today
• [11] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. “Lying words: Predicting deception from linguistic style”. Personality and Social Psychology Bulletin, 29:665–675, 2003.
• [13] L. Zhou, J.K. Burgoon, J.F. Nunamaker Jr, and D. Twitchel, 2004. “Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication”. Group Decision and Negotiation, 13:81–106, 2004.
35
(Newman et al., 2003)
• Assumptions: Deceptive communications should be characterized by– fewer first-person singular pronouns (e.g., “I”, “me”, and
“my”): disassociate one from one’s statements
– more words reflecting negative emotion: feel guilt about lying or about the topic they are discussing
– fewer "exclusive" words (e.g., “except”, “but”, “without”) and more action words (e.g., “walk”): due to the reduce of cognitive resources
36
Experiments: Five studies
• videotaped abortion attitudes• typed abortion attitudes• handwritten abortion attitudes• feelings about friends• mock crime
37
Experiments
• Trained on four studies and used the "classifier" on the remaining study
• Accuracy: about 61%
• They found these four types of words have the weights consistent with their assumptions.
38
(Zhou et al., 2004)
• Experiments:– students are asked to exchange emails about a
desert survival task– students are asked to tell the truth or lies– features: 27 linguistic cues
39
Hypothesis
• Deceptive senders display – higher (a) quantity, (b) expressivity, (c) positive
affect, (d) informality, (e) uncertainty, and (f) nonimmediacy, and
– less (g) complexity, (h) diversity, and (i) specificity of language in their messages
than truthful senders and than their
respective receivers
40
Linguistics cues
• quality:– # of words– # of verbs– # of NPs– # of sentences
• expressivity:– # of adj/adv divided by # of nouns and verbs
41
Linguistics cues (cont)
• positive effect: expression of positive emotion
• informality: # of misspelled words / # of words
• uncertainty: – # of modifiers (adj/adv)– # of modal verbs– # of uncertainty words– # of third person pronouns
42
linguistic cues (cont)
• nonimmediacy:– passive voice– generalizing terms– (fewer) self references– group references: first person plural pronouns
43
Linguistic cues• Complexity:– Ave # of clauses per sent– Ave sentence length– Ave word length– …
• Diversity:– lexical diversity– content word diversity– redundancy– …
44
Issues
• Different settings for deceptions could affect the cues (e.g., length of the messages):– interviews– emails– blogs
– lie or asked to lie
45
Hw2
• Your presentation
• Reading assignments
• Suggestions for others’ projects