liwc dictionary expansion
DESCRIPTION
This presentation explains the research I made during while working at the Social Computing Lab at KAIST. The main goal was to expand the LIWC vocabulary and adapt for Twiter sentiment analysis. Download it to see the animations :)TRANSCRIPT
LIWC Dictionary Expansion
Luiz Gustavo Ferraz Aoqui
Social Computing Lab – GSCT – KAIST
Motivation
• Dictionary-based classifiers have high precision
• But usually low recall
• Natural language is very dynamic
• New words appear
• Words change their meaning and sentiment
• Heap’s Law
• Hard to update the dictionary at the same speed
LIWC Dictionary
• Fairly large dictionary
• Almost 4,500 words and steams
• 406 positive
• 499 negative
• Development and Update is a long process
• Almost exclusively done manually
• Requires a lot of human resources
• Last update was in 2007
• Twitter was launched in July, 2006
System overview19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i
'm gonna go with the magic in 6.... just cause now
that bron bron's out i wanna
see kobe lose too.</t> SeanBennettt 98 434 159 -
18000 0 0 <n>Sean Bennett</n> <u
d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
& Canada)</t> <l>Long Island,
NY</l>...
Postive:
.. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
album via luv photo ;- john pic different kno wearing
la ).
Negative:
!! :( ?? getting twitter omg ?! ppl :/ dude idk da
weather bout wtf iphone smh wat internet =( heat dnt
=/ facebook :| gosh kate :[ fml ima jon swear punch
text =[ cringe ): nd ** imma
System overview
System overview/Parser19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i'm gonna go
with the magic in 6.... just cause now that bron bron's
out i wanna see kobe lose too.</t> SeanBennettt 98
434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
01-15 16:36:04</ud> <t>Eastern Time (US &
Canada)</t> <l>Long Island, NY</l>...
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
Parser
Extract tweet(RegEx)
Filter
Remove user name
(RegEx)
Remove URL
(RegEx)
Remove hash tag(RegEx)
Clean
Structured Text
Tweets
Clean Tweets
Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0
<n>Adam Smith</n> <ud>2007-03-07
18:17:20</ud> <t>Eastern Time (US
& Canada)</t>
<t>(.*?)</t>I just reached level 2. #spymaster
http://bit.ly/playspy
Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
#[0-9a-zA-Z+_]*I just reached level 2.
#spymaster
http://bit.ly/playspy
I just reached level 2.
http://bit.ly/playspy
Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
((http://|www.)([a-zA-
Z0-9/.~])*)
I just reached level 2.
#spymaster
http://bit.ly/playspy
I just reached level 2.
#spymaster
System overview/Masterhaha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
Index Frequency Chunks Co-frequency
Master
Tweets
Splitter
Indexer
Mapper
Reducer
Sort
M M M
TweetsTweets
Chunks
Index
R
R
R
Co-frequencyCo-frequencyCo-frequencyUnsortedFrequency
Frequency
Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC dictionary
• Split the input file in smaller chunks
Master/Indexer
• Simply save the vocabulary on a file sorted alphabetically
• Important in the future
Master/Mapper
• Spawn processes in parallel and divide the chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
WorkerChunk
someone 6down 8ever 10kinda 2crazy 14…
Frequency.tmp
Master/Mapper
• Spawn processes in parallel and divide the chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
• Second: save the co-words for each word
haha
Worker
Master/Mapper
haha nooo! i just wanna kill
mee!!!! i didn`t do my
homework...and i feel sick =(
haha
nooo
!
ijust
wanna
kill
mee
!!!!
i
didn`t
do
my
homework
... and
ifeel
sick
=(
Split Words
Remove Duplicates
Generate files
Save co-words
Master/Mapper/Issues
• Splitting is not trivial• Splitting in whitespaces
• homework… ≠ homework
• Remove punctuation
• :) ☐
• Solution: RegEx again• ([\w\-\'`]*)(\W*)
• File names:• Unique, easy to find and respect OS rules
• Hash• This is why the index file is important
Master/Mapper/Issues
• Parallel programming on Python
• Original interpreter don’t support multi-thread…• Alternatives, such as Jython and IronPython, do
• …but it is still possible to work in parallel
• Multi-thread vs. Multi-process
• Multi-process in Python• multiprocessing module
• http://docs.python.org/library/multiprocessing.html#module-multiprocessing.pool
Master/Reducer
• Spawn processes in parallel and split the words among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
car 4house 2ball 5car 1house 1
frequency.tmp
car 5house 3ball 5
frequency.txt
Reducer
Master/Reducer
• Spawn processes in parallel and split the words among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
• Second: sums the co-occurrence frequency
Workercar 1ball 3car 2house 1
trip
car 3Ball 3house 1
trip
Master/Reducer/Issues
• Index file
• Useful to access the files
• Each word has a file with a list of co-words
• But file name is hashed
• Non-invertible function
• Look-up on index, hash the word and get the file
Master/Sort
• Simply sort the frequencies file
• Most frequent first
Classifier
Frequency
Co-frequency
Scores
New words
α β γδ
Max results
Classifier/Sentiment words
Car 232Ball 143Street 125House 121Boat 114Pencil 105Pen 98Computer 81
FrequencyTop α%
CarBallStreetHouseBoat
Classifier/Co-words
CarBallStreet
Top β%
tire doorengine
court playgame
name size
Classifier/Score
tire doorengine
court playgame
door size
size type homeroom
size doorprice
engine
tire
door
size
1 0
1 0
2
2
1
1
Classifier/Collapse
• Created to deal with problems like:
• :) :)) :), :).
• They should all be treated as the same token
• Harder for words
Classifier/New words
• Rules to compare the scores
• So far the rules are
• If the positive score is bigger than the negative score plus delta, tag the word as positive
• Same idea for negative
• Returns the new words up to a maximum value
Other ideas
• WordNet based
• PMI similarity score
Evaluation
• Two evaluation methods:
• First method
• Find tweets that could not be categorized before but now they can
• Manually check the precision of the result
• Second method
• Manually select positive and negative tweets
• Compare the precision of the old dictionary with the new dictionary
Sub-product
• LIWC Dictionary Library for Python
• Provides easy access to the dictionary information• Easy search
• Reverse index
• Match wildcard
• Ex.: