liwc dictionary expansion

33
LIWC Dictionary Expansion Luiz Gustavo Ferraz Aoqui Social Computing Lab GSCT KAIST

Upload: luiz-aoqui

Post on 06-Jul-2015

1.471 views

Category:

Technology


0 download

DESCRIPTION

This presentation explains the research I made during while working at the Social Computing Lab at KAIST. The main goal was to expand the LIWC vocabulary and adapt for Twiter sentiment analysis. Download it to see the animations :)

TRANSCRIPT

Page 1: LIWC Dictionary Expansion

LIWC Dictionary Expansion

Luiz Gustavo Ferraz Aoqui

Social Computing Lab – GSCT – KAIST

Page 2: LIWC Dictionary Expansion

Motivation

• Dictionary-based classifiers have high precision

• But usually low recall

• Natural language is very dynamic

• New words appear

• Words change their meaning and sentiment

• Heap’s Law

• Hard to update the dictionary at the same speed

Page 3: LIWC Dictionary Expansion

LIWC Dictionary

• Fairly large dictionary

• Almost 4,500 words and steams

• 406 positive

• 499 negative

• Development and Update is a long process

• Almost exclusively done manually

• Requires a lot of human resources

• Last update was in 2007

• Twitter was launched in July, 2006

Page 4: LIWC Dictionary Expansion

System overview19027743 1985381275 NULL NULL <d>2009-06-01

00:00:00</d> <s>web</s> <t>I think i

'm gonna go with the magic in 6.... just cause now

that bron bron's out i wanna

see kobe lose too.</t> SeanBennettt 98 434 159 -

18000 0 0 <n>Sean Bennett</n> <u

d>2009-01-15 16:36:04</ud> <t>Eastern Time (US

&amp; Canada)</t> <l>Long Island,

NY</l>...

Postive:

.. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:

mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww

album via luv photo ;- john pic different kno wearing

la ).

Negative:

!! :( ?? getting twitter omg ?! ppl :/ dude idk da

weather bout wtf iphone smh wat internet =( heat dnt

=/ facebook :| gosh kate :[ fml ima jon swear punch

text =[ cringe ): nd ** imma

Page 5: LIWC Dictionary Expansion

System overview

Page 6: LIWC Dictionary Expansion

System overview/Parser19027743 1985381275 NULL NULL <d>2009-06-01

00:00:00</d> <s>web</s> <t>I think i'm gonna go

with the magic in 6.... just cause now that bron bron's

out i wanna see kobe lose too.</t> SeanBennettt 98

434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-

01-15 16:36:04</ud> <t>Eastern Time (US &amp;

Canada)</t> <l>Long Island, NY</l>...

haha nooo! i just wanna kill mee!!!! i didn`t do my

homework...and i feel sick =(

I can see the bus again. that makes me happy.

$$ Black Swan Fund Makes a Big Bet on Inflation

wonder how Roubini feels about this...?

blahh, i feel boredd and tiredd as hell haha

jay to conan... upgrade. lc to kristin... downgrade.

rushing home for lauren's final episode. my life

makes me sad.

Page 7: LIWC Dictionary Expansion

Parser

Extract tweet(RegEx)

Filter

Remove user name

(RegEx)

Remove URL

(RegEx)

Remove hash tag(RegEx)

Clean

Structured Text

Tweets

Clean Tweets

Page 8: LIWC Dictionary Expansion

Parser

• Regular Expressions

• Very powerful tool for text processing…

• ..but very complex

• Ex.:

<d>2009-06-01 00:00:00</d>

<s>web</s> <t>I just reached level 2.

#spymaster http://bit.ly/playspy</t>

asmith393 1522 1498 207 -18000 0 0

<n>Adam Smith</n> <ud>2007-03-07

18:17:20</ud> <t>Eastern Time (US

&amp; Canada)</t>

<t>(.*?)</t>I just reached level 2. #spymaster

http://bit.ly/playspy

Page 9: LIWC Dictionary Expansion

Parser

• Regular Expressions

• Very powerful tool for text processing…

• ..but very complex

• Ex.:

#[0-9a-zA-Z+_]*I just reached level 2.

#spymaster

http://bit.ly/playspy

I just reached level 2.

http://bit.ly/playspy

Page 10: LIWC Dictionary Expansion

Parser

• Regular Expressions

• Very powerful tool for text processing…

• ..but very complex

• Ex.:

((http://|www.)([a-zA-

Z0-9/.~])*)

I just reached level 2.

#spymaster

http://bit.ly/playspy

I just reached level 2.

#spymaster

Page 11: LIWC Dictionary Expansion

System overview/Masterhaha nooo! i just wanna kill mee!!!! i didn`t do my

homework...and i feel sick =(

I can see the bus again. that makes me happy.

$$ Black Swan Fund Makes a Big Bet on Inflation

wonder how Roubini feels about this...?

blahh, i feel boredd and tiredd as hell haha

jay to conan... upgrade. lc to kristin... downgrade.

rushing home for lauren's final episode. my life

makes me sad.

Index Frequency Chunks Co-frequency

Page 12: LIWC Dictionary Expansion

Master

Tweets

Splitter

Indexer

Mapper

Reducer

Sort

M M M

TweetsTweets

Chunks

Index

R

R

R

Co-frequencyCo-frequencyCo-frequencyUnsortedFrequency

Frequency

Page 13: LIWC Dictionary Expansion

Master/Splitter

• Count the lines in the input file

• Select only tweets that words on the LIWC dictionary

• Split the input file in smaller chunks

Page 14: LIWC Dictionary Expansion

Master/Indexer

• Simply save the vocabulary on a file sorted alphabetically

• Important in the future

Page 15: LIWC Dictionary Expansion

Master/Mapper

• Spawn processes in parallel and divide the chunks among them

• Each worker does two jobs:

• First: create (word, frequency) pairs

WorkerChunk

someone 6down 8ever 10kinda 2crazy 14…

Frequency.tmp

Page 16: LIWC Dictionary Expansion

Master/Mapper

• Spawn processes in parallel and divide the chunks among them

• Each worker does two jobs:

• First: create (word, frequency) pairs

• Second: save the co-words for each word

Page 17: LIWC Dictionary Expansion

haha

Worker

Master/Mapper

haha nooo! i just wanna kill

mee!!!! i didn`t do my

homework...and i feel sick =(

haha

nooo

!

ijust

wanna

kill

mee

!!!!

i

didn`t

do

my

homework

... and

ifeel

sick

=(

Split Words

Remove Duplicates

Generate files

Save co-words

Page 18: LIWC Dictionary Expansion

Master/Mapper/Issues

• Splitting is not trivial• Splitting in whitespaces

• homework… ≠ homework

• Remove punctuation

• :) ☐

• Solution: RegEx again• ([\w\-\'`]*)(\W*)

• File names:• Unique, easy to find and respect OS rules

• Hash• This is why the index file is important

Page 19: LIWC Dictionary Expansion

Master/Mapper/Issues

• Parallel programming on Python

• Original interpreter don’t support multi-thread…• Alternatives, such as Jython and IronPython, do

• …but it is still possible to work in parallel

• Multi-thread vs. Multi-process

• Multi-process in Python• multiprocessing module

• http://docs.python.org/library/multiprocessing.html#module-multiprocessing.pool

Page 20: LIWC Dictionary Expansion

Master/Reducer

• Spawn processes in parallel and split the words among them

• Basically counts the mapper results

• Also, each work does two jobs:

• First: sums all the (word, frequency) pairs and save

car 4house 2ball 5car 1house 1

frequency.tmp

car 5house 3ball 5

frequency.txt

Reducer

Page 21: LIWC Dictionary Expansion

Master/Reducer

• Spawn processes in parallel and split the words among them

• Basically counts the mapper results

• Also, each work does two jobs:

• First: sums all the (word, frequency) pairs and save

• Second: sums the co-occurrence frequency

Workercar 1ball 3car 2house 1

trip

car 3Ball 3house 1

trip

Page 22: LIWC Dictionary Expansion

Master/Reducer/Issues

• Index file

• Useful to access the files

• Each word has a file with a list of co-words

• But file name is hashed

• Non-invertible function

• Look-up on index, hash the word and get the file

Page 23: LIWC Dictionary Expansion

Master/Sort

• Simply sort the frequencies file

• Most frequent first

Page 24: LIWC Dictionary Expansion

Classifier

Frequency

Co-frequency

Scores

New words

α β γδ

Max results

Page 25: LIWC Dictionary Expansion

Classifier/Sentiment words

Car 232Ball 143Street 125House 121Boat 114Pencil 105Pen 98Computer 81

FrequencyTop α%

CarBallStreetHouseBoat

Page 26: LIWC Dictionary Expansion

Classifier/Co-words

CarBallStreet

Top β%

tire doorengine

court playgame

name size

Page 27: LIWC Dictionary Expansion

Classifier/Score

tire doorengine

court playgame

door size

size type homeroom

size doorprice

engine

tire

door

size

1 0

1 0

2

2

1

1

Page 28: LIWC Dictionary Expansion

Classifier/Collapse

• Created to deal with problems like:

• :) :)) :), :).

• They should all be treated as the same token

• Harder for words

Page 29: LIWC Dictionary Expansion

Classifier/New words

• Rules to compare the scores

• So far the rules are

• If the positive score is bigger than the negative score plus delta, tag the word as positive

• Same idea for negative

• Returns the new words up to a maximum value

Page 30: LIWC Dictionary Expansion

Other ideas

• WordNet based

• PMI similarity score

Page 31: LIWC Dictionary Expansion

Evaluation

• Two evaluation methods:

• First method

• Find tweets that could not be categorized before but now they can

• Manually check the precision of the result

• Second method

• Manually select positive and negative tweets

• Compare the precision of the old dictionary with the new dictionary

Page 32: LIWC Dictionary Expansion

Sub-product

• LIWC Dictionary Library for Python

• Provides easy access to the dictionary information• Easy search

• Reverse index

• Match wildcard

• Ex.:

Page 33: LIWC Dictionary Expansion