bbcsoundindex
DESCRIPTION
Slides accompanying the VLDB 2010 Journal paper - Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on "Data Management and Mining for Social Networks and Social Media"TRANSCRIPT
BBC SoundIndexPulse of the Online Music Populace
http://www.almaden.ibm.com/cs/projects/iis/sound/
Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on
"Data Management and Mining for Social Networks and Social Media", 2010
The VisionWhat is ‘really’ hot?
! Netizens do not always buy their music, let alone
buy in a CD store.
! Traditional sales figures are a poor indicator of
music popularity.
http://www.almaden.ibm.com/cs/projects/iis/sound/
The VisionWhat is ‘really’ hot?
BBC: Are online music communities good proxies for popular music listings?!
! Netizens do not always buy their music, let alone
buy in a CD store.
! Traditional sales figures are a poor indicator of
music popularity.
http://www.almaden.ibm.com/cs/projects/iis/sound/
The VisionWhat is ‘really’ hot?
BBC: Are online music communities good proxies for popular music listings?!
IBM: Well, lets build and find out!
! Netizens do not always buy their music, let alone
buy in a CD store.
! Traditional sales figures are a poor indicator of
music popularity.
http://www.almaden.ibm.com/cs/projects/iis/sound/
The VisionWhat is ‘really’ hot?
BBC: Are online music communities good proxies for popular music listings?!
IBM: Well, lets build and find out!
BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
! Netizens do not always buy their music, let alone
buy in a CD store.
! Traditional sales figures are a poor indicator of
music popularity.
http://www.almaden.ibm.com/cs/projects/iis/sound/
The VisionWhat is ‘really’ hot?
BBC: Are online music communities good proxies for popular music listings?!
IBM: Well, lets build and find out!
BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
! Netizens do not always buy their music, let alone
buy in a CD store.
! Traditional sales figures are a poor indicator of
music popularity.
http://www.almaden.ibm.com/cs/projects/iis/sound/
The VisionWhat is ‘really’ hot?
BBC: Are online music communities good proxies for popular music listings?!
IBM: Well, lets build and find out!
BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
! Netizens do not always buy their music, let alone
buy in a CD store.
! Traditional sales figures are a poor indicator of
music popularity.
http://www.almaden.ibm.com/cs/projects/iis/sound/
“one chart for everyone” is so old!
“Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media.
User metadata, Artist/Track
Metadata
unstructured, structured attention
metadata
right data source, right crowd, timeliness of data..?
UIMA Analytics Environment
Album/Track identificationSentiment IdentificationSpam and off-topic comments
“Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media.
right data source, right crowd, timeliness of data..?
“Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media.
Exracted concepts into explorable datastructures
right data source, right crowd, timeliness of data..?
What are 18 year olds in London listening to?
“Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media.
right data source, right crowd, timeliness of data..?
What are 18 year olds in London listening to?
Validating crowd-sourced preferences
“Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media.
right data source, right crowd, timeliness of data..?
Pulse of the ‘foodie’ populace!Where are 20 something year olds going? Why?
Imagine doing this for a local business!
4
Fig. 2 SoundIndex architecture. Data sources are ingested and if necessary transformed into structured data using the MusicBrainzRDF and data miners. The resulting structured data is stored in a database and periodically extracted to update the front end.
Is the radio still the most popular medium for music? In
2007 less than half of all teenagers purchased a CD6 and
with the advent of portable MP3 players, fewer people
are listening to the radio.
With the rise of new ways in which communities
are exposed to music, the BBC saw a need to rethink
how popularity is measured. Could the wealth of in-
formation in online communities be leveraged by moni-
toring online public discussions, examining the volume
and content of messages left for artists on their pages
by fans, and looking at what music is being requested,
traded and sold in digital environments? Furthermore,
could the notion of “one chart for everyone” be replaced
with a notion that different groups might wish to gen-
erate their own charts reflecting the popularity of peo-
ple like themselves (as the Guardian7 put it — “What
do forty something female electronica fans in the US
rate?”). Providing the data in a way that users can ex-
plore the data of interest to them was a critical goal of
our project.
2.2 Basic Design
The basic design of the SoundIndex, which we will de-
scribe in more detail in the remainder of the paper, is
shown in Figure 2. We looked at several multimodal
sources of social data, including “discussion sites” such
as MySpace, Bebo and the comments on YouTube, sources
6 NPD Group: Consumers Acquired More Music in 2007, ButSpent Less
7 http://www.guardian.co.uk/music/2008/apr/18/popandrock.netmusic
of “plays” such as YouTube videos and LastFM tracks,
and purely structured data such as the number of sales
from iTunes. We then clean this data, using spam re-
moval, off-topic detection, and performed analytics to
spot songs, artists and sentiment expressions. The cleaned
and annotated data is combined, adjudicating for dif-
ferent sources, and uploaded to a front end for use in
analysis.
2.3 Challenges in Continuous Operation
Real world events often impact systems attempting to
use social data. Every time the news reports that MyS-
pace has been a subject of a DDOS attack, one can be
assured that those trying to pull their data are sufferingas well. When sites report wildly inconsistent numbers
for views or listens (e.g., total number of “all time”
listens today is less than the total number yesterday),
it also causes problems for the kind of aggregation we
wish to do.
Combine these unforeseeable events with the data
scale, multiple daily updates to the front end, and run-
ning a distributed system with components on differentcontinents, and it becomes clear why this kind of system
is challenging to build and operate.
2.4 Testing and Validation
Is the social network you are looking at the right one
for the questions you are asking? Is the data timely and
relevant? Is there enough signal? All of these questions
need to be asked about any data source, and even more
SoundIndex Architecture
UIMA Annotators
UIMA Annotators
MySpace user comments (~ Twitter)
Named Entity Recognition
Sentiment Recognition
Spam Elimination
I heart your new song Umbrella..
madge..ur pics on celebration concert with jesus r awesome!
Challenges, intuitions, findings, results..
Recognizing Named Entities
Cultural entities, Informal text, Context-poor utterances, restricted to the music sense..
“Ohh these sour times... rock!”
Recognizing Named Entities
Cultural entities, Informal text, Context-poor utterances, restricted to the music sense..
Problem Defn: Semantic Annotation of album/track names (using MusicBrainz) [at ISWC ‘09]
“Ohh these sour times... rock!”
NERGot to Be There Ben Music and Me Forever Michael Off the Wall Thriller Bad Dangerous
NER
Spot and Disambiguate Paradigm
Generate Candidate entities
“Thriller was my most fav MJ album”
“this is bad news, ill miss you MJ”
Disambiguate spots/mentions in context
Got to Be There Ben Music and Me Forever Michael Off the Wall Thriller Bad Dangerous
NER
Spot and Disambiguate Paradigm
Generate Candidate entities
“Thriller was my most fav MJ album”
“this is bad news, ill miss you MJ”
Disambiguate spots/mentions in context
Got to Be There Ben Music and Me Forever Michael Off the Wall Thriller Bad Dangerous
Disambiguation Intensive Approach
60 songs with Merry Christmas
3600 songs with Yesterday
195 releases of American Pie
31 artists covering American Pie“Happy 25th! Loved
your song Smile ..”
Challenge 1: Multiple Senses, Same Domain
60 songs with Merry Christmas
3600 songs with Yesterday
195 releases of American Pie
31 artists covering American Pie“Happy 25th! Loved
your song Smile ..”
Challenge 1: Multiple Senses, Same Domain
9
This new Merry Christmas tune.. SO GOOD!
Intuition: Scoped Graphs
Scoped Relationship graphs
using cues from the content, webpage title, url..
Which ‘Merry Christmas’? ‘So Good’ is also a song!
9
This new Merry Christmas tune.. SO GOOD!
Intuition: Scoped Graphs
Scoped Relationship graphs
using cues from the content, webpage title, url..
Reduce potential entity spot size
Generate candidate entities
Disambiguate in context
Which ‘Merry Christmas’? ‘So Good’ is also a song!
What Content Cues to Exploit?“I heart your new album Habits”
“Happy 25th lilly, love ur song smile”
”Congrats on your rising star award”..
Releases that are not new; artists who are at least 25;
new careers..
Experimenting with Restrictions
Career LengthAlbum ReleaseNo. of Albums
...Specific Artist
Gold Truth Dataset1800+ spots in MySpace user comments from artist pages
Keep your SMILE on!
good spot, bad spot, inconclusive spot?
4-way annotator agreements across spots
Madonna 90% agreement
Rihanna 84% agreement
Lily Allen 53% agreement
Sample Restrictions, Spot Precision
Experimenting with Restrictions
Career LengthAlbum ReleaseNo. of Albums
...Specific Artist
3 artists, 1800+ spots
Sample Restrictions, Spot Precision
!"""#$
!""#$
!"#$
!#$
#$
#"$
#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
*****789!9$*,/):0+0-%;
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+
*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+
&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist
Experimenting with Restrictions
Career LengthAlbum ReleaseNo. of Albums
...Specific Artist
3 artists, 1800+ spots
Sample Restrictions, Spot PrecisionClosely follows distribution of random restrictions, conforming loosely to
a Zipf distribution
From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist
!"!!!#$
!"!!#$
!"!#$
!"#$
#$
#!$
#!!$!"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$
!"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"#
!#"$404(%'()'&*"'53(&&"#
%&'()*+(*,-'.,-.(/&012,-,34.51'&6.,-7)(*'(.8/1)1+(&)*'(*+'19&)1:&.&2;&+(&6*3;),9&3&-(..<=*;>6*'()*5?(*,-@
Experimenting with Restrictions
Career LengthAlbum ReleaseNo. of Albums
...Specific Artist
3 artists, 1800+ spots
Sample Restrictions, Spot PrecisionClosely follows distribution of random restrictions, conforming loosely to
a Zipf distribution
Choosing which constraints to implement is simple - pick easiest first
From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist
!"!!!#$
!"!!#$
!"!#$
!"#$
#$
#!$
#!!$!"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$
!"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"#
!#"$404(%'()'&*"'53(&&"#
%&'()*+(*,-'.,-.(/&012,-,34.51'&6.,-7)(*'(.8/1)1+(&)*'(*+'19&)1:&.&2;&+(&6*3;),9&3&-(..<=*;>6*'()*5?(*,-@
Experimenting with Restrictions
Career LengthAlbum ReleaseNo. of Albums
...Specific Artist
3 artists, 1800+ spots
Scoped Entity Lists
User comments are on MySpace artist pages
Restriction: Artist name
Assumption: no other artist/work mention
Madonna’s tracks
Scoped Entity Lists
“this is bad news, ill miss you MJ”
User comments are on MySpace artist pages
Restriction: Artist name
Assumption: no other artist/work mention
Naive spotter has advantage of spotting all possible mentions (modulo spelling errors)
Generates several false positives
Madonna’s tracks
Challenge 2: Disambiguating Spots
Got your new album Smile. Loved it!
Keep your SMILE on!
Separating Valid and invalid mentions of music named entities.
Challenge 2: Disambiguating Spots
Got your new album Smile. Loved it!
Keep your SMILE on!
Separating Valid and invalid mentions of music named entities.
Intuition: Using Natural Language Cues
15
Syntactic features Notation-S+POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent∗
Typed dependency between s and domain-specific term s.POS-TDdom∗
Boolean Typed dependency between s and sentiment s.B-TDsent∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom∗
Word-level features Notation-W+Capitalization of spot s s.allCaps
+Capitalization of first letter of s s.firstCaps
+s in Quotes s.inQuotes
Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features
∗These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Valid spot: Got your new album Smile.
Simply loved it!
Encoding: nsubj(loved-8, Smile-5) imply-
ing that Smile is the nominal subject of
the expression loved.
Invalid spot: Keep your smile on. You’ll
do great !Encoding: No typed dependency between
smile and great
Table 7. Typed Dependencies Example
Typed Dependencies:
We also captured the typed de-
pendency paths (grammatical rela-
tions) via the s.POS-
TDsent and s.POS-TDdom boolean
features. These were obtained be-
tween a spot and co-occurring senti-
ment and domain-specific words by
the Stanford parser[12] (see exam-
ple in 7). We also encode a boolean
value indicating whether a relation
was found at all using the s.B-TDsent
and s.B-TDdom features. This allows us to accommodate parse errors given the
informal and often non-grammatical English in this corpus.
5.2 Data and Experiments
Our training and test data sets were obtained from the hand-tagged data (see
Table 3). Positive and negative training examples were all spots that all four
annotators had confirmed as valid or invalid respectively, for a total of 571 posi-
tive and 864 negative examples. Of these, we used 550 positive and 550 negative
examples for training. The remaining spots were used for test purposes.
Our positive and negative test sets comprised of all spots that three annota-
tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also
included spots where 50% of the annotators had agreement on the validity of the
Generic syntactic, spot-level, domain features
Got your new album Smile. Loved it!
Intuition: Using Natural Language Cues
15
Syntactic features Notation-S+POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent∗
Typed dependency between s and domain-specific term s.POS-TDdom∗
Boolean Typed dependency between s and sentiment s.B-TDsent∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom∗
Word-level features Notation-W+Capitalization of spot s s.allCaps
+Capitalization of first letter of s s.firstCaps
+s in Quotes s.inQuotes
Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features
∗These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Valid spot: Got your new album Smile.
Simply loved it!
Encoding: nsubj(loved-8, Smile-5) imply-
ing that Smile is the nominal subject of
the expression loved.
Invalid spot: Keep your smile on. You’ll
do great !Encoding: No typed dependency between
smile and great
Table 7. Typed Dependencies Example
Typed Dependencies:
We also captured the typed de-
pendency paths (grammatical rela-
tions) via the s.POS-
TDsent and s.POS-TDdom boolean
features. These were obtained be-
tween a spot and co-occurring senti-
ment and domain-specific words by
the Stanford parser[12] (see exam-
ple in 7). We also encode a boolean
value indicating whether a relation
was found at all using the s.B-TDsent
and s.B-TDdom features. This allows us to accommodate parse errors given the
informal and often non-grammatical English in this corpus.
5.2 Data and Experiments
Our training and test data sets were obtained from the hand-tagged data (see
Table 3). Positive and negative training examples were all spots that all four
annotators had confirmed as valid or invalid respectively, for a total of 571 posi-
tive and 864 negative examples. Of these, we used 550 positive and 550 negative
examples for training. The remaining spots were used for test purposes.
Our positive and negative test sets comprised of all spots that three annota-
tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also
included spots where 50% of the annotators had agreement on the validity of the
Generic syntactic, spot-level, domain features
1. Sentiment Expressions:Slang sentiment gazetteer using Urban Dictionary
2. Domain specific termsmusic, album, concert..
Got your new album Smile. Loved it!
Intuition: Using Natural Language Cues
15
Syntactic features Notation-S+POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent∗
Typed dependency between s and domain-specific term s.POS-TDdom∗
Boolean Typed dependency between s and sentiment s.B-TDsent∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom∗
Word-level features Notation-W+Capitalization of spot s s.allCaps
+Capitalization of first letter of s s.firstCaps
+s in Quotes s.inQuotes
Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features
∗These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Valid spot: Got your new album Smile.
Simply loved it!
Encoding: nsubj(loved-8, Smile-5) imply-
ing that Smile is the nominal subject of
the expression loved.
Invalid spot: Keep your smile on. You’ll
do great !Encoding: No typed dependency between
smile and great
Table 7. Typed Dependencies Example
Typed Dependencies:
We also captured the typed de-
pendency paths (grammatical rela-
tions) via the s.POS-
TDsent and s.POS-TDdom boolean
features. These were obtained be-
tween a spot and co-occurring senti-
ment and domain-specific words by
the Stanford parser[12] (see exam-
ple in 7). We also encode a boolean
value indicating whether a relation
was found at all using the s.B-TDsent
and s.B-TDdom features. This allows us to accommodate parse errors given the
informal and often non-grammatical English in this corpus.
5.2 Data and Experiments
Our training and test data sets were obtained from the hand-tagged data (see
Table 3). Positive and negative training examples were all spots that all four
annotators had confirmed as valid or invalid respectively, for a total of 571 posi-
tive and 864 negative examples. Of these, we used 550 positive and 550 negative
examples for training. The remaining spots were used for test purposes.
Our positive and negative test sets comprised of all spots that three annota-
tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also
included spots where 50% of the annotators had agreement on the validity of the
Generic syntactic, spot-level, domain features
1. Sentiment Expressions:Slang sentiment gazetteer using Urban Dictionary
2. Domain specific termsmusic, album, concert..
Syntactic features Notation-S+POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent∗
Typed dependency between s and domain-specific term s.POS-TDdom∗
Boolean Typed dependency between s and sentiment s.B-TDsent∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom∗
Word-level features Notation-W+Capitalization of spot s s.allCaps
+Capitalization of first letter of s s.firstCaps
+s in Quotes s.inQuotes
Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features
∗These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Valid spot: Got your new album Smile.
Simply loved it!
Encoding: nsubj(loved-8, Smile-5) imply-
ing that Smile is the nominal subject of
the expression loved.
Invalid spot: Keep your smile on. You’ll
do great !Encoding: No typed dependency between
smile and great
Table 7. Typed Dependencies Example
Typed Dependencies:
We also captured the typed de-
pendency paths (grammatical rela-
tions) via the s.POS-
TDsent and s.POS-TDdom boolean
features. These were obtained be-
tween a spot and co-occurring senti-
ment and domain-specific words by
the Stanford parser[12] (see exam-
ple in 7). We also encode a boolean
value indicating whether a relation
was found at all using the s.B-TDsent
and s.B-TDdom features. This allows us to accommodate parse errors given the
informal and often non-grammatical English in this corpus.
5.2 Data and Experiments
Our training and test data sets were obtained from the hand-tagged data (see
Table 3). Positive and negative training examples were all spots that all four
annotators had confirmed as valid or invalid respectively, for a total of 571 posi-
tive and 864 negative examples. Of these, we used 550 positive and 550 negative
examples for training. The remaining spots were used for test purposes.
Our positive and negative test sets comprised of all spots that three annota-
tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also
included spots where 50% of the annotators had agreement on the validity of the
Got your new album Smile. Loved it!
Binary Classification
SVM Binary Classifiers
Training Set:
Positive examples (+1): 550 valid spots
Negative examples (-1): 550 invalid spots
Test Set:
120 valid spots, 458 invalid spots
Got your new album Smile. Loved it!
Valid Music mention or not?
Keep ur SMILE on!
Efficacy of Features
Prec
ision
inte
nsive
Recall intensive
Efficacy of Features
Identified 90% of valid spotsEliminated 35% of invalid spots
78-50
42-91
90-35
Prec
ision
inte
nsive
Recall intensive
Efficacy of Features- Feature combinations were most stable, best performing
Identified 90% of valid spotsEliminated 35% of invalid spots
78-50
42-91
- Gazetteer matched domain words and sentiment expressions proved
to be useful
90-35
Prec
ision
inte
nsive
Recall intensive
Efficacy of Features- Feature combinations were most stable, best performing
PR tradeoffs: choosing feature combinations depending on end application
Identified 90% of valid spotsEliminated 35% of invalid spots
78-50
42-91
- Gazetteer matched domain words and sentiment expressions proved
to be useful
90-35
Prec
ision
inte
nsive
Recall intensive
How did we do overall ?
!"
#!"
$!"
%!"
&!"
'!!"
()*+,
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
5('*%$%63)7)8'*#""
71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E
Step 1: Spot with naive spotter, restricted knowledge base
Madonna’s track spots-
23% precision
How did we do overall ?
!"
#!"
$!"
%!"
&!"
'!!"
()*+,
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
5('*%$%63)7)8'*#""
71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E
Step 1: Spot with naive spotter, restricted knowledge base
Step 2: Disambiguate using NL features (SVM classifier)
Madonna’s track spots-
23% precision 42-91 : “All features” setting
How did we do overall ?
!"
#!"
$!"
%!"
&!"
'!!"
()*+,
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
5('*%$%63)7)8'*#""
71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E
Step 1: Spot with naive spotter, restricted knowledge base
Step 2: Disambiguate using NL features (SVM classifier)
Madonna’s track spots-
23% precision
Madonna’s track spots-
~60% precision
42-91 : “All features” setting
Lessons Learned..Using domain knowledge especially for cultural entities is non-trivial
Computationally intensive NLP is prohibitive
Two stage approach: NL learners over dictionary-based naive spotters
allows more time-intensive NL analytics to run on less than the full set of input data
control over precision and recall of final result!
"!!
#!!
$!!
%!!
"&!! "'!! "(!! ")!! "*!! #!!!!"#$%&'()#&*+&)#$",$&*#&+*-./".0&'()#&*+&1)./
23*$,&3(#&)#$",$
'!+,-./+012/+324/+56,176+,-.+8.7967:0129;4+32-.84-
Lessons Learned..Using domain knowledge especially for cultural entities is non-trivial
Computationally intensive NLP is prohibitive
Two stage approach: NL learners over dictionary-based naive spotters
allows more time-intensive NL analytics to run on less than the full set of input data
control over precision and recall of final result
Sentiment ExpressionsNER allows us to track and trend online mentions
Popularity assessment: sentiment associated with the target entity
Exploratory Project: what kinds of sentiments, distribution of positive and negative expressions..
Observations, ChallengesNegations, sarcasms, refutations are rare
Short sentences: OK to overlook target attachment
Target demographic: Teenagers!
Slang expressions: Wicked, tight, “the shit”, tropical, bad..
What are common positive and negative expressions this demographic uses?
Urbandictionary.com to the rescue!
Urbandictionary.com
Related tagsbad appears with terrible and good!
Glossary
Slang Dictionary!
Mining a Slang Sentiment DictionaryMap frequently used sentiment expressions to their orientations
Mining a Slang Sentiment Dictionary
Seed positive and negative oriented dictionary entries
good, terrible..
Map frequently used sentiment expressions to their orientations
Mining a Slang Sentiment Dictionary
Seed positive and negative oriented dictionary entries
good, terrible..
Step 1:For each seed, query UD, get related tags
Good ->awesome, sweet, fun, bad, rad..
Terrible ->bad, horrible, awful, shit..
Map frequently used sentiment expressions to their orientations
Mining a Slang Sentiment Dictionary
Good ->awesome, sweet, fun, bad, rad
Step 2: Calculate semantic orientation of each related tag [Turney]
SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
Mining a Slang Sentiment Dictionary
Good ->awesome, sweet, fun, bad, rad
Step 2: Calculate semantic orientation of each related tag [Turney]
SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
PMI over the Web vs. UD
SOweb(rock) = -0.112 SOud(rock) = 0.513
Mining a Slang Sentiment Dictionary
Step 3: Record orientation; add to dictionary
{good, rad}
{terrible}
Continue for other related tags and all new entries ..
Mined dictionary: 300+ entries (manually verified for noise)
Sentiment Expressions in Text“Your new album is wicked”
Shallow NL parse
verbs, adjectives [Hatzivassiloglou 97]
Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ
Sentiment Expressions in Text“Your new album is wicked”
Shallow NL parse
verbs, adjectives [Hatzivassiloglou 97]
Look up word in mined dictionary, record orientation
Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ
Sentiment Expressions in Text“Your new album is wicked”
Shallow NL parse
verbs, adjectives [Hatzivassiloglou 97]
Look up word in mined dictionary, record orientation
Why dont we just spot using the dictionary!?
fuuunn! (coverage issues)
“super man is in town”; “i heard amy at tropical cafe yday..”
Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ
Dictionary Coverage Issues
Resort to corpus (transliterations)
Co-occurence in corpus (tight:top scoring dictionary entries)
“tight-awesome”:456, “tight-sweet”:136, “tight-hot” 429
Miscellaneous
Presence of entity improves confidence of identified sentiment
Short sentences, high coherence
Associate sentiment with all entities
If no entity spotted, associate with artist (whose page we are on)
“you are awesome!” on Madonna’s page
Evaluations - Mined Dictionary + Identifying Orientation
15
presence and orientation of sentiment expressions. Tun-able cut-off thresholds for annotators were determinedbased on experiments. Table 3 shows the accuracy ofthe annotator and illustrates the importance of usingtransliterations in such corpora.
Annotator Type Precision RecallPositive Sentiment 0.81 0.9Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Table 3 Transliteration accuracy impact
These results indicate that the syntax and seman-tics of sentiment expression in informal text is difficultto determine. Words that were incorrectly identified assentiment bearing, because of incorrect parses due tosentence structure, resulted in inaccurate translitera-tions which contributed to low precision, especially inthe case of the Negative Sentiment annotator. We alsoexperimented with advanced NL features using depen-dency parses of comments, but found them to be ex-pensive and minimally effective due to poor sentenceconstructions.
Low recall was consistently traced back to one ofthe following reasons: A failure to catch sentiment in-dicators in a sentence (inaccurate NL parses) or a wordthat was not included in our mined dictionary, eitherbecause it did not appear in UD or because it hadinsufficient support in the corpus for the translitera-tion (e.g., ‘funnn’). Evaluating the annotator withoutthe mined dictionary, i.e., only with the corpus sup-port, significantly reduced the recall (owing to sparseco-occurrences in the corpus). However, it also slightlyimproved the precision, indicating the need for more se-lective transliterations. This speaks to our method formining a dictionary of transliterations that does nottake into account the context of the sentiment expres-sion in the comment – an important near-term inves-tigation for this work. Empirical analysis also revealedthat the use of negation words such as ‘not’, ‘never’was rare in this data and therefore ignored at this point.However, as this may not be true for other data sources,this is an important area of future work.
7.3 Dealing with Spam
Like many online data sets today, this corpus suffersfrom a fair amount of spam — off-topic comments thatare often a kind of advertising. A preliminary analy-sis shows that for some artists more than half of thecomment postings are spam (see Table 4). This level of
noise could significantly impact the data analysis andordering of artists if it is not accounted for.
Gorillaz 54% Placebo 39%Coldplay 42% Amy Winehouse 38%Lily Allen 40% Lady Sovereign 37%Keane 40% Joss Stone 36%
Table 4 Percentage of total comments that are spam for severalpopular artists.
A particular type of spam that is irrelevant to ourpopularity gauging exercise are comments unrelated tothe artist’s work; for example, those making referencesto an artist’s personal life. Eliminating such commentsand only counting those relevant to an artist or theirwork is an important goal in generating ranked artistlists.
Since many of the features in such irrelevant con-tent overlap with the ones of interest to us, classifyingspam comments only using spam-phrases or featureswas not very effective. However, a manual examinationof a random set of comments yielded useful cues to usein eliminating spam.
1. The majority of spam comments were related to thedomain, had the same buzz words as many non-spam comments and were often less than 300 wordslong.
2. Like any auto-generated content, there were severalpatterns in the corpus indicative of spam. Someexamples include “Buy our cd”, “Come check usout..”, etc.
3. Comments often had spam and appreciative contentin the same sentence which implied that a spamannotator would benefit from being aware of theprevious annotation results.
4. Empirical observations also suggested that the pres-ence of sentiment is pivotal in distinguishing spamcontent. Figure 8 illustrates the difference in distri-bution of sentiments in spam and non-spam content.
These observations also speak to the order in whichour annotators are applied. We first perform named en-tity spotting, followed by the sentiment annotator andfinally the spam filter.
7.3.1 Spotting Spam: Approach and Findings
Our approach to identifying spam (including those ir-relevant to an artist’s work) differs from past work, be-cause the small size of individual comments and the useof slang posed new challenges. Typical content-basedtechniques work by testing content on patterns or reg-ular expressions (Mason, 2002) that are indicative of
Evaluations - Mined Dictionary + Identifying Orientation
Negative sentiments: slang orientations out of context!
you were terrible last night at SNL but I still <3 you!
15
presence and orientation of sentiment expressions. Tun-able cut-off thresholds for annotators were determinedbased on experiments. Table 3 shows the accuracy ofthe annotator and illustrates the importance of usingtransliterations in such corpora.
Annotator Type Precision RecallPositive Sentiment 0.81 0.9Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Table 3 Transliteration accuracy impact
These results indicate that the syntax and seman-tics of sentiment expression in informal text is difficultto determine. Words that were incorrectly identified assentiment bearing, because of incorrect parses due tosentence structure, resulted in inaccurate translitera-tions which contributed to low precision, especially inthe case of the Negative Sentiment annotator. We alsoexperimented with advanced NL features using depen-dency parses of comments, but found them to be ex-pensive and minimally effective due to poor sentenceconstructions.
Low recall was consistently traced back to one ofthe following reasons: A failure to catch sentiment in-dicators in a sentence (inaccurate NL parses) or a wordthat was not included in our mined dictionary, eitherbecause it did not appear in UD or because it hadinsufficient support in the corpus for the translitera-tion (e.g., ‘funnn’). Evaluating the annotator withoutthe mined dictionary, i.e., only with the corpus sup-port, significantly reduced the recall (owing to sparseco-occurrences in the corpus). However, it also slightlyimproved the precision, indicating the need for more se-lective transliterations. This speaks to our method formining a dictionary of transliterations that does nottake into account the context of the sentiment expres-sion in the comment – an important near-term inves-tigation for this work. Empirical analysis also revealedthat the use of negation words such as ‘not’, ‘never’was rare in this data and therefore ignored at this point.However, as this may not be true for other data sources,this is an important area of future work.
7.3 Dealing with Spam
Like many online data sets today, this corpus suffersfrom a fair amount of spam — off-topic comments thatare often a kind of advertising. A preliminary analy-sis shows that for some artists more than half of thecomment postings are spam (see Table 4). This level of
noise could significantly impact the data analysis andordering of artists if it is not accounted for.
Gorillaz 54% Placebo 39%Coldplay 42% Amy Winehouse 38%Lily Allen 40% Lady Sovereign 37%Keane 40% Joss Stone 36%
Table 4 Percentage of total comments that are spam for severalpopular artists.
A particular type of spam that is irrelevant to ourpopularity gauging exercise are comments unrelated tothe artist’s work; for example, those making referencesto an artist’s personal life. Eliminating such commentsand only counting those relevant to an artist or theirwork is an important goal in generating ranked artistlists.
Since many of the features in such irrelevant con-tent overlap with the ones of interest to us, classifyingspam comments only using spam-phrases or featureswas not very effective. However, a manual examinationof a random set of comments yielded useful cues to usein eliminating spam.
1. The majority of spam comments were related to thedomain, had the same buzz words as many non-spam comments and were often less than 300 wordslong.
2. Like any auto-generated content, there were severalpatterns in the corpus indicative of spam. Someexamples include “Buy our cd”, “Come check usout..”, etc.
3. Comments often had spam and appreciative contentin the same sentence which implied that a spamannotator would benefit from being aware of theprevious annotation results.
4. Empirical observations also suggested that the pres-ence of sentiment is pivotal in distinguishing spamcontent. Figure 8 illustrates the difference in distri-bution of sentiments in spam and non-spam content.
These observations also speak to the order in whichour annotators are applied. We first perform named en-tity spotting, followed by the sentiment annotator andfinally the spam filter.
7.3.1 Spotting Spam: Approach and Findings
Our approach to identifying spam (including those ir-relevant to an artist’s work) differs from past work, be-cause the small size of individual comments and the useof slang posed new challenges. Typical content-basedtechniques work by testing content on patterns or reg-ular expressions (Mason, 2002) that are indicative of
Evaluations - Mined Dictionary + Identifying Orientation
Negative sentiments: slang orientations out of context!
you were terrible last night at SNL but I still <3 you!
Precision excluding corpus transliterations: Incorrect NL parses, selectivity
15
presence and orientation of sentiment expressions. Tun-able cut-off thresholds for annotators were determinedbased on experiments. Table 3 shows the accuracy ofthe annotator and illustrates the importance of usingtransliterations in such corpora.
Annotator Type Precision RecallPositive Sentiment 0.81 0.9Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Table 3 Transliteration accuracy impact
These results indicate that the syntax and seman-tics of sentiment expression in informal text is difficultto determine. Words that were incorrectly identified assentiment bearing, because of incorrect parses due tosentence structure, resulted in inaccurate translitera-tions which contributed to low precision, especially inthe case of the Negative Sentiment annotator. We alsoexperimented with advanced NL features using depen-dency parses of comments, but found them to be ex-pensive and minimally effective due to poor sentenceconstructions.
Low recall was consistently traced back to one ofthe following reasons: A failure to catch sentiment in-dicators in a sentence (inaccurate NL parses) or a wordthat was not included in our mined dictionary, eitherbecause it did not appear in UD or because it hadinsufficient support in the corpus for the translitera-tion (e.g., ‘funnn’). Evaluating the annotator withoutthe mined dictionary, i.e., only with the corpus sup-port, significantly reduced the recall (owing to sparseco-occurrences in the corpus). However, it also slightlyimproved the precision, indicating the need for more se-lective transliterations. This speaks to our method formining a dictionary of transliterations that does nottake into account the context of the sentiment expres-sion in the comment – an important near-term inves-tigation for this work. Empirical analysis also revealedthat the use of negation words such as ‘not’, ‘never’was rare in this data and therefore ignored at this point.However, as this may not be true for other data sources,this is an important area of future work.
7.3 Dealing with Spam
Like many online data sets today, this corpus suffersfrom a fair amount of spam — off-topic comments thatare often a kind of advertising. A preliminary analy-sis shows that for some artists more than half of thecomment postings are spam (see Table 4). This level of
noise could significantly impact the data analysis andordering of artists if it is not accounted for.
Gorillaz 54% Placebo 39%Coldplay 42% Amy Winehouse 38%Lily Allen 40% Lady Sovereign 37%Keane 40% Joss Stone 36%
Table 4 Percentage of total comments that are spam for severalpopular artists.
A particular type of spam that is irrelevant to ourpopularity gauging exercise are comments unrelated tothe artist’s work; for example, those making referencesto an artist’s personal life. Eliminating such commentsand only counting those relevant to an artist or theirwork is an important goal in generating ranked artistlists.
Since many of the features in such irrelevant con-tent overlap with the ones of interest to us, classifyingspam comments only using spam-phrases or featureswas not very effective. However, a manual examinationof a random set of comments yielded useful cues to usein eliminating spam.
1. The majority of spam comments were related to thedomain, had the same buzz words as many non-spam comments and were often less than 300 wordslong.
2. Like any auto-generated content, there were severalpatterns in the corpus indicative of spam. Someexamples include “Buy our cd”, “Come check usout..”, etc.
3. Comments often had spam and appreciative contentin the same sentence which implied that a spamannotator would benefit from being aware of theprevious annotation results.
4. Empirical observations also suggested that the pres-ence of sentiment is pivotal in distinguishing spamcontent. Figure 8 illustrates the difference in distri-bution of sentiments in spam and non-spam content.
These observations also speak to the order in whichour annotators are applied. We first perform named en-tity spotting, followed by the sentiment annotator andfinally the spam filter.
7.3.1 Spotting Spam: Approach and Findings
Our approach to identifying spam (including those ir-relevant to an artist’s work) differs from past work, be-cause the small size of individual comments and the useof slang posed new challenges. Typical content-basedtechniques work by testing content on patterns or reg-ular expressions (Mason, 2002) that are indicative of
Crucial Finding
-ve +ve
Sentiment Expression Frequencies in 60k comments, 26 weeks
< 4%
Crucial Finding
-ve +ve
Sentiment Expression Frequencies in 60k comments, 26 weeks
Forget about the sentiment annotator! Just use entity
mention volumes!
< 4%
Spam, Off-topic, PromotionsSpecial type of spam: unrelated to artists’ work
Paul McCartney’s divorce; Rihanna’s Abuse; Madge and Jesus
Self-Promotions
“check out my new cool sounding tracks..”
music domain, similar keywords, harder to tell apart
Standard Spam
“Buy cheap cellphones here..”
Observations
60k comments, 26 weeks
Common 4 grams pulled out several ‘spammy’ phrases
Phrases -> Spam, Non-spam buckets
16
SPAM: 80% have 0 sentiments
CHECK US OUT!!! ADD US!!!
PLZ ADD ME!
IF YOU LIKE THESE GUYS ADD US!!!
NON-SPAM: 50% have at least 3 sentiments
Your music is really bangin!You’re a genius! Keep droppin bombs!
u doin it up 4 real. i really love the album.keep doin wat u do best. u r so bad!
hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.
Fig. 8 Examples of sentiment in spam and non-spam comments.
spam, or by training Bayesian models over spam and
non-spam content (Blosser and Josephsen, 2004). Re-
cent investigations on removing spam in blogs use sim-
ilar statistical techniques with good results (Thoma-
son, 2007). These techniques were largely ineffective onour corpus because comments are rather short (1 or
2 sentences), share similar buzz words with non-spam
content, are poorly formed, and contain frequent vari-
ations of word/slang usage. Our approach of filtering
spam is an aggregate function that uses a finite set of
mined spam patterns from the corpus and other non-
spam content such as artist names, and sentiments that
are spotted in a comment.
Our spam annotator builds off a manually assem-
bled seed of 45 phrases found in the corpus that are
indicative of spam content. This seed was assembled
by empirical analysis of frequent 4-grams in comments
that are indicative of spam content. First, the algorithm
spots these possible spam phrases and their variations
in text using a window of words over the user comment
and computing the string similarity between the spam
phrase and the window of words in the user comment.
These spots along with a set of rules over the results
of the previous entity and sentiment annotators is used
to decide if a posting is spam. An example of such a
rule would be that if a spam phrase, artist/track name
and a positive sentiment were spotted, the comment
is probably not spam. By looking for previously spot-
ted meaningful entities we ensure that we only discard
spam comments that make no mention of an artist or
their work.
Table 5 shows the accuracy of the spam and non-
spam annotators for the hand-tagged comments. Our
analysis indicates that lowered precision or recall in the
spam annotator was a direct consequence of deficien-
cies in the preceding named entity and sentiment an-
notators. For example, in cases where the comment did
not have a spam pattern from our manually assembled
list, and the first annotator spotted incorrect tracks,
the spam annotator interpreted the comment to be re-
lated to music and classified it as non-spam. Other cases
included more clever promotional comments that in-
cluded the actual artist tracks, genuine sentiments and
very limited spam content. (e.g., “like umbrella ull love
this song. . . ”). As is evident, the amount of information
available in a comment (one or two sentences) in addi-
tion to poor grammar necessitates more sophisticated
techniques for spam identification. This is an important
focus of our ongoing research.
Annotator Type Precision RecallSpam 0.76 0.8
Non-Spam 0.83 0.88
Table 5 Spam annotator performance
8 Generation of the Hypercube
Many modern Business Intelligence systems are designed
to support exploration of the underlying dimensions of
“facts” and the correlations and relationships between
them. This can be modeled as a high dimensionality
data hypercube (also known as an OLAP cube(Codd
et al, 1993)). The system stores this in a relational
database (e.g., DB2) to allow the analyst to explore
the relative importance of various dimensions for the
task at hand (for the BBC the popularity of musical
topics). The dimensions of the cube are generated in
two ways: from the structured data (e.g., age, gender,
user location, timestamp, artist), and from the mea-
surements generated by a chain of annotator methods
that annotate artist, entity and sentiment mentions in
each comment16. The resulting tuple is then placed into
a star schema in which the primary measure is a rele-
vance with regards to musical topics. This is equivalent
of defining a function.
M : (Age,Gender, Location, T ime,Artist, ...) → M
In other words, we define a multidimensional grid
where the dimensions are quantities of interest. What
those are depends on the domain; for music, they are
values such as age, gender, etc. We “fill in” each point in
the grid with the number of occurrences of non-spam
comments made by a poster of that age, gender and
location posting at that time about that artist, song,
16 Due to the scale of data being examined these annotationsare generated automatically by the UIMA analysis chain. Whilethe results are spot checked by a human periodically to assurelanguage drift has not occurred, in general, they run unsuper-vised.
The spam annotator should be aware of other annotator results!
Spam Elimination
Aggregate function
Phrases indicative of spam (4-way annotator agreements)
Rules over previous annotator results
if a spam phrase, artist/track name and a positive sentiment were spotted, the comment is probably not spam.
Performance
Directly proportional to previous annotator results
incorrect entity, sentiment spots
Some self-promotions are just clever!
”like umbrella, ull love this song. . . “
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
Do not Confuse Activity with Popularity!
Artist popularity ranking changed dramatically after excluding spam!
% Spam, 60k comments over 26 weeks
spamnon-spam15
presence and orientation of sentiment expressions. Tun-able cut-off thresholds for annotators were determinedbased on experiments. Table 3 shows the accuracy ofthe annotator and illustrates the importance of usingtransliterations in such corpora.
Annotator Type Precision RecallPositive Sentiment 0.81 0.9Negative Sentiment 0.5 1.0
PS excluding transliterations 0.84 0.67
Table 3 Transliteration accuracy impact
These results indicate that the syntax and seman-tics of sentiment expression in informal text is difficultto determine. Words that were incorrectly identified assentiment bearing, because of incorrect parses due tosentence structure, resulted in inaccurate translitera-tions which contributed to low precision, especially inthe case of the Negative Sentiment annotator. We alsoexperimented with advanced NL features using depen-dency parses of comments, but found them to be ex-pensive and minimally effective due to poor sentenceconstructions.
Low recall was consistently traced back to one ofthe following reasons: A failure to catch sentiment in-dicators in a sentence (inaccurate NL parses) or a wordthat was not included in our mined dictionary, eitherbecause it did not appear in UD or because it hadinsufficient support in the corpus for the translitera-tion (e.g., ‘funnn’). Evaluating the annotator withoutthe mined dictionary, i.e., only with the corpus sup-port, significantly reduced the recall (owing to sparseco-occurrences in the corpus). However, it also slightlyimproved the precision, indicating the need for more se-lective transliterations. This speaks to our method formining a dictionary of transliterations that does nottake into account the context of the sentiment expres-sion in the comment – an important near-term inves-tigation for this work. Empirical analysis also revealedthat the use of negation words such as ‘not’, ‘never’was rare in this data and therefore ignored at this point.However, as this may not be true for other data sources,this is an important area of future work.
7.3 Dealing with Spam
Like many online data sets today, this corpus suffersfrom a fair amount of spam — off-topic comments thatare often a kind of advertising. A preliminary analy-sis shows that for some artists more than half of thecomment postings are spam (see Table 4). This level of
noise could significantly impact the data analysis andordering of artists if it is not accounted for.
Gorillaz 54% Placebo 39%Coldplay 42% Amy Winehouse 38%Lily Allen 40% Lady Sovereign 37%Keane 40% Joss Stone 36%
Table 4 Percentage of total comments that are spam for severalpopular artists.
A particular type of spam that is irrelevant to ourpopularity gauging exercise are comments unrelated tothe artist’s work; for example, those making referencesto an artist’s personal life. Eliminating such commentsand only counting those relevant to an artist or theirwork is an important goal in generating ranked artistlists.
Since many of the features in such irrelevant con-tent overlap with the ones of interest to us, classifyingspam comments only using spam-phrases or featureswas not very effective. However, a manual examinationof a random set of comments yielded useful cues to usein eliminating spam.
1. The majority of spam comments were related to thedomain, had the same buzz words as many non-spam comments and were often less than 300 wordslong.
2. Like any auto-generated content, there were severalpatterns in the corpus indicative of spam. Someexamples include “Buy our cd”, “Come check usout..”, etc.
3. Comments often had spam and appreciative contentin the same sentence which implied that a spamannotator would benefit from being aware of theprevious annotation results.
4. Empirical observations also suggested that the pres-ence of sentiment is pivotal in distinguishing spamcontent. Figure 8 illustrates the difference in distri-bution of sentiments in spam and non-spam content.
These observations also speak to the order in whichour annotators are applied. We first perform named en-tity spotting, followed by the sentiment annotator andfinally the spam filter.
7.3.1 Spotting Spam: Approach and Findings
Our approach to identifying spam (including those ir-relevant to an artist’s work) differs from past work, be-cause the small size of individual comments and the useof slang posed new challenges. Typical content-basedtechniques work by testing content on patterns or reg-ular expressions (Mason, 2002) that are indicative of
Workflow
Unstructured chatter to explorable hypercube
Hypercube, List Projection
Exploring dimensions of popularity
Data Hypercube (DB2) from
structured data in posts (user demographics, location, age, gender, timestamp)
unstructured data (entity, sentiment, spam flag)
Intersection of dimensions => non-spam comment counts
Hypercube to One-Dimensional List
Roll-ups
What is hot in New York for 19 year old males?
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
Hypercube to One-Dimensional List
Roll-ups
What is hot in New York for 19 year old males?
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
Annotator Type Precision Recall
Spam 0.76 0.8
Non-Spam 0.83 0.88
Table 4: Spam annotator performance
the comment did not have a spam pattern, and the first
annotator spotted incorrect tracks, the spam annotator in-
terpreted the comment to be related to music and classified
it as non-spam. Other cases included more clever promo-
tional comments that included the actual artist tracks, gen-
uine sentiments and very limited spam content. (e.g. ‘like
umbrella ull love this song. . . ’). As is evident, the amount of
information available at hand in addition to grammatically
poor sentences necessitates more sophisticated techniques
for spam identification.
Given the amount of effort involved the obvious question
arises – why filter spam? For our end goal of using comment
counts to lead to positions on the list, filtering spam is key.
This is corroborated by the volume of spam and non-spam
content observed over a period of 26 weeks for 8 artists; see
Table 5. The chart indicates that some artists had at least
half as many spam as non-spam comments on their page.
This level of noise would significantly impact the ordering
of artists if it were not accounted for.
Gorillaz 54% Placebo 39%
Coldplay 42% Amy Winehouse 38%
Lily Allen 40% Lady Sovereign 37%
Keane 40% Joss Stone 36%
Table 5: Percentage of total comments that arespam for several popular artists.
4.3 Generation of the hypercubeWe use a data hypercube (also known as an OLAP cube[5])
stored in a DB2 database to explore the relative importance
of various dimensions to the popularity of musical topics.
The dimensions of the cube are generated in two ways: from
the structured data in the posting (e.g., age, gender, loca-
tion of the user commenting, timestamp of the comment,
artist), and from the measurements generated by the above
annotator methods. This annotates each comment with a
series of tags from unstructured and structured data. The
resulting tuple is then placed into a star schema in which
the primary measure is a relevance with regards to musical
topics. This is equivalent of defining a function.
M : (Age, Gender, Location, T ime, Artist, ...)→M (1)
In our case, we have stored the aggregation of occurrences
of non-spam comment at the intersecting dimension values
of the hypercube. Storing the data this way makes it easy to
examine rankings over various time intervals, weight various
dimensions differently, etc. Once (and if) a total ordering
approach is fixed this intermediate data staging step might
be eliminated.
4.4 Projecting to a listUltimately we are seeking to generate a “one dimensional”
ordered list from the various contributing dimensions of the
cube. In general we project to a one dimensional ranking
which is then used to sort the artists, tracks, etc. We can ag-
gregate and analyze the hypercube using a variety of multi-
dimensional data operations on it to derive what are essen-
tially custom popular lists for particular musical topics. In
addition to the traditional billboard “Top Artist” lists, we
can slice and project (marginalize dimensions) the cube for
lists such as “What is hot in New York City for 19 year
old males?” and “Who are the most popular artists from
San Francisco?” They translate to following mathematical
operations:
L1(X) :
X
T,...
M(A = 19, G = M, L = “NewY orkCity��, T, X, ...)
(2)
L2(X) :
X
T,A,G,...
M(A, G, L = “SanFrancisco��, X, ...) (3)
where,
X = Name of the artist
T = Timestamp
A = Age of the commenter
G = Gender
L = Location
Note that for the remainder of this paper we aggregate
tracks and albums to artists as we wanted as many com-
ments as possible for our experiments. Clearly track based
projections are equally possible and the rule for our ongoing
work.
This tremendous flexibility is one advantage of the cube
approach.
F
5. EXPERIMENTSTo test the effectiveness of our popularity ranking system
we conducted a series of experiments. We prepared a new
top-N list of popular music to contrast with the most recent
Billboard list. To validate the accuracy of our lists, we then
conducted a study.
5.1 Generating our Top-N listWe started with the top-50 artists in Billboard’s singles
chart during the week of September 22nd through 28th,
2007. If an artist had multiple singles in the chart and
appeared multiple times, we only kept the highest ranked
single to ensure a unique list of artists. MySpace pages of
the 45 unique artists were crawled, and all comments posted
in the corresponding week were collected.
We loaded the comments into DB2 as described in Section
4.1. The crawled comments were passed through the three
annotators to remove spam and identify sentiments. The
tables below show statistics on the crawling and annotation
processes.
Number of unique artists 45
Total number of comments collected 50489
Total number of unique posters 33414
Table 6: Crawl Data
The Word on the Street
Billboards Top 50 Singles chart during the week of Sept 22-28 2007
(Unique) Top 45 MySpace artist pages
Crawl, Annotate, Build Hypercube, Project lists
21
38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments
35% of total comments had no identifiable sentiments
Table 7 Annotation Statistics
As described in Section 8, the structured metadata
(artist name, timestamp, etc.) and annotation results
(spam/non-spam, sentiment, etc.) were loaded in the
hypercube.
The data represented by each cell of the cube is the
number of comments for a given artist. The dimension-
ality of the cube is dependent on what variables we
are examining in our experiments. Timestamp, age and
gender of the poster, geography, and other factors are
all dimensions in hypercube, in addition to the mea-
sures derived from the annotators (spam, non-spam,
number of positive sentiments, etc.).
For the purposes of creating a top-N list, all dimen-
sions except for artist name are collapsed. The cube is
then sliced along the spam axis (to project only non-
spam comments) and the comment counts are projected
onto the artist name axis. Since the percentage of nega-
tive comments was very small (4%), the top-N list was
prepared by sorting artists on the number of non-spam
comments they had received independent of the senti-
ment scoring.
In Table 8 we show the top 10 most popular Bill-
board artists and the list generated by our analysis of
MySpace for the week of the survey. While some top
artists appear on both lists (e.g., Soulja Boy, Timba-
land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly
identified rising artists before they reached the Top-10
list on billboard (e.g., Fall Out Boy and Alicia Keys
both climbed to #1 on Billboard.com shortly after we
produced these lists). Overall, we can observe that the
Billboard.com list contains more artists with a long his-
tory and large body of work (e.g., Kanye West, Fer-
gie, Nickleback), whereas our MySpace Analysis List is
more likely to identify ”up and coming” artists. This is
consistent with our expectations, particularly in light
of the aforementioned industry reports which indicate
that teenagers are the biggest music influencers (Me-
diamark, 2004).
11.1.2 The Word on the Street
Using the above lists, we performed a casual preference
poll of 74 people in the target demographic. We con-
ducted a survey among students of an after-school pro-
gram (Group 1), Wright State (Group 2), and Carnegie
Mellon (Group 3). Of the three different groups, Group
Billboard.com MySpace Analysis
Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys
Table 8 Billboard’s Top Artists vs. our generated list
1 was comprised of respondents between ages 8 and 15;
while Group 2 and 3 were primarily comprised of col-
lege students in the 17–22 age group. Table 9 shows
statistics pertaining to the three survey groups.
Groups and Age Range No. of male No. of femalerespondents respondents
Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3
Table 9 Survey Group Statistics
The survey was conducted as follows: the 74 respon-
dents were asked to study the two lists shown in Table 8.
One was generated by Billboard and the other through
the crawl of MySpace. They were then asked the fol-
lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-
sponse along with their age, gender and the reason for
preferring a list was recorded.
The sources used to prepare the lists were not shown
to the respondents, so they would not be influenced by
the popularity of MySpace or Billboard. In addition,
we periodically switched the lists while conducting the
study to avoid any bias based on which list was pre-
sented first.
11.1.3 Results
The raw results of our study immediately suggest the
validity of the system, as can be seen in Table 10. The
MySpace data generated list is preferred over 2 to 1
to the Billboard list by our 74 test subjects, and the
preference is consistently in favor of our list across all
three survey groups.
More exactly, 68.9± 5.4% of subjects prefer the SI
derived list to the Billboard list. Looking specifically at
Group 1, the youngest survey group whose ages range
from 8–15, we can see that our list is even more suc-
cessful. Even with a smaller sample group (resulting in
Showing Top 10
The Word on the Street
Billboards Top 50 Singles chart during the week of Sept 22-28 2007
(Unique) Top 45 MySpace artist pages
Crawl, Annotate, Build Hypercube, Project lists
21
38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments
35% of total comments had no identifiable sentiments
Table 7 Annotation Statistics
As described in Section 8, the structured metadata
(artist name, timestamp, etc.) and annotation results
(spam/non-spam, sentiment, etc.) were loaded in the
hypercube.
The data represented by each cell of the cube is the
number of comments for a given artist. The dimension-
ality of the cube is dependent on what variables we
are examining in our experiments. Timestamp, age and
gender of the poster, geography, and other factors are
all dimensions in hypercube, in addition to the mea-
sures derived from the annotators (spam, non-spam,
number of positive sentiments, etc.).
For the purposes of creating a top-N list, all dimen-
sions except for artist name are collapsed. The cube is
then sliced along the spam axis (to project only non-
spam comments) and the comment counts are projected
onto the artist name axis. Since the percentage of nega-
tive comments was very small (4%), the top-N list was
prepared by sorting artists on the number of non-spam
comments they had received independent of the senti-
ment scoring.
In Table 8 we show the top 10 most popular Bill-
board artists and the list generated by our analysis of
MySpace for the week of the survey. While some top
artists appear on both lists (e.g., Soulja Boy, Timba-
land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly
identified rising artists before they reached the Top-10
list on billboard (e.g., Fall Out Boy and Alicia Keys
both climbed to #1 on Billboard.com shortly after we
produced these lists). Overall, we can observe that the
Billboard.com list contains more artists with a long his-
tory and large body of work (e.g., Kanye West, Fer-
gie, Nickleback), whereas our MySpace Analysis List is
more likely to identify ”up and coming” artists. This is
consistent with our expectations, particularly in light
of the aforementioned industry reports which indicate
that teenagers are the biggest music influencers (Me-
diamark, 2004).
11.1.2 The Word on the Street
Using the above lists, we performed a casual preference
poll of 74 people in the target demographic. We con-
ducted a survey among students of an after-school pro-
gram (Group 1), Wright State (Group 2), and Carnegie
Mellon (Group 3). Of the three different groups, Group
Billboard.com MySpace Analysis
Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys
Table 8 Billboard’s Top Artists vs. our generated list
1 was comprised of respondents between ages 8 and 15;
while Group 2 and 3 were primarily comprised of col-
lege students in the 17–22 age group. Table 9 shows
statistics pertaining to the three survey groups.
Groups and Age Range No. of male No. of femalerespondents respondents
Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3
Table 9 Survey Group Statistics
The survey was conducted as follows: the 74 respon-
dents were asked to study the two lists shown in Table 8.
One was generated by Billboard and the other through
the crawl of MySpace. They were then asked the fol-
lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-
sponse along with their age, gender and the reason for
preferring a list was recorded.
The sources used to prepare the lists were not shown
to the respondents, so they would not be influenced by
the popularity of MySpace or Billboard. In addition,
we periodically switched the lists while conducting the
study to avoid any bias based on which list was pre-
sented first.
11.1.3 Results
The raw results of our study immediately suggest the
validity of the system, as can be seen in Table 10. The
MySpace data generated list is preferred over 2 to 1
to the Billboard list by our 74 test subjects, and the
preference is consistently in favor of our list across all
three survey groups.
More exactly, 68.9± 5.4% of subjects prefer the SI
derived list to the Billboard list. Looking specifically at
Group 1, the youngest survey group whose ages range
from 8–15, we can see that our list is even more suc-
cessful. Even with a smaller sample group (resulting in
* Top artists appear in both lists* Several Overlaps* Artists with long history/body of work vs. ‘up and coming’ artists
* Predictive power of MySpace - Billboard next week looked a lot like
MySpace this week..
Teenagers are big music influencers [MediaMark2004] Showing Top 10
Casual Preference Poll Results
“Which list more accurately reflects the artists that were more popular last week?”
75 participants
Overall 2:1 preference for MySpace list
Younger age groups: 6:1 (8-15 yrs) 21
38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments
35% of total comments had no identifiable sentiments
Table 7 Annotation Statistics
As described in Section 8, the structured metadata
(artist name, timestamp, etc.) and annotation results
(spam/non-spam, sentiment, etc.) were loaded in the
hypercube.
The data represented by each cell of the cube is the
number of comments for a given artist. The dimension-
ality of the cube is dependent on what variables we
are examining in our experiments. Timestamp, age and
gender of the poster, geography, and other factors are
all dimensions in hypercube, in addition to the mea-
sures derived from the annotators (spam, non-spam,
number of positive sentiments, etc.).
For the purposes of creating a top-N list, all dimen-
sions except for artist name are collapsed. The cube is
then sliced along the spam axis (to project only non-
spam comments) and the comment counts are projected
onto the artist name axis. Since the percentage of nega-
tive comments was very small (4%), the top-N list was
prepared by sorting artists on the number of non-spam
comments they had received independent of the senti-
ment scoring.
In Table 8 we show the top 10 most popular Bill-
board artists and the list generated by our analysis of
MySpace for the week of the survey. While some top
artists appear on both lists (e.g., Soulja Boy, Timba-
land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly
identified rising artists before they reached the Top-10
list on billboard (e.g., Fall Out Boy and Alicia Keys
both climbed to #1 on Billboard.com shortly after we
produced these lists). Overall, we can observe that the
Billboard.com list contains more artists with a long his-
tory and large body of work (e.g., Kanye West, Fer-
gie, Nickleback), whereas our MySpace Analysis List is
more likely to identify ”up and coming” artists. This is
consistent with our expectations, particularly in light
of the aforementioned industry reports which indicate
that teenagers are the biggest music influencers (Me-
diamark, 2004).
11.1.2 The Word on the Street
Using the above lists, we performed a casual preference
poll of 74 people in the target demographic. We con-
ducted a survey among students of an after-school pro-
gram (Group 1), Wright State (Group 2), and Carnegie
Mellon (Group 3). Of the three different groups, Group
Billboard.com MySpace Analysis
Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys
Table 8 Billboard’s Top Artists vs. our generated list
1 was comprised of respondents between ages 8 and 15;
while Group 2 and 3 were primarily comprised of col-
lege students in the 17–22 age group. Table 9 shows
statistics pertaining to the three survey groups.
Groups and Age Range No. of male No. of femalerespondents respondents
Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3
Table 9 Survey Group Statistics
The survey was conducted as follows: the 74 respon-
dents were asked to study the two lists shown in Table 8.
One was generated by Billboard and the other through
the crawl of MySpace. They were then asked the fol-
lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-
sponse along with their age, gender and the reason for
preferring a list was recorded.
The sources used to prepare the lists were not shown
to the respondents, so they would not be influenced by
the popularity of MySpace or Billboard. In addition,
we periodically switched the lists while conducting the
study to avoid any bias based on which list was pre-
sented first.
11.1.3 Results
The raw results of our study immediately suggest the
validity of the system, as can be seen in Table 10. The
MySpace data generated list is preferred over 2 to 1
to the Billboard list by our 74 test subjects, and the
preference is consistently in favor of our list across all
three survey groups.
More exactly, 68.9± 5.4% of subjects prefer the SI
derived list to the Billboard list. Looking specifically at
Group 1, the youngest survey group whose ages range
from 8–15, we can see that our list is even more suc-
cessful. Even with a smaller sample group (resulting in
Challenging traditional polling methods!
Lessons Learned, Miles to go
Informal (teen-authored) text is not your average blog content..
Quality check is not on a day to day spot/comment level precision, but at a system level - are we missing a source/artist, is the crawl behaving,
adjudication techniques for multiple data sources..
Leveraging SI for BI is difficult - Validation is key, continuous fine tweaking with experts