extracting relevant & trustworthy information from microblogs
DESCRIPTION
Extracting Relevant & Trustworthy Information from Microblogs. Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly, Fabricio Benevenuto MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil. My research: The big picture. - PowerPoint PPT PresentationTRANSCRIPT
Extracting Relevant & Trustworthy Information from Microblogs
Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly,Fabricio Benevenuto
MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil
My research: The big pictureThree fundamental trends & challenges in social Web
1. User-generated content sharing can we protect privacy of users sharing personal data?
2. Word-of-mouth based content exchange can we understand & leverage word-of-mouth better??
3. Crowd-sourcing content rating and ranking can we find trustworthy & relevant content sources?
Twitter microblogging site An important source for real-time Web content
500 million active users posting 400 million tweets daily
Quality of tweets / content vary widely Any one can post tweets
Celebrities, politicians, news media, academics, spammers
Challenge: Finding relevant & trustworthy content Trustworthy: Thwart spammers and their spam Relevance: Identify authoritative experts on specific
topics
Thwarting Spammers in Twitter[WWW 2012]
Part 1
Background: How spammers operate Twitter spammers try to gain lots of followers
To promote spam directly To gain influence in the network
Search engines rank tweets based on how influential the user is Most metrics depend on user’s network
connectivity More followers help a user to gain influenceIncentivizes spammers to acquire links to gain influence
Acquiring followers via link farming Unrelated users exchange links with each
other To gain more influence based on network
connectivity
Alice
Bob
Charlie
David
Influence based on Influence based on connectivity is improvedconnectivity is improved
To thwart spammers We need to
1. Understand link farming activity in Twitter 2. Combat link farming activity in Twitter
Prior works: Focused on detecting spammers Via their characteristics, e.g., follower to following
ratios Rat-race between spammers and spam fighters
We focus on the spammer support network
Identifying spammers Used Twitter network gathered from previous
study [ICWSM’10]
Data collected in August 2009 54M nodes, 1.9B links, 1.7B Tweets
Identified accounts suspended by Twitter Account could be suspended for various reasons
Found suspended users that posted blacklisted URLs Includes 41,352 such spammers
Spammers farm links at large-scale Spam-targets: 27% of all users followed by at
least one of ~40,000 spammers!
Spam-followers: 82% of all followers have been targeted
Spammers have more followers than random users Avg follower count for Spammers: 234, Random
users: 36
Who responds to links from spammers? Small number of followers respond most of
the time
Top 100k followers exhibit high Top 100k followers exhibit high reciprocation of 0.8 on avg.reciprocation of 0.8 on avg.
Top 100k users account for Top 100k users account for 60% of all links to spammers60% of all links to spammers
We call these users We call these users link farmerslink farmers
Are link farmers real users or spammers?To find out if they are spammers or real users, we
1. Checked if they were suspended by Twitter 76% users not suspended, 235 of them verified by Twitter
2. Manually verified 100 random users 86% users are real with legitimate links in their Tweets
3. Analyzed their profiles More active in updating their profiles than random users
Are link farmers lay or popular users?
Conventional wisdom: Lay users more likely to follow back due to social etiquette Popular users might be more conservative in following others
Probability increases Probability increases with user popularitywith user popularity
Link farmers are popular users with lots of followers
Are link farmers lay or popular users? Top 5 link farmers according to Pagerank:
1. Barack Obama: Obama 2012 campaign staff 2. Britney Spears 3. NPR Politics: Political coverage and conversation 4. UK Prime Minister: PM’s office 5: JetBlue Airways
Link farmers include legitimate, popular users & organizations
What possibly motivates link farmers? One explanation:
Link farmers have similar incentives as spammers They seek to amass social capital & influence in
the network
Link farmers rank among top 5% influential Twitter users In terms of various metrics like Pagerank &
Followerrank
Combating link farming Key challenge:
Real, popular and active users are involved in link farming
Detecting and suspending spammers alone will not help
Insight: Discourage users from following others carelessly Penalize users following anyone found to be bad
Lower the influence scores of users following spammersIncentivizes users to be more careful about who they link to
Collusionrank
Borrows ideas from spam defense strategies for Web [WWW’05]
Low Collusionrank score for a user indicates heavy linking to spammers or spam-followers
Requires a seed set of known spammers Twitter operator periodically identifies and updates
spammers
Collusionrank
Algorithm:1. Negatively bias the initial scores to the set of spammers
2. In Pagerank style, iteratively penalize userswho follow spammers or those who follow spam-followers
Collusionrank is based on the score of followings of a user
Because user is penalized based on who he follows
Evaluating Collusionrank Goal:
To penalize spammers and spam-followers Should not penalize users who are not following
spammers
Used a small subset of 600 spammers as seed set
Compare ranks between Pagerank Pagerank + Collusionrank
Measures influence after accounting for link farming activity
Effect of Collusionrank on spammers
40% of spammers 40% of spammers appear in top 20% appear in top 20% according to according to PagerankPagerank
Most of the Most of the spammers get spammers get pushed to last 10% pushed to last 10% positions based on positions based on CollusionrankCollusionrank
Effect on link farmers
87% of link farmers in top 87% of link farmers in top 2% users according to 2% users according to Pagerank Pagerank
98% of the link 98% of the link farmers get pushed to farmers get pushed to last 10% positions last 10% positions based on based on CollusionrankCollusionrank
Effect on normal users Focus on top 100,000 users according to
Pagerank Analyze the percentile difference in ranks between
Pagerank (P) & Pagerank + Collusionrank (PC) Percentile Difference = ( |PC-P|/N ) x 100
Only 20% of users Only 20% of users get demoted get demoted heavilyheavily
Heavily demoted Heavily demoted users follow many users follow many more spammers more spammers than othersthan others
Collusion rank selectively filters out spammers and spam-followers
Summary: Thwarting spammers Spammers infiltrate the Twitter network by farming links
Link farming helps them gain influence to promote spam Search involves ranking users based on connectivity & influence
Analyzed link farming in Twitter by studying spammers Top link farmers are real, active and popular users
Proposed an algorithm Collusionrank to limit link farming Incentivizes users to be careful about who they connect with
Finding Topic Experts in Twitter[WOSN 2012] [SIGIR 2012]
Part 2
Topic experts in Twitter Twitter is now an important source of current
news 500 million users post 400 million tweets daily
Quality of tweets posted by different users vary widely News, pointless babble, conversational tweets, spam,
…
Challenge: to find topic experts Sources of authoritative information on specific topics
Identifying topic experts in Twitter Existing approaches
Research studies: Pal [WSDM 11], Weng [WSDM 10] Application systems: Twitter Who-To-Follow, Wefollow,
…
Existing approaches primarily rely on information provided by the user herself Bio, contents of tweets, network features e.g. #followers
We rely on “wisdom of the Twitter crowd” How do others describe a user?
Twitter Lists A feature to organize tweets received from
the people whom a user is following
Create a List, add name & description, add Twitter users to the list List meta-data offers cues for who-is-who
Tweets from all listed users will be available as a separate List stream
Mining Lists to infer expertise Collect Lists containing a given
user U
Identify U’s topics from List meta-data Basic NLP techniques Extract nouns and adjectives
Extracted words collected to obtain a topic document for user
[movies tv hollywood stars entertainment celebrity hollywood …]
Lists vs. other features
Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono
Most common words from tweets
celeb, funny, humor, music, movies, laugh, comics, television, entertainers
Most common words from Lists
Profile bio
Dataset Collected Lists of 55 million Twitter users who
joined before or in 2009
Our analysis infers topics for 1.3 million users who are included in 10 or more Lists
Evaluating inference quality
Quality metrics Is the inference accurate? Is the inference informative?
Evaluation of popular users Celebrities, News media sources, US Senators
Using user feedback
Popular users set 1: Celebrities
Biographical Tags
Topics of ExpertisePopular
Perception
government, president, USA,
democratpolitics, government
celebs, leader, famous, current
events
sports, cyclist, athlete
tdf, triathlon, cancercelebs, influential, famous, inspiration
The inferred attributes accurately capture Biographical information Topics of expertise Popular perception about the user
Popular users set 2: News media sources The inferred attributes indicate
Primary topics of the media source Perceived political bias (Verified using ADA scores)
MediaBiographical
TagsTopics of Expertise
Popular Perception
CNNmedia, journalist,
bloggerspolitics, sports, tech,
weather, currentinfluential outlets
The Nationmedia, journalist, magazines, blogs
politics, governmentprogressive,
liberal
Townhall.com
media, bloggers, commentary,
journalistspolitics
conservative, republican
GuardianFilm
journalists, reviews
movies, cinema, actors, theatre,
hollywoodfilm critics
Popular users set 3: US Senators
Out of the 100 US senators, 84 have Twitter accounts
The inferred attributes correctly infer Their political party The state represented by them Their gender
‘Female’ or ‘Women’ for all 15 female senators Their political ideology
progressive/liberal/conservative/tea-party The senate committees to which they belong
Popular users set 3: US Senators
Biographical TagsSenate
CommitteesPerception
Chuck Grassley
politics, senator, republican, iowa,
gop
health, food, agriculture
conservative
Claire McCaskill
politics, democrats, missouri, women
tech, security, power, health,
commerce
progressive, liberal
Jim Inhofepolitics, congress,
oklahoma, republican
army, energy, climate, foreign
conservative
John Kerrypolitics, senate,
democrats, bostonhealth, climate, tech progressive
User feedback
Accurate
Informative
Total Evaluations
345 342
Response: Yes 274 277
Response: No 18 20
Can’t tell 53 45
Ignoring can’t tell responses,
• Accuracy – 94 %
• Informative – 93 %
Evaluating inference coverage
What fraction of Twitter can our method of inference be applied to?
A large fraction of popular Twitter users are covered
Evaluating inference coverage We could also infer attributes of less popular users
6% of users with Follower Ranks between 1 and 10 Million They are often experts on niche topics
User: Twitter bioFollowers
Listed
Inferred Attributes
spacespin: news on robotic space exploration
56 11science, space exploration, nasa, astronomy, planets
laithm: Al-jazeera network battle cameraman
201 16jounalists, photographer, al-jazeera, media
HumphreysLab: Stem Cell, Regenrative Biology of Kidney
119 17science, stem cell, genetics, cancer, physicians, biotech, nephrologist
Cognos Search system for topic experts in Twitter
Given a query (topic) Identify experts on the topic using Lists Rank identified experts
Ranking experts Used a ranking scheme solely based on Lists
Two components of ranking user U w.r.t. query Q Relevance of user to query – cover density ranking
between topic document TU of user and Q Popularity of user – number of Lists including the
user
Topic relevance(TU, Q) × log(#Lists including U)
Cognos results for “stem cell”
Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/
Evaluation of Cognos
System deployed and evaluated ‘in-the-wild’
Evaluators were students & researchers from the three home institutes of authors
Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/
Cognos vs. Twitter Who-To-Follow
Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/
Cognos vs. Twitter Who-To-Follow Considering 27 distinct queries asked at least twice Judgment by majority voting
Cognos judged better on 12 queries Computer science, Linux, Mac, Apple, Ipad, Internet,
Windows phone, photography, political journalist, …
Twitter Who-To-Follow judged better on 11 queries Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,
metallica, cloud computing, IIT Kharagpur, …
Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/
Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/
Results for query music
Summary: Finding topic experts in Twitter Developed and deployed Cognos
Uses Lists to infer topics of expertise and rank users Competes favorably with Twitter Who-To-Follow
Lists vital in searching for topic experts in Twitter
Future work Make the inference methodology robust against List
spam Key insight: Unlike follow-links, experts do not List
non-expert users Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/
Twitter microblogging site An important source for real-time Web content
500 million active users posting 400 million tweets daily
Quality of tweets / content vary widely Any one can post tweets
Celebrities, politicians, news media, academics, spammers
Challenge: Finding relevant & trustworthy content Trustworthy: Thwart spammers and their spam Relevance: Identify authoritative experts on specific
topics
Higher-level take away Links mean different things in different real-
world social networks In fact, every social network offers different types of
links
They are backed by different social interactions Many links are implicit
Important to differentiate and leverage domain-specific usage of social links
Thank You
You can try Cognos at:http://twitter-app.mpi-sws.org/whom-to-follow/http://twitter-app.mpi-sws.org/who-is-who/