flux of meme - dow 1st semester

16
Flux of MEME - description of work, 1st semester project: Flux of Meme author: Thomas M. Alisi - [email protected] client: Telecom Italia review: deliverable 11.3.11 1 Wednesday, March 9, 2011

Upload: thomas-alisi

Post on 21-Dec-2014

1.614 views

Category:

Technology


3 download

DESCRIPTION

first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.

TRANSCRIPT

Page 1: Flux of MEME - DOW 1st semester

Flux of MEME - description of work, 1st semester

project: Flux of Memeauthor: Thomas M. Alisi - [email protected]: Telecom Italiareview: deliverable 11.3.11

1

Wednesday, March 9, 2011

Page 2: Flux of MEME - DOW 1st semester

even if geo-tagging is growing, it still represents <1% of the total user generated content

2

Wednesday, March 9, 2011

Page 3: Flux of MEME - DOW 1st semester

What makes a trend a Trend?Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of Tweets about that topic at a given moment dramatically increases.

from Twitter blog, december 2010:

3

Wednesday, March 9, 2011

Page 4: Flux of MEME - DOW 1st semester

project overview

4

1. fetch datafrom real-time social networks

2. create clustersof geo-located information

3. extract topics4. analyze stats

creating timeline predictions

Wednesday, March 9, 2011

Page 5: Flux of MEME - DOW 1st semester

prologue - struggling with hardware and algorithms

5

Wednesday, March 9, 2011

Page 6: Flux of MEME - DOW 1st semester

fetching data: the Twitter streaming API

• data is fetched using Twitter streaming API

• issues:

• access to data is limited: a basic “Spritzer” account is limited to 1% of total tweets

• the amount of geo-localized tweets still represent a small figure: around 1%

• “good” data (meaning that has geo-localized information) is around: 90M (total tweets/day) * 1% * 1%

6

Wednesday, March 9, 2011

Page 7: Flux of MEME - DOW 1st semester

problems

1.how to increase geo-localized data?

2.how to increase the amount / quality of text used for topic extraction?

7

Wednesday, March 9, 2011

Page 8: Flux of MEME - DOW 1st semester

approximating geo-information

8

geo information is extracted as text from twitter profile

and searched on geonames databaseafter having indexed its content (cities with population > 5,000)

Wednesday, March 9, 2011

Page 9: Flux of MEME - DOW 1st semester

enriching information

• extra information carried by single tweets is used to enrich data sets for topic extraction

• linked data is filtered through a blacklist tocrawl and fetch what is effectively relevantfor clustering purposes

9

geo information present

fetched through GeoNames

not present

Wednesday, March 9, 2011

Page 10: Flux of MEME - DOW 1st semester

e.r. model, focusing on posts / links / queries / clusters

10

Wednesday, March 9, 2011

Page 11: Flux of MEME - DOW 1st semester

application lifecycle

• as the twitter API is connected and fetches a continuous stream of data, the clustering algorithm is executed asynchronously

1.fetch data and store in a continuous timeline

2.cut time in relevant slices

3.create geo-localized clusters of information, using HAC (Hierarchical Agglomerative Clustering)

4.extract topics from geo-clusters using LDA (Latent Dirichlet Allocation)

11

T

yesterday today tomorrow?

time slice

Wednesday, March 9, 2011

Page 12: Flux of MEME - DOW 1st semester

software architecture

12

Wednesday, March 9, 2011

Page 13: Flux of MEME - DOW 1st semester

web interface

• first prototype of web interface, showing geo-localized clusters

• radius of clusters indicates standard deviation

• opacity indicates density (number of posts)

• for each cluster, its corresponding metadata is shown, including:

• list of topics

• list of posts

• related links

13

Wednesday, March 9, 2011

Page 14: Flux of MEME - DOW 1st semester

what’s next?

• refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground truth content for construction of statistical model)

• twitter streaming API tweaks:

• location boxes

• use of keywords and keyword expansion for context specific searches

• implementation of search masks with a content indexing system (i.e. Apach Solr)

• timeline representation of clusters / topics

14

Wednesday, March 9, 2011