flux of meme - dow 1st semester

Flux of MEME - description of work, 1st semester

project: Flux of Memeauthor: Thomas M. Alisi - [email protected]: Telecom Italiareview: deliverable 11.3.11

1

Wednesday, March 9, 2011

mailto:[email protected]


even if geo-tagging is growing, it still represents <1% of the total user generated content

2


What makes a trend a Trend?Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of Tweets about that topic at a given moment dramatically increases.

from Twitter blog, december 2010:

3


project overview

4

1. fetch datafrom real-time social networks

2. create clustersof geo-located information

3. extract topics4. analyze stats

creating timeline predictions


prologue - struggling with hardware and algorithms

5


fetching data: the Twitter streaming API

• data is fetched using Twitter streaming API

• issues:

• access to data is limited: a basic “Spritzer” account is limited to 1% of total tweets

• the amount of geo-localized tweets still represent a small figure: around 1%

• “good” data (meaning that has geo-localized information) is around: 90M (total tweets/day) * 1% * 1%

6


problems

1.how to increase geo-localized data?

2.how to increase the amount / quality of text used for topic extraction?

7


approximating geo-information

8

geo information is extracted as text from twitter profile

and searched on geonames databaseafter having indexed its content (cities with population > 5,000)


enriching information

• extra information carried by single tweets is used to enrich data sets for topic extraction

• linked data is filtered through a blacklist tocrawl and fetch what is effectively relevantfor clustering purposes

9

geo information present

fetched through GeoNames

not present


e.r. model, focusing on posts / links / queries / clusters

10


application lifecycle

• as the twitter API is connected and fetches a continuous stream of data, the clustering algorithm is executed asynchronously

1.fetch data and store in a continuous timeline

2.cut time in relevant slices

3.create geo-localized clusters of information, using HAC (Hierarchical Agglomerative Clustering)

4.extract topics from geo-clusters using LDA (Latent Dirichlet Allocation)

11

T

yesterday today tomorrow?

time slice


software architecture

12


web interface

• first prototype of web interface, showing geo-localized clusters

• radius of clusters indicates standard deviation

• opacity indicates density (number of posts)

• for each cluster, its corresponding metadata is shown, including:

• list of topics

• list of posts

• related links

13


what’s next?

• refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground truth content for construction of statistical model)

• twitter streaming API tweaks:

• location boxes

• use of keywords and keyword expansion for context specific searches

• implementation of search masks with a content indexing system (i.e. Apach Solr)

• timeline representation of clusters / topics

14


15http://moritz.stefaner.eu/projects/map%20your%20moves/

http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/

http://www.janwillemtulp.com/worldeconomicforum/Wednesday, March 9, 2011

http://moritz.stefaner.eu/projects/map%20your%20moves/

http://moritz.stefaner.eu/projects/map%20your%20moves/

http://a.parsons.edu/~drumb588/tweetcatcha/

http://a.parsons.edu/~drumb588/tweetcatcha/

http://truthy.indiana.edu/

http://truthy.indiana.edu/

http://www.janwillemtulp.com/worldeconomicforum/

http://www.janwillemtulp.com/worldeconomicforum/

16

thanks!

Thomas M. Alisi, [email protected]

Giuseppe Serra, [email protected]

Marco Bertini, [email protected]




mailto:[email protected]?subject=

mailto:[email protected]?subject=



flux of meme - dow 1st semester

Technology