flux of meme - dow 1st semester
DESCRIPTION
first results after 6 months of work with Telecom Italia - Working Capital research grant. showing technology used and prototype of web interface for a topic extraction and clustering tool.TRANSCRIPT
Flux of MEME - description of work, 1st semester
project: Flux of Memeauthor: Thomas M. Alisi - [email protected]: Telecom Italiareview: deliverable 11.3.11
1
Wednesday, March 9, 2011
even if geo-tagging is growing, it still represents <1% of the total user generated content
2
Wednesday, March 9, 2011
What makes a trend a Trend?Twitter users now send more than 95 million Tweets a day, on just about every topic imaginable. We track the volume of terms mentioned on Twitter on an ongoing basis. Topics break into the Trends list when the volume of Tweets about that topic at a given moment dramatically increases.
from Twitter blog, december 2010:
3
Wednesday, March 9, 2011
project overview
4
1. fetch datafrom real-time social networks
2. create clustersof geo-located information
3. extract topics4. analyze stats
creating timeline predictions
Wednesday, March 9, 2011
prologue - struggling with hardware and algorithms
5
Wednesday, March 9, 2011
fetching data: the Twitter streaming API
• data is fetched using Twitter streaming API
• issues:
• access to data is limited: a basic “Spritzer” account is limited to 1% of total tweets
• the amount of geo-localized tweets still represent a small figure: around 1%
• “good” data (meaning that has geo-localized information) is around: 90M (total tweets/day) * 1% * 1%
6
Wednesday, March 9, 2011
problems
1.how to increase geo-localized data?
2.how to increase the amount / quality of text used for topic extraction?
7
Wednesday, March 9, 2011
approximating geo-information
8
geo information is extracted as text from twitter profile
and searched on geonames databaseafter having indexed its content (cities with population > 5,000)
Wednesday, March 9, 2011
enriching information
• extra information carried by single tweets is used to enrich data sets for topic extraction
• linked data is filtered through a blacklist tocrawl and fetch what is effectively relevantfor clustering purposes
9
geo information present
fetched through GeoNames
not present
Wednesday, March 9, 2011
e.r. model, focusing on posts / links / queries / clusters
10
Wednesday, March 9, 2011
application lifecycle
• as the twitter API is connected and fetches a continuous stream of data, the clustering algorithm is executed asynchronously
1.fetch data and store in a continuous timeline
2.cut time in relevant slices
3.create geo-localized clusters of information, using HAC (Hierarchical Agglomerative Clustering)
4.extract topics from geo-clusters using LDA (Latent Dirichlet Allocation)
11
T
yesterday today tomorrow?
time slice
Wednesday, March 9, 2011
software architecture
12
Wednesday, March 9, 2011
web interface
• first prototype of web interface, showing geo-localized clusters
• radius of clusters indicates standard deviation
• opacity indicates density (number of posts)
• for each cluster, its corresponding metadata is shown, including:
• list of topics
• list of posts
• related links
13
Wednesday, March 9, 2011
what’s next?
• refinements of LDA topic extraction algorithm (using different sources, determining better datasets of ground truth content for construction of statistical model)
• twitter streaming API tweaks:
• location boxes
• use of keywords and keyword expansion for context specific searches
• implementation of search masks with a content indexing system (i.e. Apach Solr)
• timeline representation of clusters / topics
14
Wednesday, March 9, 2011
15http://moritz.stefaner.eu/projects/map%20your%20moves/
http://a.parsons.edu/~drumb588/tweetcatcha/ http://truthy.indiana.edu/
http://www.janwillemtulp.com/worldeconomicforum/Wednesday, March 9, 2011
16
thanks!
Thomas M. Alisi, [email protected]
Giuseppe Serra, [email protected]
Marco Bertini, [email protected]
Wednesday, March 9, 2011