building topic/trend detection system based on slow intelligence
DESCRIPTION
Building Topic/Trend Detection System based on Slow Intelligence. Chia-Chun Shih & Ting-Chun Peng Institute for Information Industry Taipei, Taiwan. Presented at DMS’10 special session on Slow Intelligence Systems. Agenda. Introduction Topic/Trend Detection System - PowerPoint PPT PresentationTRANSCRIPT
Building Topic/Trend Detection System based on Slow Intelligence
Chia-Chun Shih & Ting-Chun Peng
Institute for Information Industry
Taipei, Taiwan
Presented at DMS’10 special session on Slow Intelligence Systems
2
Agenda
• Introduction• Topic/Trend Detection System• Topic/Trend Detection System with Slow Intelligence• Conclusion
Introduction
4
Introduction
• Social media is prevailing• Social media is a reflection of real-world
– An experiment from HP Social Computing Lab shows:• Twitter-rate time series can accurately predict box-office movie sales with
Adjusted R2 = 0.973 (amazing!!)
• The emerging market for Social Media Monitoring Service– E.g., Nielsen Buzzmetrics, Radian6
Twitter PostsTwitter PostsBlog PostsBlog Posts Facebook UsersFacebook Users
5
Introduction
• Topic Detection and Tracking (TDT)– Initiated by DARPA at 1996– discover the topical structure in unsegmented streams of
news reporting as it appears across multiple media– Tasks:
• Topic Detection• Topic Tracking• First Story Detection• Story Segmentation• Link Detection
(cont’d)
6
Introduction
• Slow Intelligence provides a software development framework for systems with insufficient computing resources to gradually adapt to environments to handle complexities
(cont’d)
EnumeratorEnumerator AdaptorAdaptor EliminatorEliminator ConcentratorConcentratorProblemProblem Solution Solution
Knowledge-based ControllerKnowledge-based Controller
EnvironmentEnvironment
Slow Intelligence System
1 2 3 4
7
Introduction • In this paper, we propose a design of online topic/trend detection system
for Social Media with the advantages of Slow Intelligence.• Four complexities of designing online topic/trend detection systems are
identified, along with corresponding Slow Intelligence solutions.
(cont’d)
Enumerator Adaptor Eliminator Concentrator
Slow Intelligence System Building Blocks
Crawler & Extractor Topic Extractror Trend Detector
Topic/Trend Detection System
SIS system for scheduling Crawlers
SIS system for Selecting Trend Estimation MethodSIS System for
Focused Crawling
SIS system for adapting extractors
Enumerator Adaptor Eliminator Concentrator
Slow Intelligence System Building Blocks
Crawler & Extractor Topic Extractror Trend Detector
Topic/Trend Detection System
SIS system for scheduling Crawlers
SIS system for Selecting Trend Estimation MethodSIS System for
Focused Crawling
SIS system for adapting extractors
Topic/Trend Detection System
9
Topic/Trend Detection System
• Objective– Detect current hot topics and to predict future hot topics based on data
collected from Social Media
• Three components– Crawler & Extractor: Collect data and extract information from Social
Media– Topic Extractor: Detect hot topics from a set of text documents– Trend Detector: Detect trends (future hot topics) based on currently
available data
Crawler &
Extractor
Topic Extractor
Trend Detector
SocialMedia
Current Hot topics
Future Hot topics
10
Topic/Trend Detection System
• Crawler & Extractor
(cont’d)
Web dataDB
WebCrawler
HTMLdocuments
InformationExtractor
* Extract articles and metadata (title, author, content, etc) from semi-structured web content
User’sKeywords of
Interests
Topic Extractor
Social Media
Textdocuments
Crawler & Extractor
11
Topic/Trend Detection System
• Topic Extractor
(cont’d)
Web dataDB
Topic WordExtraction
Topic WordClustering
Hot topicextraction
Currenttopics
CurrentHot topics
Topic Extractor
• Apply TF-IDF scheme to generate Top-N topic words for each document
• Apply clustering algorithm to cluster topic words into topic groups. The topic groups are treated as “topics”
• Apply aging theory to find hot topics
12
Topic/Trend Detection System
• Trend Detector
(cont’d)
Trend Detector
Currenttopics
Trend EstimationAlgorithms
Topic Trend(Future Hot Topics)
• The Trend Estimation Algorithm is a black box now, however, it will “find its way” when Slow Intelligence is involved in the system
Topic/Trend Detection Systemwith Slow Intelligence
14
T/TD System with Slow Intelligence
• Four complexities of designing online topic/trend detection systems
• 1. It is unlikely to collect all web data based on limited amount of computing resources. The system needs to develop data collection strategies which can concentrate limited resources on collecting important web data.
Crawler &
Extractor
15
T/TD System with Slow Intelligence
• 2. Many computation methods are available for estimating trends. If parameter settings are also taken into account, there are too many combinations to choose. Furthermore, Internet is a changing environment, which means current best solution may not perform well in the future. The system needs to automatically (or at least quasi-automatically) find best solution from many alternatives in a changing environment.
(cont’d)
Trend Detector
16
T/TD System with Slow Intelligence
• 3. The crawler needs to revisit websites to collect up-to-date data in hourly or daily intervals. Each site has different amount of to-be-update data and different policy to restrict frequent access, which are unknown beforehand. The system needs to find feasible data collection schedule based on past experience.
(cont’d)
Crawler &
Extractor
17
T/TD System with Slow Intelligence
• 4. Any changes in web pages may disrupt Extractors. It needs automatic repair mechanism for Extractors if many websites are being monitored. The repair mechanism needs to detect errors of Extractors, find alternatives, and choose the best solution from alternatives to fix the disrupted Extractors.
(cont’d)
Crawler &
Extractor
18
T/TD System with Slow Intelligence
1. SIS to help restrict the range of data collection
(cont’d)
Knowledge of data
Knowledge of algorithm
19
T/TD System with Slow Intelligence
2. SIS to help select and adapt trend detection algorithms
(cont’d)
20
T/TD System with Slow Intelligence
3. SIS to help scheduling Crawler
(cont’d)
21
T/TD System with Slow Intelligence
4. SIS to help adapt Extractors
(cont’d)
Conclusion
23
Conclusion
• An online trend detection system requires careful resource allocation and automatic algorithm adaptation to process huge size of heterogeneous data.
• This research adopts Slow Intelligence, which provides a framework for systems with insufficient computing resources to gradually adapt to environments, to response the challenges.
• Four Slow Intelligence subsystems are proposed, and each subsystem targets a challenge in designing online topic/trend detection systems.
If you have any questions, please e-mail us
[email protected] (Chia-Chun Shih)
[email protected] (Ting-Chun Peng)