swapnil soni thesis_presentation
TRANSCRIPT
Domain-Specific Document Retrieval Framework for Near Real-time Social Health Data
1
Master’s ThesisSwapnil Soni
Committees
Prof. Amit P. Sheth (Advisor)
Prof. Krishnaprasad Thirunarayan
Dr. Tanvi Banerjee
CollaboratorAshutosh Jadhav
Contact:LinkedIn: https://www.linkedin.com/in/swapnilsoniknoesisHome page: http://knoesis.org/researchers/swapnil/http://knoesis.org/
2
Outline
Background, Objective
Problem Statement
Data Collection
Pattern Extraction Analysis
Results and Evaluation
Demonstration
Conclusion
3
Background, Objective
Problem Statement
Data Collection
Pattern Extraction Analysis
Results and Evaluation
Demonstration
Conclusion
Outline
4
Background
Sources: Pew research http://www.pewinternet.org/2010/03/24/health-information/http://www.pewinternet.org/2011/02/01/health-topics-3/
Online health resources are easily accessible
and provide information about most of health
topics.
These resources can help non-experts to make
more informed decisions and play a vital role
in improving health literacy.
5
According to the pew research*,
45% of U.S. adults are dealing with at least one chronic condition
Of those who are living with two or more conditions, 45% have diabetes
*http://www.pewinternet.org/files/old-media/Files/Reports/2013/PIP_TrackingforHealth%20with%20appendix.pdf
Background
6
Health Information Seeking
Web search engine Social media
Choudhury et. al., Seeking and Sharing Health Information Online: Comparing Search Engines and Social Media,ACM,2014Teevan et. Al., #TwitterSearch: A Comparison of Microblog Search and Web Search, 2011
Real-time content
Popular trends
Online health information-seeker
Learn about basic facts
Get deeper understanding
about a topic of interest
7
Online health information-seeker
Real-time content
Relevant
Reliable information
Health Information Seeking
Social media search-engine
8
Health Information Seeking Challenges
Keyword-based techniques are based on the interpretation of keywords
Search results may not be real-time
9
Example: How to control diabetes
Keyword-basedNot real-time
10
To provide a platform to ask health-related questions
in near real-time, reliable, and relevant health
information shared on social media.
Objective
11
In the US 18% internet users use Twitter.
As we know, there are 500 million tweets per day and around 75K
verified healthcare professionals accounts from all over the world.
152K: number of health tweets every day by professionals in health-care.
Twitter as a Data Source
Twitter has become a new source of information overload in health-care
12
Problem
How to extract near real-time, reliable and relevant
documents from the health information shared on
Twitter for a given user query?
13
Outline
Background, Objective
Problem Statement
Data Collection
Pattern Extraction Analysis
Results and Evaluation
Demonstration
Conclusion
14
Questions
Real-time Twitter data
DataCollection
Data Collection
15
Predefined questions:
Selected most frequently asked questions from Mayo clinic, WebMD, etc.
Dynamic questions:
User can ask any question
Categories of Questions
16
System Architecture
User interface
Database
patternsURL social media
rank
Similarity score Calc
Patterns Rank Calc
URL content
extractor
Hadoop-based Pattern extractor
Pattern extractor
URL share &
like counts
extractor
23
4
5
1
Language Identifier
URL extractor
URL resolver
DBHandler
Apache StormProcessing pipeline
17
Apache Storm
It is a distributed, real-time computation system.
Spouts and Bolts are basic components in storm for real-time processing of data.
Networks of spouts and bolts are packaged
into a “topology”, which is submitted to
storm cluster.
18
Crawler
Spout
Topology Architecture
Language identifier Bolt
Object Modeling
Hashtag extractor
URL Extractor
URL resolver
A spout which crawls in real-time based on keywords
It allows only English tweets
It is used for retrieving a hashtag from the tweets
It converts tweet object to Java object
Extract URLs from tweets
It expands the URL(s) from short to
its original form
19
Outline
Background, Objective
Problem Statement
Data Collection
Pattern Extraction Analysis
Results and Evaluation
Demonstration
Conclusion
20
Component: Pattern Extractor
User interface
DatabaseLanguage Identifier
URL extractor
URL resolver
DBHandler
URL content
extractor
Apache StormProcessing pipeline
patternsURL social media
rank
Similarity score Calc
Patterns Rank Calc
Hadoop-based Pattern extractor
Pattern extractor
URL share &
like counts
extractor
23
4
5
1
21
Content Extractor:
To extract content from the URLs (present in the tweets).
URL(s) Share & Like counts Extractor:
Popularity of a source: To measure the content popularity, we have used social
media shares and likes counts of the URLs.
Facebook shares, Facebook like count, Twitter share count.
Reliability of a source: Google domain page rank of the URLs.
Extractors
22
Pattern Extractor
Pattern-based Mining
Triple Subject, predicate, and objectQuestion
Construct an AQL query
A noun or noun phrase, or a verb or verb
23
AQL is a language used for building queries that pulls structured
information from unstructured or semi-structured text.
Syntax of AQL is similar to that of Structured Query Language (SQL).
AQL file
AOG
SystemTData folder
Contains all the patterns.
Result contains pattern.
Annotation Query Language (AQL)
24
5 easy natural remedies to control diabetes : If you are a diabetic or know someone who is a diabeti... http://bit.ly/13oypg4
Pattern Extractor: Example
How to control diabetes?
X control diabetesX control blood sugarX handle blood sugarX handle diabetes
UMLSWordNet
Synsets
25
This module extracts triples (patterns) from unstructured (tweets and
URLs’ content) based on predefined questions (AQL queries).
The text analytic engine executes AQL queries--an interval of six hours.
Predefined Questions: Pattern Mining
26
How to control diabetes by exercise?
Part-of-speech tagger
Control (verb), diabetes(noun), exercise(noun)
Query builderWordNetSynsets
Query executer
diabetes control exerciseexercise control diabetes
Dynamic Query Processing Architecture
Question Extracted Pattern Paragraph
How to control diabetes?
control blood sugar Exercise is a healthy way to lower and control blood sugar levels within your body. Doing exercise and lifting weights will improve your condition significantly. http://t.co/88ulxDPFTo
How to control diabetes?
insulin into the blood stream to handle
When a meal is eaten, the pancreas will send larger amounts of insulin into the blood stream to handle the food http://t.co/WsCWiNqhb9
How to control diabetes?
remove sugar Since people with Type 2 diabetes tend to accumulate sugar in their blood due to their inability to efficiently remove sugar from the blood http://t.co/aHqKJjrTPY
27
Results
Question Extracted Pattern Paragraph
What are the Symptoms of diabetes?
Diabetics tend to get Diabetics should be very cautious when having a pedicure. Diabetics tend to get bad infections in the feet, so you must be very aware of any puncture or cut you notice on your feet. http://t.co/HqJBjBtrXC
What are the causes of diabetes?
can cause diabetes Smoking isn’t healthy for anyone but can be very dangerous if you’re a diabetic. This habit produces many poor health issues. Smoking makes a person’s insulin resistant, in can cause diabetes to develop http://t.co/Ca5SaXRL6w
Patterns URL share and
like counts
Pattern Rank Calculator
Pattern Rank Calculator Architecture
24
Similarity Score
Calculator
Database
1a 1b 2
3
29
Features Set
Popularity Relevancy
Facebook share counts
Facebook like counts
Twitter share count
Vector based
similarity score
Reliability
Google domain rank
30
Query Expansion
How to control diabetes?
How to control diabetes?How to control blood sugar?How to handle blood sugar?How to handle diabetes?
0.81 (TF-IDF score)
0.0 (TF-IDF score)
0.81(TF-IDF score)
0.77(TF-IDF score)
Exercise controls diabetes
Natural way to handle blood sugar
31
NaiveBayes supportVector RandomForest AdaBoostM1
0.638 0.630.678
0.376
0.639 0.667 0.694
0.556
Social Media share and like count + Jaccard Index on query expansion
Precision Recall
NaiveBayes supportVector RandomForest AdaBoostM1
0.7530.687
0.793
0.501
0.722 0.750.806
0.583
Social Media share and like count + TF-IDF on query expansion
Precision Recall
Experiments: ML Classifiers
32
Outline
Background, Objective
Problem Statement
Data Collection
Pattern Extraction Analysis
Results and Evaluation
Demonstration
Conclusion
33
Evaluation
Reliability, Relevancy, and Real-time
Pattern Generator
Query Expansion based on Relevance Feedback
34
Evaluation: Reliability, Relevancy & Real-timeReliability:
• Based on URL’s (extracted news article) Google domain pagerank
• Filtration criteria is URL’s Google domain pagerank should be greater than 4
Relevancy:
• Based on qualitative approach
• For a given question, user survey participants judge the relevancy of the result set from 1) Twitter search 2) Social Health Signals 3) Google time bound search and assign relevancy score from 1 (low) to 3(high)
Real-time:
• Timeliness (trends) of a retrieved document. We have considered only 6 hours
data to find out information of a user’s given query
• Example: breaking news on diabetes
35
Collected the top 10 results from these sources: Twitter search, Social
Health Signal, and Google time-bound search
Evaluation: Relevancy
Queries (Frequently Asked Query)
1) How to control diabetes?
2) What are the causes of diabetes?
3) What are the symptoms of diabetes?
36
Presented the top 10 results from all the three sources for each of the query
to participants
Participants judge each document of a query on a scale of 1 to 3 (i.e. 1-Not
good, 2-good, and 3-very good)
To calculate average rank, we have used the following formula*:
Evaluation: Relevancy
*http://help.surveymonkey.com/articles/en_US/kb/What-is-the-Rating-Average-and-how-is-it-calculated
37
Evaluation: RelevancyHow to control diabetes?
Result 1
Result 2
Result 3
Score 1 Score 2 Score 3
38
Evaluation: Relevancy
Twitter search
Social Health Signal
Query 1 40% 50%
Query 2 10% 60%
Query 3 40% 50%
Twitter search
Social Health Signal
Query 1 10% 10%
Query 2 30% 10%
Query 3 30% 20%
Twitter search
Social Health Signal
Query 1 50% 40%
Query 2 60% 30%
Query 3 30% 30%
Bad
GoodVery Good
Google-time bound
10%
50%
10%
Google-time bound
40%
10%
70%
Google-time bound
50%
40%
20%
39
nDCG@K (Normal Discounted cumulative gain)
nDCG@K can handle multiple levels of relevance
It gives more weightage to a higher position document than a lower
ranking position document
Evaluation Matrices: Relevancy
40
Twitter-Search
Social Health Signal
DCG 9.68 12.72
IDCG 13.33 13.76
nDCG 0.726 0.924
Twitter-Search
Social Health Signal
DCG 9.67 13.15
IDCG 10.55 14.15
nDCG 0.91 0.92
Twitter-Search
Social Health Signal
DCG 10.75 11.47
IDCG 12.69 13.45
nDCG 0.84 0.85
Google Time-Bound
9.12
9.81
0.929
Google Time-Bound
10.03
12.76
0.78
Google Time-Bound
10.76
10.89
0.98
Query 2Query 1
Query 3
Evaluation Matrices: Relevancy
41
Evaluation: Popularity
Google time-bound Social Health Signal
Query: How to control diabetes?
Facebook (Share + Like ) Counts
Twitter Share Counts
4 0
0 0
0 0
0 4
0 0
0 0
1 2
52 1211
229 0
Facebook (Share + Like ) Counts
Twitter Share Counts
3910 1843
213 8
81 90
0 128
149 826
0 24
0 20
0 24
0 2
42
Google Time-Bound Search
Obesity cause diabetes Overweight cause diabetes
43
URL Title: Replacing sugary drinks with water may reduce diabetes riskExtracted Pattern: 'obese is a major risk'
URL Title: More Evidence Links Diabetes to Alzheimer's DiseaseExtracted Pattern: Overweight May Decrease Mortality Risk URL Title :The facts about sugarExtracted Pattern: ‘overweight can increase your risk ‘
URL Title : Having Diabetes Can Increase Your Alzheimer's Risk Via Blood Glucose And Brain Plaque LinkExtracted Pattern : obesity can also increase our risk URL Title : Diabetes Study Suggests a Little Extra Weight Tied to Longer SurvivalExtracted Pattern: risk for dying than overweight
Social Health SignalObesity less Mortality Risk
44
Demo
http://knoesis-twit.cs.wright.edu/SocialHealthSignal/
45
Future Work
Evaluation
Relevancy on more Queries
Pattern Generator
Query Expansion based on Relevance Feedback
Semantic Categorization
Performance improvement for dynamic queries
46
Conclusion
◦ Twitter has become a popular tool for seeking health information.
◦ It is very difficult task to extract relevant, and reliable health document
from Twitter in near real-time
◦ We address this problem, by using state-of-the-art approaches such as
◦ Semantics-based pattern mining
◦ TF-IDF relevancy score on query expansion
◦ Content popularity: Social media share and like counts
◦ Reliability : Google domain page rank
47
Acknowledgements
Dr. Amit Sheth Dr. T.K Prasad Dr. Tanvi Banerjee Ashutosh Jadhav
48
Thanks!
Questions?
49
Social Health Signal
Screenshots
50
Home Screen
51
Search & Explore Screen
52
Top 10 URLs Screen
53
Tweet Locations Screen