swapnil soni thesis_presentation

Domain-Specific Document Retrieval Framework for Near Real-time Social Health Data

1

Master’s ThesisSwapnil Soni

Committees

Prof. Amit P. Sheth (Advisor)

Prof. Krishnaprasad Thirunarayan

Dr. Tanvi Banerjee

CollaboratorAshutosh Jadhav

Contact:LinkedIn: https://www.linkedin.com/in/swapnilsoniknoesisHome page: http://knoesis.org/researchers/swapnil/http://knoesis.org/

https://www.linkedin.com/in/swapnilsoniknoesis



http://knoesis.org/researchers/swapnil/

http://knoesis.org/researchers/swapnil/

http://knoesis.org/

http://knoesis.org/

http://knoesis.org/

2

Outline

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

3


Problem Statement

Data Collection



Demonstration

Conclusion

Outline

4

Background

Sources: Pew research http://www.pewinternet.org/2010/03/24/health-information/http://www.pewinternet.org/2011/02/01/health-topics-3/

Online health resources are easily accessible

and provide information about most of health

topics.

These resources can help non-experts to make

more informed decisions and play a vital role

in improving health literacy.

5

According to the pew research*,

45% of U.S. adults are dealing with at least one chronic condition

Of those who are living with two or more conditions, 45% have diabetes

*http://www.pewinternet.org/files/old-media/Files/Reports/2013/PIP_TrackingforHealth%20with%20appendix.pdf

Background

6

Health Information Seeking

Web search engine Social media

Choudhury et. al., Seeking and Sharing Health Information Online: Comparing Search Engines and Social Media,ACM,2014Teevan et. Al., #TwitterSearch: A Comparison of Microblog Search and Web Search, 2011

Real-time content

Popular trends

Online health information-seeker

Learn about basic facts

Get deeper understanding

about a topic of interest

7

Online health information-seeker

Real-time content

Relevant

Reliable information

Health Information Seeking

Social media search-engine

8

Health Information Seeking Challenges

Keyword-based techniques are based on the interpretation of keywords

Search results may not be real-time

9

Example: How to control diabetes

Keyword-basedNot real-time

Twitter

10

To provide a platform to ask health-related questions

in near real-time, reliable, and relevant health

information shared on social media.

Objective

11

In the US 18% internet users use Twitter.

As we know, there are 500 million tweets per day and around 75K

verified healthcare professionals accounts from all over the world.

152K: number of health tweets every day by professionals in health-care.

Twitter as a Data Source

Twitter has become a new source of information overload in health-care

12

Problem

How to extract near real-time, reliable and relevant

documents from the health information shared on

Twitter for a given user query?

13

Outline


Problem Statement

Data Collection



Demonstration

Conclusion

14

Questions

Real-time Twitter data

DataCollection

Data Collection

15

Predefined questions:

Selected most frequently asked questions from Mayo clinic, WebMD, etc.

Dynamic questions:

User can ask any question

Categories of Questions

16

System Architecture

User interface

Database

patternsURL social media

rank

Similarity score Calc

Patterns Rank Calc

Twitter

URL content

extractor

Hadoop-based Pattern extractor

Pattern extractor

URL share &

like counts

extractor

23

4

5

1

Language Identifier

URL extractor

URL resolver

DBHandler

Apache StormProcessing pipeline

17

Apache Storm

It is a distributed, real-time computation system.

Spouts and Bolts are basic components in storm for real-time processing of data.

Networks of spouts and bolts are packaged

into a “topology”, which is submitted to

storm cluster.

18

Crawler

Spout

Topology Architecture

Language identifier Bolt

Object Modeling

Hashtag extractor

URL Extractor

URL resolver

A spout which crawls in real-time based on keywords

It allows only English tweets

It is used for retrieving a hashtag from the tweets

It converts tweet object to Java object

Extract URLs from tweets

It expands the URL(s) from short to

its original form

19

Outline


Problem Statement

Data Collection



Demonstration

Conclusion

20

Component: Pattern Extractor

User interface

DatabaseLanguage Identifier

URL extractor

URL resolver

DBHandler

URL content

extractor

Apache StormProcessing pipeline

patternsURL social media

rank

Similarity score Calc

Patterns Rank Calc

Hadoop-based Pattern extractor

Pattern extractor

Twitter

URL share &

like counts

extractor

23

4

5

1

21

Content Extractor:

To extract content from the URLs (present in the tweets).

URL(s) Share & Like counts Extractor:

Popularity of a source: To measure the content popularity, we have used social

media shares and likes counts of the URLs.

Facebook shares, Facebook like count, Twitter share count.

Reliability of a source: Google domain page rank of the URLs.

Extractors

22

Pattern Extractor

Pattern-based Mining

Triple Subject, predicate, and objectQuestion

Construct an AQL query

A noun or noun phrase, or a verb or verb

23

AQL is a language used for building queries that pulls structured

information from unstructured or semi-structured text.

Syntax of AQL is similar to that of Structured Query Language (SQL).

AQL file

AOG

SystemTData folder

Contains all the patterns.

Result contains pattern.

Annotation Query Language (AQL)

24

5 easy natural remedies to control diabetes : If you are a diabetic or know someone who is a diabeti... http://bit.ly/13oypg4

Pattern Extractor: Example

How to control diabetes?

X control diabetesX control blood sugarX handle blood sugarX handle diabetes

UMLSWordNet

Synsets

http://t.co/eut17LIR5c

25

This module extracts triples (patterns) from unstructured (tweets and

URLs’ content) based on predefined questions (AQL queries).

The text analytic engine executes AQL queries--an interval of six hours.

Predefined Questions: Pattern Mining

26

How to control diabetes by exercise?

Part-of-speech tagger

Control (verb), diabetes(noun), exercise(noun)

Query builderWordNetSynsets

Query executer

diabetes control exerciseexercise control diabetes

Dynamic Query Processing Architecture

Question Extracted Pattern Paragraph


control blood sugar Exercise is a healthy way to lower and control blood sugar levels within your body. Doing exercise and lifting weights will improve your condition significantly. http://t.co/88ulxDPFTo


insulin into the blood stream to handle

When a meal is eaten, the pancreas will send larger amounts of insulin into the blood stream to handle the food http://t.co/WsCWiNqhb9


remove sugar Since people with Type 2 diabetes tend to accumulate sugar in their blood due to their inability to efficiently remove sugar from the blood http://t.co/aHqKJjrTPY

27

Results

Question Extracted Pattern Paragraph

What are the Symptoms of diabetes?

Diabetics tend to get Diabetics should be very cautious when having a pedicure. Diabetics tend to get bad infections in the feet, so you must be very aware of any puncture or cut you notice on your feet. http://t.co/HqJBjBtrXC

What are the causes of diabetes?

can cause diabetes Smoking isn’t healthy for anyone but can be very dangerous if you’re a diabetic. This habit produces many poor health issues. Smoking makes a person’s insulin resistant, in can cause diabetes to develop http://t.co/Ca5SaXRL6w

Patterns URL share and

like counts

Pattern Rank Calculator

Pattern Rank Calculator Architecture

24

Similarity Score

Calculator

Database

1a 1b 2

3

29

Features Set

Popularity Relevancy

Facebook share counts

Facebook like counts

Twitter share count

Vector based

similarity score

Reliability

Google domain rank

30

Query Expansion


How to control diabetes?How to control blood sugar?How to handle blood sugar?How to handle diabetes?

0.81 (TF-IDF score)

0.0 (TF-IDF score)

0.81(TF-IDF score)

0.77(TF-IDF score)

Exercise controls diabetes

Natural way to handle blood sugar

31

NaiveBayes supportVector RandomForest AdaBoostM1

0.638 0.630.678

0.376

0.639 0.667 0.694

0.556

Social Media share and like count + Jaccard Index on query expansion

Precision Recall

NaiveBayes supportVector RandomForest AdaBoostM1

0.7530.687

0.793

0.501

0.722 0.750.806

0.583

Social Media share and like count + TF-IDF on query expansion

Precision Recall

Experiments: ML Classifiers

32

Outline


Problem Statement

Data Collection



Demonstration

Conclusion

33

Evaluation

Reliability, Relevancy, and Real-time

Pattern Generator

Query Expansion based on Relevance Feedback

34

Evaluation: Reliability, Relevancy & Real-timeReliability:

• Based on URL’s (extracted news article) Google domain pagerank

• Filtration criteria is URL’s Google domain pagerank should be greater than 4

Relevancy:

• Based on qualitative approach

• For a given question, user survey participants judge the relevancy of the result set from 1) Twitter search 2) Social Health Signals 3) Google time bound search and assign relevancy score from 1 (low) to 3(high)

Real-time:

• Timeliness (trends) of a retrieved document. We have considered only 6 hours

data to find out information of a user’s given query

• Example: breaking news on diabetes

35

Collected the top 10 results from these sources: Twitter search, Social

Health Signal, and Google time-bound search

Evaluation: Relevancy

Queries (Frequently Asked Query)

1) How to control diabetes?

2) What are the causes of diabetes?

3) What are the symptoms of diabetes?

36

Presented the top 10 results from all the three sources for each of the query

to participants

Participants judge each document of a query on a scale of 1 to 3 (i.e. 1-Not

good, 2-good, and 3-very good)

To calculate average rank, we have used the following formula*:


*http://help.surveymonkey.com/articles/en_US/kb/What-is-the-Rating-Average-and-how-is-it-calculated

37

Evaluation: RelevancyHow to control diabetes?

Result 1

Result 2

Result 3

Score 1 Score 2 Score 3

38


Twitter search

Social Health Signal

Query 1 40% 50%

Query 2 10% 60%

Query 3 40% 50%

Twitter search


Query 1 10% 10%

Query 2 30% 10%

Query 3 30% 20%

Twitter search


Query 1 50% 40%

Query 2 60% 30%

Query 3 30% 30%

Bad

GoodVery Good

Google-time bound

10%

50%

10%

Google-time bound

40%

10%

70%

Google-time bound

50%

40%

20%

39

nDCG@K (Normal Discounted cumulative gain)

nDCG@K can handle multiple levels of relevance

It gives more weightage to a higher position document than a lower

ranking position document

Evaluation Matrices: Relevancy

40

Twitter-Search


DCG 9.68 12.72

IDCG 13.33 13.76

nDCG 0.726 0.924

Twitter-Search


DCG 9.67 13.15

IDCG 10.55 14.15

nDCG 0.91 0.92

Twitter-Search


DCG 10.75 11.47

IDCG 12.69 13.45

nDCG 0.84 0.85

Google Time-Bound

9.12

9.81

0.929

Google Time-Bound

10.03

12.76

0.78

Google Time-Bound

10.76

10.89

0.98

Query 2Query 1

Query 3

Evaluation Matrices: Relevancy

41

Evaluation: Popularity

Google time-bound Social Health Signal

Query: How to control diabetes?

Facebook (Share + Like ) Counts

Twitter Share Counts

4 0

0 0

0 0

0 4

0 0

0 0

1 2

52 1211

229 0

Facebook (Share + Like ) Counts

Twitter Share Counts

3910 1843

213 8

81 90

0 128

149 826

0 24

0 20

0 24

0 2

42

Google Time-Bound Search

Obesity cause diabetes Overweight cause diabetes

43

URL Title: Replacing sugary drinks with water may reduce diabetes riskExtracted Pattern: 'obese is a major risk'

URL Title: More Evidence Links Diabetes to Alzheimer's DiseaseExtracted Pattern: Overweight May Decrease Mortality Risk URL Title :The facts about sugarExtracted Pattern: ‘overweight can increase your risk ‘

URL Title : Having Diabetes Can Increase Your Alzheimer's Risk Via Blood Glucose And Brain Plaque LinkExtracted Pattern : obesity can also increase our risk URL Title : Diabetes Study Suggests a Little Extra Weight Tied to Longer SurvivalExtracted Pattern: risk for dying than overweight

Social Health SignalObesity less Mortality Risk

44

Demo

http://knoesis-twit.cs.wright.edu/SocialHealthSignal/

http://localhost:8080/SocialHealthSignal/health.jsp

http://localhost:8080/SocialHealthSignal/health.jsp

http://knoesis-twit.cs.wright.edu/SocialHealthSignal/

45

Future Work

Evaluation

Relevancy on more Queries

Pattern Generator

Query Expansion based on Relevance Feedback

Semantic Categorization

Performance improvement for dynamic queries

46

Conclusion

◦ Twitter has become a popular tool for seeking health information.

◦ It is very difficult task to extract relevant, and reliable health document

from Twitter in near real-time

◦ We address this problem, by using state-of-the-art approaches such as

◦ Semantics-based pattern mining

◦ TF-IDF relevancy score on query expansion

◦ Content popularity: Social media share and like counts

◦ Reliability : Google domain page rank

47

Acknowledgements

Dr. Amit Sheth Dr. T.K Prasad Dr. Tanvi Banerjee Ashutosh Jadhav

48

Thanks!

Questions?

49


Screenshots

50

Home Screen

51

Search & Explore Screen

52

Top 10 URLs Screen

53

Tweet Locations Screen

swapnil soni thesis_presentation

Engineering