information discovery on vertical domains

40
Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University (FIU), Miami

Upload: mika

Post on 10-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Information Discovery on Vertical Domains . Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University (FIU), Miami. Need for Information Discovery. Amount of available data increases Needle in the haystack problem Some applications: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Discovery on Vertical Domains

Information Discovery on Vertical Domains

Vagelis HristidisAssistant ProfessorSchool of Computing and Information SciencesFlorida International University (FIU), Miami

Page 2: Information Discovery on Vertical Domains

Need for Information DiscoveryAmount of available data increasesNeedle in the haystack problemSome applications:

◦ Web◦ Desktop search◦ Data Warehousing◦ Bibliographic database◦ Homes, cars search, e.g., realtor.com, autotrader.com◦ Scientific domains, e.g.,

genes, proteins, publications in biology, elements and interactions of components in chemistry Patient hospitalizations, physician info, procedure outcomes

in hospitals

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 2

Page 3: Information Discovery on Vertical Domains

Strengths and Limitations of Current Approaches

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 3

Web Search+ Scalability+ Handle free text+ Exploit content and link structure to achieve

ranking+ Simple keyword queries- Limited query expressive power- Generic, domain-independent ranking algorithms- Return pages, not answers

Database Querying+ Efficient+ Handle structured data+ Well-defined theory and answers- Must learn query language, e.g. SQL- No automatic ranking of results

Keyword Search in Databases + Simple keyword queries

+ exploit links (e.g., primary-foreign keys) - Generic ranking – typically size of result - No domain semantics

p1: person[name="John"nation="US"]

l1: lineitem[quantity=10

shipdate=Oct 14 2001]

l2: lineitem[quantity=10

shipdate=Oct 15 2001]

pa3: part[partkey=1005name="TV"]

pa1: part[partkey=1008name="VCR"]

pa2: part[partkey=1009

name="VCR & DVD"]

Page 4: Information Discovery on Vertical Domains

Research ObjectiveAllow effective and efficient information

discovery on vertical domainsStrategy:

◦ Exploit associations between entities◦ Model domain semantics, e.g., patient entity is

critical for medical practitioner, but not for biologist◦ Model users of a domain◦ Use knowledge of domain experts,and existing

knowledge structures (e.g., domain ontologies)◦ Exploit user feedback◦ Go beyond plain keyword search. Explore best

search interface for each domain, e.g., faceted search

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 4

Page 5: Information Discovery on Vertical Domains

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 5

Page 6: Information Discovery on Vertical Domains

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 6

Page 7: Information Discovery on Vertical Domains

Products MarketplaceProject started while

visiting Microsoft Research at Redmond, in Summer 2003

SQL Returns Unordered Sets of Results

Overwhelms Users of Information Discovery Applications

How Can Ranking be Introduced, Given that ALL Results Satisfy Query?

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 7

Page 8: Information Discovery on Vertical Domains

8

Products Marketplace (cont’d)Example – Realtor Database

House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results!Intuitively, Houses with lower Price,

more Bedrooms, or BoatDock are generally preferable

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

vagelis
All results have same query specified attrs since we focus on conjunctive queries as I explain later.
Page 9: Information Discovery on Vertical Domains

9

Products Marketplace (cont’d)Rank According to Unspecified Attributes [VLDB’04,TODS’06]

Score of a Result Tuple t depends onGlobal Score: Global Importance of

Unspecified Attribute Values◦ E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock

Many Bedrooms Good School District

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

vagelis
So since all result tuples have identical qury-specified attr values...as introduced in MS CIDR03
Page 10: Information Discovery on Vertical Domains

10

Products Marketplace (cont’d)Key ProblemsGiven a Query Q, How to

Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains

vagelis
In particular, I will show that the global and conditional parts of the ranking naturally appear by adapting PIR ranking techniques to our problem.
Page 11: Information Discovery on Vertical Domains

Products Marketplace (cont’d)Other ProjectsSelect the best attributes to output –

attribute ordering problem [SIGMOD’06]◦ E.g., Color is important for sports cars but

not much for family carsProduct Advertising: Select best

attributes to display for a product to maximize its visibility among its competitors [ICDE’08, TKDE’09]◦ Use past query workload◦ Maximize number of past queries for which

the product is returnedVagelis Hristidis - FIU - Information Discovery on Vertical Domains 11

Page 12: Information Discovery on Vertical Domains

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 12

Page 13: Information Discovery on Vertical Domains

Biological Databases [EDBT’09]

With University of Maryland Intuitive but powerful query

language, based on soft (ranking) and hard (pruning) filters

Goal is to improve the user experience of users of PubMed

Exploit associations between entities (genes, proteins, publications)

Example of Query: Find the most important publications on “cancer” that are related to the “TNF” gene through a protein.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 13

Page 14: Information Discovery on Vertical Domains

Results Navigation in PubMed with BioNav [ICDE’09, TKDE’10]

With SUNY Buffalo.Most publications in PubMed

annotated with Medical Subject Headings (MeSH) terms.

Present results in MeSH tree.Propose navigation model and

smart expansion techniques that may skip tree levels.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 14

Page 15: Information Discovery on Vertical Domains

BioNav: Exploring PubMed Results

Static Navigation Treefor query “prothymosin”

MESH (313)Amino Acids, Peptides, and Proteins (310)Proteins (307)Nucleoproteins (40)

Biological Phenomena, … (217)Cell Physiology (161)Cell Growth Processes (99)

Genetic Processes (193)Gene Expression (92)Transcription, Genetic (25)

95 more nodes

2 more nodes45 more nodes

4 more nodes

3 more nodes15 more nodes

10 more nodes1 more node

Histones (15)

- Query Keyword: prothymosin

- Number of results: 313- Navigation Tree stats:

• # of nodes: 3941• depth: 10• total citations: 30897

Big tree with many duplicates!

15Vagelis Hristidis, Searching and Exploring Biomedical Data

Page 16: Information Discovery on Vertical Domains

16

BioNav: Exploring PubMed Results

Reveal to the user a selected set of descendent concepts that:(a) Collectively contain all results(b) Minimize the expected user navigation costNot all children of the root are necessarily revealed as in static navigation. Vagelis Hristidis, Searching and Exploring Biomedical Data

Page 17: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 17

BioNav Evaluation

02468

101214161820

Overall Navigation Cost(# of Concepts Revealed + # of EXPAND Actions)

Static BioNav

Page 18: Information Discovery on Vertical Domains

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 18

Page 19: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Use Ontologies to Search Electronic Medical Records [ICDE’09]

With Miami Children’s Hospital, Indiana University School of Medicine, IBM Almaden.

Latest EMR format: HL7 CDA – XML-based Algorithm to enhance keyword search using ontological

knowledge (e.g., SNOMED)

19

Medical DictionaryM

edic

al D

ictio

nary

50043002Disorder of

Respiratory system

79688008RespiratoryObstruction

Is a

118946009Disorder of

Thorax

41427001Disorder ofBronchus

Is a

195967001Asthma

Is a

Is a

301229001Bronchial Finding

Is a

405944004AsthmaticBronchitis

Is a

May be

266364000Asthma attack

Is a May be955009

Bronchial Structure

Finding site of

Finding site of

Finding site of

82094008Lower respiratory tract

structure

Is a

Page 20: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 20

SAMPLE CDA FRAGMENT

Page 21: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Example 1q = {“bronchitis”, “albuterol”}

result = Observationcodevalue Bronchitisvalue Albuterol

21

Page 22: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRank: Example 2

q = {“asthma”, “albuterol”}

result = ???

22

Page 23: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data

XOntoRankA CDA node may be associated to a

query keyword w through ontology.XOntoRank first assigns scores to

ontological concepts◦ OntoScore OS(): Semantic relevance of a

concept c in the ontology to a query keyword w.

Then, given these scores, assign Node Scores NS() to document nodes

Other aggregation functions are possible. 23

Page 24: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data

Computing OntoScore of Concept Given Query KeywordThree ways to view the ontology

graph:◦As an unlabeled, undirected graph.◦As a taxonomy.◦As a complete set of relationships.

24

Page 25: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 25

Authority Flow Ranking in EMRs

A subset of the electronic health record dataset.

Work under submission.

EventsPlan TimeStampCreated=”2004-11-03 11:57:00.0" Events=”….small residual pericardial effusion…..”

Hospitalization TimeStampCreated=”2004-10-27 22:00:00.0" History=”18 year old boy with an aggressive form of chest lymphoma…” Allergies = “NKDA”…...

Cardiac PatientID=”1438" Complication=”apical impulse … Echo-large increasing pericardial effusion…”

Employee TimeStampCreated=”2004-12-23 14:03:00.0" Title=”Pediatric Cardiologist”….

EventsPlan Events=“4 month old baby… pericardial effusion...”

Medication TimeStampCreated=”2003-02-13 21:57:00.0"..

Hospitalization History = “48 year old..”

v1v7

v2v3

v4

v5v6

prescribed_to

recorded_by

recorded_by

Query: “pericardial effusion”

Page 26: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 26

ObjectRank on EMRs: Authority Flow Ranking

Schema of the EMR dataset

Hospitalization

EmployeeAssociated_Events

Patient Medication

A-E

P-M H-M

M-E

A-H H-E

P-E

created_by

reco

rded

_by

pres

crib

ed_b

y

of prescribed_to

forcreated_by

Page 27: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 27

User Study

Page 28: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 28

Explaining Subgraph

Page 29: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 29

User Study Results

00.10.20.30.40.50.60.70.80.9

1

CO085BM25 BM25 CO085 CO030

Ave

rage

Sen

sitiv

ity

00.10.20.30.40.50.60.70.80.9

1

CO085BM25 BM25 CO085 CO030

Ave

rage

Spe

cific

ity

Mean Sensitivity Mean Specificity

BM25: Traditional Information Retrieval Ranking FunctionCO: Clinical ObjectRank (Authority Flow)

Page 30: Information Discovery on Vertical Domains

Other challenges of Searching EMRs [NSF Symposium on Next Generation of Data Mining ’07]

Entity and Association Semantics

Negative StatementsPersonalizationTreatment of Time and

Location AttributesFree Text Embedded in CDA

Document Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 30

Page 31: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data

Syntax vs. Semantics in Schema

31

Example – query “Asthma Theophylline”

More details at [Hristidis et al. NSF Symposium on Next Generation of Data Mining ’07]

Page 32: Information Discovery on Vertical Domains

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 32

Page 33: Information Discovery on Vertical Domains

Bibliographic Databases Work started while at UCSD Exploit citations link structure to create query specific

ranking [VLDB’04, TODS’08] Demo available for Database literature at

http://dbir.cs.fiu.edu/BibObjectRank

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 33

Page 34: Information Discovery on Vertical Domains

Bibliographic Databases (cont’d)Query Reformulation

Work with U of Maryland [ICDE’08]

Based on user selected resultsPerform query expansion –

add/change weight of query keywords

Adjust authority flow weightsCurrently working on applying

these ideas to queries on PubMed.Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 34

Page 35: Information Discovery on Vertical Domains

Explaining Query Results – Explaining SubgraphTarget Object: “Modeling Multidimensional databases” paper.Explaining Subgraph Creation1. BFS in reverse direction from target object.2. BFS in forward direction from base set objects (authority

sources).3. Subgraph contains all nodes/edges traversed in forward

direction.4. Compute explaining authority flow along each edge by

eliminating the authority leaving the subgraph (iterative procedure).

5. Structure-based reformulation: High-flow edges in explaining subgraph receive weight boost.

Paper Authors=“H. Gupta, V. Harinarayan, A. Rajaraman, J. Ullman” Title=“Index Selection for OLAP.” Year=“ICDE 1997”

Paper Authors=“C. Ho, R. Agrawal, N. Megiddo, R. Srikant” Title=“Range Queries in OLAP Data Cubes.” Year=“SIGMOD 1997”

Paper Authors=“R. Agrawal, A. Gupta, S. Sarawagi” Title=“Modeling Multidimensional Databases.” Year=“ICDE 1997”

Author Name=“R. Agrawal”

Year Name=“ICDE”, Year=1997, Location=Birmingham

1.59e-7

6.76e-6

1.48e-4

7.12e-6

2.37e-6

3.02e-4 1.0e-4

0.001 6.76e-6

Conference Name=“ICDE”

7.12e-7

9.55e-7

v1

v2v3

v4v5

v6

TARGET OBJECT

Page 36: Information Discovery on Vertical Domains

Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 36

Page 37: Information Discovery on Vertical Domains

Search PatentsSpecial characteristics of

patents: Patents are organized

into classes and subclasses.

Patents have links to external publications and to other patents.

Patents are organized to various sections (abstract, claims, description and images).

Patents use specific legal wording in the claims section. Further, claims have references to other claims, that is, claims can be viewed as a graph.

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 37

Demo at PatentsSearcher.com

Page 38: Information Discovery on Vertical Domains

End - Thank YouFor more information, please go

to:http://ww.cis.fiu.edu/~vagelis

Supported by ◦NSF CAREER, 2010-2015◦NSF grant IIS- 0811922: III-CXT-

Small: Information Discovery on Domain Data Graphs, 2008-2011

◦DHS grant 2009-ST-062-000016: Information Delivery and Knowledge Discovery for Hurricane Disaster Management, 2009-2011

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 38

Page 39: Information Discovery on Vertical Domains

Extra Slides

Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 39

Page 40: Information Discovery on Vertical Domains

Vagelis Hristidis, Searching and Exploring Biomedical Data 40

CDA Document – Tree View