intensional associations in dataspaces

17
Intensional Associations in Dataspaces Marcos Vaz Salles Cornell University Jens Dittrich Saarland University Lukas Blunschi ETH Zurich ICDE 2010

Upload: chinue

Post on 15-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Intensional Associations in Dataspaces. Jens Dittrich Saarland University. Lukas Blunschi ETH Zurich. Marcos Vaz Salles Cornell University. ICDE 2010. Potentially relevant results. What is missing?. Irrelevant results that sound like Kevin Spacey. Potentially relevant results. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Intensional Associations in Dataspaces

Intensional Associations in Dataspaces

Marcos Vaz Salles

Cornell University

Jens Dittrich

Saarland University

Lukas Blunschi

ETH Zurich

ICDE 2010

Page 2: Intensional Associations in Dataspaces

Irrelevant results thatsound like

Kevin Spacey

Potentiallyrelevant results

What is missing?

Page 3: Intensional Associations in Dataspaces

Potentiallyrelevant results

ColleaguesWho acted

together withKevin Spacey

Other Members of the Spacey

Family in the Trade

Other Folks from NJ in the Trade

Items connectedto Kevin Spaceyby relationships

Page 4: Intensional Associations in Dataspaces

Potentiallyrelevant results

Movies in Common Same Last Name

Same Place of Birth

Items connectedto Kevin Spaceyby relationships

Samuel L. Jackson (37)Tom Hanks (34)

Robin Williams (34)Dustin Hoffman (34)

Morgan Freeman (32)

John Graham Spacey(great-great-uncle!)

Zach Braff, Adam Horovitz, Andrew Shue Joel Silver, Craig Kingsbury, Joseph Kraft Drew Rosenhaus, Lauryn Hill, Stacey Kent

Page 5: Intensional Associations in Dataspaces

The Problem

• Keywords are not enough– If item is not tagged, it is not returned– No meaningful definition of relatedness

• Relationships essential, but hard to get right– Searches do not include related items– Adding relationships to search queries hurts response

time– The more flexible the definition of relatedness, the

higher the cost

Page 6: Intensional Associations in Dataspaces

Our Solution

• Keywords are not enough– Declarative mini-language to define intensional

associations

• Relationships essential, but hard to get right– Special class of neighborhood-enriched search

queries over virtual associations– New index structure for neighborhood searches

to process these queries efficiently

Page 7: Intensional Associations in Dataspaces

Association Trails

A: QL QR

• Example: Actors in the same movies

moviesInCommon: //person[type=“actor”]

//person[type=“actor”],

θ1 = (ml L/movies: ml R/movies)

Meaning: Every element from query on the left has an intensional edge to θ-matching elements from query on the right

θ(L, R)Join Predicate that relates elements from the left with elements from the right

Search queries that select elements in the dataspace

θ1

Page 8: Intensional Associations in Dataspaces

Neighborhood Search Queries

• Combine search with pre-defined joins in association trails to get related items

• Examples:

– Search for “kevin spacey” also returns colleagues who acted together, other family members, etc

– Search for “actors who won the Oscar” also returns other actors strongly related to this set by virtual associations

Search Results

Related Items

Page 9: Intensional Associations in Dataspaces

Query Processing over Association Trails

• Intuition: Index at association trail definition time to avoid costly joins at runtime

• Naive Approach– Materialize all association trails into join index– Probe join index to get related items

Naive Approach:Given m association trails and n items,index size is worst-case O(mn2)

Page 10: Intensional Associations in Dataspaces

Grouping-Compressed Index (GCI)

• Still materializes join, but in compressed form

• Takes advantage of redundancy in join output– O(mn) worst-case on equi-joins

• Intuition:

samePlaceOfBirthθ(L,R)=(L.placeOfBirth = R.placeOfBirth)

NJ NJ

CA

CA

CA

NJ NJ

For each clique,only represent pivot, edges from pivot,

and elements in clique

Page 11: Intensional Associations in Dataspaces

Grouping-Compressed Index (GCI)

• Technical challenge is to answer neighborhood queries without decompressing

• Intuition:

• Details on full version of the paper

NJ NJ

CA

CA

CA

NJ NJ

Search ResultsProbe pivotonly once

Search: actors whowon the Oscar

samePlaceOfBirthθ(L,R)=(L.placeOfBirth = R.placeOfBirth)

Page 12: Intensional Associations in Dataspaces

Experiments with IMDb Dataset

• Dataset: – IMDb biographies and filmographies– ~2M people, ~1.5M movies

• Queries: – Original search returns a subset of people– Neighborhood processing includes all people related to original set through association trails

• Association trails: moviesInCommon, samePlaceOfBirth, sameHeight, sameLastName, sameBirthdate

Page 13: Intensional Associations in Dataspaces

Experiments with IMDb Dataset

• Indexing: over order-of-magnitude gains

• Querying:– Naive method very

sensitive to selectivity– Querying compressed

index comparable to uncompressed one with high selectivity

Page 14: Intensional Associations in Dataspaces

Related Work

• Neighborhood queries in dataspaces / IR: Dong & Halevy [SIGMOD 2007], Carmel et al. [SIGIR 2003]

• Intensional Associations: Srivastava & Velegrakis [SIGMOD 2007]

• Graph Indexing: Trissl and Leser [SIGMOD 2007], Neumann & Weikum [VLDB 2008], Weiss et al. [VLDB 2008], XML

• Recursive Queries: Declarative Networking & Datalog [SIGMOD 2006]

Page 15: Intensional Associations in Dataspaces

Conclusion

• Association Trails– Declarative mini-language to specify intensional

associations in dataspaces

• Neighborhood Search Queries– Return associated items along with search results– Search combined with joins

• Grouping-Compressed Index (GCI)– Efficient scheme to index intensional associations and

process neighborhood search queries

Thank you!

Page 16: Intensional Associations in Dataspaces

Backup Slides

Page 17: Intensional Associations in Dataspaces

Association Trail Examples

• Actors in the same movies

moviesInCommon: //person[type=“actor”] //person[type=“actor”],

θ1 = (ml L/movies: ml R/movies)

• Actors born in same place

samePOB: //person[type=“actor”] //person[type=“actor”],

θ2 = (L.placeOfBirth = R.placeOfBirth)

θ1

θ2