database laboratory university of a coruña a coruña, spain

29
Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies Database Laboratory Database Laboratory University of A University of A Coruña Coruña A Coruña, Spain A Coruña, Spain Miguel R. Luaces, Ángeles S. Places, Francisco J. Rodríguez, Diego Seco

Upload: kendis

Post on 25-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies. Database Laboratory University of A Coruña A Coruña, Spain. Miguel R. Luaces, Ángeles S. Places, Francisco J. Rodríguez, Diego Seco. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Database Laboratory University of A Coruña A Coruña, Spain

Retrieving Documents with GeographicReferences Using a Spatial Index

Structure Based on Ontologies

Database LaboratoryDatabase LaboratoryUniversity of A CoruñaUniversity of A Coruña

A Coruña, SpainA Coruña, Spain

Miguel R. Luaces, Ángeles S. Places, Francisco J. Rodríguez, Diego Seco

Page 2: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 2/29

Motivation The Database Laboratory at the University of A Coruña

works in two very active research fields:— Geographic Information Systems (GIS)

EIEL Project (http://www.dicoruna.es/webeiel)

— Information Retrieval (IR) Galician Virtual Library (http://bvg.udc.es)

GIS IR GIRRetrieval of geographically and thematically relevant documents

in response to a query of the form <theme, location>

Page 3: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 3/29

Motivation Many of the documents stored in digital libraries and

document databases include geographic references— Example: “…the hurricane touched land at Veracruz…”

Few index structures or retrieval algorithms take into account these geographic references

Some proposals have appeared recently. But… There are some specific particularities of geographic

space that are not taken into account by these proposals:— Hierarchical nature of geographic space— Topological relationships between the geographic objects

Page 4: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 4/29

Contents

Motivation

Related Work

Architecture

Index Structure

Supported Query Types

Experiments

Conclusions and Future Work

Page 5: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 5/29

Related Work Many index structures and techniques have been proposed in each

area:— IR: Inverted Index

Geographic references are normal words— GIS: R-Tree

These structures do not take into consideration the hierarchy of space

The combination of both types of indexes (SPIRIT project):— Text-First— Geo-First

They do not take into account the relationships between the geographic objects that they are indexing

An Ontology can describe the specific characteristics of geographic space:— Ontologies are used in query expansion, relevance rankings,

and web resource annotation Nobody has ever tried to combine it with other types of indexes to

have a hybrid structure

Page 6: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 6/29

Contents

Motivation

Related Work

Architecture

Index Structure

Supported Query Types

Experiments

Conclusions and Future Work

Page 7: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 7/29

Architecture

Administration User Interface Query User Interface

Geographic Information Retrieval Module

Geometry Supplier Service

Gazetteer Service

WMSQuery

Evaluation Service

Document Abstraction

Index construction Index Structure

Text, place name, text

Geographic Database

Document Database

Geographic Space Ontology Service

Continent

Country

Administrative

Region

Populated

Place

contains

contains

contains

Ontology

Europe

Spain

Galicia

Coruña

Ontology

Instance

Page 8: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 8/29

Architecture Document storage workflow:

— Keyword extraction Classic IR techniques (removing stopwords, stemmers)

— Build the index structure Natural Language Processing Techniques Gazetteer Service Geographic Space Ontology Service

Processing services:— Query solving service— Web Map Service (WMS)— Geographic information retrieval module

User Interface:— Administration (manage the document collection)— Query

Page 9: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 9/29

Contents

Motivation

Related Work

Architecture

Index Structure

Supported Query Types

Experiments

Conclusions and Future Work

Page 10: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 10/29

Index Structure

… …

hotel 1,3,7,8,12,…

sea 3,5,6,9,10,…

… …

Inverted Index

… …

Germany

Spain

… …

Place Name Hash Table

DocIds 2,3,…

EuropeAsia …

SpainGermany

MadridGalicia

Coruña

Page 11: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 11/29

Index Structure Based on a spatial ontology

— It models both the vocabulary and the spatial structure of places for purposes of information retrieval

Tree composed by nodes that represent place names— Each node:

Keyword (a place name) Bounding box Document identifiers Children nodes

— These nodes are connected by means of inclusion relationships— If the list of children nodes exceeds a threshold, an R-Tree is

used Auxiliary structures:

— Place name hash table— Traditional Inverted Index

Page 12: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 12/29

Index Structure Advantages:

— All textual queries and spatial queries can be efficiently processed

— Queries combining textual and spatial aspects are supported

— Updates and optimizations in each index are handled independently

Drawbacks:— The tree that supports the structure is possibly unbalanced— The structure is static

Page 13: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 13/29

Contents

Motivation

Related Work

Architecture

Index Structure

Supported Query Types

Experiments

Conclusions and Future Work

Page 14: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 14/29

Supported Query Types Pure textual queries

— “Retrieve all documents where the words hotel and sea appear”

— How do we solve it? Inverted Index

Pure spatial queries— “Retrieve all documents that refer to the following

geographic area”— How do we solve it?

Descend the structure + refine the result The same algorithm that is used with spatial indexes

Page 15: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 15/29

Supported Query Types Textual queries over a geographic area

— “Retrieve all documents with the word hotel that refer to the following area”

— How do we solve it? Example

— Geographic references can be given using place names

Page 16: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 16/29

Supported Query Types

… …

hotel 1,3,7,8,12,…

sea 3,5,6,9,10,…

… …

Inverted Index

EuropeAsia …

SpainPortugal

MadridGalicia

Index Structure

Text Result

Spatial Result

Query Result

… …

hotel 1,3,7,8,12,…

sea 3,5,6,9,10,…

… …

Text Result 1,3,7,8,12,…

Spatial Result

Query Result

Text Result 1,3,7,8,12,…

Spatial Result 12,14,…

Query Result

Query Window

Europe

Portugal Spain

Galicia

Coruña …Coruña

DocIds 12,14,…

Text Result 1,3,7,8,12,…

Spatial Result 12,14,…

Query Result 12,…

Text Result 1,3,7,8,12,…

Spatial Result 12,14,…

Query Result 12,…

Page 17: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 17/29

Supported Query Types Textual queries with place names

— “Retrieve all documents with the word hotel that refer to Spain”

— How do we solve it? Example

— We save some time by avoiding a tree traversal

Page 18: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 18/29

Supported Query Types

… …

hotel 1,3,7,8,12,…

sea 3,5,6,9,10,…

… …

Inverted Index

EuropeAsia …

SpainGermany

MadridGalicia

Index Structure

… …

Spain

Germany

… …

Place Name Hash Table

Text Result

Spatial Result

Query Result

DocIds 2,3,…DocIds 5,7,…DocIds 12,14,…

… …

hotel 1,3,7,8,12,…

sea 3,5,6,9,10,…

… …

Text Result 1,3,7,8,12,…

Spatial Result 2,3,5,7,12,14,…

Query Result 3,7,12,…

Text Result 1,3,7,8,12,…

Spatial Result 2,3,5,7,12,14,…

Query Result

Text Result 1,3,7,8,12,…

Spatial Result 2,3,5,7,…

Query Result

Text Result 1,3,7,8,12,…

Spatial Result

Query Result

Text Result 1,3,7,8,12,…

Spatial Result 2,3,…

Query Result

… …

Spain

Germany

… …

Spain

Galicia Madrid

Text Result 1,3,7,8,12,…

Spatial Result 2,3,5,7,12,14,…

Query Result 3,7,12,…

Page 19: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 19/29

Supported Query Types Another improvement: QUERY EXPANSION

— “retrieve all documents that refer to Spain”— How do we solve it?

The Query Evaluation Service discovers that Spain is a geographic reference

The Place Name Hash Table locates quickly the internal node that represents the geographic object Spain

All the documents associated to this node are part of the result

All the documents associated to the subtree are part of the result

— The result contains not only those documents that include the term Spain, but also all documents that contain the name of a geographic object included in Spain

Page 20: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 20/29

Demo

Page 21: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 21/29

Contents

Motivation

Related Work

Architecture

Index Structure

Supported Query Types

Experiments

Conclusions and Future Work

Page 22: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 22/29

Experiments Document collections

FT-91 FT-94# documents 5,368 71,489

# textual terms 64,711 251,057

# place names candidates 23,550 299,661

# documents with candidates 4,652 60,823

# documents with place names 4,182 54,899

# place names 167,774 2,274,146

# spatial nodes 21,282 64,843

Page 23: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 23/29

Experiments Randomly generated pure spatial queries

Ontology-based index versus R-Tree [FT-91]

Query area 0.001% 0.01% 0.1% 1%

Ontology-based index 0.013 0.017 0.052 0.360

R-Tree 0.010 0.016 0.057 0.370

Page 24: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 24/29

Experiments Ontology-based index versus R-Tree (zones of high

document density) [FT-91]

Query area 0.001% 0.01% 0.1% 1%

Ontology-based index 0.03 0.11 1.05 9.84

R-Tree 0.07 0.22 1.64 12.85

Ontology-based index versus R-Tree (zones of low document density) [FT-91]

Query area 0.001% 0.01% 0.1% 1%

Ontology-based index 0.02 0.03 0.09 0.4

R-Tree 0.02 0.03 0.07 0.2

Page 25: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 25/29

Experiments Improved R-Tree

— List of documents for each location

Query area 0.001% 0.01% 0.1% 1%

Ontology-based index 0.013 0.017 0.052 0.360

Improved R-Tree 0.007 0.008 0.022 0.145

Ontology-based index versus Improved R-Tree [FT-91]

Ontology-based index versus Improved R-Tree [FT-94]

Query area 0.001% 0.01% 0.1% 1%

Ontology-based index 0.016 0.036 0.223 3.017

Improved R-Tree 0.007 0.017 0.097 1.056

Page 26: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 26/29

Contents

Introduction

Motivation

Related Work

Architecture

Index Structure

Supported Query Types

Experiments

Conclusions and Future Work

Page 27: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 27/29

Conclusions We have defined an architecture for Geographic

Information Retrieval systems

It takes into account both textual and geographic references in the documents

This is achieved by a new index structure that combines an inverted index, a spatial index and an ontology

Traditional queries and new types of queries can be solved

Page 28: Database Laboratory University of A Coruña A Coruña, Spain

20th October, 2008 Barcelona - SeCoGIS 2008 28/29

Future Work Evaluate the performance of the index structure

Define algorithms to rank the retrieved documents

Improve the disambiguation process of place names

Explore the use of different ontologies

Include other types of spatial relationships (e.g., adjacency)

Page 29: Database Laboratory University of A Coruña A Coruña, Spain

Retrieving Documents with GeographicReferences Using a Spatial Index

Structure Based on Ontologies

Database LaboratoryDatabase LaboratoryUniversity of A CoruñaUniversity of A Coruña

A Coruña, SpainA Coruña, Spain

Miguel R. Luaces, Ángeles S. Places, Francisco J. Rodríguez, Diego Seco

Contact: [email protected]