extracting metadata for spatially-aware information retrieval on the internet
DESCRIPTION
Research Paper Presentation – CS572 Summer 2011. Extracting Metadata for Spatially-Aware Information Retrieval on the Internet. Paper by Paul Clough (University of Sheffield Western Bank). Presented by Donghee Sung. Short Overview. - PowerPoint PPT PresentationTRANSCRIPT
Extracting Metadata for Spatially-Aware Information Retrieval on
the Internet
Research Paper Presentation – CS572 Summer 2011
Presented by Donghee Sung
Paper by Paul Clough (University of Sheffield Western
Bank)
Short Overview
SPIRIT:Spatial awareness to information systems
e.g.transport timetablesrouting system for motoristsmap-based web siteslocation based services
Key Part:Extraction and use of geospatial informa-tion
Short Overview
CriteriaSpeed, Reliability, Flexibility, Multilingualism
Geo-Parsing: - Identifying geographic references- Gazetteer lookup with context rules to filter out common-usage words and personal names
Geo-Coding: - Assigning spatial coordinate- Based on information of geographic resource
What’s the SPIRIT?
< http://www.geo-spirit.org/ >
What’s the SPIRIT?
SPIRITSPatially-Aware Information Retrieval on the InterneT
A search engineto find documents and datasets on the web relating to place or regions
What’s the SPIRIT?
Poor existing web search facilities find information related to a particular lo-cation.
Vicinity: find other places within radiuswww.somewherenear.comYellow pages services:
find a specific place or post codeBuyukkten:
associated admin’s IP with telephone area code Stanford Research Institute:
proposed ‘.geo’ with cells with latitude and longitude
What’s the SPIRIT?
Resources relating to placemay not be foundmay not be places nearbymay have another name
Major Shortcoming:cannot recognize alternative name
modern/historical variantsinformal namecontained places name
What’s the SPIRIT?
SPIRIT ProjectQuery expansion / relevance ranking pro-ceduresMachine learning techniques
extraction of geographical context generating metadata
Multi-modal user interface textual inputinteractive map feedback
Spatial indices for web collections.
Data Sources
Sources of Spatial DataTGN, OS, SABE
A large web collection of SPIRIT
Data Sources
Data Sources
Data Sources
Geo-Parsing Techniques
Tokenization IssuesStop-wordsNamed-Entitiy Recognition (NER)Gazetteers
Geo-Parsing Techniques
Geo-Parsing Techniques
Named-Entity Recognition (NER)
Processing a text and identifying to par-ticular categories of Named Entities(NE)
People, Organization, Location. etc
Geo-Parsing Techniques
Tokenization Procedure
1) Tokenized on whitespace @words = split(/s+/, $sentence);(Perl Regular Expressions)
"Isn't it ashame.“ -> Isn't / it / ashame.
2) Stemming / Case conversion.isn't / it / asham
3) Removing stop-words
Geo-Parsing Techniques
Default setting in indexing and re-trieving
- Case sensitivity: Off - Stop-word removal: Off- Stemming: Off
Stop-word removal / stemming-> Reduce the size of index files
But, can be useful:Stop-words : ‘in’, ‘inside’, or ‘of’Stemming: “London” from “London” &“Lon-doner”.
Geo-Parsing Techniques
Filtering candidate locations using context rules to remove
stop-wordsreferences to people and organiza-
tions, and links to emails/URLs
Conclusion
Geo-Parsing method could be improved by enhancing the gazetteer matching and filtering
False hits would be reduced by generating better list of stop-words and using further context rules could reduce
Need for creating rules would be alleviateby generating further context rules with fea-tures on machine learning
References
[3] Jones C.B., R. Purves, A. Ruas, M. Sanderson, M. Sester, M.J.van Kreveld, R. Weibel (2002). Spatial information retrieval andgeographical ontologies an overview of the SPIRIT project.SIGIR 2002: In SIGI’02, Tampere, Finland, 387-388.
[6] Joho, H. and Sanderson, M. (2004) The SPIRIT collection: anoverview of a large web collection. In SIGIR Forum, 38(2), 57-61.
[8] Mikheev A., Moens M. and Grover C. (1999) Named Entityrecognition without gazetteers. In Proceedings of the AnnualMeeting of the European Association for ComputationalLinguistics EACL'99, Bergen, Norway, 1-8.
Spatially-Aware Information Retrieval on the Internet - A Working Searching System
Thank You!