design and creation of ontologies for environmental (multimedia) information retrieval * vipul...
Post on 21-Dec-2015
221 views
TRANSCRIPT
Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval*
Vipul KashyapNational Library of Medicine
Workshop on Science and the Semantic WebOctober 24, 2002
* Work done by the author when at MCC and LSDIS Lab, UGA
Science on the Semantic Web Worksshop – 2
Outline
Ontologies for Information Retrieval: The InfoSleuth System
The Ontology Design Process:– “Reverse Engineering” from a database schema
– Ontology refinement based on user queries
– Using a data dictionary and Thesaurus
Ontology-based Multimedia Information Retrieval– Information Extraction from Textual Data
– Information Extraction from Image Data
Conclusions and Future Work
Science on the Semantic Web Worksshop – 3
Ontology-basedretrieval query
KQML/OKBCagents Document Database
e.g., Verity
Structured Databasee.g., Oracle
Image Database:features, patterns, semantic objects
Ontologies for Information Retrieval:The InfoSleuth System
Science on the Semantic Web Worksshop – 4
A Multimedia GIS Query using an ontological modelA Multimedia GIS Query using an ontological model
Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment
Fire
name
isLocatedNear Region
county blockarea
population
spatial_locationland_cover
containment
select county, block, spatial_locationfrom regionwhere area > 50 and population > 500and land_cover = “urban”and region.isLocatedNear.containment = “excellent”
Science on the Semantic Web Worksshop – 5
Ontologies for Information Retrieval
Provide a concise, uniform, declarative description of semantic information
Independent of syntactic representations, conceptual models of the underlying information bases
Domain models provide wider access by supporting multiple world views on the same underlying data
EDEN ontology defined in the context of the InfoSleuth system:– important and crucial to capture elements of environmental information
Science on the Semantic Web Worksshop – 6
Sources for Ontology construction
Pre-existing Database Schemas– data directed component
Collection of representative set of queries possibly parameterized based on application user interface– application directed component
Thesauri and Vocabularies (e.g., EEA Thesaurus)– knowledge directed component
Ontology = knowledge-based middle ground between applications and data !!!
Science on the Semantic Web Worksshop – 7
The Ontology Design Process
Choose newDatabase Schema
Abstract detailsfrom Database Schema
Determine entitiesand attributes
Group information,Analyze foreign keysand dependencies
DetermineRelationships
EvaluateOntology
Implementand Test
Drop entitiesand attributes
Add new entitiesand attributes
Add new subclassesand superclasses
Choose new query
No morequeries
Ontology fromDatabase Schema
Ontology fromQueries
Science on the Semantic Web Worksshop – 8
Environmental Databases
CERCLIS 3– http://www.epa.gov/enviro/html/cerclis/
ITT
HAZDAT– http://www.atsdr.cdc.gov/hazdat.html
ERPIMS– http://ns1.ktc.com/personal/larnold/erpims.htm
Basel Convention Database– http://www.unep.ch/basel
Science on the Semantic Web Worksshop – 9
Grouping Information in Multiple TablesSitesite_id (PK)site_namesite_ifms_ssid_codesite_rcra_idsite_epa_id
Site_Characteristic site_id (PK, FK to Site) rsic_code (PK, FK to Ref_Sic)sc_date
Ref_Sicrsic_code (PK)rsic_code_desc
Site_Aliassite_id (PK, FK to Site)site_alias_id (PK)sa_name
Site
date
name
code
alias_name
description
Database Schema
Ontology
Science on the Semantic Web Worksshop – 10
Identifying RelationshipsSitesite_id (PK)site_namesite_ifms_ssid_codesite_rcra_idsite_epa_id
Actionsite_id (PK, FK to Site)rat_code (PK, FK to ref_action_type)act_code_id (PK)
Ref_action_typerat_code (PK)rat_namerat_def
Waste_Src_Media_Contaminatedwsmrc_nmbr (PK)site_id (PK, FK to Action)rat_code (FK to Action)act_code_id (FK to Action)
Remedial_Responsesite_idact_code_idrat_code
Site
Contaminant
RemedialResponsePerformedAt
actionName
Database Schema
Ontology
Science on the Semantic Web Worksshop – 11
Ontology refinement based on user queries
Addition of New Attributes– At NPL sites with a land use category of INDUSTRIAL, what is the cleanup level
range for LEAD ….– Add an attribute landUseCategory to the entity Site in the ontology
Addition of new Relationships– What is the range of concentrations for ARSENIC is a contaminant of concern
in the SURFACE SOIL at NPL sites– Add a relationship HasContaminant between the entities Site and Contaminant
in the ontology
Addition of class-subclass relationships and new entities– How many Super fund sites are in Edison County, New Jersey ?– Add an entity SuperFundSite as a subclass of Site in the ontology
Science on the Semantic Web Worksshop – 12
Using a data dictionary (EDR) to enhance the ontology
Site
state
StateName StateCode StateAbbr
coding_scheme1
Map
coding_scheme2
coding_scheme3
select * from Site where state = ‘TX’ or state = ‘California’
select coding_scheme1 from Map where coding_scheme3 = ‘TX’
{ “Texas”, “California” } { “TX”, “CA” }
Science on the Semantic Web Worksshop – 13
Enhancing the Ontology by using a Thesaurus
abandoned siteTHEME POLLUTIONBT land setupNT disused military site
LandSetup
Site
AbandonedSite
DisusedMilitarySite
SuperfundSite
Science on the Semantic Web Worksshop – 14
Information Extraction from Text andMultimedia DataInformation Extraction from Text andMultimedia Data
Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment
Fire
name
isLocatedNear Region
county blockarea
population
spatial_locationland_cover
containment
select county, block, spatial_locationfrom regionwhere area > 50 and population > 500and land_cover = “urban”and region.isLocatedNear.containment = “excellent”
Science on the Semantic Web Worksshop – 15
Column1 containmentexcellent
fire.name region.county
Information Extraction from Textual DataInformation Extraction from Textual Data
Fire isLocatedNear Region
containment county
<ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25),
<WORD>(%), <WORD>(active)), <PHRASE>(full, containment,,
<STEM>(was), expected)<PHRASE>(the, fire, <STEM>(is),
contained))
= “excellent”
<ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San),
[region.county]), <OR>(county, block, state)))
block
state
<PARAGRAPH>(FIRE, REGION)
Science on the Semantic Web Worksshop – 16
Mapping “domain specific” model elements to mediaMapping “domain specific” model elements to media specific metadata specific metadata Mapping “domain specific” model elements to mediaMapping “domain specific” model elements to media specific metadata specific metadata
county(x,y) county(x,y) gets mapped to:gets mapped to:– word(x), phrase(x), accrue(<list-of-subtrees>)word(x), phrase(x), accrue(<list-of-subtrees>)
containment(x, “excellent”)containment(x, “excellent”) gets mapped to: gets mapped to:– sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>)sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>)
isLocatedNear(x, y)isLocatedNear(x, y) gets mapped to: gets mapped to:– paragraph(x,y)paragraph(x,y)
Science on the Semantic Web Worksshop – 17
select county from regionwhere isLocatedNear.containment = “excellent”
Mapping SQL queries to Topic ExpressionsMapping SQL queries to Topic Expressions
<PARAGRAPH>(
<ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25),
<WORD>(%), <WORD>(active)), <PHRASE>(full, containment,,
<STEM>(was), expected)<PHRASE>(the, fire, <STEM>(is),
contained)),<ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San),
[region.county]), county))
)
Science on the Semantic Web Worksshop – 18
Limitations of Current Indexing Technologies: Limitations of Current Indexing Technologies: “selection operation” “selection operation”
Limitations of Current Indexing Technologies: Limitations of Current Indexing Technologies: “selection operation” “selection operation”
select county from region
=> post-processing of patterns returned (WILDCARD as place-holder)=> post-processing of patterns returned (WILDCARD as place-holder)
Problem: WILDCARD may match a lot of words in the same sentenceProblem: WILDCARD may match a lot of words in the same sentence WILDCARD may match different words in different sentencesWILDCARD may match different words in different sentences
<ACCRUE>(<SENTENCE>(<PHRASE>(<OR>(New, Las, San), WILDCARD),
<OR>(county, block, state)))
Science on the Semantic Web Worksshop – 19
Using NLP and statistical techniques
WILDCARD matches a number of words in the same sentenceWILDCARD matches a number of words in the same sentence
Yeltsin was appointed Yeltsin was appointed thethe Prime MinisterPrime Minister whenwhen sleepingsleeping
articlearticle nounnoun conjunctionconjunction verb verb
=> Use part of speech tagging to reduce number of possibilities=> Use part of speech tagging to reduce number of possibilities
WILDCARD matches different words in different sentencesWILDCARD matches different words in different sentences Yeltsin was appointed Yeltsin was appointed Prime MinisterPrime MinisterYeltsin was appointed Yeltsin was appointed PresidentPresident=> use frequency statistics to give a level of confidence=> use frequency statistics to give a level of confidence
Science on the Semantic Web Worksshop – 20
Definition SupportDefinition Support
INCIDENT MANAGEMENT SITUATION REPORT
Friday August 1, 1997 - 0530 MDT
NATIONAL PREPAREDNESS LEVEL II
CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires hastaffed for structure protection.
SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between GalenaThe fore is active on the southern perimeter, which is burning into a continuous stand of black sfire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern 35% contained, while protection of the historic cabit continues.
CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehassigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up wherburned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this wedepending on the results of infrared scanning.
Phrase:
SIMELS, Galina District, BLM.
Slot: fire.name
value: SIMELS
structure:
<name> , <place> , <unit> .
Science on the Semantic Web Worksshop – 21
MIDAS*: Information Extraction from Multimedia DataMIDAS*: Information Extraction from Multimedia Data MIDAS*: Information Extraction from Multimedia DataMIDAS*: Information Extraction from Multimedia Data
Query: Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover
select county, block, area, population, spatial_location, land_cover
from regionwhere area > 50and population > 500and land_cover = ‘urban’and relief = ‘moderate’
*Media Independent DomAin Specific correlation
Science on the Semantic Web Worksshop – 22
Get me all regions(counties, blocks) having50 < population < 10025 < area < 50and low density urban arealand cover ...
media independent correlation across domainspecific metadata
correlation across imageand structured data at anintensional domain level
Science on the Semantic Web Worksshop – 23
Population: Population: Area:Area:
Boundaries:Boundaries:
Land cover:Land cover:Relief:Relief:
SQL queries to structured data(Census DB)
SQL Gatewayto textual data(TIGER/Line DB)
Image Processing routinesfor Image Data
Science on the Semantic Web Worksshop – 24
Science on the Semantic Web Worksshop – 25
Mapping “domain specific” model elementsMapping “domain specific” model elementsto media specific metadatato media specific metadataMapping “domain specific” model elementsMapping “domain specific” model elementsto media specific metadatato media specific metadata
contained(<concept>, <image>)contained(<concept>, <image>) gets mapped to: gets mapped to:– latitude/longitude, image-coordinateslatitude/longitude, image-coordinates– bounding box of regionbounding box of region– image type: LULC, DEMimage type: LULC, DEM
land_cover(x, “low density urban”)land_cover(x, “low density urban”) gets mapped to: gets mapped to:– percentage(<pixel-color>, <bounding-box>)percentage(<pixel-color>, <bounding-box>)
relief(x, “moderate”)relief(x, “moderate”) gets mapped to: gets mapped to:– standard-deviation(<pixel-value, <bounding-box>)standard-deviation(<pixel-value, <bounding-box>)
Science on the Semantic Web Worksshop – 26
Need for characterization of Domain Need for characterization of Domain VocabulariesVocabulariesNeed for characterization of Domain Need for characterization of Domain VocabulariesVocabularies
Geological Region
Urban Forest Land Water
Residential
Commercial
Industrial
Deciduous
Evergreen
Mixed
LakesReservoirs
Streams and Canals
Geological Region
State
County
City Rural Area
Tract
Block GroupBlock
Another source
of domain ontology
Construction:- Classification Standards
Science on the Semantic Web Worksshop – 27
Conclusions and Future Work
Role of semantic content in handling data/information overload– Domain Specific ontologies: an approach for capturing semantic content
Design and construction of domain ontologies– labor intensive, time consuming, difficult endeavor– Re-use readily information: schemas, queries, data dictionaries, thesauri
minimize the involvement of the domain expert
Metadata is the key for MultiMedia Information Retrieval– Use an expanded notion of metadata as schema and declarative SQL like query
language– Pragamatic Incorporation of NLP/Image+Speech+Video Processing/Computer Vision
techniques– Exploit synergy across multiple media for better precision and performance
Extrapolate this technique into other domains:– Medical and Bio-Informatics– telecommunication– IP networks (use of CIM information model by DMTF)
Ontology Extraction from Textual Data:– Clustering techniques to identify central concepts and taxonomic relationships– NLP techniques to identify concept associations– Consensus analysis techniques to establish ontologies