semantic web

www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at

Semantic Web

Semantic AnnotationDieter Fensel

Katharina Siorpaes

www.sti-innsbruck.at

Where are we?

2

# Date Title

1 Introduction

2 Semantic Web Architecture

3 RDF and RDFs

4 Web of hypertext (RDFa, Microformats) and Web of data

5 Semantic Annotation

6 Repositories and SPARQL

7 OWL

8 RIF

9 Web-scale reasoning

10 Social Semantic Web

11 Ontologies and the Semantic Web

12 SWS

13 Tools

14 Applications

15 Exam


Agenda

1. Motivation

2. Technical solution, illustrations, and extensions1. Semantic annotation of text

2. Semantic annotation of multimedia

3. Summary

4. References

3


MOTIVATION

4


Semantic Annotation

• The population of ontologies is a task within the semantic content creation process as it links abstract knowledge to concrete knowledge.

• Creating semantic labels within documents for the Semantic Web.

• Used to support:– Advanced searching (e.g. concept)– Information Visualization (using ontology)– Reasoning about Web resources

• Converting syntactic structures into knowledge structures (humanmachine)

5


Semantic Annotation Process

6


Semantic Annotation

• Manual• Semi-automatic• Automatic

7


Semantic Annotation Concerns

– Scale, Volume• Existing & new documents on the Web• Manual annotation

– Expensive – economic, time– Subject to personal motivation– Schema Complexity

– Storage• support for multiple ontologies• within or external to source document?• Knowledge base refinement

– Access - How are annotations accessed?• API, custom UI, plug-ins

8


TECHNICAL, SOLUTION ILLUSTRATIONS, AND EXTENSIONS

9


Annotation of text

10


Language Toolkits

– GATE – language processing system• Component architecture, SDK, IDE

• ANNIE (‘A Nearly-New IE system’)– tokenizer, gazetteer, POS tagger, sentence splitter, etc

• JAPE – Java Annotations Pattern Engine– provides regular-expression based pattern/action rules

• Amilcare– adaptive IE system designed for document annotation

– based on LP2

– uses ANNIE

11


Annotation of text

• Many systems apply rules or wrappers that were manually created that try to recognize patterns for the annotations.

• Some systems learn how to annotate with the help of the user.• Supervised systems learn how to annotate from a training set that was

manually created beforehand. • Semi-automatic approaches often apply information extraction

technology, which analyzes natural language for pulling out information the user is interested in. In information extraction, one distinguishes between five different types of pieces of information:

12


Information extraction

• Recognition of – Entities (people, places, organization, etc.). With up to 95% accuracy, entity

recognition is the most reliable IE technology even though it is strongly domain-dependent.

– Mentions (referring to entities). Mentions identify references to entities in text. In the sentence “That's Mickey, I know him.”, “him” would be identified as reference to “Mickey”.

– Descriptions of entities combines several IE technologies and achieves a rather good score with around 80%.

– Relations between entities can be discovered with the help of defining possible relations between entities. This of course strongly depends on the domain.

– Event extraction involving entities is a difficult task as it is heavily dependent on the domain and tied to the scenarios of interest to the user: scenario templates only achieve 60% accuracy.

13


Tools for semantic annotation of text

• Armadillo– [Dingli2003] – pattern-based approach – Amilcare information extraction system [Ciravegna2003] – especially suitable for highly structured Web pages– starts from a seed pattern and does not require human input initially– does not require a manually annotated training set– patterns for entity recognition have to be added manually

• KIM– Knowledge and information management platform (KIM) [Popov2003] – ontology and knowledge base – indexing and retrieval server– RDF data is stored in an RDF repository [Kiryakov2005]– LUCENE system is performing search– based on an underlying ontology (KIMO or PROTON) – relies on GATE [Bontcheva2003] as an information extraction tool

14


KIM (2003)

– ontology, kb, semantic annotation, indexing and retrieval server, front-ends (Web UI, IE plug-in)

– KIMO ontology• 250 classes, 100

properties

• 80,000 entities from general news corpus in KB

– (plus >100,000 aliases)

• IE – Uses GATE, JAPE– Gazetteers (from KB) Source: http://www.ontotext.com/kim/SemWebIE.pdf

15



• Magpie– [Domingue2004– suite of tools that supports the fully automatic annotation of Web pages– maps entities that are found in the knowledge base against those identified on the Web

pages– quality of the results depends on the ontology in the background

• Melita– [Ciravegna2002] – Amilcare – can learn from user input by taking annotated content and generalizing this content in

order to make new annotations– system requires the user to fully annotate documents and starts to learn– makes suggestions to the user– learns from the user's feedback– possibilities for rule writing, i.e. advanced users can define rules for automatic

annotation

16



• MnM– [Vargas-Vera2002– semi-automatic annotation based on the Amilcare system– set of training data in order to carry out the annotation semi-automatically– system gradually takes over the annotation process– while browsing the Web, the user manually annotates chosen Web pages in the MnM

Web browser

• S-Cream– [Handschuh2002] – semi-automatic annotation tool that combines two tools: Ont-O-Mat, a manual

annotation editor implementing the CREAM framework, and Amilcare– can be trained for different domains with the appropriate training data– both manual and semi-automatic annotation of Web pages

17


Ont-O-Mat (2002)

– Uses Amilcare• Wrapper induction (LP2)

– Extensible• Adapted in 2004 for

PANKOW algorithm

– Disambiguation by maximal evidence

– Proper nouns + ontology linguistic phrases

Source: http://www.aifb.uni-karlsruhe.de/WBS/sha/papers/kcap2001-annotate-sub.pdf

18

18


SemTag (2003)

– Large-scale annotation• Annotations separate from source• “Semantic Label Bureau”

– Uses the TAP taxonomy

– Approach is:• Find match to label in taxonomy

• Save window before & after match

• Perform disambiguation

• Main contribution is using taxonomy for disambiguation

Source: http://www.almaden.ibm.com/webfountain/resources/semtag.pdf

19

19


Platform Effectiveness

*as reported by platform authors

20


Multimedia annotation

21


Multimedia Annotation

• Different levels of annotations– Metadata

• Often technical metadata• EXIF, Dublin Core, access rights

– Content level• Semantic annotations• Keywords, domain ontologies, free-text

– Multimedia level• low-level annotations• Visual descriptors, such as dominant color

22


Metadata

• refers to information about technical details• creation details

– creator, creationDate, …– Dublin Core

• camera details– settings– resolution– format– EXIF

• access rights– administrated by the OS– owner, access rights, …

23


Content Level

• Describes what is depicted and directly perceivable by a human• usually provided manually

– keywords/tags– classification of content

• seldom generated automatically– scene classification– object detection

• different types of annotations– global vs. local– different semantic levels

24


Global vs. Local Annotations

• Global annotations most widely used– flickr: tagging is only global– organization within categories– free-text annotations– provide information about the content as a whole– no detailed information

• Local annotations are less supported– e.g. flickr, PhotoStuff allow to provide annotations of regions– especially important for semantic image understanding

• allow to extract relations• provide a more complete view of the scene

– provide information about different regions– and about the depicted relations and arrangements of objects

25


Semantic Levels

• Free-Text annotations cover large aspects, but less appropriate for sharing, organization and retrieval

– Free-Text Annotations probably most natural for the human, but provide least formal semantics

• Tagging provides light-weight semantics– Only useful if a fixed vocabulary is used– Allows some simple inference of related concepts by tag analysis (clustering)– No formal semantics, but provides benefits due to fixed vocabulary– Requires more effort from the user

• Ontologies– Provide syntax and semantic to define complex domain vocabularies– Allow for the inference of additional knowledge– Leverage interoperability– Powerful way of semantic annotation, but hardly comprehensible by “normal

users”

26


Tools

• Web-based Tools– flickr– riya

• Stand-Alone Tools– PhotoStuff– AktiveMedia

• Annotation for Feature Extraction– M-OntoMat-Annotizer

27


flickr

• Web2.0 application• tagging photos globally• add comments to image regions

marked by bounding box• large user community and tagging

allows for easy sharing of images• partly fixed vocabularies evolved

– e.g. Geo-Tagging

28


riya

• Similar to flickr in functionality• Adds automatic annotation features

– Face Recognition• Mark faces in photos• associate name• train system• automatic recognition of the person in the future

29


PhotoStuff

• Java application for the annotation of images and image regions with domain ontologies

• Used during ESWC2006 for annotating images and sharing metadata• Developed within Mindswap

30


AktiveMedia

• Text and image annotation tool• Region-based annotation• Uses ontologies

– suggests concepts during annotation

– providing a simpler interface for the user

• Provides semi-automatic annotation of content, using– Context– Simple image understanding

techniques– flickr tagging data

31


M-OntoMat-Annotizer

• Extracts knowledge from image regions for automatic annotation of images

• Extracting features:– User can mark image regions manually or using an

automatic segmentation tool– MPEG-7 descriptors are extracted– Stored within domain ontologies as prototypical,

visual knowledge• Developed within aceMedia• Currently Version 2 is under development,

incorporating– true image annotation– central storage– extended knowledge extraction– extensible architecture using a high-level

multimedia ontology

32


Multimedia Ontologies

• Semantic annotation of images requires multimedia ontologies– several vocabularies exist (Dublin Core, FOAF)– they don’t provide appropriate models to describe multimedia

content sufficiently for sophisticated applications

• MPEG-7 provides an extensive standard, but especially semantic annotations are insufficiently supported

• Several mappings of MPEG-7 into RDF or OWL exist– now: VDO and MSO developed within aceMedia– later: Engineering a multimedia upper ontology

33


aceMedia Ontology Infrastructure

• aceMedia Multimedia Ontology Infrastructure– DOLCE as core ontology– Multimedia Ontologies

• Visual Descriptors Ontology (VDO)

• Multimedia Structures Ontology (MSO)

• Annotation and Spatio-Temporal Ontology augmenting VDO and MSO

– Domain Ontologies• capture domain specific

knowledge

34


Visual Descriptors Ontology

• Representation of MPEG-7 Visual Descriptors in RDF– Visual Descriptors represent low-level features of multimedia

content– e.g. dominant color, shape or texture

• Mapping to RDF allows for– linking of domain ontology concepts with visual features– better integration with semantic annotations– a common underlying model for visual and semantic features

35


Visual Knowledge

• Used for automatic annotation of images• Idea:

– Describe the visual appearance of domain concepts by providing examples

– User annotates instances of concepts and extracts features– features are represented with the VDO– the examples are then stored in the domain ontology as

prototype instances of the domain concepts

• Thus the names: prototype and prototypical knowledge

36


Extraction of Prototype

<?xml version='1.0' encoding='ISO-8859-1' ?><Mpeg7 xmlns…> <DescriptionUnit xsi:type = "DescriptorCollectionType"> <Descriptor xsi:type = "DominantColorType"> <SpatialCoherency>31</SpatialCoherency> <Value> <Percentage>31</Percentage> <Index>19 23 29 </Index> <ColorVariance>0 0 0 </ColorVariance> </Value> </Descriptor> </DescriptionUnit></Mpeg7>


37


Transformation to VDO



extractextract

<vdo:ScalableColorDescriptor rdf:ID="vde-inst1"> <vdo:coefficients> 0 […] 1 </vdo:coefficients> <vdo:numberOfBitPlanesDiscarded> 6 </vdo:numberOfBitPlanesDiscarded> <vdo:numberOfCoefficients> 0 </vdo:numberOfCoefficients></vdo:ScalableColorDescriptor>

<vdoext:Prototype rdf:ID=“Sky_Prototype_1"> <rdf:type rdf:resource="#Sky"/> <vdoext:hasDescriptor rdf:resource="#vde-inst1"/></vdoext:Prototype>

<vdo:ScalableColorDescriptor rdf:ID="vde-inst1"> <vdo:coefficients> 0 […] 1 </vdo:coefficients> <vdo:numberOfBitPlanesDiscarded> 6 </vdo:numberOfBitPlanesDiscarded> <vdo:numberOfCoefficients> 0 </vdo:numberOfCoefficients></vdo:ScalableColorDescriptor>

<vdoext:Prototype rdf:ID=“Sky_Prototype_1"> <rdf:type rdf:resource="#Sky"/> <vdoext:hasDescriptor rdf:resource="#vde-inst1"/></vdoext:Prototype>

transformtransform


Using Prototypes for Automatic Labelling

extract

<RDF /><RDF />

<RDF /><RDF />

<RDF /><RDF />

<RDF /><RDF />

segment labeling

Knowledge Assisted Analysis

<RDF /><RDF />rockrockskysky

seasea

beachbeachbeach/rockbeach/rock

rock/beachrock/beach

sea, skysea, sky

person/bearperson/bear


Multimedia Structure Ontology

• RDF representation of the MPEG-7 Multimedia Description Schemes

• Contains only classes and relations relevant for representing a decomposition of images or videos

• Contains Classes for different types of segments– temporal and spatial segments

• Contains relations to describe different decompositions• Augmented by annotation ontology and spatio-temporal

ontology, allowing to describe– regions of an image or video– the spatial and temporal arrangement of the regions– what is depicted in a region

40


MSO Example

Sky/Sea

Sea

Sand

Sea Sea/Sky

Person/SandPerson

image01

segment01 sky01

sea01

sand01

Image

Sky

Sea

Sand

Segment

spatial-decomposition

rdf:type

rdf:type

rdf:type

rdf:type

depicts

depicts

depicts

segment02

rdf:type

segment03


SUMMARY

42


Summary

• The population of ontologies is a task within the semantic content creation process as it links abstract knowledge to concrete knowledge.

• This knowledge acquisition can be done manually, semi-automatically, or fully automatically.

• There is a wide range of approaches that carry out semi-automatic annotation of text: most of the approaches make use of natural language processing and information extraction technology.

• In the annotation of multimedia aim at closing the so-called semantic gap, i.e. the discrepancy between low-level technical features which can be automatically processed to a large extent, and the high-level meaning-bearing features a user is typically interested in.

• Low level semantics can be extracted automatically, while high level semantics are still a challenge (and require human input to a large extent).

43


References

• F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. User-system cooperation in document annotation based on information. In 13th International Conference on Knowledge Engineering and KM (EKAW02), 2002.

• P. Asirelli, S. Little, M. Martinelli, and O. Salvetti. Mulimedia metadata management: a proposal for an infrastructure. In Proceedings of SWAP 2006, 2006.

• C. Barras, E. Georois, Z. Wu, and M. Liberman. Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication Special Issue on Speech Annotation and Corpus Tools, 33(1-2), 2000.

• S. Bloehdorn, K. Petridis, C. Saatho, N. Simou, V. Tzouvaras, Y. Avrithis, S. Handschuh, Y. Kompatsiaris, S. Staab, and M. G. Strintzis. Semantic annotation of images and videos for multimedia analysis. Springer LNCS, 2005.

• T. Buerger and C. Ammendola. A user centered annotation methodology for multimedia content. In Poster Session at ESWC 2008, CEUR Vol. 367, 2008.

• A. Chakarvarthy, F. Ciravegna, and V. Lanfranchi. Cross-media document annotation and enrichment. In 1st Semantic Authoring and Annotation Workshop (SAAW2006), 2006.

• F. Ciravegna and Y. Wilks. Designing Adaptive Information Extraction for the Semantic web in Amilcare, volume Annotation for the Semantic Web. IOS Press, Amsterdam, 2003.

44


References (cont‘d)

• K. Dellschaft and S. Staab. Ontology Learning and Population: Briding the Gap between Text and Knowledge, chapter Strategies for the evaluation of ontology learning, pages 253272. IOS Press, 2008.

• S. Dill, N. Gibson, D. Gruhl, R.V. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J.A. Tomlin, and J.Y. Zien. Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In Twelfth International World Wide Web Conference, 2003.

• S. Handschuh and S. Staab. Annotation for the semantic web, 2003.• S. Handschuh, S. Staab, and F. Ciravegna. S-cream semi-automatic

creation of metadata. In SAAKM 2002 -Semantic Authoring, Annotation & Knowledge Markup, 2002.

45

semantic web

Documents