supervised multimedia categorization

Supervised Multimedia Categorisation

Frank Aldershoffa, Alfons Saldena, Sorin Iacoba and Masja Kempena

aTelematica Instituut, P.O. Box 589, Drienerlolaan 5, 7500 AN Enschede, The Netherlands

ABSTRACT

Static multimedia on the Web can already be hardly structured manually. Although unavoidable and necessary,manual annotation of dynamic multimedia becomes even less feasible when multimedia quickly changes in com-plexity, i.e. in volume, modality, and usage context. The latter context could be set by learning or other purposesof the multimedia material. This multimedia dynamics calls for categorisation systems that index, query andretrieve multimedia objects on the fly in a similar way as a human expert would. We present and demonstratesuch a supervised dynamic multimedia object categorisation system. Our categorisation system comes about bycontinuously gauging it to a group of human experts who annotate raw multimedia for a certain domain ontologygiven a usage context. Thus effectively our system learns the categorisation behaviour of human experts. Byinducing supervised multi-modal content and context-dependent potentials our categorisation system associatesfield strengths of raw dynamic multimedia object categorisations with those human experts would assign. Aftera sufficient long period of supervised machine learning we arrive at automated robust and discriminative multi-media categorisation. We demonstrate the usefulness and effectiveness of our multimedia categorisation systemin retrieving semantically meaningful soccer-video fragments, in particular by taking advantage of multimodaland domain specific information and knowledge supplied by human experts.

Keywords: multimedia, categorisation, equivalence, scale-space, supervised machine learning

1. INTRODUCTION

Representation, analysis, processing and understanding of multimedia requires a multidisciplinary and eveninterdisciplinary approach to the simultaneous categorisation of text, audio, video and annotations added byhuman experts. A categorisation system should first of all take advantage of the dependency relations betweenall multimodal objects that are annotated by human experts. During a learning phase a multimedia categorisationsystem subsequently should be gauged to the categorisation behaviour of human experts communicating withthe system in terms of annotations. A reason for imposing this requirement is that the dynamics of multimediain volume, type and usage context requires at least a semi-automation of the multimedia categorisation system.After this supervision phase the process of multimedia categorisation by the human experts could be automatedthrough supervised machine learning techniques based either on discrete or continuous classes.1–3 In case ofdiscrete classes a classname is stored at a leaf, whereas in a continuous class a value is stored, i.e. categoricalversus numeric prediction. However, whenever multimedia domain specific contexts, such as rules for a soccergame, change then certainly human experts must interface again with the system to supervise and therewithaccelerate the categorisation process and thus gauge the system upto an acceptable confidence level, so-to-speak. In this paper we present such a supervised multimedia categorisation system and apply it to a soccergame. Automation of the multimedia categorisation will not be reached only at the level of natural language.Our categorisation system takes advantage of the information and knowledge hidden in other modalities likeaudio-video material, because those media actually are the bases of the induced hierarchically minimal linguisticmultimedia descriptions, i.e. multimedia summarisations and associations. Note that supervised machine learningin our approach to multimedia categorisation boosts or reinforces old inference rules derived from the interactionof a human expert with the system, but also inforces new rules like those for the soccer game.

The purpose of a multimedia categorisation system is to come up with reproducible and stable multime-dia representation, analysis, processing and inference schemes. The latter schemes in turn could be used toindex, query, retrieve, summarise and associate multimedia databases within different usage contexts.4, 5 A

E-mail:Frank.Aldershoff,Alfons.Salden,Sorin.Iacob,[email protected]

(de)categorification of dynamic multimedia objects could be defined as a natural multimedia consistent sum-marisation, abstraction and association, respectively. When decategorifying one forgets about morphisms thatactually allow us to abstract away from reality in the sense not only of a simplification - the concept “abstractpainting” gives us more freedom than boring mathematical axiomatics. Here the multimedia could compriseother streams like annotations that are supplied by human experts. These annotation streams could be classifiedas the highest levels of (de)categorification possible interfacing almost directly with the usage context. Therewiththey are the most valuable to be used in reproducible and robust and discriminative multimedia categorisation.The question arises how to automate such a multimedia categorisation system for the multimedia annotations.Such an automation would assure a rapid enough categorisation of dynamic multimedia ever more growing involume and in type, and used in different contexts.

Recently it was shown that integration of information from multiple senses can be very beneficial to humansperforming complex tasks.6 In multimedia categorisation system theories and applications this observation leadto the proposal of coupling multimodal dynamic objects in order to reveal the most predominant semantics ofthe concurrent multimodal streams.4, 5 These theories and applications already alluded to using user-profiles orother context information to steer multimedia categorisation schemes in a particular direction and to speed upthat process considerably. Evidently supervision by human experts could further accelerate the categorisationprocess of static as well as dynamic multimedia objects.

Current categorisation system theories and applications are developed and deployed from different disciplinaryperspectives. Video categorisation systems like QBIC,7 VisualSEEK,8 Virage9 and VideoQ10 are not designedto integrate and to couple to multimedia streams. Furthermore, multimedia models that integrate differentmultimodal streams like text and images still do not really fuse these streams during the multimedia categorisationprocess.11

The superiority of supervised over semi- and unsupervised categorisation systems in terms of effectivenessis observed for text, audio and video.12, 13 For example, supervised multiscale Bayesian image segmentationmodels do enable a robust contextual categorisation of textured images. The different multiscale models arebased on pyramids, Hidden Markov Trees and Markov Random Fields that are used to characterise interscaleand intrascale dependencies of textures.14–19

The above multiscale Bayesian image categorisation models have, however, some notable drawbacks. First,the geometry assumed to be living on the image domain is not directly coupled to the image features them-selves. Instead, the geometry is fully determined by the layout of the grid, i.e. Euclidean space is assumed tobe appropriate. Realising that surface textures are normally view-dependent it should not come as a surprisethat under perspective transformations or more severe image transformations like homotopies or morphologicaltransformations those models will break down. Second, these multiscale segmentation schemes are not coupledto the image structure, i.e. the image intensity field is not assumed to induce an appropriate metric or connectionsuch that a consistent figure-ground segmentation can be carried out. Such a metric or connection in physics areknown as field potentials, whereas the figure-ground segmentation are substantiated by field strengths that can bederived from that metric or connection. Third, the scaling of image structure that is carried out to retain reliableand robust contextual or feature information is neither coupled to that image structure nor to external usagecontexts. Fourth, these texture models do not really support multiple image interpretations, because the usedmetric, connection and similarity operators are fixed, such that they fail to resolve ambiguities. The questionarises which multiscale image categorisation model should be applied in order to retain robust and discriminativeimage features and contexts as texture classifiers consistent with a particular usage context. Another questionthat arises is how to extend multiscale image categorisation to other monomodal and multimodal media. An-swering these questions is necessary in order to decide which multi-scale and multi-context inferences to carryout on multimedia in general. After doing such inferences we know in which category multimedia objects couldfall and which interpretations could be associated to them.

Our multimedia categorisation system not only supports, but also extends the multiscale Bayesian imagemodels in all of the above aspects. For example, applying various connections, metrics, similarity operators andinference schemes could resolve disambiguities that occur upon changes in usage context (see Ref.4, 5 for extensiveexpositions on these matters). Our approach to multimedia categorisation substantiates content, colour andalgebraic frame based methods.20–24 Furthermore, the supervised multimedia categorisation system we propose

has not yet been considered. Instead of trying to classify every multimedia element, our system considersstatistical geometries and topologies of multimedia consistent scale-spaces. The multimedia-dependent metrics,connections and similarity operators determine then in essence the scale-space structure that can be exploredby means of corresponding inference schemes. Furthermore, figure-ground segmentation schemes and inferenceschemes to index, query and retrieve arrangements of multimedia objects are then also in fact a consequenceof choices in usage contexts.4, 5 The question arises how supervision and automation (supervised learning) ofsuch categorisation systems can come about on the basis of textual annotations made by an ensemble of humanexperts using a specific domain ontology.25 Note that the ensemble of multimedia objects including annotationsalso possesses a natural statistics and according geometry and topology that is not yet imposed as contextualinformation or knowledge on multimedia consistent scale-space schemes. This natural statistics and geometryactually influences a human expert in defining a specific domain ontology. In this context a domain ontology(for e.g soccer) is a hierarchical taxonomy of categories, concepts or words, and even of relationships betweenconcepts. In the field of natural language, hard text categorisation is concerned with assigning either a trueor false value to the decision to put a document in a certain category by means of a so-called classifier.26–28

If only one category can be assigned, then one speaks of single-label text categorisation. In case of morecategories one speaks of multi-label text categorisation. Furthermore, the use of a text classifier to find categories(documents) given a document (category) refers to document-pivoted (category-pivoted) text categorisation.Opposite to hard text categorisation exists ranking text categorisation. In ranking text categorisation categories(documents), the classifier estimates the similarity with a certain document (category), i.e. so-called category-ranking (document-ranking) text categorisation. The question arises how an ontology for e.g. the soccer domaincan enforce statistically relevant inference schemes for the semantically most meaningful events like goals andpenalties by means of supervised learning.

Our paper is organised as follows. In section 2 we briefly recapitulate and illustrate our mathematical-physicaland logical framework of our supervised multimedia categorisation system. In section 3 we demonstrate theeffectiveness of our supervised multimedia categorisation system in retrieving statistically relevant and reinforcedmultiscale semantics inferred from arrangements of multimedia objects obtained after multiscale multimediasegmentation. The reinforcement of the multiscale semantics is induced by the annotations made over time byan ensemble of human experts. We do this all for the soccer-domain in particular.

2. SUPERVISED MULTIMEDIA CATEGORISATION

Multimedia can be represented as a vector-valued energy-density field (current) of the external physical fielddynamics onto a vector-valued density field of the induced field dynamics on a multimedia categorisation system:

Definition 2.1. Multimedia I is defined by a mapping I : M → N where M is an external physical fielddynamics and N an induced/stored physical field dynamics on the multimedia categorisation system, and I aninduction operator.

Here the multimedia could be the external multimedia field dynamics fallen onto the system, whereas Ncould be the induced multimedia system dynamics given the induction operator I which hides the interactionor entanglements of external, induced and stored physical field dynamics. Note that I, M and N form amathematical physical model of space-time and external, induced and stored physical field dynamics. The lattermultimedia dynamics could comprise streams of video, audio and text, but also annotations made by humanexperts.

In order to properly analyse multimedia (Definition 2.1) in terms of a complete and irreducible set of equiv-alences it is essential to know how the multimedia representation changes whenever they are subjected to aparticular class of so-called gauge groups:

Definition 2.2. A gauge group G consistent with multimedia (Definition 2.1) is a group or set of transforma-tions leaving multimedia (Definition 2.1) or some of its properties invariant.

Such gauge groups could cover deformationals and even morphological transformations including spatio-temporal reordering, cutting, pasting, insertion and deletion of multimedia objects. For audio-video as well astext such a gauge group could introduce e.g. flashbacks, while at an expert level it might even cause a change inplot (Sherlock Holmes is not the detective but the murderer).

A set of equivalences of multimedia (Definition 2.1) could come about after setting up a (co)-frame field, metricand/or connection invariant under a particular class of gauge group (Definition 2.2) (see for a brief expositionon those objects4, 5). Besides semi-local geometric equivalences there exist also non-local, joint, topological andfunctional equivalences that could by definition be invariant under the considered gauge group. It is noteworthy tomention that such equivalences are not restricted to simple indices but also to physical objects and operators suchas multimedia inference schemes. The latter type of equivalences yield besides winding numbers of higher orderhomotopy groups, Betti numbers of co-homology groups, knot, link, braid and other CW-complex invariants, alsomorphisms enabling abstraction and association. Together these invariants or equivalences that do not compriseonly measures but also inference and association schemes enable a multimedia consistent (de)categorification.The latter observation explains how supervised learning of a domain ontology identified by human experts canautomate and considerably speed up multimedia categorisation. Note that in the audio, video and text domainsuch equivalences are commonly known as features such as edges, ceptral coefficients and classifiers, respectively.Quite often these equivalences are heuristically formulated and not consistent with the assumed model underlyingthe multimedia categorisation scheme.

As indicated some of the gauge groups generate active transformations, such as morphological transformationsthat have far reaching implications on a semantic as well as contextual level (Sherlock Holmes is murdered).However, they could be undone by means of similarity operations inducing robust and discriminative inferenceschemes on the multimedia objects despite those type of transformations (Evidently, somebody died). Thusour multimedia categorisation problem involves besides the problem of invariance under gauge groups also theproblem of robustness under similarity operations that filter possible outliers. For example, spatio-temporalreordering of multimedia objects need not cause dramatic changes in interpretations acquired through inferencescarried out on multimedia.

In order to ensure some level robustness, i.e. Lyapunov stability under noise and structural stability un-der severe morphological transformations, and discriminative power of a multimedia categorisation in terms ofequivalences, a gauge group consistent class of multimedia filtrations must be iteratively applied until that leveland power are reached:

Definition 2.3. A filtration of equivalence F says that the change per unit scale τ in equivalence F in a spatio-temporal region Ω of the multimedia (Definition 2.1) is equal to topological current jF between this region andits surrounding across their (common or virtual - in case of non-adjacency) boundary S = ∂Ω:

δτF = −jF , jF = − ∇Γvs

F

γ2(√

g(∇Γvs

F,∇Γvs

F ), Z = exp [−F [Vi(x)]], F [Vi(x)] =

∑i,k,p

dvp(Vi;πk(g1...gk)(x, τi;πk(g1...gk))

),

with γ a monotonic increasing function, (g,Γ) a metric and connection, suitable initial-boundary conditions, vs

is connecting free equivalence states F (pi) and F (pj) on different physical objects and equivalence F is related tostatistical partition function Z. Here x labels any spatio-temporal location, region or cell, πk a permutation of asequence of k ≥ 0 integers (g1 . . . gk) with k = 0 for labeling frame vector fields vgk

and τ ’s for dynamic scalesconsistent with the gauge group G and equivalences Vi;πk(g1...gk).

The topological current could be steered by equivalences such that predominant semantic multimedia infor-mation, knowledge or contexts are preserved. The initial and boundary conditions can be formulated in termsof similar information. For example, an initial segmentation of the multimedia in figure-ground could be used toimpose spatio-temporal inclusion and connectivity relations and dynamic ordering relations for an equivalence.These relations in turn could be used to set up local or global constraints to the exchange principle. Note that adynamic scale-space paradigm can be different for each equivalence invariant under a gauge group. Furthermore,the formulation of the topological current could depend on the spatio-temporal inhomogeneity, a-symmetry andanisotropy of the multimedia dynamics. Last but not least, multimedia consistent similarity operators possiblyhave related recursion operators that generate other solutions of the above differential-integral equation usedin multimedia processing. Both these types of operators generate hierarchical nestings of gauge invariant, ro-bust and discriminative equivalences that could be perceptually consistently grouped multimedia objects. Thesehierarchical nestings of self-similar dynamic multimedia objects could come about by segmentations and ar-rangements of dynamic scale-spaces of equivalences of primal multimedia objects. Next, an ensemble of inference

schemes can be derived on top of those equivalences through combinatorics and enumeration. Combinatoricsand enumeration of semantics by inducing different connections on the multimedia objects perfectly explainsthe possibility of ambiguity and also the capability of humans to disambiguate such objects given a particularcontext in which input stimuli occur. In order to come up with a unique interpretation of scenes user contextinformation is obviously indispensable. Thus these (fuzzy) inference schemes could give rise to multiple dynamicmultimedia object interpretations. The output of these inference schemes in terms of gauge invariant and ro-bust multimedia meta-indices can then also be used to come up with multiple query, retrieval, summarisation,synthesis and association schemes. The natural statistics of those artificial inference schemes could in the endcorrelate with and even underly the syntax, semantics and contexts assigned to multimedia by human expertsapplying a specific domain ontology such as soccer.

Recently we demonstrated that integration of audio and video material is very important in order to arriveat robust and discriminative audio-video categorisation schemes capable of revealing meaningful compositionalsemantics.4, 5 Instead of repeating the same academic exercise of integrating audio-video and text together withmultimedia annotations made by a human expert we merely elucidate in the sequel how categorisation schemesavailable for texts can be directly fused with those for other multimedia.

Machine learning approaches to text categorisation automatically build a text classifier by learning the char-acteristics of the categories of interest from an initial corpus of documents, which is manually preclassified bya domain expert, with accuracies comparable to or exceeding those of human experts. The effectiveness of aspecific text classifier is evaluated with respect to a test set of documents by measuring how often an expert’scategorisation decision on all test documents matches those made by the classifier. Obviously, the categorisationschemes instantiated by an expert statistically gauges or influences machine learning approaches to text categori-sation. Analogously our artificial multimedia categorisation system could be complemented with a supervisedmachine learning component for textual information.

In the above multimedia categorisation system adapted to machine learning approaches for text categorisation,the inductive construction of text classifiers plays a crucial role. Inductive construction of text classifiers involvesanalytical or experimental determination of categorisation status value functions yielding evidence measuresthat a document belongs to a certain category. These hard or ranked measures in turn require determiningthresholds for the classifiers. The classifiers considered for text categorisation could be probabilistic, decisiontree or rule-based, neural network, example-based, support vector machine and committee classifiers. Otherclassifiers are based on regression methods, e.g. Linear Least Square Fit, online methods, e.g. cosine similarityclassifier, Rocchio method for closeness of vectors of weighted terms to category, Bayesian inference networks,genetic algorithms, and maximum entropy modelling. Again our supervised multimedia categorisation systemsprovide similar inductive construction schemes for multimedia classifiers with the difference that they are basedon a dynamic scale-space approach to multimedia equivalences.

In text categorisation, the techniques that are used for indexing the documents from an initial corpus andthose to be classified, for inductive construction of text classifiers, and for evaluating the effectiveness of classifiersare mainly information retrieval techniques. Evidently our dynamic scale-space approach extends the lattertechniques considerably and moreover gives them a solid basis. For example, Bayesian inference networks fortext categorisation are physically substantiated by associated multimedia streams and our categorisation systemmakes advanced forms of such networks operational.

Document indexing in multimedia categorisation is one of the main problems. It involves the problem of lexicalsemantics, i.e. identifying meaningful units of text, and the problem of compositional semantics, i.e. identifyingmeaningful rules for combining those units of text. Lexical semantics could be represented as a vector of weightedterms or features, i.e. a bag of words. Compositional semantics could be indexed by means of Hidden MarkovModels.29 Instead of using only weights of terms such as words, stems or phrases, the Darmstadt IndexingApproach (DIA)30 uses properties of terms, documents, categories, and their pairwise relationships. This indexingapproach largely resembles our dynamic scale-space approach in determining appropriate fuzzy inference schemes.In order to make inductive construction of text classifiers feasible and to prevent overfitting of the documents tobe classified, one applies to the term space a so-called dimensionality reduction by information-theoretic termselection or filtering on the basis of e.g. document frequency and information gain, or by term extraction via

(un)supervised term clustering12 and latent semantic indexing to handle synonimity and polysemity.31 Againour multimedia filtration schemes support similar dimensional reduction schemes.

Applications of text categorisation range from automatic document indexing with controlled dictionaries;document organisation; word sense disambiguation, context-sensitive spelling correction, prepositional phraseattachment, part of speech tagging and word choice selection; to hierarchical categorisation of hypermedia on theWeb. Our supervised multimedia categorisation schemes are typically well-designed to handle such categorisationissues.4, 5

The evaluation of effectiveness of text categorisation is normally based on precision and recall, or benchmarks.The applied evaluation criteria are unfortunately not well-defined, nor are they text or document consistent.Our supervised multimedia categorisation approach offers an opportunity to get hold of the proper criteria to beconsidered. Further elaborations on these matters are out of the scope of this paper and will be presented in aforthcoming exposition.

For supervised multimedia categorisation it is necessary to segment a dynamic scale-space of multimedia. Thesegmentation schemes that are permitted by our categorisation scheme, are then adapted to the raw multimediastructure as well as to the usage context imposed by a human expert using a particular domain ontology.Furthermore, a human expert will identify specific arrangements of segmented dynamic scale-space objects ofmultimedia in determining the lexical and compositional semantics of multimedia by means of inferences. In thefollowing subsections we further elaborate on multimedia segmentation, arrangements and inferences.

2.1. Multimedia Segmentation

In the introduction of this section we addressed the supervised multimedia categorisation problem given a class ofgauge groups to which audio, video, text and annotation streams could be subjected. Assuming we have identifiedthis class of deformational as well as morphological multimedia transformations we may readily derive appropriatesegmentation operators for dynamic multimedia scale-spaces.4, 5 For multimedia streams I = (I1, . . . , Ik), inwhich k labels e.g. audio, video, text and annotation streams, we can subsequently retain a spatio-temporaland scale segmentation of a dynamic multimedia scale-space by means of the following regularised topologicaldistributions:

• Topological distribution of temporal extrema:

νet =

12

(sign(∇et

Ik) + sign(∇−etIk)

),

• Topological distribution of temporal inflections:

νit(I

k) =12

(sign∇et

(∇etIk) − sign∇−et

(∇−etIk)

),

• Topological distribution of spatial extrema:

νes(Ik) =

1#es

∑s

(sign(∇es

Ik) + sign(∇−esIk)

),

where s are denotes the (locally relevant) spatial dimension of multimedia,

• Topological distribution of spatial inflections:

νis(I

k) =12

(sign∇es

(∇esIk) − sign∇−es

(∇−esIk)

),

where in this case es denotes the normalised multimedia gradient field,

• Topological distribution of ridge manifolds:

νrs (Ik) =

12

(sign∇e⊥

s(∇es

Ik) + sign∇−e⊥s(∇−es

Ik)),

where in this case e⊥s denotes one of the normalised orthogonal multimedia gradient fields,

• Topological distribution of scale extrema:

νeτ (Ik) =

12

(sign∇eτ

Ik + sign∇−eτIk

),

• Topological distribution of scale inflections:

νiτ (Ik) =

12

(sign∇eτ

(∇eτIk) − sign∇−eτ

(∇−eτIk)

),

Here e’s are gauge invariant frame fields that define integral multimedia segments. At local temporal maximaνe

t (Ik) = 1, whereas at local temporal minima νet (Ik) = −1. For regular temporal points and inflections νe

t = 0.Similar reasonings hold for the other topological distributions. Furthermore, one can distinguish rising and fallinginflections and the like, as we can distinguish between left and right, top and bottom, and future and past inmultimedia. Of course, there exist other relevant spatio-temporal topological quanta and geometric quanta, likedislocation and disclination currents that could capture multimedia deformation as well as distortion fields.4, 5

Note that the multimedia could be represented according to various categorisation models including colourmodels.22

A multi-scale multimedia segmentation on the basis of the above topological distributions provides us withrobust and discriminative equivalences to instantiate a categorisation. For example, a multi-scale spatial figure-ground segmentation of video frames comes about by studying the signature of νi

s(Ik). On the basis of this

segmentation, subsequently, connectivity relations, spatial inclusion and hierarchical nestings of multimediaobjects can be determined on the basis of other similar criteria.4, 5 Similar remarks hold for the temporalaspects of multimedia. For example, pitches in the audio signal could be detected on the basis of signaturecriteria analogous to those for figure-ground segmentation. Furthermore, shot changes in the video stream canbe robustly quantified and qualified in terms of both topological and modern geometric currents. Last but notleast, the lexical and compositional semantics of streams of text and annotations can be associated with audio-video segmentation. Actually the latter segmentation substantiates the text and annotation streams. The latterlexical and compositional semantics are coupled to and can be associated with arrangements on a multi-scalemultimedia segmentation. Therewith annotation streams induced by a human expert could generate audio-videoand text streams, and vice versa, i.e. abstraction versus summarisation in (de)categorification.

2.2. Multimedia Arrangements

In the previous paragraph we obtained spatio-temporal dynamic scale-space segmentations of multimedia. Theproblem remains how to induce a relevant semantic structure consistent with a specific usage context. As statedabove, the multi-scale multimedia segmentations supply us with the opportunity to look around segmentedmultimedia objects and lay bare their mutual relations. These relations can be spatio-temporal inclusion relationsand dynamic ordering relations, i.e. hierarchical nestings of segmented multi-scale multimedia objects inducedby filtration schemes (see section 2). Using run-length coding and connectivity analysis techniques on top ofthose objects one can derive different multimedia semantics. Very advanced modern geometric and topologicalequivalences for categorisation can even be computed.4, 5 However, there are so many semantic representationspossible that supervised learning of inference schemes by the system is needed. Another reason for this is thatthose inferences made by a machine make sense to humans too. In order to supervise our categorisation systemwe should try to figure out which robust and discriminative inference schemes a human expert uses. Allowing ahuman expert to interface with our multimedia categorisation system he or she can point out which arrangementsof segmented multi-scale multimedia objects to look for and to consider as input for further usage context sensibleanalysis.

2.3. Multimedia Inferences

On the basis of multiscale arrangements of segmented multimedia objects that are assigned by a human expertapplying a particular domain-specific ontology within a given usage context, we should derive usuable multimediainference schemes in order to effectuate supervised multimedia categorisation.4, 5 These supervised inferenceschemes could form the backbone of a decision rule system that ultimately could enable automation of multi-media categorisation. One of such an inference scheme could be based on geometric and algebraic invariants ofmultimedia. By means of signature criteria with respect to the topological distributions of section 2.1 we couldsegment multi-scale audio-video and lexical objects. Next we could compute their spatial and temporal centersof mass. Subsequently, we could describe center ξi and mass mi of each object with respect to the center Xc ofmass Mc of all objects considered meaningful by a human expert. For spatial objects in video-frames, say N innumber, so-called linear forms li:

li =mi

Mcξi · ξ,

could be constructed. Here the operator “·” could be consistent with the central perspective or elliptic geometryintroduced by the camera system. A problem from classical invariance theory is finding an irreducible andcomplete set of invariants of this set of forms. Such a system of independent integral and algebraic invariantsAp can be retained upon computing:

H2N−2(ξ, t) = (h, g)1 = h′(t)g(t) − g′(t)h(t) =2N−4∑p=0

Apt2N−p; H(ξ, t) =

N∑k=1

lktN−k = ξ1h(t) + ξ2g(t).

Finally, considering the signatures of the invariants Ap, global morphological changes between consecutive videoframes can be read out. We could certainly add numerous other examples like permutation invariants32 for thegeometric structures retained by imposing various dynamic ordering and spatio-temporal inclusion relations notonly for video frames but also for other slices of the multimedia. For example, enumeration and combinatoricsof arrangements of segmented multimedia objects could permit us to come up with possible robust cognitiveinference schemes to resolve multimedial ambiguities. Finally, we could associate such invariants to the lexicaland compositional semantics pointed out by a human expert using a specific domain ontology given a certainusage context.

3. APPLICATION TO SOCCER DOMAIN

After so many years of multimedia research automation of multimedia categorisation seems further off than ever.As stated in the introduction, the need for automation of multimedia categorisation is ever more demanding.However, ontological multimedia representations provided by a human expert could make such an automationhappen. In the sequel we automate supervised multimedia categorisation capable of handling large amounts ofmultimedia and enabling a human expert to do the “intelligent work”. The inference schemes applied by thehuman expert are learned by the system and associated to those inference schemes that can be made operationalby the system on the raw multimedia streams of text, audio and video. We do so for a multimedia streamof a soccer match that includes audio, video, speech-to-text transcriptions and annotations made by a humanexpert. The domain ontology is soccer∗. The usage context for which the human expert is providing multimediaannotation streams, is fast retrieval for TV summaries of e.g. scores after interruptions of play.

The supervised learning phase of automation of our categorisation system consists of (re)inforcing inferenceschemes consistent with the raw multimedia streams that are strongly correlated to or associated with thoseapplied by a human expert. He or she expresses those inferences in terms of annotation streams consistentwith our soccer-domain specific ontology within the particular usage context mentioned above. The inferencesquantify and qualify the lexical and compositional semantics of the raw dynamic multimedia objects in terms ofdiscrete classes or continuous classes.1–3 The evolutionary statistics of the ontology domain and usage contextimposed by the human experts determine a hierarchical nesting of the most predominant and valuable multimedia

∗http://www.lgi2p.ema.fr/ ranwezs/ontologies/soccerV2.0.daml

Figure 1. Preamble to a goal.

inference schemes that could be automated. This also enables the definition of relevance measures and metricsfor automating and launching appropriate multimedia indexing, query and retrieval schemes.

In practice though, the main idea can be roughly described as a mapping of the concepts and relationsmaking up a specific ontology onto a set of easily measurable equivalences of multiscale multimedia objects.We are aware of the fact that a mapping of the whole “conceptual space” defined by ontological entities andrelations can not be mapped linearly onto any space of equivalences derived from multimedia, because some ofthe aspects of the usage context are not represented in terms of the applied domain ontology. Nevertheless, wewill observe that raw multiscale multimedia equivalences can be reliably inferred from and associated with theconceptual space considered by a human expert. In order to illustrate our approach we implement a supervisedcategorisation scheme for soccer videos. Although many other techniques have been proposed for semantic videocategorisation, especially for sports videos and commercials, the novelty of our technique resides in its generalityin terms of its sustainability, i.e. scalability, extensibility and applicability. To simplify the implementation of ourcategorisation system we first start from a taxonomy of the selected domain, and isolate sub-trees that includethe whole set of videos that need to be categorised. Second, we characterise the leaf nodes of the remainingsub-tree in terms of disjoint sets of multiscale multimedia equivalences (features). Third, we attach to eachparent node a set of equivalences that is the the union of his children node equivalences. Fourth, we analyzeinput multimedia with the selected set of inference schemes corresponding to the root of above sub-trees. Fifth,since the typical equivalences that characterise each leaf are assumed to be irreducible and complete, it sufficesto compute the angle between the vector of equivalences and each of the vectors of equivalences in the specificsoccer domain ontology learned. We can then link the raw dynamic multimedia objects to a user-defined domainspecific ontology up to some confidence measure.

In Fig. 1 and Fig. 2 this process is illustrated for several modalities of the soccer video. From top to bottomfirst the video data is displayed, then the audio, followed by the text annotation and lastly instances from theontology. The video data has been automatically segmented into objects. When properly trained the system isable to identify these objects with the players, the ball and other entities in the soccer domain. By tracking theposition of objects through time, consecutive frames, paths for the objects can be derived. These paths can thenbe used to identify interactions between the objects, e.g. the ball crosses the path of a player and its velocityand direction change abruptly: the player kicked the ball. The processing has been done in Mathematica,33 asymbolic mathematics modelling environment. While not the fastest environment for simulations, its symbolicnature makes it a convenient tool for semantic and symbolic modelling. For the audio display a spectrogram isdisplayed. The y-axis represents the range of frequencies from low (bottom) to high frequencies (top). The x-axis

Figure 2. The goal and a very happy player (Pardo).

represents time and the intensity of the point (x,y) indicates the activity for the frequency at that time. Thesound of a ball being kicked into the goal is distinctly visible in the lower frequencies. The occurrence of speechis also easily recognisable. The annotations consists of the text spoken by the TV commentator (in Dutch).The speech has been automatically extracted and segmented, but the speech to text has been done manually.Since only a subset of words is relevant for the soccer domain, automatic speech-to-text is quite feasible. Formany other domains the vocabulary of most relevant words is also quite small. A bar between the text andthe spectrogram indicates the exact length of the speech sections. The words ‘doelpunt’ (goal) and ‘Pardo’ areunderlined because these words directly map to entities in this ontology. In Fig. 1 the commentator is silent,so no text is shown. The last row shows instances from the ontology that are relevant to the segment they aredisplayed with. In both figures the relevant parts in the video domain are highlighted with an ellipse.

4. CONCLUSION AND DISCUSSION

We presented and demonstrated an automated supervised multimedia categorisation system that allows themultimedia to be dynamic and allows human experts to interface with it, such that they can gauge the systemfor a specific domain ontology given a particular usage context. In order to build such a system first of all wemodelled supervised multimedia categorisation, not merely as a simplification and an abstraction process, but alsoas an association process. In this process we considered the supervision by human experts, imposed in terms ofannotation streams, as integral parts of the multimedia. All multimedia streams in our categorisation frameworkstatistically induce potentials and consequently field strengths for multimedia segmentations, arrangements andinferences. By filtration of the multimedia field strengths we retain a hierarchical nesting of multiscale multimediadynamics. The supervised learning that is carried out by the categorisation system subsequently focuses onreproducible robust and discriminative multimedia inferences induced by human experts and associated withand coupled to the other raw multimedia streams, i.e. our categorisation system adapts to the inference schemesused by a human expert. Whenever the domain ontology given a usage context changes human experts are askedagain to interface with and to gauge the system, i.e. inference rules have to be (re)inforced. We illustrated oursupervised multimedia categorisation system for an annotated soccer match.

In our exposition above we did not elaborate on the impact of our categorisation method on other disciplines.For example, in cybernetics our categorisation system could enable robots to perform reproducible robust anddiscriminative coordinated and controlled autonomous actions on the basis of integrated and coupled vision,audio, motor, haptic, tactile and vestibular streams. In the field of natural language processing it could categorise

tone, chromaticity, grammar, discourse and prosody, and style-figures. Most importantly, our categorisationsystem could be very valuable in areas like knowledge management by providing inference rules for business notonly to collaborate but also to compete.34

REFERENCES1. J. R. Quinlan, “C4.5: Programs for Machine Learning, Morgan Kaufmann,” San Mateo, CA, 1992.2. W. Daelemans, J. Zavrel, K. van der Sloot and A. van den Bosch, “TiMBL: Tilburg Memory Based Learner,”

version 4.2, Reference Guide Technical Report 02-01, 2002.3. L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, Wadsworth,

United Kingdom, 1984.4. A.H. Salden, “Multimedia system analysis and processing,” In Proceedings of 2001 IEEE International

Conference on Multimedia and Expo, ICME2001, Waseda University, Tokyo, Japan, CD-ROM, 2001.5. F. Aldershoff and A. H. Salden,“Multiscale audio-video analysis and processing: segmentations and arrange-

ments,” In Proceedings of SPIE , Internet Multimedia Management Systems II, 4519, pp. 20-31, 2001.6. B. E. Stein, P. J. Laurienti, T. R. Stanford and M. T. Wallace, “Neural mechanisms for integrating in-

formation from multiple senses,” In Proceedings IEEE International Conference on Multimedia and ExpoICME2000, pp. 567–570, 2000.

7. C. Faloutsos, M. Flickner, W. Niblack, D. Petkovic, W. Wquitz, R. Barber, “Efficient and Effective Queryingby Image Content,” Research Report RJ 9203 (81511), IBM Almaden Research Center, San Jose, 1993.

8. J. R. Smith and S. F. Chang, “VisualSEEK: A fully automated content-based image query system,” ACMMultimedia, 1996.

9. A. Hamrapur, A. Gupta, B. Horowitz, C. F. Shu, C. Fuller, J. Bach, M. Gorkani and R. Jain, “VIRAGEVideo Engine,” In SPIE Proceedings on Storage and Retrieval for Image and Video Databases V, pp. 188-197,1997.

10. S. F. Chang, W. Chen, J. Meng, H. Sundaram and D. Zhong, “VideoQ: An automated content based videosearch system using visual cues,” ACM Multimedia 1997, pp. 313-324, 1997.

11. C. Meghini, F. Sebastiani and U. Straccia, “A model of multimedia information retrieval,” Journal of theACM 48(5), pp. 909-970, 2001.

12. N. Slonim and N. Thisby, “The power of word clusters for text classification,” In Proceedings of ECIR-01,23-rd European Colloquium in Information Retrieval Research, Darmstadt, Germany, 2001.

13. X. Song and G. Fan, “A Study of Supervised, Semi-Supervised and Unsupervised Multiscale Bayesian ImageSegmentation,” In Proceedings of the 45th IEEE International Midwest Symposium on Circuits and Systems,Tulsa, Oklahoma, USA, 2002.

14. M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, “Wavelet-based statistical signal processing using hiddenMarkov models,” IEEE Trans. Signal Processing, 46(4), pp. 886-902, April 1998.

15. H. Choi and R. Baraniuk, “Multiscale image segmentation using wavelet-domain hidden Markov models,”IEEE Transactions in Image Processing, 10(9), pp. 1309-1321, 2001.

16. G. Fan and X.-G. Xia, “Maximum likelihood texture analysis and classification using wavelet-domain hiddenMarkov models,” In Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers,Pacific Grove, CA, 2000.

17. C. A. Bouman and M. Shapiro, “A multiscale random field model for Bayesian image segmentation,” IEEETransactions on Image Processing, 3(2), pp. 162- 177, 1994.

18. G. Fan and X.-G. Xia, “A joint multi-context and multiscale approach to Bayesian image segmentation,”IEEE Transactions on Geo-science and Remote Sensing, 39(12), pp. 2680-2688, 2001.

19. G. Fan and X.-G. Xia, “On context-based Bayesian image segmentation: Joint multi-context and multiscaleapproach and wavelet-domain hidden Markov models,” In Proceedings of 35th Asilo-mar Conf on Signals,Systems and Computers, Pacific Grove, CA, Nov. 2001.

20. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, “Content-based image retrieval, theend of the early years,” IEEE TRANS PAMI, 22(12), pp. 1349-1380, 2000.

21. J.M. Corridoni, A. del Bimbo and P. Pala, “Image retrieval by colour semantics,” Multimedia Systems, 7,pp. 175–183, 1999.

22. T. Gevers and A.‘W. M. Smeulders, “Colour based object recognition,” Pattern Recognition, pp. 453–464,1999.

23. H. Burkhardt and S. Siggelkow, “Invariant features for discriminating between equivalence classes,” InNonlinear model-based image video processing and analysis, John Wiley and Sons, 2000.

24. G. Sommer and J. J. Koenderink (Eds.), Algebraic Frames for the Perception-Action Cycle, In Lecture Notesin Computer Science; Lecture Notes in Artificial Intelligence, 1315(VIII), 1997.

25. A. Del Bimbo, “Expressive semantics for automatic annotation and retrieval of video streams,” In Proceed-ings of IEEE International Conference on Multimedia and Expo ICME2000, pp. 671–674, 2000.

26. T. Joachims and F. Sebastiani (eds.), “Automated text categorisation ,” Special issue of Journal of Intelli-gent Information Systems, 18(2-3), 2002.

27. D. D. Lewis and P. J. Hayes, (eds.), “Automated text categorisation ,” Special issue of ACM Transactionson Information Systems, 12(3), 1994.

28. F. Sebastiani, “Machine learning in automated text categorisation: a survey,” IEI-B4-31-1999,” Pisa, IT,1999.

29. L. Denoyer, H. Zaragoza and P. Gallinari, “HMM-based passage models for document classification andranking,” In Proceedings of ECIR-01, 23-rd European Colloquium in Information Retrieval Research, Darm-stadt, Germany, 2001.

30. N. Fuhr, “A probabilistic model of dictionary-based automatic indexing,” In Proceedings of RIAO-85, 1-stInternational Conference “Recherche d’Information Assistee par Ordinateur”, Grenoble, France, pp. 207-216, 1985.

31. S. C. Deerwester, S. T. Dumais, T. K. Landauer, W. Furnas and R. A. Harshman, “Indexing by LatentSemantic Analysis,” Journal of the American Society of Information Science, 41(6), pp. 391-407, 1990.

32. G. Csurka and O. Faugeras, “Algebraic and geometric tools to compute projective and permutation invari-ants,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 21, pp. 58-65, 1999.

33. Roman E. Maeder, The Mathematica Programmer II, Academic Press, 1996.34. A. Salden and M. Kempen, “Business Information and Knowledge Sharing,” In Proceedings of IASTED

International Conference Information and Knowledge Sharing, (IKS 2002), St. Thomas, Virgin Islands,USA, November 18-20, 2002

supervised multimedia categorization

Documents