a framework for customizable sports video management and retrieval

�� !"#��$%$��&�'(��))*�+�,�-��-./�-��0�-��1��2�-��))*

A Framework for Customizable Sports VideoManagement and Retrieval

Dian Tjondronegoro, Yi-Ping Phoebe Chen, and Binh Pham

Centre for Information Technology InnovationFaculty of Information Technology

Queensland University of Technology, Brisbane, Australia��

Abstract. Several domain specific approaches for sports video managementhave shown the benefits of integrating low- and high- level video contents insupporting more robust retrieval. However, there are very little work has shownhow to integrate them in order to support different types of sports. In this paper,we firstly propose a framework for customizable video management systemwhich allows the system to detect the type of video to be indexed, so thatappropriate tools can be used to extract the key segments. It is alsocustomizable because the system manages user preferences and usage history tomake the system supports specific requirements. Secondly, we will show howthe extracted key segments can be summarised using standard descriptions ofMPEG-7 in a hierarchical scheme which is potentially easy to share betweenusers. Thirdly, we have developed and tested some queries which show thatXQuery provides a powerful language for our video management’s retrieval.

1 Introduction

With the ongoing advance of computer technology, the requirements for using videodata, such as large storage capacity, fast processing power, and broadbandnetworking, have become inexpensive. This phenomenon triggers fast growth ofvideo data for a vast range of applications, such as sport, news and security.Unfortunately, the effectiveness of video usage is still very limited due to theunavailability of a single and complete technology which can fully capture thesemantic content of video and index video parts according to their contents, so thatusers can intuitively retrieve specific video segments. Thus, content-based VideoManagement Systems (VDMS) has become an active research topic for manyresearchers from both academia and industries.

To enable effective usage of video, the key segments of a video stream need to beindexed (i.e. extracted and labelled according to their contents) to support content-based retrieval. Since video data is typically long and any of its arbitrary frames maycontain important information which can be of users’ interest, the process ofdeveloping content-based VDMS requires the following set of complex procedures.The first step is video segmentation where the key-segments (or highlights) of thevideo are identified. At the same time, content analysis is used to analyse the contentof each segments, which works in conjunction with video labelling, where videosegments are labelled according to their contents. The last step is to structure the

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��%

video descriptions for efficient retrieval, such as by constructing some hierarchicalsummaries of the exciting segments of sport videos.

These processes rely heavily on accurate and complete understanding of thesemantic contents of video data. However, computers are yet as intelligent as humanto fully understand the complexity of semantic concepts, while manual work is tooexpensive and often biased by subjectivity. Thus, computers depend on thecomputerised semantic of video, such as by examining the characteristics of its audioand moving pictures via signal analysis algorithms. This process is complex sincedifferent video applications have specific features and characteristics. For example theaudio components of sport video typically consist of background noise, crowd cheersand commentator’s voice. To overcome these limitations, we aim to demonstrate inthis paper the significance of integrating domain specific approach to improve theaccuracy and effectiveness of content extraction process, as well as allowing robustretrievals according user/application requirements. The integrated system can becustomised to suit other domains, particularly for the video domains that have similarcharacteristics. This paper will present a framework for customizable VDMS byfocusing on the three most important components: key-segment extraction, contentdescriptions, and retrieval. Our experimental work on segmentation will be based onindexing goal segments in soccer videos, while using the currently active-evolvingtechnologies of MPEG-7 for our video content descriptions and XQuery for retrieval.Due to their rapid evolvements, this paper only reflects upon the current status ofMPEG-7 and XQuery.

2 Related Work

State of the art content-based VDMS can be categorized into two main approachesbased on the types of data which can be extracted (see [1-3] for review). The firstapproach allows users to retrieve video according to its high-level semantic contents.However, the main limitation is that manual work for annotating the video data isexpensive and often subjective. The second approach utilizes low level featurescomparison, which allows users to query video by example (i.e. using audio or visualsamples). Although this approach uses automatically extracted features, users wouldnot be able to retrieve video based on the high-level concepts. For this reason, manyresearchers have tried to show the benefits of domain specific approach in closing upthe main gaps between the high-level and low-level features of video. The followingsare some examples of domain-specific video management approaches which areapplied in sport videos, such as tennis, soccer, and basketball.

In tennis domain, Zhong and Chang [4] summarised the temporal structure of livebroadcasted tennis video by detecting the recurrent event boundaries, such as servingviews which consist of unique visual cues, such as colour, motion and object layout.Such views can be automatically detected by using supervised learning and domain-specific rules. Thus, users can browse and query based on these recurrent events. Inaddition to serve views, Miyamori et al [5] segmented tennis video by extracting:court-net lines, player-ball position, and player behaviour. Each court-net line isdetected by connecting two feature points at both sides, while the positions of playersand ball are captured by adaptive template matching. Players’ behaviours are

�()��9:��-��-��;�.<�<��=��0��<=�

evaluated using 2D appearance model on silhouette transitions. Using a similarframework, Sudhir et al [6] extracted court scenes from video frames usingdistinguished colour properties of the various tennis courts including carpet, clay,hard, and grass courts. Their court-line detection algorithm extracts basic-linesegments in order to reconstruct the complete tennis court. Then, players’ positionsare examined in regards to the tennis court lines and the net in order to recognisehigh-level events such as passing-shot and serve-and-volleying. For soccer, Xie et al[7] summarised soccer videos into play and break states because sport videos usuallyhave only 60% play segments in which key events can be found. They used HiddenMarkov Model to compute play/break transitions which are based on dominant-colourratio of camera views and motion intensity features. Alternatively, Gong et al [8]segmented soccer match according to the court position (for example in midfield, onleft corner area and so on). Their method detected the court’s white line, theneliminate the noise (e.g. from the players) before trimming the line to fix broken linesdue to player occlusions. The line captured is matched with a pre-defined pattern tojustify which part of the court is contained by a particular scene. However, Yajima etal [9] identified certain aspects of soccer video which cannot be annotated andretrieved by keywords. Thus, they proposed an indexing framework to allow usersquerying objects’ movements (including player and ball) as well as querying bydirectly drawing movement-direction on a video screen. Their framework depends ontracing the spatio-temporal relationships between moving objects. Finally forbasketball videos, Nepal et al [10] designed some temporal models based the typicalevents which can be used to detect goals in basketball games. These models drive thedecision to choose audio-visual features that can be used to automatically detect thetypical events. In particular, they extracted high energy audio segments, video text,and change in motion direction. In contrast, Zhou at al [11] used rule-based classifierto categorise the structures in basketball video into left-, right-, and middle- courtclasses as well as close-up class according to distinct visual and motionalcharacteristics.

Based on this review, we have identified the main benefits of developing domain-specific video management systems. First of all, the structure of a specific video typecan be pre-determined using domain knowledge and specific user requirements.Secondly, this video structure model determines a set of specific low-level featureswhich need to be extracted in order to detect the transition and classification of keysegments. Thirdly, key segments are indexed with its content description according tothe domain-specific characteristics so that queries or browsing methods can bedesigned to support user and application requirements. Thus, we will propose aframework to integrate them into a customizable video indexing system in the nextsection.

For describing video segments, several works have shown the challenges inutilizing MPEG-7 descriptions. For example, in order to describe the content of videodata, Pfeiffer et al [12] used MPEG-7 for their TV anytime application, which showedhow to describe a program summary, using a soccer match description as an example.Similarly, Van Beek et al [13] used soccer match descriptions to give example of theMPEG-7 Multimedia Description Schemes (MDS) semantic content descriptions andsummary DS and Hunter et al [14] showed how to use MPEG-7 content descriptionsin describing news video. While researchers have shown the benefits of using an

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��(>

MPEG-7 based hierarchical indexing scheme for video browsing, MPEG-7 does notactually aim to standardize the techniques and methods for the extraction of thefeatures and the usage of the descriptions. Thus, our paper aims to contribute in thisactive competition for developing methods to extract key segments and its retrieval.Moreover, since MPEG-7 uses XML, we need to find the most suitable methods forthe organisation and retrieval of XML data. We have found that effective usage ofXML data has been a big and interesting issue by itself. Although XML was notinitially designed to make a new database scheme, but since XML documents containa collection of data, in the strictest sense of term, it is a database [15] and thereforedesigning and querying XML database has also become a very challenging researchtopic by itself. For example, Michael Graves [16] have recently published acomprehensive textbook for designing XML databases in enterprise and webapplications. Some alternative languages for querying XML documents, such asXQuery [17] have also been introduced. However, despite these studies on organisingand retrieving XML data, the need for finding the most suitable solution for MPEG-7data is still an emerging research field and therefore our paper attempts to show analternative on how XML data structuring and retrieval methods can be applied forour MPEG-7 based video descriptions.

3 Extracting Goal Segments from Soccer Videos

The first requirement of our content-based VDMS is the extraction process of keysegments. In this section, we present our case study which extracts the goal segmentsfrom soccer videos since goal events are often used for sport highlights. In order tofacilitate accurate detection, we have investigated the typical events that happenbefore and after a goal is scored. The following events can typically indicate goals:

Crowd cheers and commentator’s voice: during a soccer match, when excitingevents happen, the level of crowd noise becomes much louder, and commentator’svoice raises its tone (indicating the excitement). These remarkable audio features areusually sustained for around 6 seconds when a goal is scored, and if a goal is notscored (i.e. other exciting events), the crowd cheer loudness is not sustained andcommentator’s voice is back to normal.

Position of play is near the goal: soccer has a very big court and involves manyplayers. Thus the camera mainly shows top-views (tower view) in order to follow theball’s movements. Thus when the video frames show that the position of play is nearthe goal for longer than 2 seconds, usually interesting event happens and a goal ismore likely to be scored.

Slow motion replay and scoreboard display: when interesting events has justhappened, a slow motion replay usually follows the event. When a goal is scored,definitely slow motion replays will be played repeatedly, depending on the editingdecision. Following the slow motion, an updated scoreboard is displayed within thenext 2-7 seconds. It is also possible that the replay comes together with the scoreboarddisplay. However, scoreboard display and slow motion replays vary significantlydepending on the editing works done by the TV broadcasting. For example, somesoccer videos always display the scoreboard throughout the game. Thus the only wayto detect the goal is when the scoreboard is just updated. Nevertheless, usually in this

�(��9:��-��-��;�.<�<��=��0��<=�

case, when a goal is just scored, a bigger text display is to tell viewers who just scoredthe goal and the timestamp when the goal is scored.

Fig. 1 below shows some examples of these events. From left to right: 1) positionof play is near goal (just before a goal is scored); 2) a digital-video-editing pointindicating the start of slow motion replay scenes after the goal celebration; 3) anupdated scoreboard, showing the current score-line of the two teams.

Fig. 1. Events that indicate goals in soccer video

After recognising the typical events for goal segments, we can integrate somesegmentation techniques to detect each of these typical events. Firstly, weimplemented the sound analysis algorithms to detect the crowd cheer and excitedcommentator’s voice. Secondly, we designed a method to detect the editing worksthat separate the slow motion replay scenes before and after it were played, to locatethe slow motion scenes. Then, we detect pairs of DVEs (Digital Video Effects), whichare displayed before and after a slow motion replay segment and therefore can beperceived as scene transition between live- and replay- scenes [18]. After detectingthe replay shots, we can search the actual live- scene which is typically a scene withtypical framing of shot from a fixed angle of a main camera (e.g. in soccer the maincamera is the top-view camera) [19]. Thirdly, we can use the available techniques toidentify those video shots that reveal the position of the play, to then recognise theline marks on the soccer field, to determine the position of play [4, 8]. Finally, we canalso benefit from the techniques to analyse the text in the score board [20] so that thesystem can determine whether the score status has been updated (because of the goal).The integrated segmentation techniques are then encapsulated in a segmentation toolslibrary, which stores all the implemented segmentation tools.

Segmentation library is used for our customizable video segmentation moduleconcept. Our proposed customizable video segmentation architecture is depicted inFig.2 and can be described as follows: Firstly, video stream from TV broadcast isdigitised and stored in a video buffer, which then filtered by the classification toolbefore the video is stored in the video storage. Secondly, the classification tool filtersthe TV program according to user preferences, such as news, sport, and documentary,in order to save the hard disk space for storing the video. After the video is stored, thecustomizable segmentation module determines the key segments. Thus, oursegmentation module relies heavily on the segmentation processor, whose job is toproduce key segments according to user profile (user preference, usage history anduser options), which determines what type of key segments the user wants for aspecific type of video.

Segmentation processor uses the video profile, which comprises the event modelsthat correspond to the current type of video (e.g., soccer event models); each eventmodel justifies the set of segmentation tools from the segmentation tools library,which best satisfies the event model. This set of segmentation tools is customized

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��(*

since not necessarily all of them need to be processed; rather, user can select theprocessing time (e.g. slow-but most accurate, moderate, quick). This processing timeis pre-determined by the sum of the processing times of the segmentation tools used.Each time segmentation processor uses one or more segmentation tools, it will outputa set of candidate key segments, which users finally can decide to see either theearliest (least accurate – have some false detections, but less miss detections)candidate frames, or the final version of candidate segments (most accurate – leastfalse detections, but may eliminate some detections that users may still want). Eachset of candidate key segments are used as the input to be re-confirmed by othersegmentation tools (eliminate false detections).

It has been identified that user profile can be stored using MPEG-7 standard UserInteraction Description Scheme which provides the tools to personalize how usersinteract with video content in terms of user preferences pertaining to video contentand usage history of users in using the video indexes. User preferences describe eachuser identifier based on, such as the creation, classification, and source information ofthe video, as well as storing their filtering, search, and browsing preferences. On theother hand, usage history stores user action list in terms of the action type and itsdate/time. Thus, user profile can be managed by reading the MPEG-7 tags.

In order to clarify our ideas on how our customizable segmentation module works,we have provided the following process illustration using a context:

Pre-processing: Video stream is digitized and stored in video buffer. Theclassification tool checks the user profile, which requests the filtering of the contentsof the buffer, by selecting only soccer videos.Highlighting Process (5 steps): First of all, the segmentation processor looks at theuser profile, which interests to only select goal segments as highlights. If there is nouser preference has been stored, processor will look for the usage history.Alternatively, users can ask the processor to allow interactive inputs. In the secondstep, the segmentation processor checks the video storage if there is a non-highlightedsoccer video in the video storage. If all the soccer videos has been highlighted, thesegmentation processor will stop processing and ask whether the user wants to re-dothe highlighting process. Thirdly, the segmentation processor selects the video profilefor soccer video. The video profile contains the typical events for detecting goal

�(��9:��-��-��;�.<�<��=��0��<=�

segment; in order to justify what segmentation techniques (from the segmentationtools library) can be used. In the next step, segmentation processor checks the userprofile to determine what type of processing user wants (i.e. fast or slow). Finally, thesegmentation processor uses fast processing that utilizes sound and text analysis fromthe segmentation tools library. Sound analysis will produce a set of candidate keysegments, which is in this case the final candidate of key segments (since othermethods are not used).

Fig. 2. Customizable Video Segmentation Module

4 Utilising MPEG-7 for Video Summaries

After the key-segments are extracted, the next step is to label them according to thecontents, so that users can effectively browse and retrieve them later on. After ourinvestigation on MPEG-7 which is proposed by the World Wide Web Consortium(W3C) to become the standard multimedia description scheme, we proposed ourvideo descriptions requirements in the following section. Please note that highlightsare also what people refer as key-segments. Our proposed video descriptions schemehas mainly extended standard MPEG-7 descriptions to suit our requirements (which

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��((

are mainly to describe sport video summaries). For example, we added some domain-specific annotations that are specific for sport videos, to enable more precise retrieval(e.g. “what is the final score of the game?”). However, while we try to benefit specificannotations, we have ensured that our schema is still robust and can be easily sharedbetween users since we use the “standard” MPEG-7 annotations wherever possible.The following is our proposed schema for collection of video summaries.

A collection of video summaries describes unordered sets of video summaries.This can be instantiated as, for example, a sport video library which is a collection ofvideo summaries which have sport genre. A video summary describes the overallsummary and a hierarchical summary of a particular video segment or video stream.Overall summary provides the links to the whole content of the video, and the link tothe full length video. Hierarchical summary structures multiple highlights, forexample, we can group goal 1, goal 2, into a highlight summary named “goals”.Highlight segment descriptions include the link to the media, while describing thecontents. Each highlight can link to key video clip and key audio clip. Within thosetwo we may also have the link to the key frame and key sound. For example, for asoccer goal segment, the key-frame may show the face of the goal scorer, and the keysound is when the commentator says that the player has just scored. Sequentialsummary can be combined to our hierarchical summary descriptions in order todescribe the properties of the key frame, key sound or the texts displayed.

Preamble (wrapper)<schema xmlns=”http://www.w3.org/2000/10/XMLSchema”xmlns:mpeg7=”http://www.mpeg7.org/2001/MPEG-7 schema”targetNameSpace=”[the URI by which the current schema is to be identified]/VIDSchema”elementFormDefault=”unqualified”attributeFormDefault=”unqualified”><import namespace=”http://www.mpeg7.org/2001/MPEG-7 Schema”>

Collection of Video Summaries<element name=”VideoSummariesCollection”><complexType name=”VideoSummariesCollectionType”>

<extension base=”mpeg7:CollectionType”><element name=”Video” type=”VideoSummaryType” minOccurs=”1”

maxOccurs=”Unbounded” /></extension>

</complexType></element>

Video Summary<complexType name=”VideoSummaryType><sequence>

<element name=”OverallSummary” type=”vid: OverallSummary Type” / ><element name=”Hierarchical Summary” type=”vid: HierarchicalSummaryType” / ><element name=”RelatedVideo” type=”mpeg7:ReferenceType” />

</sequence><attribute name=”id” use=”required” type=”string”><attribute name=”genre” use=”required” type=”string”><attribute name=”type” use=”required” type=”string”>

</complexType>

�('��9:��-��-��;�.<�<��=��0��<=�

Overall Summary and Hierarchical Summary<complexType name=”OverallSummaryType”><complexContent>

<extension base=”mpeg7:SummaryType”><element name=”OverallSummaryAnnotation”

type=”vid:OverallSummaryAnnotationType” /></extension>

</complexContent></complexType><complextType name=”HierarchicalSummaryType”><complexContent>

<restriction base=”mpeg7:HierarchicalSummaryType”><element name=”HighlightSummary” type=”vid:HighlightSummaryType” />

</restriction></complexContent>

</complextType><complexType name=”HighlightSummaryType”><complexContent>

<extension base=”mpeg7:HighlightSummaryType”> <element name=”KeyFrameProperty” type=”mpeg7:FramePropertyType”

minOccurs =”0” maxOccurs = “unbounded” /><element name=”KeySoundProperty” type=”mpeg7:SoundPropertyType”

minOccurs =”0” maxOccurs = “unbounded” /><element name=”KeyTextProperty” type=”mpeg7:TextPropertyType”

minOccurs =”0” maxOccurs = “unbounded” /><element name=”HighlightSegmentAnnotation

type=”vid:HighlightSegmentAnnotationType” use = “required” /></extension>

</complextContent></complexType>

Overall Summary - & Highlight Segment - Annotation<complexType name=”OverallSummaryAnnotation”><sequence>

<element name=”Event” type=” mpeg7:TextualType” /><element name=”HomeTeam” type=" mpeg7:TextualType” /><element name=”VisitingTeam” type” mpeg7:TextualType” /><element name=”Place” type=” mpeg7:TextualType” /><element name=”Time” type=” mpeg7:TextualType” /><element name=”Winner”>

<simpleType><restriction base=”String”>

<choice><value = ”HomeTeam”><value =”VisitingTeam”><value = “Draw”>

</choice></restriction>

</simpleType></element><element name=”FinalScore” >

<simpleType><restriction base=”String”>

<pattern value=”H \d - V \d” /></restriction>

</simpleType></element><element name=”MatchComment” type=”mpeg7:TextualType” />

</sequence></complexType><complexType name=”HighlightSegmentAnnotation”><sequence>

<element name=”Player” type=” mpeg7:TextualType” /> <element name=”Team” type=“mpeg7:TextualType” / > <element name=”Time” type=” mpeg7:TextualType” /> <element name=”HighlightComment” type=”mpeg7:TextualType”

</sequence></complexType>

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��($

<VideoSummariesCollection name=”Sport Video Album – Soccer Champions League 2002”<CreationInformation>

<Creation> <Title> Soccer Champions League 2002 </Title> <Creator> <Agent xsi:type=”PersonType”>

<Name> Dian Tjondronegoro </Name> </Agent>

</Creator><Creation>

</CreationInformation> <TextAnnotation> <FreeTextAnnotation>Collection of soccer champions League

videos 2002 </FreeTextAnnotation> </TextAnnotation>

<Video id=”ACMilanBayern1”genre="sport" type="soccer"> <MatchSummary> <Name>soccer: AC Milan vs Bayern Muenchen </Name> <SourceLocator> <MediaUri>http://vidlib.org/soccer1.mpg</MediaUri> <MediaTime> <MediaRelTimePoint>PT0S</MediaRelTimePoint> <MediaDuration>PT100M</MediaDuration> </MediaTime> </SourceLocator> <MatchAnnotation>

<WhatAction>Soccer: Champions League Qualifying</WhatAction> <HomeTeam>AC Milan soccer team</HomeTeam>

<VisitingTeam>Bayern Muenchen soccer team </VisitingTeam> <Where>Milan, Italy </Where> <Winner>AC Milan soccer team</Winner> <FinalScore>H2-V1</FinalScore>

<MatchComment>A very thrilling match </MatchComment> </MatchAnnotation>

</MatchSummary><HierarchicalSummary>

<SummaryThemeList> <SummaryTheme xml:lang="en" id="E0"> soccer </SummaryTheme>

<SummaryTheme xml:lang="en" id="E01" parentId="E0"> goals </SummaryTheme> </SummaryThemeList> <HighlightSummary id="AC Milan_goals" themeIds="E01">

<HighlightSegment id="Milan_Goal1" level="0”> <KeyAVclip> <MediaTime>

<MediaRelTimePoint> PT44M10S </MediaRelTimePoint> <MediaDuration> PT1M20S </MediaDuration>

</MediaTime> </KeyAVclip> <HighlightSegmentAnnotation>

<Who> Serginho </Who><Team> ACMilan soccer team </Team><Time> 44M </Time><HighlightComment> Serginho left unmarked before

he drive a low shot beautifully </HighlightComment></HighlightSegmentAnnotation>

</HighlightSegment></HighlightSummary>

</HierarchicalSummary> <RelatedVideo href="http:/vidlib.org/VideoSummariesCollection2.mp7” xpath=”/VideoSummariesCollection/Video[id=’ACMilanJuventus1”] /> </Video> <CollectionRef href="http:/vidlib.org/VideoSummariesCollection1.mp7” /></VideoSummariesCollection>

�(��9:��-��-��;�.<�<��=��0��<=�

For a better understanding of our video descriptions scheme, we have included avideo summaries collection example. This example compiles all of the videosummaries which belong to the Soccer Champions League 2002. We have shown thatthis video summaries collection is referenced to another video summary collectionwhich stores all the video summaries of soccer champion’s league 2001 (i.e. olderseason). Moreover, the video with video ID’ACmilanBayern1’ is linked to a similarvideo, which is from another video summaries collection which stores all the soccervideos of Italian league.

5 Managing Video Retrieval

After the key-segments are extracted and described to form a collection of videosummaries, the next step is to design the retrieval technique to meet userrequirements. Since our video descriptions are based upon MPEG-7 which uses XML,we need to design a data model to structure the XML database and search for astandard query language to retrieve our ‘XML schema –based’ video contentdescriptions.

Since there is still no standard language for XML retrieval, we have compared thethree most recent XML query languages proposed to W3C: XML-QL, XQL, andXQuery (see table 1 for the comparison). Our study has found that XQuery is superiorcompared to the older ones, because it supports our requirements, as well asproposing some important extra features that we can use for our complex queries.

Based on this table, it is obvious that XQuery provides the most complete featureswhich meet our requirements. In addition, we also found that XQuery uses‘XQuery1.0 and XPath 2.0’ data model which supports MPEG-7 (i.e. by supportingXML schema).XQuery supports our video summary schema by allowing:

•� All operations on all data types represented by the data model (e.g., simpleand complex types, references, and collections)

•� Hierarchies and sequence of document structures and must be able topreserve it to the output document.

•� Combination of information from different parts of a document or multipledocuments. Existing database management operations, such as conditions,quantifiers, aggregation, and sorting.

5.1 Data Model for Our Video Summaries

We have decided to use W3C’s XQuery 1.0 and XPath 2.0 data model which is thedata model for XQuery 1.0: a query language for XML. The data model supports ourMPEG-7 based video descriptions by supporting: 1) XML schema types, such asstructures and simple data types, and 2) the representation of collection of documentsand complex values [21]. Since XML documents are tree-structured, we describe ourdata model using conventional terminology for trees [16]. Fig. 3 illustrates the treestructure of our video descriptions schema.

We also found that retrieval in an XML database can be classified based on thetopology of the query [6]. Based on the topology of our tree structured videosummaries collection, we have classified our queries into 4 categories which we will

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��(%

present the example of each in the next section. Technically, all four topologies canbe illustrated as returning a table of results as needed:

•� In Singleton query, it matches one column which has 1 row which is equal tosingle value. For instance, ‘find the first goal segment where Serginhoscores.’

•� Path query returns a table with a set of one or more tuples. For example,‘find all goal segments where Serginho scores.’

•� Tree queries are used when conjunctions of paths are queried or whenmultiple results are needed (join). For example, ‘find all goal segmentswhere AC Milan host the game and Van Nilestrooy scores’ or ‘ find thevisiting team and the final score of a match where AC Milan host the game’

•� Graph patterns extend the hierarchical queries supported by XML paths toimprove the complexity and flexibility in querying allowed. Tree structure ofan XML document becomes a graph when nodes are connected by intra-document links (via IDREF) or inter-document links (via XML link). Thusgraph query allows an entire (indexed) XML documents with links to beexamined, and all fragments in the document that match the pattern arereturned. Graph (pattern) queries are also used when two or more paths of atree must be joined for a query. For instance, ‘find all video summaries andtheir related videos (regardless if they are in the same document or not)’

Fig. 3. Tree structure of a Video Summaries Collection

5.2 Implemented XQuery Retrievals

We now present some implemented (XQuery) queries. The main purpose is to showhow we can retrieve our video descriptions using XQuery features. For ourexperimental works, we have used Howard Katz’ XQuery engine (www.fatdog.com)which was developed as JAVA APIs. We have chosen this engine because it supportsour XML document structure and XML parser [22]. Please note that queries 1 to 3 arealready supported by this engine while queries 4 and 5 are yet to be fully supported bythe next version.

�')��9:��-��-��;�.<�<��=��0��<=�

Tab1e 1. Support for features by alternative query languages for XML

?� �� -��8�� @��-�� ?A � �� ?� .A �

��

•� ��--�� ,��6��-��

•� <-�� "�B�� -� <-�� -��

•� �=��-�� C�D�C� ��-�D��-�

•� ��B�� C�DD�C� ,��6.��.��B�� -��-�DD��-�

•� " -�2� �� E� " -�2� ��-�E�

� B=�� F1��G2��5H��G��2��=�-HG��H"��.F��IGD��HGD��2��=�-H��G � ��H�J GD � ��H��G� =�-H�JGD� =�-H��GD2��5H�#!�K444��2�BD2�2��K��!,9�L�9�J�

� ��

•� #�� M>N� M>N� !D"�

•� �� 6�B�� M�B��N� M�6�B��N� F1��O�� O�O�� P�

��

•� 0�� J��J��J�-J��J�� J�

"��-��!� � !D"�

•� �@��8��B�� Q��RQ�J�@J��J��J�

Q��RQ��-��@��-�

!D"�

•� ��-�� G��GQ��H��HQ�J� J��J��J��J� J��J��J�

G��GQ��H��HQ��-�� -�

!D"�

•� !��-��

#�� I��-�B��S��<-�B��SS�

GG��HH��-�B��6��4��

!D"�

•� A�� 6�B �� "�I�� ,��C�� 6��C��8�-IC�� 6��C�

!D"�

��

,�- ��-��-�2I�� -��-�2I� ,�- �0I� �-��-��0IT�O��!,9�L�9O��A��-I�

U-��-��2I�� !D"� L��B�� -�B �-��3�-�J2��DD2��5�� -��J2��V6�-�J��J2D� =�-W��

!D"�

X�� ,��- �� ,��- �� ,��- ��

•� L��#� �-��B ��

J��J��-�Y��J�� -��B J�

L��-��Y��#� �-��B �

!D"�

"��-�� 6��B �� 8��

��!,9�L�9�G-�� #�Q6�J��H�J��G��4�� .�-�B�H�J��J��GDH��G=��=�� .�-�B�H�J��J��GDH�

Query 1: Singleton queryIn most cases, users want to display the summary of the specific video which they arekeen to watch. In this case, users can specify the video identifier.

/VideoSummariesCollection/Video[id=’ACMilanBayern1”]

Query2: Path queryUsers may want to be able to display all the highlight segments from all videos whichbelong to soccer champions league 2002. We can do the query by using relative pathexpression to access elements

/VideoSummariesCollection/Video/HierarchicalSummary/HighlightSummary/HighlightSegment

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��'>

However, we can alternatively use absolute path expression to reduce the querylength. It is less error-prone as we do not have to know the path to access the elementwe need. Moreover, it does not get affected should the path is changed.

//HighlightSegment

Query 3: Tree queryDisplay all key-segments of soccer matches, where AC Milan hosts the game andSerginho scores

/VideoSummariesCollection/Video [.//homeTeam = ‘AC Milan soccerteam']/HierarchicalSummary/HighlightSummary/HighlightSegment[.//player='Serginho‘]

Query 4: Graph queryFind the video summary and its related video which features AC Milan soccer team asthe host of the match.

/VideoSummariesCollection/Video [.//homeTeam = ‘AC Milan soccerteam']/RelatedVideo [.//homeTeam = ‘AC Milan soccer team’]

Query 5: Example of complex queryDisplay all players which appear in the video (just once), and for each player displaythe video match summary (in which the player appears).

For $p in distinct-value (.//Who/data( ) )Return<result>

<player>{$p}</player>{for $v in //Video, $p2 in

$v/HierarchicalSummary/HighlightSummaryHighlightSegment/HighlightSegmentAnnotation/Who/data( )

where $p = $p2 return $v/MatchSummary

}</result>

6 Experimental Results on Goal Segments Extraction

We used MATLAB6 to analyse the audio signals of soccer videos because theinformation is highly correlated with the important events during sport matches. Inthis section, we present our experimental work which extracted the goal segments insoccer videos by detecting loud crowd cheer and excited commentator’s speech. It isnoted that the audio samples that we receive from soccer matches are complex sincethey contain human speech, crowd cheer, whistle sounds, and various backgroundnoises. Thus, we have firstly designed a digital filter to extract the crowd cheer andcommentator’s speech components and we segmented our audio samples into 1.5 secaudio frames (equally). After the filter was applied to each audio frame, we calculated

�'��9:��-��-��;�.<�<��=��0��<=�

its signal energy. We then compared the signal energy of each audio frame to a presetthreshold. If the signal energy is greater than the threshold, we marked the particularaudio frame as a potential goal segment. Finally, we examine the audio frames whichhave been marked, and (based on our experiments) if there are >= 5 continuousmarked audio frames, then we can determine that these segments contain a goal event.Please note that continuous marked audio frames are highlighted in Table 2.

Our audio filter to extract crowd cheer and commentator’s speech was designed asa band-pass filter with the following specifications:

•� �4�-�� .2�� >��Q�)�>% ��4� =�� .2�� "��Q�$(��0•� �4�-��.2�� >��Q�)�� 4� =��.2��-��Q�)�)>��0•� L��-��.2�� Q�)��( ��4� =��.2��-��Q��)�)>�0•� L��-�� .2�� Q�)��$ ��4� =�� .2�� "��Q�$(��B

The signal energy of an audio frame (in time-domain) was calculated using thealgorithm presented in [19]:

L��Q��Z� ��M�N� �� n=1

Where Ex is the signal energy of audio frame x, x[n] is the nth sample value(magnitude) of the audio frame, and L is the length of the audio frame. Based on ourinvestigation, an audio frame of 1.5 sec length is marked (as a potential goal segment)if: Ex >= 445.

All the audio samples in our experiments were stored as PCM-encoded WAVformat with 16 bits sample size and 48000 kHz sampling frequency. Although thesamples were recorded in 2 channels (stereo) our program converted them into monofor faster processing time. We used 5 audio samples which were extracted from 3different soccer matches (marked as match 1, 2, and 3), so that we can ensure that ourdetection algorithm does not dependent on a particular speaker or TV broadcaster.The followings describe our audio samples, while table 2 reports our experimentalresults.

•� Goal1 (match1): a segment in which a very exciting goal happens as a resultof a terrible defensive mistake.

•� Goal2 (match2): a segment in which a goal happens as a result of a welldelivered cross and powered with a skilful header.

•� Penalty (match3): a segment in which a goal happens as a result of penalty.•� Non goal (match 1): a segment which contains two occurrences of fouls.•� Kick off (match 2): a segment which happens in the beginning of match two.

Table 2. Audio samples and the potential key segments

Audio samples Marked segments (in seconds)Goal1 (match1) 12, 13.5, 15, 16.5, 18Goal2 (match2) 4.5 12, 13.5, 16.5, 18, 19.5 24, 25.5Penalty (match3) 13.5 25.5, 27, 28.5, 30 34.5

90, 91.5, 93, 94.5, 96, 97.5, 99, 100.5Non goal (match 1) 9, 10.5, 12Kick off (match 2) None

Based on this result, we have correctly detected goal segments from Goal1 andGoal2 and penalty segments. The highlighted marked segments have shown there are

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��'*

more than 4 continuous marked audio frames during a goal event. Please note that themarked frames from 25.5, 27, 28.5, 30 sec in penalty segment indicate that the penaltyhas been awarded (as a result of foul in penalty area). However, due to the possibilityof a sudden silence in between speeches, our program still regard the marked 12, 13.5,16.5, 18, 19.5 sec in goal 2 as a continuous segment although we missed the 15 sec(i.e. missing one marked frames is still regarded as continuous).

7 Summary and Some Future Work Suggestions

We have presented a three-stage scheme for indexing sport videos using our MPEG-7based video summaries. The main benefit of this system is its capability to allow usersin selecting the particular soccer video(s) using the match highlights which arecomposed by the annotated goal segments. During the first stage where key segmentsare automatically extracted, we proposed a concept of customizable videosegmentation module. The purpose is to combine certain suitable sets of techniquesfor detecting the key segments which are most suitable for a particular type of video.In this case, our experimental work has shown that the segmentation module candetect goal segments in soccer videos by detecting loud crowd cheer and excitedspeech audio frames. In the second stage, soccer videos and their goal segments arestructured hierarchically into match highlights and video summaries. The benefit ofadapting MPEG-7 standard descriptions scheme for our video summaries is to allowstraight forward sharing of information between various VDMS. In the third stage, wehave shown that XQuery can retrieve our MPEG-7 based video summaries (which areformed by XML documents) using the singleton, path, tree, and graph queries. Theadvanced features of XQuery from its predecessor, such as XQL and XML-QL, willalso enable users to do some advanced retrievals.

We suggest the following future work to enhance our VDMS. First of all, we needto develop more techniques to detect goal segments from soccer videos usingaudiovisual and text features extraction. In particular, the system need to decidewhether the current position of play is near goal area before the goal segments andwhether there is a slow motion replay scene and updated scoreboard display after thegoal segments. Secondly, we need to develop an XML document parser to ensure thatthe annotated video summaries are valid. After that, we need to develop or extend anXQuery retrieval engine which can fully support our video summaries. Finally, weplan to develop a user interface which will enable users to intuitively understand thevideo summaries and easily formulate their queries.

References

1. Huang, Q., A. Puri, and Z. Liu, Multimedia search and retrieval: new concepts, systemimplementation, and application. Circuits and Systems for Video Technology, IEEETransactions on, 2000. 10(5): p. 679–692.

�'��9:��-��-��;�.<�<��=��0��<=�

2. Koh, J.-L., C.-S. Lee, and A.L.P. Chen. Semantic video model for content-based retrieval.in Multimedia Computing and Systems, 1999. IEEE International Conference on. 1999.Dept. of Inf. & Comput. Educ., Nat. Taiwan Normal Univ., Taipei, Taiwan: Practical.

3. Hampapur, A. and R. Jain, Video Data Management Systems: Metadata and Architecture,in Multimedia data management : using metadata to integrate and apply digital media, A.Sheth and W. Klas, Editors. 1998, McGraw-Hill: New York.

4. Zhou, W., A. Vellaikal, and C.C.J. Kuo. Rule-based video classification system forbasketball video indexing. in ACM Workshops on Multimedia. 2000. Los Angeles,California, United States: ACM.

5. Miyamori, H. and S.-I. Iisaku. Video annotation for content-based retrieval using humanbehavior analysis and domain knowledge. in Automatic Face and Gesture Recognition,2000. Proceedings. Fourth IEEE International Conference on. 2000. Commun. Res. Lab.,Minist. of Posts & Telecommun., Koganei, Japan: Theoretical or MathematicalExperimental.

6. Sudhir, G., J.C.M. Lee, and A.K. Jain. Automatic classification of tennis video for high-level content-based retrieval. in Content-Based Access of Image and Video Database,1998. Proceedings., 1998 IEEE International Workshop on. 1998. Dept. of Comput. Sci.,Hong Kong Univ. of Sci., Hong Kong: Practical.

7. Xie, L., et al. Structure analysis of soccer video with hidden Markov models. in Acoustics,Speech, and Signal Processing, 2002 IEEE International Conference on. 2002. ColumbiaUniversity.

8. Gong, Y., et al. Automatic parsing of TV soccer programs. in Multimedia Computing andSystems, 1995., Proceedings of the International Conference on. 1995: Practical.

9. Yajima, C., Y. Nakanishi, and K. Tanaka. Querying video data by spatio-temporalrelationships of moving object traces. in 6th IFIP Working Conference on Visual DatabaseSystems. 2002. Brisbane: Kluwer.

10. Nepal, S., U. Srinivasan, and G. Reynolds. Automatic detection of ’Goal’ segments inbasketball videos. in ACM International Conference on Multimedia. 2001. Ottawa;Canada: ACM.

11. Zhou, W., S. Dao, and C.-C. Jay Kuo, On-line knowledge- and rule-based videoclassification system for video indexing and dissemination. Information Systems, 2002.27(8): p. 559–586.

12. Pfeiffer, S.S., Savitha. TV anytime as an application scenario for MPEG-7. in Proceedingson ACM multimedia 2000 workshops 2000. 2000. Los Angeles, California, UnitedStates.: ACM press.

13. Van Beek, P.B., A.B.; Heuer, J.; Martinez, J.; Salembier, P.; Shibata, Y., Smith, J.R.; andWalker, T., Text of 15938-5 FCD Information Technology - Multimedia ContentDescription Interface – Part 5 Multimedia Description Schemes. 2001, InternationalOrganisation For Standardisation, Coding of Moving Pictures and Audio, ISO/IEC JTC1/SC 29/WG 11/N3966,: Singapore.

14. Hunter, J.I., R. . The Application of Metadata Standards to Video Indexing. in SecondEuropean Conference on Research and Advanced Technology for Digital LibrariesECDL'98. 1998. Crete, Greece.

15. Bourret, R., XML and Databases. 2002.16. Graves, M., Designing XML databases. 2002, Upper Saddle River, NJ: Prentice Hall.17. Boag, S., et al., XQuery 1.0: An XML Query Language. 2002, W3C.18. Pan, H., P. van Beek, and M.I. Sezan. Detection of slow-motion replay segments in sports

video for highlights generation. in Acoustics, Speech, and Signal Processing, 2001.Proceedings. 2001 IEEE International Conference on. 2001. Salt Lake City, UT, USA.

19. Babaguchi, N., et al. Linking live and replay scenes in broadcasted sports video. in ACMWorkshop on Multimedia. 2000. Los Angeles, California, United States: ACM Press.

"�3-��4�-5�6�-�� 72��,��- ��/�� -��8��'(

20. Lienhart, R. Automatic text recognition for video indexing. in Proceedings of the fourthACM international conference on Multimedia. 1996. Boston, Massachusetts, UnitedStates: ACM Press.

21. Fernández, M., J. Marsh, and M. Nagy, XQuery 1.0 and XPath 2.0 Data Model. 2002,W3C.

22. Katz, H., An Introduction to Xquery. A look at the W3C proposes standard for an XMLQuery language. 2002, IBM developerWorks:XMLzone:XML zone articles.

a framework for customizable sports video management and retrieval

Documents