white paper - mpeg 4 toolkit approach

1

White paper

SCALABLE MEDIA PERSONALIZATION

Amos Kohn

September 2007

ABSTRACT

User expectations, competition and sheer revenue pressures are driving rapid development—and operator acquisition--of highly complex media processing technologies.

Historically, cable operators provided ―one stream for all‖ service in both the analog and digital domains. At most, they provided two to three streams for East and West Coast delivery. Video on Demand (VOD) represented a first step toward personalization, using personalized delivery, in the form of ―pumping‖ and network QAM routing, in lieu of personalization of the media playout itself. In some cases, personalized advertisement play-lists were also created. This resulted in massive deployments of VOD servers and edge QAMs.

The second step in this evolution is the introduction of switched digital video, which takes the linear delivery one step further to deliver a hybrid VOD/linear experience without applying any personal media processing. Like previous personalization approaches, user-based processing is limited to network pumping and routing, with no access to the actual media or ability to manipulate it for true personalization.

True user personalization requires the generic ability to perform intensive media processing on a per user basis. As of today, a STB-based approach to media personalization seems to be dominant. This approach necessitates future deployment of more capable (thus more expensive) STBs. This approach, although straight-forward, is incompatible with the need to lower costs, unify user experience, and retain customers and other operator needs. The network approach, where per-user personalization is completely or partially accomplished BEFORE the video reaches the STB (or any other user device) delivers the same experience but has been explored only in a very limited fashion. However, this approach has the most potential to benefit operators as it addresses most of the current and future challenges that operators face.

2

NETWORK-BASED PROCESSING TOOLKIT

The following defines a set of coding properties that are used as part of the media personalization solution. As indicated below, one of the advantages of this solution is that it is standard-based, as are the tools. The properties defined here are a combination of MPEG-4 (H.264 mostly) and MPEG-2. The combination provides a solution for both coding schemes.

MPEG-4 is composed of a collection of "tools" built to support and enhance scalable composition applications Among the tools discussed here are shape coding, motion estimation and compensation, texture coding, error resilience, sprite coding and scalability.

Unlike MPEG-4, MPEG-2 provides a very limited set of functionality for scalable personalization. The tools defined in this document are nevertheless sufficient to provide personalization in the MPEG-2 domain.

Object-Based Structure and Syntax

Content-based interactivity, The MPEG4 standard extends the traditional frame-based processing towards the composition of several video objects superimposed on a background image. For the proper rendering of the scene without disturbing artifacts on the border of video objects (VO), the compressed stream contains the encoded shape of the VO representing video as objects rather than in video frames, enables content-based applications. This, in turn, provides new levels of content interactivity based on efficient representation of objects, object manipulation, bit stream editing and object-based scalability.

An MPEG-4 visual scene may consist of one or more video objects. Each video object is characterized by temporal and spatial information in the form of shape, motion and texture. The visual bit stream provides a hierarchical description of a visual scene. Start codes, which are special code values, can access each level of the hierarchy in the bitstream. The ability to process objects, layers and sequences selectively is a significant enabler for scalable personalization. Hierarchical levels include:

Visual Object Sequence (VS): MPEG-4 scene may include any 2-D or 3-D natural or synthetic objects. Those objects and sequences can be addressed individually based on the targeted user.

Video Object (VO): A video object is linked to a certain 2-D element in the scene. A rectangular frame provides the simplest example, or it can be an arbitrarily shaped object that corresponds to an object or background of the scene.

Video Object Layer (VOL): Video object encoding takes place in one of two modes, scalable or non-scalable, depending on the application represented in the video object layer (VOL). The VOL provides support for scalable coding.

Group of Video Object Planes (GOV): Optional in nature, GOVs enable random access points into the bitstream by providing points where video object planes are independently encoded. MPEG-4 video consists of various video objects, rather than frames, allowing a true interactivity and manipulation of separate arbitrary object shape object with efficient scheduling scheme to speedup real-time computation.

3

Video Object Plane (VOP): VOPs are video objects sampled in time. They can either be sampled independently or dependently by using motion compensation. Rectangular shapes can represent a conventional video frame. A motion estimation and compensation technique is provided for interlaced digital video such as video object planes (VOPs). Predictor motion vectors for use in differentially encoding a current field coded macroblock are obtained using the median of motion vectors of surrounding blocks or macroblocks which will support high system scalability.

Figure 1 below illustrates an object-based visual bitstream.

A visual elementary stream compresses visual data of just one layer of one visual object. There is only one elementary stream (ES) per visual bitstream. Visual configuration information includes the visual object sequence (VOS), visual object (VO) and visual object layer (VOL). Visual configuration information must be associated with each ES.

Figure 1: The visual bitstream format

Compression TOOLS

Intra Coded VOPS (I-VOPS): VOPS that are coded with information within the VOP, removing some of the spatial redundancy. Inter coding makes use of temporal redundancies between frames by the method of motion estimation and compensation: two modes of inter coding are provided for - prediction based on a previous VOP (P-VOPs) and prediction based on a previous VOP and a future VOP (B-VOPs). These tools are use in the content preparation stage to increase compression efficiency, error resilience, and coding of different types of video objects.

Shape coding tools: MPEG4 provides tools for encoding arbitrary shaped objects. Binary shape information defines which portions (pixels) of the object belong to the video object at a given time, and is encoded by a motion compensated block-based technique that allows both lossless and lossy coding. The technique allows for accurate representation of object that in turn improved accuracy of quality of

4

the final composition, as well as assist the differentiation between video and non video objects within the stream.

Sprite coding: Sprite is an image composed of pixels belonging to a video object visible throughout a video sequence and an efficient and concise method for representation of background video object, which is typically compressed with the object-based coding technique. Sprite has high compression efficiency when a video frame contains the whole background that is at least visible once over a video sequence.

MPEG4 H.264/AVC Scalable Video Coding (SVC): A method of achieving high efficiency of video compression is the scalable extension of H.264/AVC, known as scalable video coding or SVC.

A scalable video bitstream contains the non-scalable base layer and one or more enhancement layers. (The term ―Layer‖ in Video Coding Layer (VCL) is related to syntax layers such as: block, macroblock, slice, etc., layers). The basic SVC design can be classified as layered video codec. In general, the coder structure as well as the coding efficiency depends on the scalability space that is required by an application. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial resolution, or the quality of the video content represented by the lower layer or part of it. The scalable layers can be aggregated to a single transport stream, or transported independently.

Scalability is provided at the bitstream level, allowing for reduced complexity. Reduced spatial and/or temporal resolution can be obtained by discarding NAL units (or network packets) from a global SVC bit-stream that are not required for decoding the target resolution. NAL units contain motion information and texture data. NAL units of Progressive Refinement (PR) slices can additionally be truncated in order to further reduce the bit-rate and the associated reconstruction quality.

5

NETWORK BASED PERSONALIZATION CONCEPT

Network-based personalization represents an evolution of the network infrastructure. The solution includes devices which allow multi-point media processing, enables the network to target any user with any device with any content. In this paper, we are focusing primarily on the cable market and TV services. However, the concept is not confined to these areas.

The existing content flow remains intact regardless of how processing functionality is extended within each of the network components, including the user device. This approach can accommodate the range of available STBs, employ modifications based on user profiles, and support a variety of sources.

The methodology behind the concept anticipates that the in-and-out point of the system must support a variety of containers, formats, profiles, rates and so forth. However, within the system, the manipulation flow is unified for simplification and scalability. Network-based personalization can provide service to incoming baseline (Low Resolutions), Standard Definition (SD) and High Definition (HD), formats and support multiple containers (such as Flash, Windows Media, Quicktime, MPEG Transport Stream and Real).

Network personalization requires an edge processing point and optionally, an ingest and user premise as content manipulation locations. The conceptual flow of the solution is defined in figure 2 below.

Figure 2: Virtual Flow: Network based personalization

The virtual flow and building blocks defined is generic and can be placed at different locations of the network, co-located or remote. Specific examples of architecture will be reviewed later in this paper.

Prepare Integrate Create Present

SessionAsset

Interact

6

At the ―preparation‖ point, media content is ingested and manipulated in several aspects: 1) Analysis of

the content and creation of relevant information (metadata), which will then accompany it across the flow. 2

) Processing of the content for integration and creation, which includes manipulation such as changing

format, structure, resolution and rate. The outcome of the preparation stage is a single copy of the incoming media, but in a form that includes data that will allow the other blocks to create multiple personalized streams from it.

The ―integration‖ point is a transition point from asset focus to session focus. The block is all about connecting, synchronizing prepared media streams with instructions and other data to create a complete session specific media and data flow, to be provided later to the ―create‖ block.

―Create‖ and ―present‖ blocks are the final content processing steps where for a given session each media stream is crafted according to the user, device and medium (in the ―create‖ block), then joined to a visual experience at the ―present‖ block. The ―create‖ and ―present‖ blocks are intentionally defined separately, in order to accommodate different end user device types and power. Further discussion of this subject appears in the ―Power to the user section‖ below.

7

PUTTING IT ALL TOGETHER

The proposed implementation of network-based personalization takes into account the set of tools and the virtual building blocks defined above to create the required end result.

To support high level personal session-based services we propose to utilize the MPEG-4 toolkit which enables scene-related information to be transmitted together with video, audio and data to a processor-based network element in which an object-based scene is composed based on user device rendering capabilities. Using MPEG4 authoring tools and applying BIFS (Binary Format for Scenes) encoding at the content preparation stage the system will support efficiency enhancement of personalization stream processing , specifically at the ―create‖ and ―present‖ stages. Different encoding levels are required to support the same bitstream; for example varied network computational power will be required to process the foreground, background and other data such as 2D/3D in the same in the same bitstream. Moreover, some of the video rendering will be passed directly to the user reception device (STB) and will reduce network image processing requirements.

The solution described in this paper utilizes a set of tools allowing the content creator to build multimedia applications without any knowledge of the internal representation structure of an MPEG-4 scene. By using an MPEG4 toolkit, the multi-media content is object-oriented with spatial - temporal attributes which can be attached to it, including BIFS encoding scheme. The MPEG4 encoded objects address video, audio and multimedia presentations such as 3D as defined by the authoring tools.

The solution is built on four network elements: Prepare, integrate, create and present. All four network elements work together to ensure the highest processing efficiency and accommodate different service scenarios such as legacy MPEG2 set top boxes; H.264 set top boxes with no object-based rendering capabilities and finally, STBs with full MPEG4 object-based processing capabilities. Two-way feedback between the STB, the edge network and the network-based stream processor will be established in order to define what will be processed in each of the network stages.

PREPARE

At the prepare stage, the assumption is that incoming content is received or converted to support MPEG4 toolkit encoding, generating content media in object based format. Using authoring tools to upload content and create scene-related object information will support improved media compression that will be transmitted and processing by the network. The object based scene will be created using MPEG4 authoring tools and applying BIFS (Binary Format for Scenes) encoding to support the integration and control of different audio/visual and synthetic objects seamlessly in a scene. Compression and manipulation of visual content using MPEG4 toolkit introduces novel concept of a Video Object Plane (VOP) and a sprite. Using video segmentation, each frame of an input video sequence can be segmented into a number of VOPs, each of which may describe a physical object within the scene. A sprite coding technique may be used to support a mosaic layout. It s based on large image composed of pixels belonging to a video object visible throughout a video segment. It captures spatio-temporal information in a very compact way.

Other tools also might be used at the prepare stage applied to improve the network processing and reduced bandwidth, those includes I-VOPs - "Intra-coded Video Object Plane" that allow

8

encoded/decoded based on its shape, motion and texture. Bidirectional Video Object Plane (B-VOP) may be used to predict from a past and a future reference VOP for each object or shape motion vector built from neighbouring motion vectors that were already encoded.

The output of the prepare stage is, per asset, set of object based information, coded as elementary streams, packetized elementary streams and metadata. The different object layers and data can in turn be transported as independent IP flows, over UDP, RTP, to the Integrate stage.

INTEGRATE

The session with the preparation stage will be an "object-based" session which is embodied mainly in its visualization of several visual object types. The scalable core profile is required mostly because it supports arbitrary-shaped coding, temporal/spatial scalability, etc. At the same time, the scalable core profile will need to support computer graphics, such as 2D mesh, synthetic objects, etc. as part of the range of scalable objects in the integration stage.

MPEG4 object-based coding allows separate encoding of foreground figures and background scenes. Arbitrary shaped coding needs to be supported to maintain the quality of the input elements. It includes shape information in the compressed stream. In order to apply stream adaptation to support different delivery environments and available bandwidths, temporal and spatial scalability are included in the system. Spatial scalability allows addition of one or more enhancement VOL (video object layers) to the base VOL to achieve different video scenes. To Summarize, at the integrate stage, a user composed out of multiple incoming object based assets, to create a the final, synchronized, video object layers and object planes. The output of the integrate includes all the info and media require for the session; however at this point the media is still not tuned to the specifics of the network, device and user, it is a super set of it. The streams will than be transport to the ―create‖ and ―present‖ stages, where the final manipulation is done.

CREATE

The system part of MPEG-4 allows creation or viewing of a multimedia sequence with hybrid elementary streams, which can be encoded and decoded with the best suitable codec for each stream. However, to manipulate those streams synchronously and compose them onto a screen in real time is computationally demanding. Therefore a temporal cache will be used in the ―create‖ stage to store the encoded media streams. All of the ES (elementary streams) consist of either a multiplexed (using the MPEG-4 defined FLEXMUX) stream or a single stream, but all of them have been packetized by the MPEG-4 SL (sync layer). The uses of FLEXMUX and sync layer will allow grouping of the elementary streams with a low multiplexing overhead at the ―prepare‖ and ―Integrate‖ stages, where the SL will be used to synchronize bitstream delivery information from the previous stage to the ―create‖ stage. In order to generate the relevant session (stream) the ―create‖ stage will use an HTTP submission to ask for a desired media presentation. The submission will only contain the index of the preformatted Binary Format for Scenes - BIFS for those of a pre-created and stored presentation or a text-based description of the user’s authored presentation. BIFS coding also allow integration and control of different audio/video objects seamlessly in a scene. The ―integrate‖ stage will receive the request and will send the media to the ―create‖ stage, i.e. the BIFS stream together with the object descriptor in the

9

form of an initial object descriptor stream. The MPEG-4 BIFS will allow integration and control of different audio/video objects seamlessly in a scene. If the client side can satisfy the decoding requirements, it will send a confirmation to the ―create‖ stage to start the presentation; otherwise, the client will send its decoding and resolution capabilities to the ―create‖ stage. At this point it will repeatedly downgrade to a lower-profile until it meets the decoding capabilities or will inform the ―present‖ stage to compose a stream that will satisfy the client decoding device (i.e. H.264 or MPEG2). The ―create‖ stage will initiate the establishment of the necessary sessions for the SD (scene description) stream (BIFS format) and the OD (object description) stream referenced with the user device. It will allow the user device to retrieve the compressed media stream by using the URL contained in the ES descriptor stream in real time. The BIFS is used to lay out the media elementary stream in the presentation, as it provides the spatial and temporal relationship of those objects by referencing their ES_IDs. If the ―create‖ stage needs to modify the received scene, such as by adding an enhancement layer to the current scene based on user device or network capabilities, it can send a BIFS update command to the ―integrate‖ stage and obtain a reference to the new media elementary stream. The ―create‖ stage can handle multiple streams and sync between different objects and between the different elementary streams of a single object (e.g., base layer and enhancement layer). The synchronization layer is responsible for synchronizing the elementary streams. Each SL-packet consists of an Access Unit (AU) or a fragment of an AU. An AU needs to have time stamps for synchronization and constitutes the data unit that will be consumed by the decoder at the ―create‖ stage or the user device decoder. An AU consists of a Video Object Plan (VOP). Each AU will be receiving by the decoder at the time instance specified by a Decoding Time Stamp (DTS). The media is processed by the ―present‖ stage in such a way that MPEG objects are transcoded to either an H.264 or MPEG2 transport stream utilizing stored motion vector information and macroblock mode. The applicable process is defined based on user device rendering capabilities. When an advanced user device with MPEG4 object layer decoding capabilities is the target, the ―present‖ processor acts as a stream adaptor, resizing where composition will be performed by the client device (advanced STB).

PRESENT

The modularity of the coding tools, expressed as well-known MPEG profiles and levels, allows for easy customization of the ―present‖ stage for a selected segment. For example, MPEG2 legacy STB markets where full stream composition needs to be applied at the network vs. full MPEG4 scene object-based advanced set top box capability where minimum stream preparation will need to be applied by the network ―resent‖ stage.

Two extreme service scenarios might be applied as follows:

Network-based ―present‖: The ―present‖ function applies stream adaptation and resizing; composes the network object elements; and applies transcoding functions to convert MPEG4 file-based format to either MPEG2 stream-based format or MPEG4/AVC stream-based format.

10

STB based ―present‖: The ―present‖ function might path through to the network the object elements after rate adaptation and resizing to be composed and presented by the advanced user device

The ―present‖ functionality is based on client/network awareness. In general, media provisioning will be based on metadata that will be generated by the client device and the network manger. Metadata will include the following information:

Video format. i.e. MPEG2, H.264. VC-1, MPEG4, QT etc.

User device rendering capabilities

User devise resolution format. i.e. SQCIF, QCIF, CIF, 4CIF, 16CIF

Network bandwidth allocation for the session

“Present” stage performance

It is essential that the ―present‖ function be composed of object-based elements that use the defined set of tools which present binary coded representation of individual audiovisual objects, text, graphics, and synthetic objects. It composes Visual Object Sequence (VS), Video Object Layer (VOL) or any other defined tool to a valid H.264 stream or MPEG2 stream in the resolution and the BW as it defined by the client device and the network metadata feedback.

The elementary streams (scene data, visual data, etc.) will be received at the ―present‖ stage from the ―create‖ system element which allows scalable representations, alternate coding (bitrate, resolution, etc.), enhanced with metadata and protection information. An object described by an ObjectDescriptor

will be sent from the content originator i.e. the ―prepare” stage, and provides simple meta-data related to the object such as content creation information or chapter time layout. This descriptor also contains all information related to stream setup, including synchronization information or initialization data for decoders.

The BIFS (Binary Format for Scenes) at the ―present‖ stage will be used to place each object, with various effects potentially applied to it, in a display which will be transcoded to an MPEG2 or H.264 stream.

STB-based ―present‖: Object reconstruction

The essence of MPEG4 lies in its object-oriented structure. As such, each object forms an independent entity that may or may not be linked to other objects, spatially and temporally. This approach gives the end user at the client side tremendous flexibility to interact with the multimedia presentation and manipulate the different media objects. End users can change the spatial-temporal relationships among media objects, turn on or shut down media objects. However, it will require difficult and complicated session management and control architecture.

A remote client retrieves information regarding the media objects of interest, and composes a presentation based on what is available and desired. The following communication messages between

the client device and ―present” stage will occur:

The client requests a service by submitting the description of the presentation to the data controller (DC) at the ―present‖ stage side.

The DC on the ―present‖ stage side controls the encoder/producer module to generate the corresponding scene descriptor, object descriptors, command descriptors and other media

11

streams based upon the presentation description information submitted by the end user at the client side.

Session control on the ―Create‖ stage side controls the session initiation, control and termination.

Actual stream delivery commences after the client indicates that it is ready to receive and streams flow from the ―Create‖ Stage to the ―Present‖ client. After the decoding and composition procedures, the MPEG-4 presentation authored by the end user is rendered on his or her display.

It is required that the set top box client support the architectural design of the MPEG4 system decoder model (SDM), which is defined to achieve media synchronization, buffer management, and timing when reconstructing the compressed media data.

The session controller at the client side communicates with the session controller at the server (―Create‖ Stage) side to exchange session status information and session control data. The session controller translates the user action into appropriate session control commands.

Network-based MPEG4 to H.264/AVC baseline profile transcoding

Transcoding from MPEG4 to H.264/AVC can be done in the spatial domain and compressed domain. The most straightforward method is to fully decode each video frame and then completely re-encode it with H.264. This approach is known as spatial domain video transcoding. It involves full decoding and re-encoding and is therefore very computationally intensive.

Motion vector refinement and an efficient transcoding algorithm are used for transcoding the MPEG4 object-based scene to a H.264 stream. The algorithm exploits the side information from the decoding stage to predict the coding modes and motion vectors of H.264 encoding. Both INTRA macroblock (MBs) transcoding and INTER macroblock transcoding will be exploited by the transcoding algorithm at the ―present‖ stage.

During the decoding stage, the incoming bitstream is parsed in order to reconstruct the spatial video signal. During the decoding process, the prediction direction for INTRA coded macro blocs and motion vectors are stored and then used in the coding process.

To get the highest transcoding efficiency by the ―present‖ stage, side information will be stored. During the decoding process of MPEG4, a lot of side information (like MVs) is obtained. The ―present‖ stage reuses the side information, which reduces the transcoding complexity compared to a full decode/re-encode scenario. In the process of decoding the MPEG4 bitstream, the side information is stored and used to facilitate the re-encoding process. In the transcoding process both MV estimation and coding mode decisions are addressed by reusing the side information to reduce complexity and computation power.

Network-based MPEG4 to MPEG2 transcoding

To support legacy STBs that have limited local processing capabilities and support only MPEG2 transport streams, a full decode-encode will be performed by the ―present‖ stage. However, the ―present‖ stage utilizes tools that have been used for the conversion of MPEG4 to H.264 in order to remove complexity. Stored motion vector information and macroblock mode decision algorithms for inter-frame prediction based on machine learning techniques will be used as part the MPEG4 to MPEG2 transcode process. Since coding mode decisions take up the most of the resources in video transcoding, a fast macro block (MB) mode estimation would lead to reduced complexity.

12

The implementation presented above has the ability to incorporate in offline and realtime environment. See appendix 2 for elaboration on real time implementation.

13

BENEFITS OF NETWORK-BASED PERSONALIZATION

Deploying network-based processing, whether complete or hybrid, has significant benefits:

A unified user experience is delivered across the various STB’s in the field;

It is a presentation, future-proof cost model for low to high-end STBs.

It utilizes existing the VOD environment, servers and infrastructure. Network-based processing accommodates low-end and future high-end systems, all under existing, managed operators’ on-demand systems. Legacy servers require more back office preparation, with minimal server processing power overhead, while newer servers can provide additional per-user processing and thus more personalization features.

Rate utilization is optimized. Instead of consuming the multiplication of all streaming comprised in the user experience, network optimized processing reduces overhead significantly. In the extreme case, it may be a single stream with no overhead, instead of 4-5 times the available BW. In the common case, it has overhead of approximately 20%.

Best quality of service fpr connected home optimization. By performing most or all the processing before hitting the home, the operator optimizes the bandwidth and experience across the user end devices, delivering best quality of service.

Prevention of subscriber churn in favour of direct over-the-top (OTT). The operator has control over the edge network. Over-the-top providers do not. Media manipulation in the network can and will be done by OTT operators. However, unlike cable operators, they do not have control over the edge network, limiting the effectiveness of their action, unless there is a QOS agreement with the operator, in which case control stays in the operator’s hands.

Maintaining the position of current and future ―smart pipe‖. Being aware of the end-user device and processing for it is critical for the operator to maintain processing capabilities that will allow migration to other areas such as mobile and 3D streaming.

14

IMPLEMENTING NETWORK-BASED PERSONALIZATION

As indicated earlier in the document, the solution can be implemented in a variety of ways. In this section, we present three of the options, all under a generic North America on-demand architecture. The three options are: Hybrid network edge and back office; Network edge; and Hybrid home network.

Hybrid network edge and back office

As the user device powers up or the user starts using personalization features, the user client connects with the session manager, identifies the user, his device-type and his personalization requirements. and once resources are identified, starts a session. In this implementation the ―prepare‖ function is physically separated from the other building blocks, and the user STB is not capable of relevant video processing/ rendering. Each incoming media is processed and extracted to create it for downstream personalization as part of the standard ingest process. Once a session is initiated and the edge processing resources are found, sets of media and metadata flows are propagated across the internal CDN to the ―integrate‖ step at the network edge. The set of flows include the different media flows, related metadata (which includes target STB-based processing, source media characteristics, target content insertion information, interactivity support and so forth. The metadata needs to be available for the edge to start processing the session), objects, data from content provider/ advertiser and so forth.

After arrival at the edge, the ―integrate‖ function aligns the flow and passes it to the ―create‖ and ―present‖ functions, which in this case, generate a single, personally composed stream, accompanied with relevant metadata, directed at a specific user.

Figure 3: Hybrid back office and network edge

Curve User

Analog, Broadcast

IP

Back Office

IP

Edge

IP HFC HFC

Region

Edge QAM

Wired

Media Over Broadband

Legacy Media

Wireless

Realtime

Offline

Legacy STBLegacy STB

PreparePrepare

Integrate

Compose

Present

Integrate

Compose

Present

SessionManagerSessionManager

AMS,CDNAMS,CDN

UERMUERM

AppServers

AppServers

15

As can be seen from Figure 3 above, the SMP (Scalable Media Personalization) session manager is connecting between the user device and the network, influencing in real time the ―integrate‖, ―create‖ and ―compose‖ edge functions.

Network edge only

This application case is about doing all the processing on-demand, in real time. It is similar to the hybrid case; however, instead of the ―prepare‖ function being located at the back office and working offline, all functions in this case are on the same platform. As can be expected this option has significant horsepower requirements for the ―prepare‖ function, since content needs to be ―prepared‖ in real time. In this example, the existing flow is almost seamless, as the resource manager simply identifies it as another network resource and manages it accordingly.

Figure 4: Network Edge

Curve User

Analog, Broadcast

IP

Back Office

IP

Edge

IP HFC HFC

Region

Prepare

Integrate

Compose

Present

Prepare

Integrate

Compose

Present


AMS,CDNAMS,CDN

UERMUERM

Edge QAM

Wired


Legacy Media

Wireless

Generic

Ingest

Generic

Ingest

Realtime

Offline

Legacy STBLegacy STB

AppServers

AppServers

16

Hybrid Home and Network

In the hybrid implementation, the end user device (STB in our case) was identified as one that is capable of hosting the ―present‖ function. As a result, as can be seen from Figure 5, the ―present‖ function is dislocated from the user home, while the system demarcation is the ―create‖ function. During the session, multiple ―prepared‖ flows of data and media will arrive to the STB, taking significantly less bandwidth versus the non-prepared options and consuming reduced CPU horsepower as part of the ―present‖ function.

Figure 5: Hybrid Home and Network

Curve User

Analog, Broadcast

IP

Back Office

IP

Edge

IP HFC HFC

Region

Integrate

Compose

Integrate

Compose


AMS,CDNAMS,CDN

UERMUERM

Edge QAM

Wired


Legacy Media

Wireless

Realtime

Offline

ADV STBADV STB

AppServers

AppServers

PreparePrepare

17

POWER SHIFTING TO THE USER

Although legacy STBs are indeed present in many homes, the overall processing horsepower at the home is growing and will continue to grow. That means that the user device will be able to do more processing at home and theoretically less in need of network-based assistance. At first glance this is indeed the case. However, when the subject is delved into further, two main challenges reveal themselves.

1. The increase in user device capabilities and actual user expectations, comes back to the network as a direct increase in bandwidth utilization, which then reflects back on users’ experience and ability to run enhanced applications such as multi-view.

For example, today’s next generations STBs support 800MIPS to 16000 MIPS versus the legacy 20 to 1000 MIPS, with dedicated dual 400Mhz video graphics processors and dual 250-MHz audio processors (S-A/Cisco’s next-gen Zeus silicon platform).

In Figure 6 below, the expected migration of media services into other home devices such as media centres and game consoles significantly increases available home processing power.

Processing Roadmap [TMIPS]

0

0.5

1

1.5

2

2.5

3

2007 2008 2009 2010

Figure 6: Home Processing Power Roadmap

2. No matter how ―fast and furious‖ processing power is in the home, users will always want more. Having home devices perform ALL the video processing increases utilization of CPU and memory and directly diminishes the performance of other applications.

In addition, as discussed earlier in the document, the increase in open standard home capabilities substantially strengthens the threat of customer churn for the cable operators.

Network-based personalization is targeted at providing solutions to the above challenges. The approach is to use network processing to help the user, improving his experience.

18

By performing the ―prepare‖, ―integrate‖ and ―create‖ functions in the network, and leaving only the ―present‖ function to the user home, several key benefits are delivered which effectively address the above challenges.

Network bandwidth utilization: The ―create‖ function drives down network bandwidth consumption. The streams that are delivered to the user are no longer the complete, original media as before, but rather only what is needed. For example, when looking at 1 HD and 2 SD in the same multi-view window, each of the three streams will have the correct resolution and frame rate required at each given moment, resulting in significant bandwidth savings, as can be seen in Figure 7.

Bandwidth to the home example (1HD, 2SD)

0

2

4

6

8

10

12

14

16

18

MPEG2 H.264

STB Only [Mbps]

Hybrid [Mbps]

Network Only [Mbps]

Figure 7: 2SD, 1HD bandwidth to the home

CPU Processing power: As indicated in the ―putting it all together‖ section, our solution allows for object layer selective composition. Also, the actual multi-view is created out of multiple resolutions and thus there is no need for render-resize-compose functions at the user device, which in turn reduces the overall CPU utilization.

Finally, the fact that the network can deliver the above benefits inherently drives power back to the hands of the operator, who can deliver the best user experience.

19

SUMMARY

Exceeding user expectation while maintaining a viable business case is becoming more challenging than ever for the cable operator. As the weight is shifted to the home and broadband streaming, the operator is forced to find new solutions to maintain leadership in the era of personalization and interactivity. Network base personalization provides a balanced solution. The ability to maintain an open, standard based solution, while being able to dynamically shift the processing balance based on user, device, network and time, can provide the user and operator a ―golden‖ solution.

REFERENCES

Ahmad, X. Wei, Y. Sun and Y.-Q. Zhang, "Video Transcoding: An Overview of Various Techniques and Research Issues," IEEE Transactions on Multimedia, Vol. 7, No. 5, pp. 793-04, Oct. 2005.

ISO/IEC JTC 1/SC 29/WG 11, "Information technology-Coding of audio-visual objects, Part8: Carriage of MPEG-4 contents over IP networks (ISO/IEC 14496-8)― Jan. 2001.

Ishfaq Ahmad Dept. of Computer Science and Engineering The University of Texas Arlington, TX ―MPEG-4 To H.264/AVC Transcoding‖.

Haining Liu, Xiaoping Wei, and Magda El Zarki - ―Real Time Interactive MPEG-4 Client-Server‖

ISO/IEC JTC 1/SC 29/WG 11- ―MPEG-4 Terminal Architecture‖

ISO/IEC JTC1/SC29/WG11 – ―CODING OF MOVING PICTURES AND AUDIO‖

ITU-T – ―The Advanced Video Coding Standard‖

MPEG Video Group, Description of Core Experiments in SVC, ISO/IEC JTC1/SC 29 WG 11 Document N6898, 2005

John Watkinson ―THE MPEG HANDBOOK‖

ABOUT THE AUTHOR

Amos Kohn is Vice President of Business Development at Scopus Video Networks. He has more then 20 years of multi-national executive management experience in convergence technology development, marketing, business strategy and solutions engineering at Telecom and new multimedia emerging organizations. Prior to joining Scopus, Amos Kohn held a senior position at ICTV, Liberate Technologies and Golden Channels.

20

APPENDIX 1: STB BASED ADDRESSABLE ADVERTISING

In the home addressable advertising model, multiple user profiles in the same household are offered to advertisers within the same ad slot. For example, within the same slot, multiple targeted ads will replace the same program feed targeted at different ages of youth while another advertisement may target the adults at the house (male, female) based on specific profiles. During the slot, youth will see one ad while the adult will see another ad. Addressable advertising require more bandwidth to the home then traditional zone-based advertisements. Granularity might step one level up, where the targeted advertisement will target the household and not the user within a household. In this case, less bandwidth will be required in a given serving area in comparison to the userbased targeted advertisement. The impact of home addressability on the infrastructure of channels that are already in the digital tier and enabled for local ad insertion will be similar to unicast VOD service bandwidth requirements.

In case of a four demographics scenario, for each ad zone, four times the bandwidth that has been allocated for a linear ad will need to be added.

APPRENDIX 2: REALTIME IMPLEMENTATION

Processing in real time is defined by stream provisioning (fast motion estimation), stream complexity and the size of the buffer at each stage.

The scenes as compositions of audiovisual objects (AVO's), support of hybrid coding of natural video and 2D/3D graphics, and provision of advanced system and interoperability capabilities support real time processing.

MPEG4 real time software encoding of arbitrarily shaped video objects (VO) is an important key in the structure of the solution. The MPEG4 toolkit unites the advantages of block and pixel-recursive motion estimation methods in one common scheme, leading to a fast hybrid recursive motion estimation which supports MPEG4 processing.

white paper - mpeg 4 toolkit approach

Documents