multimedia communications group berlin, germany 1 … · dependent omaf video profile, also known...

4
HTML5 MSE Playback of MPEG 360 VR Tiled Streaming JavaScript implementation of MPEG-OMAF viewport-dependent video profile with HEVC tiles Dimitri Podborski, Jangwoo Son, Gurdeep Singh Bhullar, Robert Skupin, Yago Sanchez Cornelius Hellge, Thomas Schierl Fraunhofer Heinrich Hertz Institute Multimedia Communications Group Berlin, Germany [email protected] ABSTRACT Virtual Reality (VR) and 360-degree video streaming have gained significant attention in recent years. First standards have been published in order to avoid market fragmentation. For instance, 3GPP released its first VR specification to enable 360-degree video streaming over 5G networks which relies on several technologies specified in ISO/IEC 23090-2, also known as MPEG-OMAF. While some implementations of OMAF-compatible players have already been demonstrated at several trade shows, so far, no web browser- based implementations have been presented. In this demo paper we describe a browser-based JavaScript player implementation of the most advanced media profile of OMAF: HEVC-based viewport- dependent OMAF video profile, also known as tile-based streaming, with multi-resolution HEVC tiles. We also describe the applied workarounds for the implementation challenges we encountered with state-of-the-art HTML5 browsers. The presented implementa- tion was tested in the Safari browser with support of HEVC video through the HTML5 Media Source Extensions API. In addition, the WebGL API was used for rendering, using region-wise packing metadata as defined in OMAF. CCS CONCEPTS Information systems Multimedia streaming; RESTful web services;• Computing methodologies Virtual reality. KEYWORDS OMAF, VR, 360 video, Streaming, HEVC, Tiles, JavaScript, MSE ACM Reference Format: Dimitri Podborski, Jangwoo Son, Gurdeep Singh Bhullar, Robert Skupin, Yago Sanchez and Cornelius Hellge, Thomas Schierl. 2019. HTML5 MSE Playback of MPEG 360 VR Tiled Streaming: JavaScript implementation of MPEG-OMAF viewport-dependent video profile with HEVC tiles. In MMSys ’19: ACM Multimedia Systems Conference - Demo Track, June 18– 21, 2019, Amherst, MA, USA. ACM, New York, NY, USA, 4 pages. https: //doi.org/10.1145/3304109.3323835 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). MMSys ’19, June 18–21, 2019, Amherst, MA, USA © 2019 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6297-9/19/06. https://doi.org/10.1145/3304109.3323835 1 INTRODUCTION Virtual Reality (VR) and 360-degree video streaming have gained popularity among researchers and the multimedia industry in re- cent years. For example, in addition to many published research papers in this area, several standardization organizations such as ISO/IEC, MPEG and 3GPP have published their first specifications on VR [1][6]. One such specification is the result of an MPEG activity and is referred to as the Omnidirectional Media Format (OMAF) that specifies a storage and delivery format for 360-degree multimedia content. OMAF particularly defines the HEVC-based viewport-dependent media profile for video which allows to stream HEVC tiles of different resolutions and finally combine them into a single bitstream so that only one video is decoded on the client end-device. This approach, often referred to as tile-based stream- ing, allows the service operator to increase the resolution within the viewport while decoding a lower resolution video compared to the conventional naive 360-degree video streaming approach. However, this is coupled with additional complexity of the player implementation, particularly when implemented in a web browser environment using only JavaScript. In addition, the W3C Media Source Extensions (MSE) offer no direct support for OMAF which requires workarounds at certain functional steps. In this paper we describe how to overcome these challenges in order to implement a fully standard-compliant OMAF player for tile-based streaming in JavaScript. Furthermore, the entire source code of our implemen- tation is available on GitHub [4]. The remainder of the paper is structured as follows. Section 2 presents the overall architecture of our implementation and de- scribes each component of the system. The proof of concept is presented in Section 3, explaining the most important challenges and workarounds. Finally, Section 4 concludes our paper. 2 ARCHITECTURE A general overview of the main components involved in our im- plementation is depicted in Figure 1. It consists of six main mod- ules which interact with each other and together provide a fully functional player for OMAF 360-degree video. Those modules are: Player, Downloader (DL), MPD Parser (MP), Scheduler (SE), Media Engine (ME) and finally the Renderer (RE). The Player module represents the core of the entire application. It connects all modules with each other and controls them. The DL module deals with all HTTP requests to the server. The MP module implements the parsing of the DASH manifest file (MPD) together with additional metadata defined in OMAF. The SE module controls arXiv:1903.02971v2 [cs.MM] 23 Apr 2019

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimedia Communications Group Berlin, Germany 1 … · dependent OMAF video profile, also known as tile-based streaming, with multi-resolution HEVC tiles. We also describe the applied

HTML5 MSE Playback of MPEG 360 VR Tiled StreamingJavaScript implementation of MPEG-OMAF viewport-dependent video profile with HEVC tiles

Dimitri Podborski, Jangwoo Son, Gurdeep Singh Bhullar, Robert Skupin, Yago SanchezCornelius Hellge, Thomas SchierlFraunhofer Heinrich Hertz InstituteMultimedia Communications Group

Berlin, [email protected]

ABSTRACTVirtual Reality (VR) and 360-degree video streaming have gainedsignificant attention in recent years. First standards have beenpublished in order to avoid market fragmentation. For instance,3GPP released its first VR specification to enable 360-degree videostreaming over 5G networks which relies on several technologiesspecified in ISO/IEC 23090-2, also known as MPEG-OMAF. Whilesome implementations of OMAF-compatible players have alreadybeen demonstrated at several trade shows, so far, no web browser-based implementations have been presented. In this demo paperwe describe a browser-based JavaScript player implementation ofthe most advanced media profile of OMAF: HEVC-based viewport-dependent OMAF video profile, also known as tile-based streaming,with multi-resolution HEVC tiles. We also describe the appliedworkarounds for the implementation challenges we encounteredwith state-of-the-art HTML5 browsers. The presented implementa-tion was tested in the Safari browser with support of HEVC videothrough the HTML5 Media Source Extensions API. In addition, theWebGL API was used for rendering, using region-wise packingmetadata as defined in OMAF.

CCS CONCEPTS• Information systems→Multimedia streaming; RESTful webservices; • Computing methodologies→ Virtual reality.

KEYWORDSOMAF, VR, 360 video, Streaming, HEVC, Tiles, JavaScript, MSE

ACM Reference Format:Dimitri Podborski, Jangwoo Son, Gurdeep Singh Bhullar, Robert Skupin,Yago Sanchez and Cornelius Hellge, Thomas Schierl. 2019. HTML5 MSEPlayback of MPEG 360 VR Tiled Streaming: JavaScript implementationof MPEG-OMAF viewport-dependent video profile with HEVC tiles. InMMSys ’19: ACM Multimedia Systems Conference - Demo Track, June 18–21, 2019, Amherst, MA, USA. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3304109.3323835

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).MMSys ’19, June 18–21, 2019, Amherst, MA, USA© 2019 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-6297-9/19/06.https://doi.org/10.1145/3304109.3323835

1 INTRODUCTIONVirtual Reality (VR) and 360-degree video streaming have gainedpopularity among researchers and the multimedia industry in re-cent years. For example, in addition to many published researchpapers in this area, several standardization organizations such asISO/IEC, MPEG and 3GPP have published their first specificationson VR [1] [6]. One such specification is the result of an MPEGactivity and is referred to as the Omnidirectional Media Format(OMAF) that specifies a storage and delivery format for 360-degreemultimedia content. OMAF particularly defines the HEVC-basedviewport-dependent media profile for video which allows to streamHEVC tiles of different resolutions and finally combine them intoa single bitstream so that only one video is decoded on the clientend-device. This approach, often referred to as tile-based stream-ing, allows the service operator to increase the resolution withinthe viewport while decoding a lower resolution video comparedto the conventional naive 360-degree video streaming approach.However, this is coupled with additional complexity of the playerimplementation, particularly when implemented in a web browserenvironment using only JavaScript. In addition, the W3C MediaSource Extensions (MSE) offer no direct support for OMAF whichrequires workarounds at certain functional steps. In this paper wedescribe how to overcome these challenges in order to implementa fully standard-compliant OMAF player for tile-based streamingin JavaScript. Furthermore, the entire source code of our implemen-tation is available on GitHub [4].

The remainder of the paper is structured as follows. Section 2presents the overall architecture of our implementation and de-scribes each component of the system. The proof of concept ispresented in Section 3, explaining the most important challengesand workarounds. Finally, Section 4 concludes our paper.

2 ARCHITECTUREA general overview of the main components involved in our im-plementation is depicted in Figure 1. It consists of six main mod-ules which interact with each other and together provide a fullyfunctional player for OMAF 360-degree video. Those modules are:Player, Downloader (DL), MPD Parser (MP), Scheduler (SE), MediaEngine (ME) and finally the Renderer (RE).

The Player module represents the core of the entire application.It connects all modules with each other and controls them. The DLmodule deals with all HTTP requests to the server. TheMPmoduleimplements the parsing of the DASH manifest file (MPD) togetherwith additional metadata defined in OMAF. The SEmodule controls

arX

iv:1

903.

0297

1v2

[cs

.MM

] 2

3 A

pr 2

019

Page 2: Multimedia Communications Group Berlin, Germany 1 … · dependent OMAF video profile, also known as tile-based streaming, with multi-resolution HEVC tiles. We also describe the applied

MMSys ’19, June 18–21, 2019, Amherst, MA, USA Dimitri Podborski, et al.

Player

Downloader DL 

Media Engine ME   

Renderer RE 

Sensor

Scheduler SE 

MPD Parser MP 

Fetch API (http 2.0) 

DASH Content 

MSESourceBuffer 

Media / Metadata Control data 

Figure 1: Overall architecture.

the DL module and decides when requests for the next segmentsshould be executed, based on the current status of the player. Thetask of theME module is to parse OMAF related metadata on theFile Format level and re-package the downloaded OMAF contentin such a way that the Media Source Extensions API of the webbrowser can process the data. Finally, the RE module uses theOMAF metadata in order to correctly render the video texture onthe canvas using WebGL API. The following subsections describeeach of these six modules in more detail.

2.1 PlayerAs already mentioned in the previous section the Player modulecan be seen as the core of the entire application. Its main goal isto connect all modules with each other and to control them. Italso connects HTML elements such as video and canvas from themain page with the ME and RE modules. In addition, it providesa basic functionality to the user such as load a source, play, pause,loop, change into or go out of full screen mode and retrieve certainmetrics of the application in order to plot the data on the screen.

2.2 DownloaderThe main task of this module is to manage all HTTP traffic betweenour application and the server. This module receives a list of URLsfrom the player module and downloads them using the Fetch API.After all required HTTP requests are processed and all requesteddata is successfully downloaded, it forwards the data to the MEmodule for further processing. Since the current version of theplayer fires a lot of simultaneous HTTP requests it is desirableto host the media data on an HTTP/2 enabled server in order toimprove the streaming performance.

2.3 MPD ParserOMAF uses Dynamic Adaptive Streaming over HTTP (DASH) as aprimary delivery mechanism for VR media. It also specifies addi-tional metadata for 360-degree video streaming such as:

• Projection type: only Equirectangular (ERP) or Cubemap(CMP) projections are allowed.

• Content coverage: is used to determine which region eachDASH Adaptation Set covers while each HEVC tile is storedin a separate Adaptation Set. This information is requiredto select only those Adaptation Sets which cover the entire360-degree space.

• Spherical region-wise quality ranking: (SRQR) is used to de-termine where the region with the highest quality is locatedwithin an Adaptation Set. This metadata allows us to se-lect an Adaptation Set based on current orientation of theviewport.

In addition to OMAFmetadata, another notable feature is the DASHPreselection Descriptor which indicates the dependencies betweendifferent DASH Adaptation Sets. The MPmodule parses all requiredDASH manifest metadata and implements several helper functionswhich are used by the Player module in order to make appropriateHTTP requests.

2.4 SchedulerOne of the key aspects of any streaming service is maintaining asufficient buffer in order to facilitate smooth media playback. Inthe implementation, the buffer is maintained using a parameternamed ’buffer limit’ which can be set prior or during a streamingsession. The parameter indicates the maximum buffer fullness levelin milliseconds and depending on its value the SE module schedulesthe next request. If the buffer is able to accommodate a segment,then the SE module initiates the request for the next segment. Onthe other hand, if the buffer is full, then the request for the nextsegment is delayed until buffer fullness and the current time of themedia playback indicate otherwise.

An important point to mention for any viewport-dependentstreaming implementation, is that the user can change the viewportorientation during the playback at any time. Therefore, the sys-tem should adapt to the changed viewport orientation as quicklyas possible, which implies that the buffer fullness limit must bekept relatively small. Preferably, it should be in the range of a fewseconds.

2.5 Media EngineThe ME module is taking input from the DL module and preparesthe downloaded media segments for the consumption by the MediaSource Extension (MSE). Current state-of-the-art MSE implementa-tions do not yet fully support all modern features of the ISO BaseMedia File Format, in particular some File Format features used inOMAF media segments. Therefore, the downloaded media segmentdata is repackaged in a format that is compatible with MSE imple-mentations. For the parsing and writing of the File Format data weare using the JavaScript version of GPAC’s MP4Box tool mp4box.js[2]. Furthermore, the parsing of the OMAF related metadata is alsoimplemented in the ME module. Figure 2 visualizes the repackagingprocess and shows the flow of downloaded media segments throughthe ME module to the MSE SourceBuffer.

Before the ME module processes the media segments, each seg-ment should be completely downloaded and consist of multiplehvc1 video tracks (one track for each HEVC tile) and one additional

Page 3: Multimedia Communications Group Berlin, Germany 1 … · dependent OMAF video profile, also known as tile-based streaming, with multi-resolution HEVC tiles. We also describe the applied

HTML5 MSE Playback of MPEG 360 VR Tiled Streaming MMSys ’19, June 18–21, 2019, Amherst, MA, USA

Seg N+1

hvc1

hvc1

hvc1

hvc2

...

ID 49

Seg N

hvc1

hvc1

hvc1

hvc2

...

ID 60

Media Engine

...

Seg (N+1)*

hvc1ID=1

Seg N*

hvc1ID=1

MSE SourceBuffer 

Figure 2: Media Engine module repackages the downloadedDASH segments before pushing them to MSE SourceBuffer.

hvc2 video track with Extractor NAL Units for HEVC video [5].The extractor track is required for the creation of a single HEVCbitstream from the individually delivered HEVC tiles in order to usea single decoder instance. An extractor represents the in-streamdata structure using a NAL unit header for extraction of data fromother tracks and can be logically seen as a pointer to data located ina different File Format track. Unfortunately, currently available webbrowsers do not natively support File Format extractor tracks andthus a repackaging workaround as performed by the ME module isnecessary. Therefore, the ME module resolves all extractors withinan extractor track and packages the resolved bitstream into a newtrack with a unique track ID, even if the extractor track ID changes.Hence, from the MSE SourceBuffer perspective, it looks like thesegments are coming from the same DASH Adaptation Set even ifthe player switches between different tiling configurations.

2.6 RendererAfter the repackaged segment is processed by theMSE SourceBuffer,the browser decodes the video and the video texture is finally ren-dered by the RE module using OMAF metadata. The RE module isimplemented using a custom shader written in OpenGL ShadingLanguage (GLSL) together with a three.js library (WebGL library)[8] which is used to implement three-dimensional graphics on theWeb. Our rendering implementation is based on triangular polygonmesh objects and supports both: equirectangular and cubemap po-jections. In case of a cubemap projection one cube face is dividedinto two triangular surfaces, while in case of an equirectangularprojection, a helper class from three.js library is used to create asphere geometry.

Figure 3 shows three main processes used for rendering, from thedecoded picture to the final result rendered on the cube. It shows anexample where each face of the cube is divided into four tiles whilethe decoded picture is composed of 12 high-resolution and 12 low-resolution tiles. The 12 triangular surfaces of the cube as depicted inFigure 3 (c) can be represented as a 2D plane like in Figure 3 (b). Thefragment shader of the RE module uses OMAF metadata to renderthe decoded picture correctly at the cube faces as shown in Figure 3(b). The Region-wise Packing (RWPK) of OMAF metadata has top-left position, width, height in packed and unpacked coordinates aswell as rotation of tiles for all tracks. Since the shader reconstructsthe position of pixels based on OMAF metadata, Figure 3 (b) canbe assumed to be a region-wise Un-Packed image. Therefore, theshader sets the rendering range of the Figure 3 (a) using the RWPKmetadata, and renders the tiles of the decoded picture to the cubefaces of the Figure 3 (b). However, when there is a change in the

Left Front Right

Bottom Back TopBack

Low Resolution

High Resolution

Packed Top-Left Coordinates

Un-Packed Top-Left Coordinates

(a) (b) (c)

Tile Boundaries

Figure 3: Rendering process of one face of the cube; (a) thedecoded video texture before unpacking; (b) the 12 triangu-lar surfaces on the 2D plane after region-wise unpacking; (c)the composition of 12 triangular surfaces on the cube.

viewport position, the RE module has to be given correct metadatafor the current track. In the implementation, when the manifestfile is loaded, the RE module is initialized with all RWPK metadatato correctly render all tracks. The synchronization of the trackswitching is covered in the following section.

3 PROOF OF CONCEPTIn this section, we first give a brief overview of the implementation.We then discuss the most important challenges we encounteredduring implementation and describe their workarounds.

3.1 Implementation overviewFigure 4 shows the screenshot of the main page of our OMAFJavaScript Player. In addition to some basic video player controlssuch as load MPD, play, pause, change to full screen mode etc., italso provides several controls for debugging purposes. In the topinput menu, the user can provide the URL to a DASH manifestfile (MPD) and load it. After the MPD is successfully loaded andparsed, the player downloads all initialization segments (one foreach emphasized viewport, which, in relation to the example inFigure 3, corresponds to 24 extractor tracks), parses OMAF-relatedmetadata and initializes the RE module with extracted region-wisepacking information. In addition, the ME module creates a pseudo-initialization segment for the MSE SourceBuffer to initialize it in away such that following repackaged media segments can be suc-cessfully decoded by the web browser.

The streaming session starts when the user presses the playbutton. The player then continuously downloads media segments ofa certain extractor track, depending on the current orientation of theviewport. In addition, all dependent media segments (hvc1 tracks)are also downloaded. All downloaded media segments are thenimmediately repackaged and the corresponding RWPK metadatais used to correctly render the video on the canvas. We testedour implementation on Safari 12.02 since only web browsers fromApple and Microsoft1 natively support it. Raw video sequenceswere provided by Ericsson and prepared for streaming using the1Microsoft Edge web browser supports the implementation on devices with a hardwareHEVC decoder. For devices that do not provide hardware support for HEVC, the HEVCVideo Extensions have to be enabed in the Microsoft Store.

Page 4: Multimedia Communications Group Berlin, Germany 1 … · dependent OMAF video profile, also known as tile-based streaming, with multi-resolution HEVC tiles. We also describe the applied

MMSys ’19, June 18–21, 2019, Amherst, MA, USA Dimitri Podborski, et al.

Figure 4: Screenshot of the OMAF JavaScript Player user in-terface.

OMAF file creation tools from Fraunhofer HHI [3] [4]. Finally, thecontent was hosted on Amazon CloudFront CDN, which supportsHTTP/2 in conjunction with HTTPS.

The following section covers some of the issues that we facedduring the implementation.

3.2 Implementation challengesAn important prerequisite for good functionality of the implemen-tation is the synchronization of the extractor track switching andthe corresponding RWPK metadata. When the user changes theviewport, the extractor track is changed and the high and low-quality tile positions of the decoded picture are derived using thecorresponding region-wise packing metadata. The RE module hasto reset the texture mapping of the decoded picture according tothe changed track and it shall be done at exactly the same timewhen the video texture changes from one track to another. Thesimplest way to detect the exact frame number is to check the cur-rent time of the video. Unfortunately, W3C organizations are stilldiscussing the precise accuracy of the currentTime of the videoelement [7] and nowadays it is not possible to conveniently detectthe exact frame number of the video reliably using currentTime.Therefore, the ME module uses two video elements together withtwo MSE SourceBuffers and alternately switches between themwhen the extractor track (and RWPK) changes. The ME modulesaves the bitstream of the changed track in the SourceBuffer of theother video element. When the Player reaches the end of the activebuffer it subsequently switches to the other video element. At thesame time, the player informs the RE module through an eventabout the change of the video element. The RE module declarestwo scene objects and associates them with each video element.Also, the RE module calculates and stores the RWPK metadata ofthe decoded picture for all tracks in the initialization phase. Whenreceiving the event about the change of the video element fromthe Player, the RE module replaces the scene and maps the video

texture of the decoded picture based on the scene object, so thattrack synchronization is performed without an error.

While this solution works well on Safari, we discovered an openissue on Microsoft Edge browser [9] that interferes with the twobuffers workaround. The Edge web browser requires a few secondsof data in each buffer in order to start the decoding process, andtherefore the new segment at every track switch cannot be instantlyrendered.

Furthermore, due to the track synchronization solution usingtwo video elements, we need to operate two MSE SourceBufferobjects, which makes the buffering logic a bit more complex asthe SE module has to monitor the buffer fullness level of bothSourceBuffer objects. The duration of media segments present in thetwo media sources is combined together to determine the availablebuffer time at a given moment so that the SE module can makerequests for the future media segments accordingly.

For future work we plan to further optimize the streaming perfor-mance of the player while reducing the amount of HTTP requestsand implement suitable rate-adaptation algorithms for tile-basedstreaming.

4 CONCLUSIONSIn this paper, we present the first implementation for the playbackof the HEVC-based viewport-dependent OMAF video profile in a webbrowser. This OMAF profile allows to increase the resolution of360-degree video inside the viewport by sacrificing resolution ofthe areas which are not presented to the user. The entire imple-mentation is in JavaScript and therefore no additional plugins arerequired. Although current state-of-the-art MSE implementationsdo not yet fully support some File Format features used in OMAF,our implementation shows how these limitations can be overcomeby using JavaScript in combinationwithMSE.We also point out thatW3C is currently working on accurate frame-by-frame signaling ofmedia elements, which currently requires a slightly more complexworkaround in the rendering process using two MSE SourceBufferelements in alternating order. The entire source code of the im-plementation presented in this paper is published on GitHub [4].Information on other MPEG-OMAF implementations from Fraun-hofer HHI can be found in [3].

REFERENCES[1] 3GPP. 2019. 5G; 3GPP Virtual reality profiles for streaming applications. Technical

Specification (TS) 26.118. 3rd Generation Partnership Project. Version 15.1.0.[2] GPAC 2019. GPAC Multimedia Open Source Project, JavaScript version of GPACs

MP4Box tool. Retrieved April 19, 2019 from https://gpac.github.io/mp4box.js[3] Fraunhofer HHI. 2019. Better quality for 360-degree video. Retrieved April 19,

2019 from http://hhi.fraunhofer.de/omaf[4] Fraunhofer HHI. 2019. HTML5 MSE Playback of MPEG 360 VR Tiled Streaming.

Retrieved April 19, 2019 from https://github.com/fraunhoferhhi/omaf.js[5] ISO/IEC 2017. 14496-15, Information technology - Coding of audio-visual objects -

Part 15: Carriage of network abstraction layer (NAL) unit structured video in the ISObase media file format. ISO/IEC.

[6] ISO/IEC 2019. 23090-2, Information technology - coded representation of immersivemedia (MPEG-I) - Part 2: Omnidirectional media format. ISO/IEC.

[7] W3C Media and Entertainment Interest Group. 2018. Frame accurate seeking ofHTML5 MediaElement. Retrieved April 19, 2019 from https://github.com/w3c/media-and-entertainment/issues/4

[8] three.js 2019. Three.js, JavaScript 3D Library. Retrieved April 19, 2019 fromhttps://threejs.org/ Version r101.

[9] David V. 2017. Video MSE issues as of March 01 2017. Retrieved April 19,2019 from https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/11147314