hevc tile based streaming to head mounted...

3
HEVC Tile Based Streaming to Head Mounted Displays Robert Skupin, Yago Sanchez, Dimitri Podborski, Cornelius Hellge, Thomas Schierl Video Coding & Analytics Department Fraunhofer Heinrich-Hertz-Institute Berlin, Germany {forename.surname}@hhi.fraunhofer.de Abstract—360° video streaming to clients using Virtual Reality head mounted displays is a challenge for traditional video delivery. As transmission of the complete content in a desirable quality sacrifices a large fraction of available client and network resources, adaptivity to the user viewport promises substantial benefits. An efficient way to achieve viewport adaptive streaming without per-user or per-orientation encoding, i.e. essentially transcoding, is to make use of motion-constrained HEVC tiles. DASH can be used for tiled streaming, where tiled content resides on the server at multiple resolutions. The DASH client selects the resolutions of each tile according to the current viewport. This demonstration paper presents an agile and responsive streaming prototype system for 360° video content. In order to achieve acceptable responsiveness, the DASH client relies on small buffer sizes and shifted random access points across tiles when suitable. Keywords—360° video; streaming; tiles; HEVC. I. INTRODUCTION Major platforms such as YouTube and Facebook are already streaming 360° video to various devices. The combination of this type of content with consumption devices such as head mounted displays (HMDs) poses interesting challenges to the established design of video streaming services. In video services that offer high resolution content of which a subset is displayed on traditional end devices with flat panel screens, such as TV sets or tablets, constraining user interaction to relax service requirements is a common approach, e.g. limited scrolling speed. However, such constraints are not suitable for HMDs as head pose may change considerable within milliseconds. Therefore, the complete 360° video content needs to be transmitted permanently in case the user turns quickly. Covering the full 360° surroundings in desirable resolution could easily lead to multiple times UHD resolution. Current deployments of 360° video streaming services transmit the content in a user viewport agnostic way. Such a streaming approach sacrifices a substantial amount of throughput and decoder capabilities for pixels that are not even presented to the user. As a result, viewport visual quality of such services can hardly compete with what users of traditional video services are used to. Moreover, many virtual reality relevant devices such as mobile phones contain hardware video decoders that are tailored to conventional resolutions used in traditional video service such as FHD or UHD. With such a tight budget of decodable pixels, only viewport adaptive coding and transmission schemes allow to achieve desirable visual quality within the user viewport. Adaptive streaming based on HTTP is currently the dominant means for distribution of video on demand. Specifically, the MPEG DASH [1] standard has seen major deployment in recent years, e.g. YouTube and Netflix. While adapting the video bitrate to available throughput is most important for traditional video content, 360° video on HMDs allow for further adaptation strategies. Taking the current user viewport into account for adaptation promises to yield significant gains with respect to the requirements on throughput and decoder capabilities. A more simplistic approach in this direction is to provide an encoding per user. This approach however does not scale well with large deployments. Facilitating MPEG DASH in the fashion of traditional video services, per-orientation encodings are another option that was recently implemented on a large scale at Facebook [2]. Here, the DASH client switches between numerous representations with per-orientation encodings according to the user head pose to maintain high visual quality within the current viewport. However, this approach comes along with significant overhead for the generation, encoding and storage of as much as multiple dozen versions of what was originally the same content. In this demonstration paper, we present an alternative approach for streaming of 360° video based on varying resolution motion-constrained HEVC tiles as presented in [3]. This approach allows emphasis of the current user viewport through decreasing resolution of video pixels outside the viewport on-the-fly. Hence, the full 360° surroundings are always available on the end device. The resolution adaptivity per video area is achieved without transcoding by merging tiles of varying quality or resolution into a single common bitstream through lightweight tile aggregation [4]. The demonstration shows a 360° video streaming prototype system for HMDs using tile based MPEG DASH streaming that is capable of adapting resolution of individual video areas depending on the current user viewport. Furthermore, the DASH client switches between representations with different random access configuration to mitigate undesirable bitrate peaks.

Upload: others

Post on 16-Apr-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HEVC Tile Based Streaming to Head Mounted Displaysiphome.hhi.de/skupin/assets/pdfs/CCNC2017_HEVCTile...HEVC Tile Based Streaming to Head Mounted Displays Robert Skupin, Yago Sanchez,

HEVC Tile Based Streaming to Head Mounted Displays

Robert Skupin, Yago Sanchez, Dimitri Podborski, Cornelius Hellge, Thomas Schierl Video Coding & Analytics Department

Fraunhofer Heinrich-Hertz-Institute Berlin, Germany

{forename.surname}@hhi.fraunhofer.de

Abstract—360° video streaming to clients using Virtual Reality head mounted displays is a challenge for traditional video delivery. As transmission of the complete content in a desirable quality sacrifices a large fraction of available client and network resources, adaptivity to the user viewport promises substantial benefits. An efficient way to achieve viewport adaptive streaming without per-user or per-orientation encoding, i.e. essentially transcoding, is to make use of motion-constrained HEVC tiles. DASH can be used for tiled streaming, where tiled content resides on the server at multiple resolutions. The DASH client selects the resolutions of each tile according to the current viewport. This demonstration paper presents an agile and responsive streaming prototype system for 360° video content. In order to achieve acceptable responsiveness, the DASH client relies on small buffer sizes and shifted random access points across tiles when suitable.

Keywords—360° video; streaming; tiles; HEVC.

I. INTRODUCTION Major platforms such as YouTube and Facebook are

already streaming 360° video to various devices. The combination of this type of content with consumption devices such as head mounted displays (HMDs) poses interesting challenges to the established design of video streaming services. In video services that offer high resolution content of which a subset is displayed on traditional end devices with flat panel screens, such as TV sets or tablets, constraining user interaction to relax service requirements is a common approach, e.g. limited scrolling speed. However, such constraints are not suitable for HMDs as head pose may change considerable within milliseconds. Therefore, the complete 360° video content needs to be transmitted permanently in case the user turns quickly. Covering the full 360° surroundings in desirable resolution could easily lead to multiple times UHD resolution. Current deployments of 360° video streaming services transmit the content in a user viewport agnostic way. Such a streaming approach sacrifices a substantial amount of throughput and decoder capabilities for pixels that are not even presented to the user. As a result, viewport visual quality of such services can hardly compete with what users of traditional video services are used to. Moreover, many virtual reality relevant devices such as mobile phones contain hardware video decoders that are tailored to conventional resolutions used in traditional video service such as FHD or UHD. With such a

tight budget of decodable pixels, only viewport adaptive coding and transmission schemes allow to achieve desirable visual quality within the user viewport.

Adaptive streaming based on HTTP is currently the dominant means for distribution of video on demand. Specifically, the MPEG DASH [1] standard has seen major deployment in recent years, e.g. YouTube and Netflix. While adapting the video bitrate to available throughput is most important for traditional video content, 360° video on HMDs allow for further adaptation strategies. Taking the current user viewport into account for adaptation promises to yield significant gains with respect to the requirements on throughput and decoder capabilities. A more simplistic approach in this direction is to provide an encoding per user. This approach however does not scale well with large deployments. Facilitating MPEG DASH in the fashion of traditional video services, per-orientation encodings are another option that was recently implemented on a large scale at Facebook [2]. Here, the DASH client switches between numerous representations with per-orientation encodings according to the user head pose to maintain high visual quality within the current viewport. However, this approach comes along with significant overhead for the generation, encoding and storage of as much as multiple dozen versions of what was originally the same content.

In this demonstration paper, we present an alternative approach for streaming of 360° video based on varying resolution motion-constrained HEVC tiles as presented in [3]. This approach allows emphasis of the current user viewport through decreasing resolution of video pixels outside the viewport on-the-fly. Hence, the full 360° surroundings are always available on the end device. The resolution adaptivity per video area is achieved without transcoding by merging tiles of varying quality or resolution into a single common bitstream through lightweight tile aggregation [4]. The demonstration shows a 360° video streaming prototype system for HMDs using tile based MPEG DASH streaming that is capable of adapting resolution of individual video areas depending on the current user viewport. Furthermore, the DASH client switches between representations with different random access configuration to mitigate undesirable bitrate peaks.

Page 2: HEVC Tile Based Streaming to Head Mounted Displaysiphome.hhi.de/skupin/assets/pdfs/CCNC2017_HEVCTile...HEVC Tile Based Streaming to Head Mounted Displays Robert Skupin, Yago Sanchez,

II. VIDEO CONTENT PREPARATION The 360° video scheme described in [3] is based on the

cubic projection which comes at a number of advantages. Opposed to more complex polyhedral projections, the cubic projection has seen widespread adoption and support in graphics and rendering frameworks. Further, when compared to the also widespread simple equirectangular projection, the cubic projection is more convenient for tiling approaches as equally sized tiles result in roughly the same covered Field of View (FoV) per tile.

For content preparation on server side, the original 360° video in a cubic format is initially encoded at two resolutions making use of motion-constrained HEVC tiles at the desired tiling granularity. As shown in the top left of Fig. 1, the cubic video1 is evenly divided into 24 tiles exemplarily, i.e. at a granularity of 4 tiles per cube face. Likewise, a downsampled version of the cubic video, as depicted on the top right side, is tiled at the same granularity. The tiles are HEVC encoded with a set of encoder constraints that allow aggregation into a single common bitstream without transcoding. These constraints mostly concern the inter prediction and temporal motion vector prediction on tile boundaries. For further details the reader is referred to [4]. On the bottom of Figure 1, the user viewport dependent mixture of two resolutions is shown wherein exemplarily 8 tiles are delivered to the client in original high resolution (HR), while the remaining 16 tiles are of low resolution (LR). Using this exemplary setup, overall resolution of the transmitted video is half the resolution of the original content.

However, from encoding perspective, allowing for adapting the resolution of tiles as the user viewport changes, require the encoding to provide a suitable rate of bitrate costly intra coded random access points (RAPs) within the bitstream. In order to

1 Dataset generously provided by Deep Inc.

mitigate the effect of frequent RAPs onto the buffering behavior, multiple RAP variants are encoded. Specifically, the concept of shifted IDRs is adopted from [5]. As illustrated in Fig. 2, the different variants with a given RAP interval are encoded with RAP pictures at varying position. After segmentation for DASH, each segment is available in a RAP and a non-RAP version. This allows the client to distribute RAP induced bitrate peaks over time as suitable, i.e. depending on viewport changes.

III. TILE BASED STREAMING FOR HMDS The individual components of the demonstrator system are

illustrated in Fig. 3. The Oculus Rift Consumer Version 1 HMD serves as end device for the demonstrator. Feedback from the OculusSDK about the current head orientation of the user is passed to the DASH client and the tile aggregation component. The DASH client drives the tile selection process using the orientation information by deciding which tiles to request at high and low resolution respectively, as well as, analyzing potential viewport orientation variations in order to decide which segments are requested with and without RAPs. Another parameter that influences the tile selection process is the FoV of the end device as other ratios of HR to LR tiles than the one in depicted in Fig. 1 can be requested. Once segments are downloaded, a single viewport dependent individual user bitstream is created by the tile aggregation component. A proprietary HEVC software decoder processes the bitstream and passes decoded pictures to an OpenGL based render to generate the user viewport to be output on the Oculus Rift HMD.

The Media Presentation Description (MPD) on the HTTP server makes use of MPEG DASH Spatial Relationship Descriptors (SRD) [6] which provide the spatial location of individually offered streams within a global coordinate system, as well as relationship between the different resolution tiles. Essential for tile based streaming, SRD gives the DASH client an understanding of the applied tiling setup. For instance, in [7] the authors demonstrated a novel approach utilizing tile based streaming with SRD to adaptively distribute bitrate throughout the picture plane. The presented demonstrator facilitates SRD in a similar fashion to allow for tile based streaming. In addition, for a given resolution and bitrate, each tile is made available at several representations with a RAP at different positions in time (see Fig. 2). These shifted IDR representations are equivalent to each other and allow users to download segments with or without a RAP depending on the viewport changes.

The DASH client buffer size used for adaptive streaming determines the stability and agility of the system. In traditional

originalresolu,on

downsampled

SERVERSIDE

CLIENTSIDE

userdependentmixture33%highresolu,on

Tileboundary

Sliceboundary

Fig. 1. (top) original and downsampled encodings on server side with motion-constrained tiles. (bottom) client side user dependent mixed resolution cubic projection with 8 high and 16 low resolution tiles.

segmentn

RAP

RAP#1 ...

RAP#2 ...

RAP#3 ...

segmentn+1 segmentn+2 segmentn+3 ...

Fig. 2. Shifted RAPs: short and long interval with varying offsets.

Page 3: HEVC Tile Based Streaming to Head Mounted Displaysiphome.hhi.de/skupin/assets/pdfs/CCNC2017_HEVCTile...HEVC Tile Based Streaming to Head Mounted Displays Robert Skupin, Yago Sanchez,

non-live steaming services rather large buffer sizes of several seconds are used to cope with throughput variations and try to avoid buffer events, i.e. playout interruptions. A large buffer size however also incurs a tradeoff regarding the adaptivity when it comes to viewport adaptivity. Note that low resolution content is presented to the user whenever the viewport changes beyond a tolerable degree during playout of a given segment. Amongst others, the tolerable viewport change depends on the ratio of HR to LR tiles. However, it is important that LR tiles in the changed viewport are switched to HR tiles as quick as possible. If large buffers are used, either pre-buffered media segments are discarded and downloaded again in the suitable version, inducing an overhead in transmitted bitrate, or the adaptation response is delayed until all pre-buffered media segments are played out, the later leading to a poor user experience. In fact, when streaming viewport adaptive 360° video on HMDs, the latter is not desirable as the viewport changes rapidly and frequently. Therefore the buffer size has to be kept minimal.

However, keeping the buffer size at small values poses challenges on rate adaptation algorithms of the DASH clients. More concretely, peak bitrates resulting from segments having a RAP might affect the buffer level and might lead to DASH

clients switching to lower qualities. In order to avoid this issue from happening, the presented prototype streaming system makes use of the aforementioned shifted IDR concept. Therefore, once the DASH client has determined which tiles to download at high and low resolutions, the rate adaptation algorithm of the client has to select which among the shifted IDR representations to use for downloading the segments. I.e., the DASH client has to select RAP and non-RAP segments in a way that the peak bitrate of the downloaded data is kept at a minimal value.

IV. SUMMARY This demonstration paper presents a prototype streaming

system for 360° video content to head mounted displays based on motion-constrained HEVC tiles. The demonstrated approach allows reduction of pixels that need to be decoded without being presented to the user. This is achieved by reducing resolution of video areas not in the current user viewport. Providing video variants with varying random access configuration allows mitigating the penalty of RAP induced bitrate peaks. On the end device, a user dependent HEVC bitstream is created from the received tile segments through lightweight tile aggregation and fed into a single HEVC decoder.

REFERENCES [1] Information technology – Dynamic adaptive streaming over HTTP

(DASH) – Part1: Media presentation description and segment formats, ISO/IEC 23009-1:2014.

[2] E. Kuzyakov, D. Pio, “Next-generation video encoding techniques for 360 video and VR”, Retrieved from: https://code.facebook.com/posts/1126354007399553/next-generation-video-encoding-techniques-for-360-video-and-vr/, 2016.

[3] Skupin, R., Sanchez, Y., Hellge, C., & Schierl, T., “Tile Based HEVC Video for Head Mounted Displays”, Proceedings of IEEE International Symposium on Multimedia (ISM), Las Vegas, US, Dezember 2016.

[4] Sanchez, Y., Skupin, R., & Schierl, T., “Compressed Domain Video Processing for Tile based Panoramic Streaming using HEVC”, Proceedings of IEEE International Conference on Image Processing (ICIP), Quebec, Canada, September 2015.

[5] Sanchez, Y., Podborski, D., Hellge, C. & Schierl, T., “Shifted IDR Representations for Low Delay Live DASH Streaming using HEVC Tiles”, Proceedings of IEEE International Symposium on Multimedia (ISM), Las Vegas, US, Dezember 2016.

[6] Niamut, Omar A., et al. "MPEG DASH SRD: spatial relationship description."Proceedings of the 7th International Conference on Multimedia Systems. ACM, 2016.

[7] Le Feuvre, Jean, and Cyril Concolato. "Tiled-based adaptive streaming using MPEG-DASH." Proceedings of the 7th International Conference on Multimedia Systems. ACM, 2016.

UserDevice

TileAggrega/on

OculusSDK

OpenGLRendering

Orienta/onFeedback

HEVCDecoder

DASHClient

HTTPServer

PerTileBitstreamSegments PerTile

BitstreamSegments

lowresolu/onoriginalresolu/on

network

Fig. 3. Demonstrator system overview