design and development of an efficient h.264 video encoder

SASTECH Journal 30 Volume 11, Issue 2, Sep 2012

Design and Development of an Efficient H.264 Video Encoder for CPU/GPU using OpenCL

Shaikh Mohd. Laeeq1, Gangadhar N. D.2, Brian Gee Chacko3

1- M. Sc. [Engg.] Student, 2-Professor, 3-Assistant Professor Computer Engineering,

M. S. Ramaiah School of Advanced Studies, Bangalore-560 058.

Abstract

Video codecs have undergone dramatic improvements and increased in complexity over the years owing to various commercial products like mobiles and Tablet PCs. With the emergence of standards, such H.264 which has emerged as the de facto standard for video, uniformity in the delivery of video is observed. With constraints of memory and transmission bandwidth, focus has been on the effective compression and decompression of video. Multicore architectures have increasingly becoming available on mobiles and Tablet PCs. As codecs have increased in complexity and become computationally intensive, it is all the more important to leverage such computation over multicore hardware architectures. OpenCL programming framework for programming multicore hardware architectures such as CPUs, GPUs and DSPs has grown to a high level of maturity.

In this study an efficient H.264 video codec is developed using OpenCL for multicore architectures based on the x264 open source H.264 library. The x264 library is profiled using sample videos on a CPU and performance hotspots are identified for optimisation. These hotspots are optimized by means of encapsulation into the OpenCL kernel loops where 4 parallel threads are created by OpenMP. Further, compiler optimization flags and assembly instructions within the x264 library are used to improve memory efficiency and execution speed. Programs to identify and use the queried OpenCL CPU device and analyze the PCI bandwidth between the host and the device are developed.

When launched over CPU and GPU platforms, with OpenCL API’s and multi threading, improvements in time of execution and the number of systems calls made are observed. The hotspot of x264_pixel_satd_8*4 resulted in 1.2 seconds gain as compared with earlier non OpenCL based optimization on CPU and 0.4 seconds gain for a GPU. The degradation in performance on a GPU platform is due to the read and write latencies. However, along with the use of compiler optimization flags and invoking assembly instructions in the entire x264 library resulted in a 4.3X improvement on a CPU and a 4.2X on a GPU platform. It can be concluded that, along with multithreading with OpenCL, the traditional approach of compiler level optimization is important as it deals with the core improvement in the application considered.

Keywords: OpenCL, Video Encoding, GPU Platforms.

1. INTRODUCTION

Over the years, the demand of high quality video consumption, generation and broadcast has increased tremendously. With the emergence of Tablets, mobile phones and other consumer electronics products, these have pushed the landscape of video viewing to a whole new level. More demand has been observed from the side of end consumers for a better and high quality video viewing experience. This has resulted in various standards of MPEG 2 to be rethought and drafted for further improvements.

The traditional medium of video consumption has been the Television. With enormous cable network and number of channels relayed via satellites, this hasn't changed for a long period of time. Over the recent years, this prime medium of cable networks has seen stiff competition from various other mediums such as Set top boxes and the cloud. Set top boxes that relay based on DTH Technology has sufficiently high gains in terms of bandwidth and this improves the capacity of Video data to be transferred. Hence as a natural outcome, the quality of both audio and video is of high quality. The cloud has seen similar improvements. With websites like YouTube and others making uploads, downloads and viewing of videos

easy and fun, video broadcasting has found a new place compared to the traditional TV medium. With services like Netflix, YouTube and others movies can be viewed on desktops. The only limitation is the network bandwidth which has again improved over time.

With the emergence of embedded devices and the sophistication in both hardware and software, video playback has become possible. These devices are extremely constrained by hardware resources and hence have limited computation capabilities. Hardware resources of the underlying CPU, memory and network protocols implemented vary from device to device. Hence, video playback experience and the formats vary as well.

The other traditional medium of viewing video has been that of the Television and the cable operators. The quality of video and audio has been limited by the technologies used by the cable operators. No improvements in the quality have been observed and this is due to the fact that the medium used is of coaxial cables which are prone to less bandwidth and losses by heat and noise. With the emergence of DTH, TV’s can now be directly connected to the downlink of satellites and a complete high definition video experience is possible.


Gaming and graphics applications have always utilized the GPU effectively.

Of late, lots of other user space applications have started using the GPU and commonly this term is given as GPGPU (General Purpose GPU) computing. Applications of encryption, cryptography, banking and financial applications, virtualization and the like have started leveraging computation on the GPU as well. For this to become possible, lots of frameworks from different software companies have been proposed. CUDA from NVIDIA graphics is a proprietary GPU programming framework running only on NVIDIA GPUs. With the emergence of such frameworks, writing applications to leverage GPU’s has become easy, highly efficient and portable.

OpenCL stands for Open Compute Library and is an Open Source standard released by Khronos group. It is a framework for programming heterogeneous computing platforms consisting of CPUs or GPUs and floating point DSPs. This has made user space applications to be ported and re-developed so that GPU compute power is utilized. This had led to the emergence of the GPGPU computing paradigm and lots of applications are being done in this way.

However the domain of video processing though being compute intensive has not yet leveraged the power of heterogeneous compute platforms. Given the ease of programmability and the emergence of frameworks, the prime tasks of encoding and decoding of a video can be done over the multi core platforms as well. This study aims to develop an efficient H.264 encoder utilizing the underlying hardware power effectively. This will result in a new era of video codecs aware of the multi core platforms and taking the user experience to a whole new level never witnessed before.

2. BACKGROUND THEORY

2.1 Video Processing

Earlier MPEG audio and video coding standards such as MPEG-1 and MPEG-2 have enabled many familiar consumer products. For instance, these standards enabled video CDs and DVDs allowing video playback on digital DVD players, VCRs and computers. Even digital broadcast video was being delivered via terrestrial, cable or satellite networks, allowing digital TV and HDTV. While MPEG-1 addressed coding of non-inter-laced video at lower resolutions and bit-rates offering VHS-like video quality, MPEG-2 addressed coding of interlaced video at higher resolutions and bit-rates enabling digital TV and HDTV with decent video quality.

At the time of their completion they represented a timely as well as practical, state-of-the-art technical solution, consistent with the cost/performance tradeoffs of the products intended within the context of implementation technology available. Various products in the consumer electronics space like Set top boxes, DVD players and the like delivered MPEG 2 quality video and ruled the industry for quite some time.

MPEG-4 was launched to address a new generation of multimedia applications and ser-vices. However MPEG-4 is an evolving standard with new additions and applications routinely being added in order to be the defining standard with ever evolving consumer electronics products. The

multimedia applications and services or products like interactive TV, Smart TV’s, Internet video and the like needed a different standard altogether and hence MPEG 4 was drafted keeping these futuristic applications in mind.

From coding efficiency standpoint, MPEG-4 video was evolutionary in nature as it was built on coding structure of MPEG-2 and H.263 standards and adding enhanced tools but within the same coding structure. Thus, MPEG-4 part 2 offers a modest coding gain but only at the expense of a modest increase in complexity. The expectation was that since object-based video was the main focus, increase in complexity could be only justified for those applications only, not for pure rectangular video applications.

In the meantime, while highly interactive multi-media applications appear farther into the future than anticipated, there seems to be an inexhaustible demand for much higher compression to enable with as best video quality as possible, practical applications such as internet multimedia, wireless video, personal video recorders, video-on-demand, and videoconferencing. The H.264/ MPEG-4 AVC standards (Richardson) are a new state-of-the-art video coding standard that addresses the above mentioned applications. It promises significantly higher compression than earlier standards.

The main focus is on compression efficiency as it yields two important benefits. Less memory to store and in case of video transmission, it means less bandwidth to utilize. These are the prime constraints for embedded devices. Further the flexibility in programming is also available in terms of profiles to choose from giving the trade-off between power used and performance. The encoder has one prime area of concentration and that is motion estimation.

The redundancy between frames is motion and this can be exploited to achieve even higher compression. One frame can consist of the objects and their motion. Hence the next frame need not be compressed along with the motion as well. The subtraction between consequent frames can leave a frame without motion. This is called as motion compensation. Further as motion compensation is occurring at the side of the encoder, motion prediction or estimation must also happen at the side of the decoder. This is illustrated in Figure 1.

Fig. 1 H.264 encoder block diagram

Figure 1 shows the entire block diagram of the encoder. The important aspects of motion compensated prediction are


displayed clearly. This is the most time and resource consuming part of the encoder.

Fig. 2 The division and organization of pixels in H.264

Figure 2 shows the organization of the video sequence or picture in terms of multiple end points. Each picture is divided into many frames or slices. Each slice is divided in macro blocks of 16 X 16 luminance which are further divided in to Sub blocks of either 8 X 8 or 8 X 4. This entirely depends upon the video format and profile chosen. Further these are again divided into blocks of same types and finally the data entities of Pixels themselves. These are divided into three main types of profiles like baseline, main and extended respectively. They include forms of basic functionality and performance and scale up to high features and better performance with a gradual increase in coding complexity. The additions of NAL Network Abstraction Layer for live streaming of videos and the like are all features available for the higher level profiles with a significant overhead in processing and complexity . .

2.2 Introduction to CPU/ GPU Platforms

CPU based computation has been the hallmark of all applications running in the operating system. Examples include web browsers, document editing, drawing and the like. Over time much sophistication has been observed in the user space applications and there have become more demanding in terms of compute power , CPU time , memory and in general all of the hardware resources . As a result, sluggish performance has been observed with specialized applications of gaming, graphics, video and the like on general purpose systems.

One reason being that although CPU performance has been increasing and doubling every 18 months as per Moore’s Law, all this has been exploited in sequential way only. Programming high demanding applications sequentially and making them run over CPU’s limits the capabilities of the applications themselves and thereby the overall user experience.

Over time for special applications of graphics and gaming, dedicated hardware units called GPUs have emerged. Owing to the limitation of sequential computation on CPUs, the GPUs differ with one major feature, they are highly parallel. Dedicated hardware units are available called SPE that can execute processes and threads in parallel. This approach of rendering compute intensive parts of user space applications on the GPU dramatically increases the overall performance by 2x to 100 x. This is made possible by leveraging the special hardware units available in GPU’s.

Fig. 3 CPU and GPU based computation

Figure 3 shows the execution models of CPU and GPUs. In CPU we have one main memory and all the processes share that memory in a time bound fashion. Hence the time needed to complete execution depends upon the number of such processes being run in parallel and in addition to the time needed for the context switch to occur. Hence with all such latencies, processes take a longer time to execute and hence the overall performance is slow. On a GPU, the processes are run in parallel due to the fact that GPU’s have dedicated hardware units of SPE’s available. Moreover the computation time is only restricted by the PCI bandwidth available as read and write operations are needed frequently.

3. DESIGN AND DEVELOPMENT

As the study is about an efficient H.264 encoder over a CPU/GPU multicore platform using OpenCL, device detection and its usage forms the first step in working with OpenCL. The device and the various compute units available need to be identified. This helps in leveraging code over the various compute units and computing that part effectively.

Fig. 4 OpenCL device detection CPU

In Figure 4 the flowchart of the device detection program is shown. This is normally called as clinfo in terms of OpenCL terminology. The first step is the installation of the SDK. The runtime environment is vital for successful launch of the OpenCL. The compiler in terms of open source is GCC and is supported by OpenCL. Thereafter the driver needs to be


queried. Note that for OpenCL to detect a device, the compliant driver needs to be installed. For GPU’s, the OpenCL compliant GPU driver needs to be installed. After detecting the driver, the various platforms need to be queried. Note that as the hardware contains multiple cores or for an SMP based platform, the platform detected will be one and various devices will be equal to the number of cores.

Furthermore, in case of GPUs as it is an AMP platform two platforms will be detected. One is for the master CPU and the other is for the GPU. It is similar to the Master – slave communication and terminology. Hence even the numbering will be first for the CPU and then the next will be for the GPU. Thereafter, the properties of the underlying hardware can be printed. This helps in understanding the hardware in a better way. The various features of either an IEEE 754 floating point unit or the amount of L1/L2/L3 cache memory available etc is indeed helpful in leveraging the resources for computing code.

The next step after detecting the device is to understand latency between the host and the device. As these are normally connected via the PCI bus, the minimum read and write time to and from the device is needed. This helps in understanding the minimum file or kernel loop to be sent over the GPU or compute unit. In case if it is less than the capacity of the PCI bus , then more time would be needed to read and write data and results rather than time being spent on vital computation. This would become counterproductive.

In Figure 5, the PCI bandwidth query is being done by OpenCL. The first method in the program is to identify the device. Thereafter select the various devices listed. To compute the read and write bandwidth, 200 iterations for a string of size 4096 * 4096 is being done. This helps in getting to know about the read and write bandwidth available and also acts as a benchmark result under standard conditions. X264 is an Open Source H,264 encoder library. It is widely available for various platforms and the code base extends thousands and thousands of lines. The entire H.264 encoder and its functioning can be correlated with the library. After running the x264 library with sample videos to be encoded, various functions come into play. Lots of system calls are made to the kernel and this happens over each frame along the entire video sequence. The execution has been profiled with the help of gprof profiler. This is explained in Figure 5. As in Figure 6, we can observe the various functions being executed. Also these are nothing but system calls being done even at the kernel level. Various functions are implemented for different applications over the encoding process of video. The most time and resource consuming is the x624_pixel_satd_8*4() function. It is taking most of the time and a large number of system calls are being done.

Note that in terms of efficiency the time taken to execute must be inversely promotional to the number of system calls done. This would mean more calls being done in less time and hence the overall efficiency will increase

Fig. 5 OpenCL PCI bandwidth query

Fig. 6 Gprof profiler results

In Figure 7, the internals of the function x264_satd_8*4 are described. As every a complete video sequence consists of frames and each frame consists of various macroblocks and within them submacro blocks are present. This hierarchy is again with different resolutions of 16X16 or 8X 16 or 8 X 4 and the like. This differs upon the profile chosen. Hence x624_satd_8*4 is a function at the submacro block level with 8 elements vertically and 4 elements horizontally each in Pix1 and Pix 2 respectively.


Fig. 7 x264_pixel_satd_8*4 function and its internals

Fig. 8 x264_satd_8*4 functions flowchart

In Figure 8, the flowchart of functioning of the function x264_satd_8*4 is explained. First it initializes two arrays of Pix1 and Pix2 and their pointers *Pix1 and *Pix2. Further other temporary arrays are declared. For 4 times the pixel would be incremented and a specific computation is being performed. It is subtraction of R1C1 and R1C2 and the other

being subtraction of R2C1 and R2C2. Thereafter both are being added. This is similar to the logic used in motion compensation and estimation concepts. This loop gets executed over Pix1 till it reaches 4 times. After which the copying of the data happens and the functions returns the final frame.

The optimization done here is in the copying of data entities. As the logic of motion compensation and estimation is correct, no further parallelization can be achieved. However parallelization can be done in the way the data entities are being read and written. Hence the whole function has been encapsulated into an OpenCL kernel loop.

The earlier function definition is static NOINLINE int x264_pixel_satd_4x4( pixel *pix1, int i_pix1, pixel *pix2, int i_pix2 ). This has been encapsulated into an OpenCL kernel loop as __kernel static NOINLINE int x264_pixel_satd_8x4(__global pixel *pix1, __global int i_pix1, __global pixel *pix2, __global int i_pix2). Note that the inclusion of __kernel directive is needed for OpenCL to detect it as a kernel loop. Further the memory allocated is in global space i.e shared memory architecture is used. Hence ths use of __global directive before pixel pix1 and 2 respectively.

Further the addition of #pragma omp parallel for directive creates parallel threads with the function. This loop then executes over the compute units as identified by the OpenCL device detection entity. The numbers of threads so chosen are 4 as the iterations to be done are 4. Changes have been made in the config,mak , Makefile and has been compiled successfully. The addition of #include<omp.h> header file and the inclusion of macro #define OMP_NUM_THREADS=4 has been added.

Improvements have been observed and these would be discussed in the results chapter. This specific change of encapsulating the function into an OpenCL kernel loop is proTable even on the GPU. Instead of the CL_DEVICE_TYPE_CPU, we need to use the API of CL_DEVICE_TYPE_GPU. This helps in creating kernel loops capable of executing over compute units in devices in an effective parallel fashion. Further the other optimizations that are equally important other than use of kernel loops are compiler optimizations. These help in creating compact and memory efficient code. The optimization flags used are

CFLAGS=-Wshadow -O3 -ffast-math -fopenmp -Wall -I. -I$(SRCPATH) -pg -std=gnu99 -I/usr/local/include -I/usr/local/include -fno-tree-vectorize

The options are added in the config.mak file and are nothing but the compiler flags commonly called CFLAGS. These options help in optimizing the code and making efficiency in the overall program.

LDFLAGS= -pg -lm -lpthread –fopenmp -lOpenCL

4. RESULTS AND DISCUSSIONS

Figures here indicate the results obtained on the CPU and GPU platforms with OpenCL.


Fig. 9 OpenCL info for the CPU

Figure 9 shows the results obtained for detecting the device with the help of OpenCL. All the important features of cache memory, image height, width etc. supported are displayed. This is very helpful when we leverage computation and utilize all this thereby adding efficiency to the whole application.

Fig. 10 OpenCL device info for CPU

In Figure 10 the other details of the CPU are being given by OpenCL device program. This helps in identifying the device and its various features.

Fig. 11 PCI bandwidth for CPU

In Figure 11 the program for checking the PCI bandwidth and its output is described.

The PCI bandwidth available and time to read and write data to and from host to device and vice versa is obtained. This is an important feature to identify the bottleneck in the bus for communication of computation and results.

Fig. 12 OpenCL GPU device detection

In Figure 12 the results for detecting the GPU device are shown. Note that it is an NVIDIA graphics card and hence the prime platform name given is NVIDIA CUDA. Also the CL_DEVICE_TYPE returns CL_DEVICE_TYPE_GPU.

Fig. 13 PCI bandwidth for GPU

In Figure 13 the PCI bandwidth obtained over CPU to GPU computation and time to read and write results is obtained.

Fig. 14 Matrix program compilation

In Figure 14, the successful compilation of matrix program is shown. This helps test the running and execution of OpenCL environment.

Fig. 15 Output of matrix addition program


In Figure 15 the outputs of the matrix addition program are shown. This helps verify the working of OpenCL based programs.

Fig. 16 x264 running for a test video

In Figure 16 the running of the x264 library is shown. For a test video of 600 frames, it is being encoded.

Fig. 17 Profiler results of x264 with test video

In Figure 17, the profiler output form gprof is obtained. This is helpful identifying the hotspots to be optimized. The most time and resource consuming function is the x264_pixel_satd_8*4().

Fig. 18 Test video results

In Figure 18, the test video of football match is showed and used as a test video for all the results taken will be described

in the below sections. It consists of 600 frames and RAW video.

Table. 1 Encoding results CPU

Title % Time Seconds System Calls

Call Name

Without OpenCL

17.34 5.92 49375005 x264_pixel_satd_8x4

With OpenCL

9.51 4.71 70814214 x264_pixel_satd_8x4

Gain 7.83 1.21 21439209 In Table 1, the results obtained before and after optimizing are described. Initially, without the use of OpenCL, the mentioned hotpspot of x264_pixel_satd_8*4 utilizes 17.34 percentage of the total time. This all amounts to 5.92 seconds and 49375005 number of system calls made to the kernel. All this gets optimized by the use of OpenCL API’s as the device is detected and the computation is leveraged properly. Better use of architectural features, memories and the like are utilized by OpenCL. Hence results after optimizations with OpenCL are better in terms of speed of execution and number of system calls to be made. The kernel does a good job of executing less number of calls and hence memory and speed efficiency is achieved with the help of OpenCL.

Table. 2 Encoding results on GPU

Sr. No. % Time Seconds System Calls

Call name

Without OpenCL

16.24 24.13 289447503 x264_pixel_satd_8x4

With OpenCL

16.89 23.72 277162456 x264_pixel_satd_8x4

Gain -0.65 0.41 12285047 In Table 2, the encoding results obtained on the GPU are

shown. Similar approach has been followed of detecting the GPU by OpenCL as a device. The improvements are minor ones. This resulted due to the latencies in PCI bandwidth and frequent read and write operations done from the host to the device and vice versa. Hence local computational benefits are not used and this decreases performance. However, OpenCL kernels do a good job of launching and executing more system calls in least time. This shows the memory being used in an efficient way. Hence, efficiency is observed in the internals and not an overall gain is achieved.

Table. 3 Encoding results CPU with compiler optimizations

Sr. No. Frames Real / Total

Without Optimizations 600 3m 12.859 s

With Optimizations 600 0m 44.387 s

In Table 3, the encoding results of CPU with OpenCL and other compiler optimizations are shown. With the test video consisting of 600 frames, a total encoding time 3 minutes and 12 seconds are taken without any optimizations. After conducting the required optimizations, the time taken is only


44 seconds. Hence a Gain of 2m 68.472 s, 436 % or 4.3 X gain on a CPU is obtained.

Table. 4 Encoding results GPU with compiler optimizations

Sr. No. Frames Real / Total

Without Optimizations 600 2m 19.206 s

With Optimizations 600 0m 33.343 s

In Table 4, the results of encoding of video on the GPU with compiler optimizations are shown. Hence a Gain of 1m 85.863 s, 421 % or 4.2 X gain on a GPU is obtained.

5. CONCLUSIONS

These are the critical conclusions done from the entire study. Encoding process in video codecs is best computed over CPU based SMP architectures with least read and write latencies while utilizing architectural benefits of memory and local on device computation. Submacro block and Macro block based parallelization in encoders is the way to leverage computation over multi core platforms with compiler optimizations. This is the prime area in which optimized and efficient computation can be achieved in complex video encoders computing over high resolution and high frame rate videos. Encoders need to be aware of the underlying architecture and hardware and use it effectively by means of OpenCL framework. This assists in leveraging compute and resource intensive parts of an application to be encapsulated in kernel loops and computed over compute units in devices thereby increasing performance and efficiency dramatically.GPU computation is overkill for a 2D video encoding process. However for 3D videos, architectural features of GPU’s will assist in rendering of graphics, animations and other geometrical effects. Along with the use of compiler optimization flags and invoking

assembly instructions in the entire x264 library resulted in a 4.3X improvement on a CPU and a 4.2X on a GPU platform. It can be concluded that, along with multithreading with OpenCL, the traditional approach of compiler level optimization is important as it deals with the core improvement in the application considered.

6. REFERENCES

[1] Donaldson, A., Amesfoort, A. (2011), ‘The Impact of Diverse Memory Architectures on Multicore Consumer Software’. ACM, MSPC’11, California.

[2] Barak, A., Ben-Nun, T., Levy, E., Shiloh, A. (n.d.), ‘A Package for OpenCL Based Heterogeneous Computing on Clusters with Many GPU Devices’. The Hebrew University of Jerusalem, 91904 Israel.

[3] Puria, A., Chen, X., Luthra, A. (2004), ‘Video coding using the H.264/MPEG-4AVC compression standard’ Signal Processing: Image Communication (19), 793–849.

[4] Elliott, G., Sun, C., Anderson, J. (n.d.),‘Real-Time Handling of GPU Interrupts in LITMUS’ Department of Computer Science, University of North Carolina.

[5] Weigand, T., Sullivan, G., Bjontegaard, G., Luthra, A. (2003), ‘Overview of the H.264/AVC Video coding

standard’. IEEE Transactions on Circuits and Systems for Video Technology.

[6] Du, P., Luszczek, P., Dongarra, J (n.d.),‘OpenCL Evaluation for Numerical Linear Algebra Library Development’, available from < http://icl.cs.utk.edu/news_pub/submissions/saahpc_10_cuda_vs_opencl.pdf > [10 October 2012].

[7] Fleming, K., Lin, C., Dave, N., Raghavan, A. (2008.), ‘H.264 Decoder A Case Study in Multiple Design Points’ MEMOCODE 2008. 6th ACM/IEEE International Conference, 165-174.

[8] Kirk D., Hwu, W. (2010) Programming Massively Parallel Processors A Hands-on Approach,. Burlington: Morgan Kaufmann Publishers.

design and development of an efficient h.264 video encoder

Documents