Learned Hierarchical B-frame Coding with Adaptive Feature Modulation for YUV 4:2:0 Content

Abstract

This paper introduces a learned hierarchical Bframe coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. Bframe coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve contentadaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year’s challenge on commonly used datasets in terms of PSNR-YUV.

Overview

The figure presents an overview of our proposed method. As shown, the encoding of a B-frame $x^{420}_t$ begins with using the motion estimation network (MENet) operating internally in the YUV 4:4:4 domain to obtain bi-directional optical flow maps $m_{t\to t-k}, m_{t\to t+k}$ according to its two reference frames $\hat{x}^{420}_{t-k}, \hat{x}^{420}_{t+k}$, respectively. The resulting flow maps are compressed jointly by the CANF-based conditional motion codec ($M, M^{-1}$) given the conditioning signals $m^p_{t\to t-k}, m^p_{t\to t+k}$ generated by the motion prediction network (MPNet). The decoded flow maps $\hat{m}_{t\to t-k}, \hat{m}_{t\to t+k}$ are used for bi-directional motion compensation. Particularly, we adopt two separate motion compensation networks (MCNet-Y, MCNet-UV) to synthesize the motion-compensated frames $\hat{x}^y_c, \hat{x}^{uv}_c$ for Y and UV components, respectively. $\hat{x}^y_c, \hat{x}^{uv}_c$ serve as the conditioning signals for conditional inter-frame coding of $x^y_t, x^{uv}_t$ to obtain the reconstructed Y and UV components $\hat{x}^y_t, \hat{x}^{uv}_t$, respectively. Notably, for coding the UV components, we introduce the reconstructed Y component as an additional conditioning signal. The following sections elaborate on these proposed modules.

Method

Adaptive feature (AF) modulation is to adapt the feature distribution in every convolutional layer, in order to achieve variable-rate compression with a single model and content-adaptive coding. The AF modulation is placed after every convolutional layer in the motion and inter-frame codecs. As shown in the figure, it outputs channel-wise affine parameters, which are used to dynamically adjust the output feature distributions. As compared to the previous works, our scheme has two distinctive features. One is that we introduce the coding level $C$ of a B-frame as its contextual information to achieve hierarchical rate control. This is motivated by the fact that with hierarchical B-frame coding, the reference quality of a B-frame varies with its coding level. The additional contextual information from the coding level allows greater flexibility in adjusting the bit allocation among B-frames. Additionally, our AF module incorporates a global average pooling (GAP) layer to summarize the input feature maps with a 1-D feature vector. As such, our AF module is able to adapt the feature distribution in a content-adaptive manner.

Paper

Results

The figure shows the rate-distortion comparison and subjective quality comparison. First, our method outperforms x265 and the learned codec by a large margin across all the datasets. This is attributed to the use of more efficient B-frame coding. Second, the proposed method is seen to be inferior to HM under the random access configuration, which represents a much stronger baseline method for B-frame coding. However, our proposed method shows better visual quality with less color bias and blocking artifacts for some content. Click on image to enlarge it.