Abstract

This paper introduces a learned hierarchical Bframe coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. Bframe coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve contentadaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year’s challenge on commonly used datasets in terms of PSNR-YUV.

Method


Adaptive feature (AF) modulation is to adapt the feature distribution in every convolutional layer, in order to achieve variable-rate compression with a single model and content-adaptive coding. The AF modulation is placed after every convolutional layer in the motion and inter-frame codecs. As shown in the figure, it outputs channel-wise affine parameters, which are used to dynamically adjust the output feature distributions. As compared to the previous works, our scheme has two distinctive features. One is that we introduce the coding level $C$ of a B-frame as its contextual information to achieve hierarchical rate control. This is motivated by the fact that with hierarchical B-frame coding, the reference quality of a B-frame varies with its coding level. The additional contextual information from the coding level allows greater flexibility in adjusting the bit allocation among B-frames. Additionally, our AF module incorporates a global average pooling (GAP) layer to summarize the input feature maps with a 1-D feature vector. As such, our AF module is able to adapt the feature distribution in a content-adaptive manner.

Paper

Results

The figure shows the rate-distortion comparison and subjective quality comparison. First, our method outperforms x265 and the learned codec by a large margin across all the datasets. This is attributed to the use of more efficient B-frame coding. Second, the proposed method is seen to be inferior to HM under the random access configuration, which represents a much stronger baseline method for B-frame coding. However, our proposed method shows better visual quality with less color bias and blocking artifacts for some content. Click on image to enlarge it.