The figure presents an overview of our proposed method. As shown, the encoding of a B-frame $x^{420}_t$ begins with using the motion estimation network (MENet) operating internally in the YUV 4:4:4 domain to obtain bi-directional optical flow maps $m_{t\to t-k}, m_{t\to t+k}$ according to its two reference frames $\hat{x}^{420}_{t-k}, \hat{x}^{420}_{t+k}$, respectively. The resulting flow maps are compressed jointly by the CANF-based conditional motion codec ($M, M^{-1}$) given the conditioning signals $m^p_{t\to t-k}, m^p_{t\to t+k}$ generated by the motion prediction network (MPNet). The decoded flow maps $\hat{m}_{t\to t-k}, \hat{m}_{t\to t+k}$ are used for bi-directional motion compensation. Particularly, we adopt two separate motion compensation networks (MCNet-Y, MCNet-UV) to synthesize the motion-compensated frames $\hat{x}^y_c, \hat{x}^{uv}_c$ for Y and UV components, respectively. $\hat{x}^y_c, \hat{x}^{uv}_c$ serve as the conditioning signals for conditional inter-frame coding of $x^y_t, x^{uv}_t$ to obtain the reconstructed Y and UV components $\hat{x}^y_t, \hat{x}^{uv}_t$, respectively. Notably, for coding the UV components, we introduce the reconstructed Y component as an additional conditioning signal. The following sections elaborate on these proposed modules.
Adaptive feature (AF) modulation is to adapt the feature distribution in every convolutional layer, in order to achieve variable-rate compression with a single model and content-adaptive coding. The AF modulation is placed after every convolutional layer in the motion and inter-frame codecs. As shown in the figure, it outputs channel-wise affine parameters, which are used to dynamically adjust the output feature distributions. As compared to the previous works, our scheme has two distinctive features. One is that we introduce the coding level $C$ of a B-frame as its contextual information to achieve hierarchical rate control. This is motivated by the fact that with hierarchical B-frame coding, the reference quality of a B-frame varies with its coding level. The additional contextual information from the coding level allows greater flexibility in adjusting the bit allocation among B-frames. Additionally, our AF module incorporates a global average pooling (GAP) layer to summarize the input feature maps with a 1-D feature vector. As such, our AF module is able to adapt the feature distribution in a content-adaptive manner.
The figure shows the rate-distortion comparison and subjective quality comparison. First, our method outperforms x265 and the learned codec by a large margin across all the datasets. This is attributed to the use of more efficient B-frame coding. Second, the proposed method is seen to be inferior to HM under the random access configuration, which represents a much stronger baseline method for B-frame coding. However, our proposed method shows better visual quality with less color bias and blocking artifacts for some content. Click on image to enlarge it.