MBS-NET: A MOVING-CAMERA BACKGROUND SUBTRACTION NETWORK FOR AUTONOMOUS DRIVING

: Background subtraction aims at detecting salient background which in return provides regions of moving objects referred to as the foreground. Background subtraction inherently uses the temporal relations by including time dimension in its formulation. Alternative techniques to background subtraction require stationary cameras for learning the background. Stationary cameras provide semi-constant background images that make learning salient background easier. Still cameras, however, are not applicable to moving camera scenarios, such as vehicle embedded camera for autonomous driving. For moving cameras, due to the complexity of modelling changing background, recent approaches focus on directly detecting the foreground objects in each frame independently. This treatment, however, requires learning all possible objects that can appear in the ﬁeld of view. In this paper, we achieve background subtraction for moving cameras using specialized deep learning approach, the Moving-camera Background Subtraction Network (MBS-Net). Our approach is robust to detect changing background in various scenarios and does not require training on foreground objects. The developed approach uses temporal cues from past frames by applying Conditional Random Fields as a part of the developed neural network. Our proposed method have a good performance on ApolloScape dataset (Huang et al., 2018) with resolution 3384 × 2710 videos. To the best of our acknowledge, this paper is the ﬁrst to propose background subtraction for moving cameras using deep learning.


INTRODUCTION
Primary goal of background subtraction is to find moving objects based on their differences from the salient background which is learned from a stream of images. This task can be considered as classification of each pixel as background or foreground that can be designated as a pixel-wise binary semantic segmentation task. Generalizing this binary segmentation to more than two classes, aka. semantic background subtraction, has shown to improve the performance of background subtraction. These methods aim to finally label pixels into a number of moving object regions (Cioppa et al., 2020, Braham et al., 2017. Apart from background subtraction, another body of work directly detect the objects for each frame (Yokoyama and Poggio, 2005). These object detection methods are widely used in low-level computer vision tasks such as video surveillance, robotics and authentication systems. Modern object detection methods seek to locate object instances by learning predefined categories from images (Liu et al., 2020, Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018. Considering that there can be thousands of different object categories, these methods cannot be generalized. Increasing these known object categories indefinitely not only increases computational cost but also the complexity of the model used to learn these categories. In fact, for some problems, knowing the category of the object is not be important: for autonomous vehicle scenarios, obstacles on the roads are important and they can be in any form from among thousands of different categories of objects one can think of. Hence, distinguishing between background and foreground becomes more important and is a liability for autonomous driving. Especially, during autonomous driving, the vehicle cameras provide images that contain background region build up of sky, buildings, lanes, trees and the road itself among others. One can arguably conjecture that the number of categories appearing in the background is significantly limited compared to that of object categories. Therefore, learning the background, hence the background subtraction is a more efficient and feasible task. In this work, we model the background using convolutional neural networks (CNN) that has been successfully applied to image segmentation among others to model complex and recessive relationships between the inputs and outputs (Vemulapalli et al., 2016).
Labeling pixels as background or foreground may introduce spurious regions and noisy labels that can be reduced by imposing spatial and temporal regularization. Conditional Random Fields (CRFs) has been used generally for spatial regularization purposes as a probabilistic graph model (Lafferty et al., 2001). While there have been other regularization solutions introduced in the past, such as Hidden Markov Models (HMMs) (Krogh et al., 2001) and stochastic grammars (Zhu et al., 2007), CRFs offers a directed probability graph model to relax strong and causal dependence assumptions between current frame and its previous adjacent frame(s). In this study, we introduce the CRFs as temporal regularization to ensure the background information learned in the previous frames is carried over to the current CNN output. This introduction is critical as the past data contains important temporal cues that help refine the current result which in return would improve the background subtraction accuracy. Additionally, adopting CRFs as a CNN layer would keep the end-to-end solution of the CNN based approaches.
Another important consideration in background subtraction is the loss function minimized. In a typical Gaussian Mixture Model, maximum likelihood minimization estimates the mix- -We apply Focal Loss to assign all training samples dynamic weights to avoid easy examples establishing dominance over the loss.
-We propose MBS-Net that achieves impressive results on the benchmarks of ApolloScape dataset. More specifically, we achieve 97.53% Mean IoU on background and 76.06% on foreground.
The rest of this paper is organized as follows. Section 2 reviews recent related work in background subtraction. Section 3 states the problem we have on current approaches. Section 4 detailed illustrate our proposed MBS-Net. Section 5 shows our experiment results and the ablation study of our proposed MBS-Net.

RELATED WORK
Background subtraction has been an active area of research for a long time. There are several baseline methods such as the Gaussian Mixture Model (Zivkovic, 2004), Principle Component Analysis (Guyon et al., 2012) and its variants, Kernel Density Estimation (Mittal and Paragios, 2004), and Mean Shift (Piccardi, 2004). While these techniques have been used over the past two decades they do not typically apply spatial and/or temporal regularization to labeled pixels. Zamalieva (Zamalieva et al., 2014) introduced the motion, appearance, temporal and spatial regularization in the labeling cost which they minimized using graph-cut. While their method works with nominal camera motion it suffers from larger camera motions due to the optical flow estimation step that requires small camera motion.
Aside from more traditional background subtraction algorithms, deep learning have also been used in more recent papers. These techniques build on the development of CNNs over the past decade. While there is a large body of work on object detection and tracking in the published literature, we will cite only a few as representatives of their categories as they relate to background subtraction. Considering background subtraction provides regions of moving objects, one can use deep learning to detect and track the objects directly. Wang  proposed object tracking aiming at predicting trajectories of multiple targets in video sequences. Girdhar (Girdhar et al., 2018) performs object detection in video by building on achievements in human detection and video understanding. Alternative to object detection and tracking another approach one can consider is semantic scene segmentation. In (Long et al., 2015, Ronneberger et al., 2015, authors semantically segment an image which provides enclosing boundaries of the objects in the scene as well as the clutter region which can be vaguely considered as the background. While the goal is different, video scene parsing can also be considered as a way of detecting objects in the video. Among others, scene parsing can be performed based on optical flow estimation (Gadde et al., 2017, Kroeger et al., 2016, recurrent neural networks (Hochreiter andSchmidhuber, 1997, Fayyaz et al., 2016) and convolutional networks (Shelhamer et al., 2016).
For most training datasets, imbalance between easy and hard examples, as well as, negative and positive examples are common problems. These two problems always happen for datasets that can be used in moving-camera background subtraction where the background mostly is composed of sky and buildings. A common treatment is to modify and customize loss function to focus the loss on the harder training examples. In recent years, some novel approaches have been proposed to release this imbalance problem including Online Hard Example Mining (Shrivastava et al., 2016), gradient harmonizing mechanism (Li et al., 2019), and Focal Loss (Lin et al., 2017) introduced by Lin et al.. Those proposed loss functions are originally proposed for object detection, which has serious imbalance as only few proposals contain objects over millions of candidate proposals. Therefore, we here firstly introduce their approaches to our background detection problem. Experiments indicate it benefits our training process as well as the network predicted output.
Our proposed MBS-Net in the paper, modifies the BiSeNet network architecture (Yu et al., 2018). The BiSeNet generates onestage output from two paths within the architecture, the spatial path and the context path, to preserve image size and provide large receptive field. MBS-Net shares three main modifications. First, we use Focal Loss during training training process, which distributes dynamic weights for all examples and makes hard examples to be dominant during training. Then we introduce upsampling of fused feature map(s) back to the original input size. Finally, we add CRFs to the network to achieve temporal regularization by sharing previous frames labeling constraint with current CNN output.

PROBLEM STATEMENT
We seek to achieve background subtraction by using vehicle embedded cameras. As the vehicle moves, the camera sees a new scene and the background changes. The motion of the camera is dependent on the vehicle motion which can be forward or backward while turning or going straight. These motion types provides geometric conditions on the types of images acquired.
In Figure 1, we show an example road scene acquired from a vehicle mounted camera. The same figure also shows the background reference in gray color where the labeling criteria for background include sky, buildings, road, lane signs and trees. The remaining regions including pedestrians, vehicles in the reference image correspond to other objects indicated as foreground. A sequence of frames acquired from the vehicle camera contains redundancy and temporal cues that provide important constraints on the solution. Fusing the temporal cues mainly introduces two advantages. First, temporal cues carry semantic information, when utilized would improve the background subtraction performance. Second, it regularizes the labeling process and provides coherency that smooths the generated labels in time axis. The proposed semantic segmentation network, MBS-Net, and others in the literature, do not consider regularization in spatial and time domains. The vanilla application of these approaches to the background subtraction problem, where the acquired image sequence contains visual shakes, creates many spurious regions that are incorrectly identified as background or foreground. These spurious labels also are observed due to other reasons which can be removed using the temporal information. Based on these advantages of the temporal information, MBS-Net introduces the temporal regularization in the background estimation step.

MOVING BACKGROUND SUBTRACTION NETWORK ARCHITECTURE
The proposed MBS-Net has Convolutional Neural Network BiSeNet (Yu et al., 2018) as its backbone architecture. The use of CRF introduces the temporal regularization to background estimation and overcomes spurious regions and the effects of camera shakes during vehicle is in motion. MBS-Net also adopts Focal Loss in order to tackle the sample imbalance problem.

Convolutional Neural Network Architecture
The MBS-Net is built on the BiSeNet architecture. The BiSeNet includes a spatial path, a context path , an attention refinement module and a feature fusion module. In this paper, the temporal regularization is introduced by modifying this architecture. The structure of the MBS-Net architecture is illustrated in Fig. 2.
In this architecture, the spatial path preserves the spatial size of the original input image. It extracts feature maps that are 1/8 of the original image size by cascading three 2D convolution layers with stride = 2. Therefore, the spatial path encodes spatial information with many details preserved in large sizes feature maps. In contrast to spatial path, the context path perceives sufficient details for large receptive fields at the pixel level.
With respect to the consideration of having large receptive fields versus high computational cost, we adopted light-weight models of Xception (Chollet, 2017) and MobileNet (Howard et al., 2017). Once the embedding is completed, the attention refinement module refines the extracted context by integration of spatial and contextual information. This step is followed by the feature fusion module that integrates the two paths and encodes the integrated feature maps back to input image size.

Conditional Random Fields
The architecture introduced in Fig. 2 contains a specialized fully connected layer operating on the time axis to ensure consistency in labeling for the image sequence. This fully connected layer acts as CRF that models the interaction between the current frame and a set of previous frames. In context of deep learning, CRFs have been used to improve spatial model interactions between respective input image and output labels . While there are similarities, our approach to CRF is different and they model the interactions in time axis. This application requires changes to the model and the kernels used. Specifically, the kernels in our work become: where the first kernel models temporal interactions where the pixels (denoted as P ) and past labels (denoted as l) are used; and the second kernel models spatial interactions. The hyper parameters Θα, Θ β , Θγ control the "scale" of the kernels and remains constant on training. One can observe that the larger these parameters are, the more likely the corresponding features get ignored. w (1) and w (2) are compatibilities, deciding how much is learnt between the two separate kernels. The larger the compatibility is, the more its corresponding kernel gets weighted.

Focal Loss
Modern object detection mainly has two sub-branches including one-stage and two-stage approach. In two-stage approach, the first stage generates a sparse set of candidate proposals and the second stage classifies them. One-stage approach combine the above two stages as one at the same time. Thus one-stage object detection has an extreme candidate location imbalance between object and non-object samples on training. These detectors would evaluate 10 4 − 10 5 candidate-proposals but only few of them contain objects. Similar to the dilemma one-stage  object detection faced with, we also need to tackle the sample imbalance problem. In our training dataset, the ratio of background and foreground samples reach 10 : 1 or so. The focal Loss is originally designed for one-stage object detection (Lin et al., 2017), which has a better trade-off between speed and accuracy compared with two-stage methods. Traditionally, weighted cross-entropy loss function in (2) is applied in classification problem: where p k (p k ∈[0,1]) represents softmax probability of the sample belonging to ground-truth class k and α k (α k ∈[0,1]) specifies predefined weight of class k. While α balances the contribution of background and foreground examples, all samples in the same class are still of same significance, no matter how easy/hard those samples are. As training going further and deeper, the easily classified pixels grow up to be the majority of the loss and will dominate the back-propagation gradients. This could impede even stop neural networks learning dataset. Lin et al. (Lin et al., 2017) proposed Focal Loss which introduced dynamic weights for all samples to reshape the loss as: where γ is focus parameter γ ≥ 0. Focal Loss is able to downweight easy pixels as probability p k get close to 1. For example, when γ=2, a pixel k classified with p k =0.9 would contribute the loss just 1/100 compared with that in CE loss. The pixel contribution keeps almost same when p k → 0. Therefore, by reshaping CE loss, Focal Loss could adjust hard-classified pixels to be dominant of the loss. Specially, FL loss is equivalent to CE loss when γ=0.

EXPERIMENTS
In our implementation, we introduced the Xception39 into Spatial Path of BiSeNet. Using this code with other changes including Focal Loss, CRFs temporal regulation layer and Deconvo-lution upsampling technique, we evaluate its performance on ApolloScape road02 seg dataset, which is avaiable from Apol-loScape website. It contains 25 snippets and 11435 continuous frames in total. We manually divided them into training, validation and testing datasets, which respectively have 7923, 1200 and 2312 frames. Besides, all frames are fine annotated have a resolution of 3, 384 × 2, 710, in which each pixel is annotated to 25 different predefined labels by 8 groups, listed in  (s) per second (fps) speed results on the ApolloScape testing dataset. At the end, Section 5.3 investigates the effect of fully connected CRFs, Focal Loss and Bilinear Interpolation Upsampling approaches by ablation study.

Implementation details
ApolloScape datasets are first released by Baidu Research containing 140K time dependent images and corresponding semantic pixel-level labels. These datasets are collected in various cities in recent years in China, aiming to increase its variability and complexity of urban street views (Huang et al., 2018). Each frame is acquired one meter apart with the equipped vehicle keeping velocity of 30 km/h. All frames in each snippet are time dependent. Considering that the goal of this paper is to detect background and foreground regions from images acquired by a moving camera, we fuse all classes into background and foreground; such as sky, building and road become background and denoted as 1; and everything else including all moving objects become foreground and are labeled as 0.
In our tests, traditional mean subtraction and standard normalization methods are not used due to the fact that the batch normalization (Ioffe and Szegedy, 2015) layers normalize feature maps inside the mini-batches. There are over 11k frames in our preprocessed ApolloScape dataset. This rich set of fine-annotated frames removes the typical requirement for traditional data augmentation such as random flip and random crop. Therefore, we only employ a sequential crop and resize operations including cropping frame resolution from 3384 × 2710 to 960 × 1600 and resizing it to 240 × 400 to keep the object shapes, cut the computational cost and save GPU memory.
Using this dataset, we implemented the MBS-Net architecture which contains three convolutional layers with stride=2 in its Spatial Path, and a pretrained Xception39 model in Context Path. The model uses Attention Refinement Module and Feature Fusion Module (FFM) to refine and fuse feature maps generated from the two paths. The output of FFM is 1/8th of the input image. The bilinear interpolation is then used to enlarge the output map back to the original image size. Finally , CRF layers are enforced as a temporal regularization layer within MBS-Net and they model interactions between current frame t and previous n frames. In the experiment, with the consideration of short-time dependency and long-time independency of video frames, we set n=1 and repeat the boundary frames within each snippet. Note that, the CRFs layer is only activated during testing.
Our implementation uses Adam optimizer with initial learning rate η0 = 3e −3 , and applies "step-wise decay" learning rate strategy into training process, where the initial learning rate decay with power 0.9 every 2 epochs η = η0 * 0.9 n 2 . The Focal Loss is initialized with α0=0.75, α1=0.25 and γ=1. Mini batch size is set batch size=8 due to GPU memory limitation.
We note that training and testing implementations are conducted with PyTorch on NVIDIA Titan V.

Results
The computational bandwidth autonomous vehicles is constraint due to other tasks the vehicles is performing every second. Hence, the speed becomes a key factor to algorithm evaluation. Aside from the quantitative comparisons, for this stated reason, we conduct experiment to compare different backbones architectures shown in Table 2. In our test, the fastest results are optained at 305f ps using ResNet18 as the base-model.
In Table 2, we compare the speed of MBS-Net with several different popular basemodels. We count MBS-Net total parameters under each basemodel, as well as its speed with and without activating CRFs layer respectively. All experiments are conducted on NVIDIA Titan V. For all experiments in Table 2, input image has the resolution of 240*400 for fair comparison. In this speed experiment, we don't apply any loss function or measure matrix for simulating real-scene practice.
Aside from achieving high throughput, we have also achieved the state-of-the-art accuracy in quantitative analysis. Among the variants of ResNet basemodels, we pick ResNet50 as it outperformed others in the experiments. We have also tested GoogleNet, MobileNet and Xception39 as a part of the MBS-Net architecture, and selected Xception39 as in our final design. In order to have a fair comparison, we test ResNet50 and Xcep-tion39 basemodel MBS-Net on the above mentioned test dataset and compute Mean IoU with activating and deactivating CRFs layer, shown in Table 3.
In Table 3, we assess the accuracy within two best performed basemodels, ResNet50 and Xception39. Mean IoU of background and foreground is computed with activating and deactivating CRFs layer under the two MBS-Net basemodels. It can be observed that the CRFs significantly improves foreground detection for the Xception39 architecture (highlight row). We may also notice that CRFs layer slightly improves foreground detection for ResNet50 architecture. This is because ResNet50 based MBS-Net is more powerful on detecting boundaries between foreground and background, and CRFs as temporal regulation approach contributes mainly on boundaries regulation in the same way.  Table 3. Accuracy Analysis

Ablation study
CRFs is a necessary part of the MBS-Net and has been originally used for image semantic segmentation without considering temporal regularization. There is a large improvement, CRFs in MBS-Net has improved the performance from 68.35% to 76.06% on foreground and 96.64% to 97.53% on background, as shown in Table 3 highlight row. Quantitatively, these results are important as shown in Fig.3  Our final ablation study is on Bilinear Interpolation Upsampling. Upsampling layer is designed to increase resolution of fused feature maps to the original input image. Some existing approaches include Bilinear Interpolation, pooling indices memorization, deconvolution, etc. Different from the other two approaches, pooling indices memorization requires sharing those pooling indices from encoder feature map(s) with corresponding feature map(s) in decoder (Badrinarayanan et al., 2017). As discussed in Section 4.1 stated, Spatial Path (SP) in MBS-Net cascades three Conv+BN+Relu blocks with stride=2 so which downsamples images to 1/8th of input image. Hence, pooling indices memorization will not be an alternative approach because of the loss of pooling indices information. Here we mainly compare the performance of Bilinear Interpolation and deconvolution, as shown in Table 4. Bilinear Interpolation outperforms Deconvolution approach both on speed and accuracy without introducing additional parameters.

CONCLUSIONS AND FUTURE WORK
In this paper, we introduced MBS-Net that modifies an existing semantic segmentation CNN architecutre by including addi-tional steps and layers to ensure temporal regularization is performed in the background labeling process. The temporal regularization step combined with spatial regularization have been tested on the ApolloScape benchmark dataset and is shown to achieve good results. We apply Focal Loss which reshapes cross entropy loss in order to focus on hard to learn examples during training. We also design ablation study to investigate their efficacy showing that it can achieve state-of-the-art accuracy and speed. Reward mechanisms such as the ones used in reinforcement learning is an ongoing research.