INDOOR SEMANTIC SEGMENTATION FROM RGB-D IMAGES BY INTEGRATING FULLY CONVOLUTIONAL NETWORK WITH HIGHER-ORDER MARKOV RANDOM FIELD

Indoor scenes have the characteristics of abundant semantic categories, illumination changes, occlusions and overlaps among objects, which poses great challenges for indoor semantic segmentation. Therefore, we in this paper develop a method based on higher-order Markov random field model for indoor semantic segmentation from RGB-D images. Instead of directly using RGB-D images, we first train and perform RefineNet model only using RGB information for generating the high-level semantic information. Then, the spatial location relationship from depth channel and the spectral information from color channels are integrated as a prior for a marker-controlled watershed algorithm to obtain the robust and accurate visual homogenous regions. Finally, higher-order Markov random field model encodes the short-range context among the adjacent pixels and the long-range context within each visual homogenous region for refining the semantic segmentations. To evaluate the effectiveness and robustness of the proposed method, experiments were conducted on the public SUN RGB-D dataset. Experimental results indicate that compared with using RGB information alone, the proposed method remarkably improves the semantic segmentation results, especially at object boundaries. * Corresponding author.


INTRODUCTION
Semantic segmentation is a fundamental problem in computer vision, which decomposes a scene into meaningful parts and assigns semantic labels to them (Wolf et al., 2015).Compared with outdoor counterpart, indoor scene annotation is a relatively difficult issue since it usually contains illumination variations, occlusions and overlaps among objects, significant appearance variations and imbalanced representations of object categories (Chu et al., 2017).Therefore, semantic segmentation for indoor scene has seen an increased interest.
In recent years, many methods about indoor semantic segmentation have been presented.Most pervious researches primarily rely on hand-crafted features from both color channels and depth channel, as input of the frequently-used classifier for automatic classification.Silberman and Fergus (2011) developed a CRF-based model, combining 3D location prior from depth channel with features captured from both depth channel and color channels, for indoor scene segmentation.Ren et al. (2012) adopted the kernel-based framework for transforming the pixel-level similarity within each super-pixel into the patch descriptor, which were then integrated with contextual information for labeling RGB-D images.Gupta et al. (2013) made effectively use of depth information for optimizing image segmentation and defined the features of super-pixels for automatic classification using random forest classifier and support vector machine.Müller and Behnke (2014) conducted conditional random filed, into which color, depth and 3D scene features were incorporated, for semantic annotation of RGB-D images.Unfortunately, these conventional methods usually consist of segmentation, feature extraction and classification and their final results depend on the results of each stage (Husain et al., 2016).
With the success of convolutional neural network (CNN) in many applications, a large variety of CNN architectures, especially fully CNN, have been developed to extract the highlevel semantic features for semantic segmentation in recent years and worked in an end-to-end manner.He et al. (2017) developed a spatio-temporal pooling layer for combining contextual information derived from multi-view images for semantic image segmentation.Chu et al. (2017) integrated learnable constraint layers that encode contextual regularization between the neighboring pixels with a deep convolutional segmentation network for enhancing the semantic segmentation results of indoor scene images.More recently, inexpensive RGB-D sensors are proving to be a rich source of information for indoor scenes and can provide color and depth images in real-time (Khan et al., 2014).To effectively use the depth channel, Höft et al. (2014) presented the histogram of oriented depth descriptor as input of convolutional neural network.Lin et al. (2017a) proposed context-aware receptive field and performed a multiple branches-based network model for segmenting RGB-D images.To sufficiently exploit contextual information, Li et al. (2017) carried out a two-stream FCNs to learn the RGB and depth features respectively and gradually fused these features from high level to low level for indoor scene semantic segmentation.Jiang et al. (2018) developed an encoder-decoder architecture to extract RGB information and depth information separately and fuse the information over several layers for indoor semantic segmentation.By incorporating the depth information, the spatial geometric information, which is more invariant to illumination changes and appearances, can be derived for the improvement of semantic segmentation.
To address the issues raised from the state-of-the-art of the semantic segmentation for indoor scenes, we develop a method based on higher-order Markov random field model for indoor semantic segmentation from RGB-D images.Due to illumination changes, occlusions and overlaps among objects in indoor scenes, the spatial location relationship from depth channel and the spectral information from color channels are integrated as prior information for a marker-controlled watershed algorithm to derive the robust and accurate visual homogenous regions, which will encode the low-level visual features for complementarily reconstructing the detailed boundaries.Moreover, to alleviate the fact that the pooling operations result in the blurry object boundaries, higher-order Markov random field model is adopted to encode the shortrange context among the adjacent pixels and the long-range context within each visual homogenous region for refining the semantic segmentations, especially at object boundaries.
The rest of this paper is organized as follows.Section 2 describes the proposed method in detail.Section 3 presents the experimental results and analysis for evaluating the proposed method.This paper concludes with a discussion of future research considerations in Section 4.

METHODOLOGY
In this paper, we develop a method based on higher-order Markov random field (MRF) model, which combines the highlevel semantic information derived from RefineNet and the low-level visual information captured from a marker-controlled watershed algorithm, for indoor semantic segmentation from RGB-D images.As shown in Fig. 1, the proposed method consists of the following steps: (1) initial semantic segmentation using RefineNet, (2) Visual homogenous regions generated by combining color information and depth information, (3) Region-level label consistency based on higher-order MRF model.As a result, the indoor scenes are interpreted into 38 classes.Key algorithms of the proposed method are given in more detail below.

Initial semantic segmentation using RefineNet
To date, numerous FCNN architectures have been developed, such as U-Net (Ronneberger et al., 2015), SegNet (Badrinarayanan et al., 2017), PSPNet (Zhao et al., 2017) and DeepLab (Chen et al., 2017), for semantic segmentation.To efficiently exploit all the information available along the downsampling process for reconstructing the high-resolution prediction, these architectures presented a large variety of strategies, such as atrous convolutions (Chen et al., 2017) and skip connections (Ronneberger et al., 2015;Badrinarayanan et al., 2017).Since RefineNet (Lin et al., 2017b) effectively integrated low-resolution semantic features with fine-grained low-level visual features for generating high-resolution semantic feature maps and adopt residual connections with identify mappings for addressing the problems of vanishing the gradients during the training stage (He et al., 2016), which achieved new state-of-the-art performance on seven public datasets.Thus, we in this paper use the trained RefineNet model to predict the initial semantic segmentation on RGB images alone.An illustration of RefineNet architecture is presented in The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4, 2018 ISPRS TC IV Mid-term Symposium "3D Spatial Information Science -The Engine of Change", 1-5 October 2018, Delft, The Netherlands Fig. 2. For more details about RefineNet architecture, please refer to (Lin et al., 2017b).

Visual homogenous regions generated by combining color information and depth information
As aforementioned, incorporating depth information for enhancing the performance of semantic segmentation achieved great successes (Höft et al., 2014;Husain et al., 2016;Lin et al., 2017a;Li et al., 2017;Jiang et al., 2018) since depth data can provide the 3D spatial location relationships among objects (Lin et al., 2017a) and be insensitive to illumination changes from which the color channels suffer.As a matter of fact, the data quality of depth sensors, which is a measure of point precision, is limited (Khoshelham, 2012).For example, random error of depth measurement increases drastically with increasing distance from sensors, which inevitably causes the semantic segmentation errors if depth data directly serves as the input of CNN architectures.In our implementation, the depth data is just used for assisting the generation of visual homogenous regions.Furthermore, abundant semantic categories, occlusions and overlaps among objects are common in indoor scenes, which easily results in over-segmentation during the procedure of producing the visual homogenous regions.Since a markercontrolled watershed segmentation algorithm is simple and intuitive and can be parallelized (Xu et al., 2011), effectively avoiding over-segmentation with the marker constraints.Therefore, we use a marker-controlled watershed segmentation algorithm by combining color information with depth information to efficiently and robustly derive a set of visual homogenous regions from RGBD images.
Marker-controlled watershed segmentation is a variant of the conventional watershed segmentation (Vincent and Soille, 1991) for solving the over-segmentation issues from numerous potential but trivial regional minima.Watershed segmentation considers a gray-level image as topographic surface, where the gray value of each pixel is interpreted as its altitude.Suppose a water source is placed in each regional minimum and the entire topography structure is flooded from below.When water from two sources (i.e., regional minima) are about to meet, a dam is constructed to prevent the merging.The flooding and dam construction process continues until only the dams are visible from above.These dames effectively segment the image into regions.Due to noises and quantization error (Parvati et al., 2008), the over-segmentation is an intrinsic problem of watersheds.In our implementation, we constrain the watershed segmentation with marker image that is generated through multiple morphological operations.As a result, each marker is associated with a region in the segmented image.Fig. 3 shows a simple example of the marker-controlled watershed algorithm based on the morphological operations, which consists of the following steps.
(1) Generation of gray gradient image, depth gradient image and normal vector gradient image.The original RGB image and depth image are transformed into the associated gradient images based on Sobel filter (Sobel et al., 1968), respectively.The original depth image is used for producing the 3D point cloud based on the corresponding camera intrinsics and the normal vector of each pixel is estimated for deriving the normal vector gradient image.In such cases, high gradient magnitudes are at object boundaries while low gradient magnitude occurs inside objects.At the subsequent procedures, we perform the watershed segmentation on the derived gradient images instead of the original image; These gradient images associated with both RGB image and depth image are fused for providing redundant and complementary object boundaries from different perspectives.
(2) Because compared with the traditional opening and closing operator, opening by reconstruction and closing by reconstruction are less destructive and can maintain the object shape better (Lewis and Dong, 2012).Thus, the marker image is derived based on the morphological operations, including opening by reconstruction and closing by reconstruction, from the original RGB images.
(3) The combined gradient image is modified based on minima imposition technique (Vincent, 1993), which makes regional The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4, 2018 ISPRS TC IV Mid-term Symposium "3D Spatial Information Science -The Engine of Change", 1-5 October 2018, Delft, The Netherlands minima occur at marker pixels, using the marker image derived in Step (2); (4) Marker-controlled watershed segmentation is performed on the modified gradient image.

Region-level label consistency based on higher-order MRF model
It is noted that the down-sampling operation in CNN architectures, such as pooling layer, causes the burry boundaries in the semantic segmentation results.Recently, higher order potentials were incorporated into MRF model for modeling higher-level contextual information and achieved successes in many applications (Woodford et al, 2009;Ren et al., 2015;Yang et al., 2018).For these models, visual homogenous regions can help to model long-range contextual information, which is particularly useful for obtaining object segmentations with fine boundaries (Kohli and Torr, 2009).Hence, we in this section use higher-order MRF model (Kohli and Torr, 2009) for optimizing the semantic segmentation through encoding the short-range contextual information among the adjacent pixels and the long-range contextual information within each visual homogenous region.
MRF model (Geman and Geman, 1987) , where V denotes a set of vertices, and E represents a set of undirected edges between the neighboring vertices.For the image semantic segmentation, an observed image with V pixels is denoted by a discrete random filed, where each random variable is associated with a pixel.The goal is to infer the labeling of the image where each variable i y is the label of pixel i and takes a value from the set , L is the number of classes.In the field of computer vision, finding the optimal label configuration * Y can be naturally formulated into the energy function minimization as the following Eq.(5).
where first order (or unary) energy term ) ( En unary Y measures the disagreement between Y and the observed data, second order (or pairwise) energy term ) ( En pairwise Y measures the extent to which Y is not piecewise smooth, higher order energy term ) ( En region Y measures the label consistency over visual homogenous regions, 1  and 2  are the weighted parameters.
The form of unary term ) ( En unary L is typically where ) ( D i i y quantitatively measures the degree of "fit" between the label i y and the observed data.In this paper, the output of the softmax layer in the learned RefineNet architecture quantitatively measures the disagreement between the label i y and the observed data.As defined in Eq. ( 6), the class posterior probability, the smaller the unary term.
To generate locally continuous and globally optimal label configuration, the pairwise energy term ) ( En pairwise Y is generally defined as the following Eq.( 8).
, i x and j x denote the semantic feature vectors of the pixel i and j respectively derived from the learned RefineNet architectures.
As defined in Eq. ( 8), the smoothness penalty term is zero for the neighboring pixels with the same label.With regards to the adjacent pixels with different labels, the smaller the distance between them is, the larger the smoothness penalty term is.Consequently, the pairwise energy term ) ( En pairwise Y encodes the extent to which the adjacent pixels belong to the same label. To reconstruct the semantic segmentation objects with refine boundaries, the higher order energy term ) ( En region Y is incorporated into the energy function Eq. ( 5) for capturing the long-range contextual information within each visual homogenous region derived from Section 2.2.Although the combination of color and geometric information can improve the performance of generating the visual homogeneous regions, some inaccurate segmentations might still exist due to the complexity of the indoor scene.Thus, we use a Robust P n model (Kohli and Torr, 2009;Yang et al., 2018) (as defined in Eq. ( 9)) to capture the long-range contextual information, which allows some pixels inside the same segmented object to take different labels and effectively avoids the over-smoothness caused by a rigid consistency.
where S denotes the number of visual homogeneous regions, ) , (

EXPERIMENTATION AND ANALYSIS
To evaluate the effectiveness and robustness of the proposed method, in this section, we performed both qualitative and quantitative analysis on the public SUN RGB-D dataset (Song et al., 2015).
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4, 2018 ISPRS TC IV Mid-term Symposium "3D Spatial Information Science -The Engine of Change", 1-5 October 2018, Delft, The Netherlands

Experimental data and evaluation criteria
SUN RGB-D dataset is a scene understanding dataset with indoor scene images, which contains 10355 RGB and depth image pairs captured from different cameras.There are 37 semantic classes and about 0.25% unannotated pixels that do not belong to any of the 37 classes.Like (Song et al., 2015), the whole dataset was divided into 5285 image pairs for training and 5050 image pairs for test.
In this paper, we use 5 common evaluation criteria, including the global accuracy, the class accuracy, the mean class accuracy, the intersection-over-union (IoU) score (Everingham et al., 2010) and the mean IoU, to measure the segmentation quality: the global accuracy represents the percentage of pixels correctly classified by the division of the total number of pixels of true positive and the total number of pixels of ground true, the class accuracy measures the percentage of pixels correctly classified in a class i , the mean class accuracy represents the mean of the accuracy over all classes by the division of the sum of class accuracy in all classes and the number of classes, IoU is a measure which imposes the penalty of false positive on the class accuracy in a class i , and the mean IoU is the mean of intersection over union in all classes.

Experimental analysis
As aforementioned, the burry boundaries are common in the semantic segmentation results of the conventional fully convolutional network architectures because of the pooling operations.Furthermore, the depth information can be used for improving the performance of the semantic segmentation and how to use the depth information is still an open area.Thus, we propose a higher-order MRF framework for exploiting the depth data and further optimizing the semantic segmentation, particular over the boundaries among objects, deriving from the existing RefineNet architecture.First, to evaluate the effectiveness of the proposed method, we compared the proposed method with the conventional RefineNet architecture.Table I lists the performance comparisons in the class accuracy, the mean class accuracy, the IoU and the mean IoU between the conventional RefineNet architecture and the proposed method on SUN RGB-D dataset.Fig. 4 demonstrates some typical comparisons between the conventional RefineNet architecture and the proposed method.Experiments suggested that for most indoor objects, the proposed method could further optimize the semantic segmentation results, especially over the object boundaries (as shown in Fig. 4), and provide the better performance compared with the conventional RefineNet, with the difference in class accuracy of 2.26% on average and in IoU of 2.33 on average.Second, to further evaluate the effectiveness of the proposed method, the other existing architectures were used to compare with the proposed method.Table II lists performance comparison between the proposed method and the other existing architectures.For the oexisting architectures in Table II, we copied the best performances in these papers (Chen et al., 2014;Kendall et al., 2015;Badrinarayanan et al., 2017;He et al., 2017;Li et al., 2017).Among all the methods, the proposed method achieved the best performance in global accuracy, mean class accuracy and mean IoU, with difference of 7. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4, 2018 ISPRS TC IV Mid-term Symposium "3D Spatial Information Science -The Engine of Change", 1-5 October 2018, Delft, The Netherlands illustrate that the proposed method succeeded in the improvement of semantic segmentations.

CONCLUSION
We developed a method based on higher-order Markov random field model for indoor semantic segmentation from RGB-D images.In this paper, we used the depth information for enhancing the performance of the watershed algorithm and combined the high-level semantic information with the longrange contextual information for improving the semantic segmentation results under the higher-order MRF framework.Although experimental results suggested the improvements in the semantic segmentation results to some extent, the final   II Performance comparison between the proposed method and the other existing architectures.The best performances in all methods are marked with BOLD fonts.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4, 2018 ISPRS TC IV Mid-term Symposium "3D Spatial Information Science -The Engine of Change", 1-5 October 2018, Delft, The Netherlands results primarily depend on the initial semantic segmentation derived from the fully convolutional network architecture and the robust visual homogeneous region generations.Our future work will focus on further improving the performance of the fully convolutional network architecture itself and enhancing the robustness of producing the visual homogeneous regions.
Fig. 2 An illustration of RefineNet Fig. 3 An example of the marker-controlled watershed algorithm.Different visual homogeneous regions in output results are randomly rendered in different colors.
probability of pixels, Q represents the threshold controlling the rigidity of the higher order potentials, max  is the homogeneity of each segmented objects, • GT denote the total number of pixels of true positive and ground true respectively, of pixels of true positive, ground true and false positive in a class i respectively.
Fig. 4 Typical comparisons between the conventional RefineNet architecture and the proposed method.