TOWARDS LEARNING LOW-LIGHT INDOOR SEMANTIC SEGMENTATION WITH ILLUMINATION-INVARIANT FEATURES

: Semantic segmentation models are often affected by illumination changes, and fail to predict correct labels. Although there has been a lot of research on indoor semantic segmentation, it has not been studied in low-light environments. In this paper we propose a new framework, LISU, for L ow-light I ndoor S cene U nderstanding. We ﬁrst decompose the low-light images into reﬂectance and illumination components, and then jointly learn reﬂectance restoration and semantic segmentation. To train and evaluate the proposed framework, we propose a new data set, namely LLRGBD, which consists of a large synthetic low-light indoor data set (LLRGBD-synthetic) and a small real data set (LLRGBD-real). The experimental results show that the illumination-invariant features effectively improve the performance of semantic segmentation. Compared with the baseline model, the mIoU of the proposed LISU framework has increased by 11.5%. In addition, pre-training on our synthetic data set increases the mIoU by 7.2%. Our data sets and models are available on our project website.


INTRODUCTION
Indoor semantic segmentation is a fundamental computer vision task, which assigns a semantic label to each pixel in an indoor scene image. In recent years, convolutional neural networks (CNNs) have made remarkable achievements in indoor semantic segmentation. However, most of the research focuses on the segmentation of normal-light scenes, while scene understanding in low-light indoor scenes is practical but has not been investigated much. When robots or first responders perform search tasks and use a light source to illuminate a dark room, the lack of illumination will make the colors and textures of the same objects look different when views change. The illumination variances reduce the robustness of CNN-based methods and yield inaccurate segmentation. It is nearly impossible to build a sufficiently large data set to cover all illumination settings and train a robust CNN on it to learn complete representations of illumination changes. To overcome the negative influence of illumination changes on semantic segmentation, some research work in the field of autonomous driving pre-processed RGB images and transform them into illumination-invariant images based on spectral response (Alshammari et al., 2018. However, these transformation methods are sensitive to the saturation of images, so they are not always effective . This paper attempts to fill the gap of indoor semantic segmentation under low-light, and explores to take advantage of illumination-invariant features to improve the segmentation accuracy. Inspired by some intrinsic image decomposition methods based on Retinex theory (Land and McCann, 1971), an image I can be factorized as the product of a reflectance map R(I) and an illumination map S(I): I = R(I) · S(I).

Reflectance Illumination
Reflectance Illumination The reflectance component is beneficial to segmentation task because it represents the intrinsic property of a scene and shows the original colors of objects that are not affected by illumination. As shown in Figure 1, we propose a novel framework dubbed LISU, which segments low-light indoor scenes by embedding illumination-invariant features into the segmentation branch. We decompose paired low/normal-light images of the same scene into reflectance and illumination components in an unsupervised manner. Then, we feed the decomposed components of the low-light image into a simple encoder-decoder network. This network has two separate decoders, learning reflectance restoration and semantic segmentation, respectively. The features from each task's decoder are linked to the decoder of the other task, so as to enable tighter joint learning of two tasks. Our contributions can be summarized as follows: • We propose a novel framework which exploits the illumination-invariant features for robust low-light indoor semantic segmentation. As far as we know, we are the first to propose an end-to-end (without any pre/postprocessing) trainable framework for semantic segmentation in low-light indoor scenes.  Figure 2. The cascade architecture of our LISU network, which consists of LISU-decomp and LISU-joint. LISU-decomp learns the decomposition of each input image, and LISU-joint jointly learns the reflectance restoration and semantic segmentation.
• Since there is no available data set for low-light indoor scene understanding, we collect and annotate a large synthetic data set and a small real data set to train and evaluate our method. For each scene in the data sets, two different illumination settings are deployed. Besides, the corresponding segmentation maps and depth maps are provided. • The experimental results show the effectiveness of introducing illumination-invariant features to the low-light semantic segmentation task. Our data sets and models are available on our project website 1 .
The remainder of the article is organized as follows. Section 2 introduces the architecture and loss functions of our LISU framework. Section 3 presents the details of the data sets. Section 4 elaborates on the experimental results and discussion. Section 5 concludes the paper.

LISU: A FRAMEWORK FOR LOW-LIGHT INDOOR SCENE UNDERSTANDING
The upper part of Figure 2 shows our LISU framework for Low-light Indoor Scene Understanding. It consists of two subnetworks: LISU-decomp and LISU-joint. The former is responsible for decomposing an image into reflectance and illumination components. The second sub-network receives the reflectance map and illumination map of the the low-light image output by LISU-decomp and performs joint learning of reflectance restoration and semantic segmentation. Our framework needs paired low/normal-lights for training, while only lowlight images are needed at the inference stage. Next, we will introduce the detailed structures and loss functions of the two sub-networks.

LISU-decomp: Intrinsic Image Decomposition
Our decomposition network is shown in the upper left (green box) of Figure 2. The long skip connection (gray arrows) concatenates the features from the encoder to corresponding decoder layers and enables the network to preserve low-level information and generate sharper result. Similar network structures have been used for intrinsic image decomposition (Dai et al., 2016, Rematas et al., 2016 and semantic segmentation (Ronneberger et al., 2015, Wu et al., 2018. Since the ground-truth of the reflectance map and illumination map of real images are not available, we adopt the unsupervised method proposed in (Zhang et al., 2019) and suppose that we have paired images of the same scene taken under low-light and normal-light conditions [I l , In]. We use the same network to decompose these two images into a reflectance map and an illumination map, namely [R(I l ), S(I l )] and [R(In), S(In)], respectively. According to Eq. 1, the first part of our reconstruction loss can be defined as: where · 1 denotes the l 1 norm. Ideally, the reflectance maps of two images should be equal. Therefore, we can reconstruct the low-light image using the reflectance map of the normallight image and the illumination map of the low-light image, and vice versa. We construct another part of reconstruction loss as follows: Following Retinex-based methods (Guo et al., 2016, Handa et al., 2016, Land and McCann, 1971), we use the maximum values of RGB channels of input as an initial estimation of the illumination map: (4) We also use a loss to constrain the smoothness of the illumination map. Our structure-aware smoothness loss is defined as: where ∇ represents for the first-order derivative in both horizontal and vertical directions. λg is a weight term to ensure piece-wise smooth, and it is set to 10 as in (Wei et al., 2018). Unlike (Wei et al., 2018) that used reflectance maps to weight the function, we look for clues from the original low-light images to weight the loss function. The reason is that the quality of the reflectance map generated at this stage is noisy and not reliable to guide the decomposition of illumination map. Our final loss function for the decomposition network is: where λ1, λ2 and λ3 are weight factors.

LISU-joint: joint learning of reflectance restoration and semantic segmentation
We tend to use the output of LISU-decomp to train our segmentation network as the reflectance maps are not affected by illumination. However, the reflectance maps obtained by decomposing low-light images have serious degradation, and they can be further restored using the corresponding reflectance maps of normal-light images (Zhang et al., 2019). Therefore, we extend the single task segmentation network to jointly learn reflectance restoration.
The proposed joint learning network LISU-joint shown in the lower part (blue box) of Figure 2 is also a U-shaped network similar to our decomposition network, but with deeper convolutional layers for the encoder. The five-layer encoder takes the reflectance map and illumination map of a low-light image output by LISU-decomp as input and learns shared features. Then, two distinct decoders learn reflectance restoration and semantic segmentation, respectively. We enhance the correlation between two tasks by linking the features (light blue arrows) from two decoders together. As the training goes on, the semantic segmentation task benefits from the gradually restored illumination-invariant features. At the same time, the segmentation branch also provides semantic information to the restoration branch, and promotes it to produce better restoration at boundaries. Following (Zhang et al., 2019) the loss function for reflectance restoration is defined as: where R(I l ) is the restored reflectance map, and · 2 2 denotes the l 2 reconstruction loss (MSE). SSIM (·, ·) measures the structural similarity (Wang et al., 2004) of two reflectance maps. The last term makes the restored reflectance map have textures close to the reference. For the semantic segmentation task, we use cross-entropy as loss function: where p c i denotes the probability of a pixel i being predicted as class c. M is the semantic category. The combined objective function for LISU-joint is:

NEW DATA SETS OF LOW-LIGHT INDOOR SCENES
In this section we introduce the synthetic and real data sets, namely LLRGBD-synthetic and LLRGBD-real, which are used for understanding low-light indoor scenes. Figure 3 shows some sample images in our data sets.

LLRGBD-synthetic
We render realistic image pairs using a modified version of Opposite Renderer (McCormac et al., 2017, Pedersen, 2013, which is an open-source renderer based on the Nvidia OptiX ray tracing engine. For each rendering, the engine selects one layout from a total of 57 layouts, and randomly places relevant 3D objects from ShapeNet repository (Chang et al., 2015) into the scene. Then the camera moves in the scene to generate random collision-free trajectories. We maintain a high-quality texture library to realize photorealistic rendering of scenes. In order to simulate an LED light source in the dark, we modify the source code of the renderer and installed a white light source on the camera. The engine uses z-buffer of OpenGL to render the depth map, and we get the 13-class (Couprie et al., 2013) labels by simple object-class matching process.
We render 29K × 2 images at 640×480 resolution, and each rendering takes 2-3 minutes on a Nvidia Titan XP GPU. This data set is divided into training set and test set according to the ratio of 90%-10%.

LLRGBD-real
This data set consists of 515 pairs of real low-light and normallight images of indoor scenes, including offices, bedrooms, bathrooms, kitchens, and living rooms. These image pairs are taken with a fixed Intel RealSense D435i camera, which is an advanced RGB-D camera usually used as SLAM sensor. Unlike those methods that collected image pairs by changing shutter speed and ISO (Chen et al., 2018a, Guo et al., 2016, Wei et al., 2018, when capturing low-light images, we ensure that the scenes are completely dark and we only use one point light as illumination. The color temperature of this light is 5500±200K and the illuminance is about 800 lm. The purpose of using a light instead of controlling the camera parameters is that we aim to focus on real exploration missions with a light as illumination in dark indoor environments. Then the white lights in the room are turned on and we take these pictures as normallight images. We capture the images at 640×480 resolution, and we use 13 categories defined in (Couprie et al., 2013) to label our data set. This data set is divided into 415 image pairs for training and 100 for testing. Corresponding depth map of each scene is collected by the infrared sensor of RealSense D435i and aligned to the RGB images. We further refine these depth maps using the colorization method proposed in (Levin et al., The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) 2004). Although depth maps are not used in this paper, we still release them as a part of our data set. Researchers can use this data set to study other scene understanding tasks, such as depth estimation of low-light indoor scenes.

Metrics
We evaluate segmentation with three metrics: 1) pixel accuracy (PixAcc.), which calculates the percentage of correctly classified pixels; 2) mean accuracy (mAcc.), which takes the average of all the pixel accuracy over all the classes; 3) mean Intersection over Union (mIoU). IoU is the intersection between the classified pixels and the true labels divided by the union between the classified pixels and the true labels. mIoU takes the average of all the IoUs over all classes.

Implementation details
We use PyTorch to implement our network, and train the models on a Nvidia Titan XP GPU with batch size 12. We use Adam solver with (β1, β2) = (0.95, 0.999) as the optimizer. The initial learning rate is 0.001 and it is scheduled with polynomial decay with power p = 0.9. When training on LLRGBDsynthetic, the total number of epochs is set to 50, while when training on LLRGBD-real, it is set to 200. All the training images and labels are down-sampled to 320 × 240, and no data augmentation is applied. The influence of the weights on the results is studied in the Experiment II of section 4.5, and we finally set λ1 = 0.01, λ2 = 0.1 and λ3 = 0.5 because this combination of weights achieves the best segmentation accuracy.

Evaluation of the baseline model for low-light images segmentation.
We first evaluate the segmentation branch of our LISU-joint, which serves as the baseline model in this paper. It is the bottom part of LISU-joint shown in Figure 2, and contains only the gray encoder features, the yellow decoder features and the red features from the encoder. We call the baseline model LISU-seg. In this experiment we directly use the original lowlight images to train the segmentation models. Table 1 shows the results of our baseline model LISU-seg, and the other two encoder-decoder models, SegNet (Badrinarayanan et al., 2017) and U-Net (Ronneberger et al., 2015). We report the results trained on LLRGBD-synthetic and LLRGBD-real respectively, and our LISU-seg performs best on both data sets. SegNet does not perform well because the pixel values of low-light images are very small, and the use of max pooling loses a lot of information. Although U-Net also uses max pooling layers to reduce the sizes of feature maps, the U-shaped structure that uses long-skip connections helps to maintain shallow convolutional features. LISU-seg has a structure similar to U-Net, but it has less encoder features and we avoid to use max pooling.

Evaluation of segmentation using degraded reflectance maps
In this experiment we explore if segmenting the reflectance maps of low-light images can achieve better results than using original low-light images. However, the ground-truth of reflectance is not available. Therefore, we first train our decomposition network LISU-decomp with paired low/normal-light images, and then feed the output three-channel reflectance map of the low-light images to the segmentation networks. We still train and evaluate three segmentation networks on the proposed data sets. Table 2 shows the results and we can find that the accuracy of all three segmentation models on both synthetic and real data sets has increased. It is worth noting that the reflectance maps used in the training come from our decomposition network, and these reflectance maps degrade very much. But they are still helpful to semantic segmentation.

Evaluation of LISU for low-light images segmentation
Experiment I: In this experiment we evaluate the segmentation accuracy of LISU, which contains the joint learning network LISU-joint. The results in Table 3 show that the proposed joint learning network outperforms the baseline model on two data sets. Especially on LLRGBD-real data set, there is a great improvement (for mIoU 47.6 versus 36.1). The training of segmentation task on the small data set derives benefit from the illumination-invariant features. The effectiveness of LISU is also reflected in LLRGBD-synthetic (for mIoU 43.4 versus 39.5). Since this synthetic data set contains sequential images, the network has a better chance to learn the representations of same objects under illumination variations. The strategy of joint learning further improves the learning ability of the network. Experiment II: In our method, LISU-decomp is very important because it provides the input of LISU-joint. In this experiment we evaluate the influence of weights in Eq. 6 on segmentation accuracy. The results in Table 4 show that when we fix λ2 and λ3, a larger λ1 generates worse segmentation. The first combination of weights is chosen (λ1 = 0.01, λ2 = 0.1, λ3 = 0.5) because it has the highest mIoU and mean accuracy. 4.6 Evaluation of the effectiveness of pre-training on synthetic data In this experiment we explore if a model pre-trained on our synthetic data set can further improve the segmentation performance on small-scale real data. We first train LISU for 50 epochs on LLRGBD-synthetic, and then fine-tune the model on LLRGBD-real by freezing the encoders of LISU-decomp and LISU-joint. Compared with LISU without pre-training, the mIoU shown in Table 5

Segmentation with modified DeepLab v3+
Since our LISU-joint uses simple encoder-decoder structure as segmentation network, in principle it can be replaced by any network with a similar structure. In order to further verify the superiority of joint learning of semantic segmentation and reflectance restoration in low-light indoor scene segmentation,  we evaluate our approach with the state-of-the-art DeepLab v3+ (Chen et al., 2018b) (DLv3p) as shown in Figure 4. The structure in the yellow box is the original DLv3p (without green feature maps), and it consists of an encoder (ResNet50 backbone ) and an decoder. We modify its original structure by adding an extra decoder to restore the reflectance (DLv3p-joint), and we remove the dropout layers for this decoder. The features from two decoders are concatenated together (yellow and green feature maps). We use original lowlight images from LLRGBD-real to train DLv3p. And for the modified structure, the input comes from the output of LISUdecomp. We encountered overfitting problems when directly training DLv3p on LLRGBD-real. Thus, we adopted early stopping, and the results are shown in the first row of Table  6. The introduction of joint learning makes the training more stable. Note that only one layer in the decoders participates in feature exchange, which once again proves the effectiveness of our approach. We show qualitative results in Figure 5. Although we do not need normal-light images when inferring the model, we still show them here because some low-light images are too dark to be seen clearly.

CONCLUSION
In this paper, we present a novel end-to-end trainable CNN framework that takes advantage of the illumination-invariant features for low-light indoor scene segmentation. We also present a new data set for understanding low-light indoor scenes. The experimental results on both synthetic and real data sets show the effectiveness of our approach.