UNSUPERVISED DOMAIN ADAPTATION USING A TEACHER-STUDENT NETWORK FOR CROSS-CITY CLASSIFICATION OF SENTINEL-2 IMAGES

A machine learning algorithm in remote sensing often fails in the inference of a data set which has a different geographic location than the training data. This is because data of different locations have different underlying distributions caused by complicated reasons, such as the climate and the culture. For a large scale or a global scale task, this issue becomes relevant since it is extremely expensive to collect training data over all regions of interest. Unsupervised domain adaptation is a potential solution for this issue. Its goal is to train an algorithm in a source domain and generalize it to a target domain without using any label from the target domain. Those domains can be associated to geographic locations in remote sensing. In this paper, we attempt to adapt the unsupervised domain adaptation strategy by using a teacher-student network, mean teacher model, to investigate a cross-city classification problem in remote sensing. The mean teacher model consists of two identical networks, a teacher network and a student network. The objective function is a combination of a classification loss and a consistent loss. The classification loss works within the source domain (a city) and aims at accomplishing the goal of classification. The consistent loss works within the target domain (another city) and aims at transferring the knowledge learned from the source to the target. In this paper, two cross-city scenarios are set up. First, we train the model with the data of the city Munich, Germany, and test it on the data of the city Moscow, Russia. The second one is carried out by switching the training and testing data. For comparison, the baseline algorithm is a ResNet18 which is also chosen as the backbone for the teacher and student networks in the mean teacher model. With 10 independent runs, in the first scenario, the mean teacher model has a mean overall accuracy of 53.38% which is slightly higher than the mean overall accuracy of the baseline, 52.21%. However, in the second scenario, the mean teacher model has a mean overall accuracy of 62.71% which is 5% higher than the mean overall accuracy of the baseline, 57.76%. This work demonstrates that it is worthy to explore the potential of the mean teacher model to solve the domain adaptation issues in remote sensing.


INTRODUCTION
According to the United Nations (UN) 1 , more than 55.3% of the world's population lived in urban areas in 2018, and the number is still growing. Mapping the urban regions globally provides strategic geographic information for the development of the human kind. Current state-of-the-art global urban mapping delivers a global layer of binary mask, urban vs. nonurban, such as the World Settlement Footprint (WSF) (Marconcini et al., 2019). However, binary maps are not able to provide information within cities, such as functionality and morphological structure of blocks. Those information are very relevant. For example, the evaluation of the Sustainable Development Goals (SDGs) relies on those geographic information within cities (Paganini et al., 2018;Melchiorri et al., 2019). Currently, some efforts have been done toward providing detailed urban maps on the global scale (Demuzere et al., 2019;Yoo et al., 2019). All those studies have pointed out a technical issue for achieving their goals, the cross-city classification challenge. For a global task, a classification algorithm is trained over data sets of a limited number of cities, and is applied over all cities globally. During the inference, the accuracy of the trained algorithm is often not acceptable. This is because the data of different cities change due to different climates, environments, cultures, and so on. This issue is so relevant in remote sensing because no one can avoid it when a large scale or a global scale task is under consideration. To tackle this issue, domain adaptation is an option from the methodological aspect.
Domain adaptation in the context of this paper refers training in a source domain and testing in a target domain for the same task, according to the description in (Pan, Yang). For a global scale remote sensing tasks, the target domain normally has no labeled data samples or occasionally a few labeled ones. This work focuses on the former case, a.k.a. unsupervised domain adaptation. Among literature, there are some studies (Demuzere et al., 2019;Yoo et al., 2019) that test the transferability of various algorithms in remote sensing tasks. However, to our best knowledge, only few studies have developed strategies to improve the transferring capability of their algorithms. Tong et al. has developed a strategy to improve the transferring capability of their algorithm. This work trains a deep network in the source domain, predicts labels of instances from the target domain with the trained network, selects reliable predictions in the target domain based on defined criterion, and tunes the trained network with the selected reliable samples. Their experiments have shown considerable improvements. However, the selection of reliable predictions in this framework requires human interaction and empirical experiences. It might be an issue in practice when dealing with big data. Therefore, it would be more practical to have an end-to-end learnable solution. Fang Figure 1. The structure of the mean teacher model implemented in this work, modified from (French et al., 2017). et al. and Liu et al. have both applied a generative adversarial network strategy to deal with domain adaptation for land cover mapping using very high resolution (VHR) optical aerial images. However, it is very expensive to access VHR optical aerial images with a consistent quality or a global coverage.
Pursuing an end-to-end network solution to the domain adaptation problem with no labeled data available in the target domain, a model draw the authors' attentions, the mean teacher model (Tarvainen, Valpola;French et al., 2017). The mean teacher model was originally (Tarvainen, Valpola) designed as a temporal ensemble solution for semi-supervised learning, and later it was modified in (French et al., 2017) to deal with domain adaptation problems. The modified version produced the state-of-the-art classification accuracy over multiple benchmark data sets of unsupervised domain adaptation. In this paper, the authors attempt to investigate the performance of the mean teacher model in terms of the domain adaptation problem in remote sensing. In the section 2, the cross-city problem, the mean teacher model, and the data used in this paper are introduced. Section 3 illustrates the experiment results upon which a discussion is carried out. Section 4 concludes this paper.

Problem statement
In this paper, the cross-city classification challenge is formatted as a domain adaptation problem. The data of one city with annotations are treated as the source domain. The data of the source domain is represented as (Xs, Ys), where Xs presents the data and Ys indicates the corresponding label. The data of another city without any annotation is treated as the target domain and represented as Xt. The task is to estimate the label Yt of data Xt in the target domain.

Mean teacher model
The structure of the mean teacher model is illustrated in Figure 1. It consists of a student network and a teacher network.
The student network takes the data of the source domain to accomplish the supervised classification by minimizing the cross entropy loss. Meanwhile, both the teacher network and the student network take the data of the target domain, and the consistent loss (mean square error) encourages the two networks providing identical outputs for the same sample. The consistent loss is aim to bridge the gap between the source and target domains. It should be pointed out that the consistent loss encourages consistent predictions of the teacher and student networks. As the teacher network is a temporal ensemble version of the student network, the teacher network is more robust than the student network. With the consistent objective, the teacher network guides the student network on predicting the data samples of the target domain.
The teacher network is a temporal ensemble version of the student network because its weights are updated by an exponential moving average (EMA) of the weights of the student network (1). (1) where In is the n th iteration, α is a weight value ranging from 0 to 1, and Wt and Ws are weights of the teacher and student networks, respectively.
The advantages of this update strategy are: 1. Computation cost is lower than optimizing the teacher network directly. 2. The teacher network is a temporal ensemble of the student network, which is robust.
On the other hand, its disadvantages are as follows: 1. The performance of the teacher network heavily depends on the student network. 2. The teacher network barely brings diversity for the consistent loss which is important for domain adaptation. 3. The setting of α is complicated.

Data
This paper investigates the mean teacher model on the crosscity classification of the local climate zone (Stewart, Oke;Bechtel et al., 2015). The data set used in this paper is a part of the So2Sat LCZ42 dataset . It has about 400,000 pairs of the Sentinel-1 and Sentinel-2 patches with annotated local climate zone labels. The Sentinel-2 patches of the city Moscow and the city Munich are used in this paper. The data patches have a size of 32 by 32 by 10. The ten channels are the ten out of the thirteen bands of the Sentinel-2 data where the first, the ninth, and the tenth bands are abandoned. More details about the data can be found in Schmitt et al., 2019). The numbers of samples of classes are shown in the table 1. For the sake of simplicity, the classes of the two cities are kept as the same ones.

Experiment setting
Two cross-city scenarios are set up in the experiments: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition Table 2. Cross-city classification results of the baseline network and the mean teacher model indicated by overall accuracy, average accuracy, and kappa coefficient. The accuracy of the target domain are the results of training and testing on the target city, which demonstrate the best achievable results. The numbers are the mean value of ten independent experiments with the standard deviation following the "±" symbol.  (He et al., 2016) which is trained on the source city and tested on the target city. Second, the mean teacher model is trained on the source city and tested on the target city. The student and teacher networks in the mean teacher model have the same structure as the ResNet-18. At last, the setting is to train and test a ResNet-18 structure with data of the target city so that it demonstrate the best achievable results. In evaluation, all experiments are carried out for ten times to provide statistical robust mean accuracy.
For the training, the Adam gradient descent (Kingma, Ba) was applied with the learning rate of 1e-3. Every training procedure lasts for 100 epochs. The number of batches is 100.

Discussion
Statistical outcomes. Table 2 shows the classification accuracy of the experiments in a statistical manner. For training with data of Munich, the mean teacher model has a similar performance with the baseline algorithm in terms of accuracy. For training with data of Moscow, the mean teacher model improves the overall accuracy, the average accuracy, and the kappa coefficient by 5%, 2.4%, and 0.05 comparing to the baseline experiment. This is a considerable improvement. However, it is also noticeable that the standard deviation of the overall accuracy and the kappa coefficient are much larger than the baseline algorithm. It means that the mean teacher model is not stable in terms of those two indicators. By comparing to training and testing in the target domain, there exist more than 20% potential of the overall and average accuracy to be improved. It also illustrates the difficulties of the cross-city classification challenge.
Individual outcome. Table 3 demonstrates the classification outcomes of every repetition of the experiments. Considering the best results of all four experiments (marked in blue), the mean teacher model exhibits superior performance for both cross-city scenarios by a considerable margin. Meanwhile, the worst results of all four experiments (marked in red) suggests that the mean teacher model could also perform worse than the baseline. Therefore, it concludes the mean teacher model is not stable for the task described in this paper.
Producer accuracy. Table 4 provides the number of training sample, the number of testing samples, and the mean producer accuracy. This table demonstrates the impact of imbalanced number of samples. For the compact low-rise, the scattered trees, and the bush, scrub, their samples are limited in both cities. The mean producer accuracy of these classes are so low that it is impossible to classify them. On the other hand, for the low plants, the dense trees, the large low-rise which have a large number of samples, the producer accuracy are relately high. Therefore, the sample balance has a major impact. Table 4 also demonstrates that the adapting difficulty is directional. For example, it is a easy task to recognize dense trees when adapting from Munich to Moscow, but it is hard on the other way around.
Confusion matrix. Figure 2 provides

CONCLUSION
This paper investigates the cross-city classification problem where the classification algorithm is trained on data set of a city and is deployed on a data set of a different city. The cross-city scenario is a fundamental set up for a global task, yet is more challenge than the conventional ones whose training and testing data are located in the same region. This paper attempts to adapt an end-to-end unsupervised domain adaptation model, the mean teacher model, to solve the cross-city problem. The mean teacher model is implemented to be trained on the data of Munich and be tested on the data of Moscow for the local climate zone classification. The cities for training and testing were switched for an extra experiment. For comparison, the baseline model is a network of ResNet-18. Each of the experiments were repeated for ten times to provide statistical outcomes which are reliable for analysis. This work summarizes three findings from the experiments: (1) the mean teacher model has a potential to be a solution to the domain adaptation problem in remote sensing because of accuracy improvements have been found; (2) the mean teacher model is unstable according the standard deviation of accuracy resulted from repeated experiments; (3) the sample imbalance cross classes and cross source-target domain could be problematic in the domain adaptation problem of remote sensing.
Based on the findings of this work, the future work will be: (1) the mean teacher model should be tested on a large data set; (2) a strategy should be developed to overcome the impact of imbalanced samples, e.g. data augmentation; (3) the mean teacher model should be modified for remote sensing tasks.