IMPROVED CLASSIFICATION OF SATELLITE IMAGERY USING SPATIAL FEATURE MAPS EXTRACTED FROM SOCIAL MEDIA

In this work, we consider the exploitation of social media data in the context of Remote Sensing and Spatial Information Sciences. To this end, we explore a way of augmenting and integrating information represented by geo-located feature vectors into a system for the classification of satellite images. For that purpose, we present a quite general data fusion framework based on Convolutional Neural Network (CNN) and an initial examination of our approach on features from geo-located social media postings on the Twitter and Sentinel images. For this examination, we selected six simple Twitter features derived from the metadata, which we believe could contain information for the spatial context. We present initial experiments using geotagged Twitter data from Washington DC and Sentinel images showing this area. The goal of classification is to determine local climate zones (LCZ). First, we test whether our selected feature maps are not correlated with the LCZ classification at the geo-tag position. We apply a simple boost tree classifier on this data. The result turns out not to be a mere random classifier. Therefore, this data can be correlated with LCZ. To show the improvement by our method, we compare classification with and without the Twitter feature maps. In our experiments, we apply a standard pixel-based CNN classification of the Sentinel data and use it as a baseline model. After that, we expand the input augmenting additional Twitter feature maps within the CNN and assess the contribution of these additional features to the overall F1-score of the classification, which we determine from spatial cross-validation.


INTRODUCTION
Humans are very powerful sensors, capable of not only perceiving the world, but interpreting it as well.Social media give us the opportunity to access information gathered by humans and thus can be seen as an interface to access this powerful sensor.The usage of this implicit, user generated data can reduce the demand for explicit data, which has to be gathered with additional overhead and costs.One of the scenarios where user generated data could help, is the development of land use and land cover classification models.
The classification of land use and land cover has been an important research topic in remote sensing for a couple of decades (Anderson, 1976), (Foody, 2002).Recently, a world-wide data set with a resolution of 30m has been created ( (Chen et al., 2015)).In the last decades, fine-grained classification of urban regions has shown great potential in understanding urban dynamics.To this end, Oke et al. proposed the local climate zones classification scheme which has seen wide adoption (Stewart and Oke, 2012), (Stewart et al., 2014).In this scheme, 17 classes have been defined that jointly cover surface structure (height and density) and surface cover (pervious or impervious).The most amount of publicly available LCZ data is provided by the World Urban Database and Access Portal Tools (WUDAPT).This organization provides rastered LCZ data for a couple of cities around the world.Ground truth for local climate zone classification has been acquired by individuals for a few cities in the WUDAPT project (Mills et al., 2015).Many efforts have been taken to predict LCZ classes from various remote sensing sources (Yokoya et al., 2018) (Bechtel and Daneke, 2012) (Bechtel et al., 2015) most notably including the last edition of the IEEE GRSS Data Fusion Contest in 2017.
Social media as novel modality is quite orthogonal to usual remote sensing data: it is very sparse, it is local, it is mostly generated by humans thereby exploiting human knowledge, but it is noisy.While the text content of tweets is difficult to relate to rather morphological LCZ classification, it can still provide hints on how to distinguish different classes that look similar from space.In this paper, however, we focus on aggregated tweet metadata including the number of tweets and the type of users.We aggregate this metadata on a coregistered grid with downsampled Sentinel imagery.In this work we first show that Twitter data contains implicit information about land use and land cover (Experiment 1) and that this data improves the LCZ classification with a baseline Convolutional Neural Network (CNN) based on satellite imagery (Experiment 2).Note that our experimental setup is considered not to provide the best possible classification performance as this always comes at the risk of high overfitting and overly targeted models.Instead, a simple baseline classification system is analyzed with respect to its behavior with and without Twitter data on the classification quality.
The remainder of this paper is structured as follows: In chapter 2 we give a brief overview on the background of this work including the LCZ system, CNNs for different image related tasks and Twitter data.We continue in chapter 3 with a more detailed description of the dataset we created and used.In chapter 4 we describe our methodological approach.In the subsequent two chap-ters 5 and 6 we focus on the experiments including our evaluation strategy.The results of the experiments are presented in chapter 7, followed by a discussion in chapter 8.In the last chapter we conclude our work and give a brief outlook regarding possible future work.

BACKGROUND
Local climate zones are a land use / land cover classification system currently receiving high attention within the GI community.It has a focus on the built environment, which leads to its capability to improve climatic modeling (Stewart and Oke, 2012) and to provide a generalized and therefore comparable representation of urban architectural topographies (Taubenböck et al., 2012 LCZ is a raster data set, where each pixel is assigned a single class from the 17 classes, which are equally defined all over the world.The Tables 1 and 2 show the definitions of the classes from (Stewart and Oke, 2012).The classes 1 to 10 describe areas with buildings, they are derived from parameters like height of the buildings, ratio of width and height and materials.The classes A to F describe land cover and derived from parameters like height of trees, density of trees and soil type.And finally class G represents water.This classification uses resolutions between 100 m and 200 m, because it is not constructive to describe topographic layout and morphology in a finer resolution.Large amount of work in context of LCZ is directed toward classifying single cities (Danylo et al., 2016) (Verdonck et al., 2017).
Two organisation are responsible for collection of LCZ data and their securing their quality, those are GeoWiki and World Urban Database and Portal (WUDAPT).The data contribution is driven by croudsourcing campaigns (See et al., 2013) (Foody et al., 2013) and games (Laso Bayas et al., 2016).
Convolutional Neural Networks (CNNs) are deep learning architectures, inspired by the visual perception mechanism of animals based on receptive fields in the visual cortex (Gu et al., 2018).
CNNs have been known for a couple of decades, e.g.LeChun et al. (LeCun et al., 1990) showed in 1989 that CNNs can be used to classify images of hand-written digits.Since the mid 2000s CNNs started to get more popular because the previously occurring problems like the lack of training data and computational resources became less relevant (Gu et al., 2018).This was on the one hand due to the improvement of the methods on the other hand due to the usage of more efficient hardware.Nowadays CNN architectures reach state-of-the-art performance in many image related tasks like image classification, which is the task of assigning the correct class to an image (or image patch) or semantic segmentation, where the goal is to assign the correct class to each pixel of the input image.
Twitter data is present in a large amount, but it is unequally distributed in space.There are areas where only sparse information is present, as opposed to other, mostly urban areas with high amount of Twitter data.Among all the Tweets, the amount of geotagged messages is rather low, and also differs from country to country.Geotagged tweets have a very unclear relation between their content and their location: usually people use the it as communication medium and not as means to convey information about the environment.However, in emergency scenarios (such as earthquakes, flooding), people also talk about it and thus, the content can be beneficially exploited (see e.g.(Dittrich et al., 2015)).Independent of this problems there are several research topics on employing spatial information like automated derivation of features with spatial relevance (Sengstock and Gertz, 2012) or inferring home locations (Lin and Cromley, 2018) from geotagged tweets.Tweets contain several types of information, like text image/video and location, which can be fused in order to identify spatial events such as floods (Feng and Sester, 2018).Nevertheless simple meta information like user mentions, count, and tag count can be used to distinguish between types of users (Guo and Chen, 2014).

DATASET
When we created the dataset to proof our hypothesis, the biggest limitation was the availability of LCZ data.This examination focuses on the local area around Washington DC, defined by the corresponding LCZ label raster provided by WUDAPT, visualized in Figure1.We decided to investigate this area because it seemed suitable for our purpose due to high amount of Twitter data as well as the availability of open satellite imagery.The LCZ label raster is used as ground truth in our experiments.Although the provided labels are not annotated by humans, but instead created by a classification algorithm, we assume the data to be error free.The LCZ classes are distributed in this area as shown in Table 3 For the baseline of the CNN we take advantage of the publicly available satellite imagery, gathered during the SENTINEL-2 earth observation mission developed by the ESA.In particular we used rectified and georeferenced images of different spectral bands (Processing Level-1C).We decided to use the infrared-redgreen-and blue channels with a spatial resolution of 10 m as well as the bands 11 and 12 with a spectral resolution around 1610 nm and 2190 nm respective.These two bands have a spatial resolution of 20 m.  3 We generate six feature maps which are used in two experiments.Those features were selected due to their potential relevance for LCZ classes.The feature maps are generated as follows: 1. Tweet count (TC) -summed tweets with geotag in a particular cell.This information could help distinguish between urban and rural area.TC contains integer values from zero up to 19 thousand.2. Mean text length (MTL) -mean count of symbols per tweet.Text length is a simple feature which could help distinguish between private, casual tweets and bussiness tweets of companies.More general, this feature can contain information whether or not this is a spontaneous tweet or it is a well planed tweet with optimized content dense.Areas, where companies place their offices tend to have specific structure, which can be related to LCZ classes.This feature is a mean over tweet length, which is limited by 280 symbols constraint.For the second experiment we prepare the Twitter data feature maps for the training, by means of linear normalization to the interval from 0 to 1.The feature maps TC and MFC contain hot spots with few single cells containing very high values.Linear normalization of such a feature map would contain in most cell very small normalized values, therefore we crop the values above the 95th percentile.
The second experiment uses a Convolutional Neural Network for a pixel wise (cell wise) prediction of LCZ classes based on different bands of satellite imagery and the six Twitter feature maps.
In order to investigate the influence of the Twitter data we train and evaluate two classification models.The first one, referred to as baseline model, infers the LCZ classes based on the satellite images only.The second version of the model additionally uses the Twitter data and is referred to as augmented model.

FIRST EXPERIMENT
In the first experiment we use the six earlier described feature maps as input.The input data has the following structure for a single entry:

T C|M T L|M F C|M T |M T C|M M C||LCZClass
The distribution of the classes within the Twitter data can be seen in third column of the table 3.
In order to combat overrepresentation we apply random undersampling.All classes have at least 540 samples, except class A with 475 samples.Therefore only the following classes are trained in this experiment: {2, 3, 5, 6, 8, 9, A} Total number of used samples is 3715.All the data is divided into test set (67%) and test set (33%).
The classification model is implemented using XGBClassifier from the eXtreme Gradient Boosting Package provided by the Distributed (Deep) Machine Learning Community (DMLC).The metaparameters are optimized by means of a grid search.The number of estimators is selected using cross validation with 5 folds.
In order to avoid the influence of the outliers, the experiment is repeated 10 times.For each repetition the following performance metrics are calculated for the test set and finally averaged over all 10 repetitions: F1 score per class, overall accuracy, recall and precision.

SECOND EXPERIMENT
In the second experiment we investigate, if the LCZ classification results of a CNN based on satellite images can be improved by additionally feeding the generated Twitter feature maps to it.Therefore we set up a fully Convolutional Neural Network.In comparison to a standard image to class label network where input and output are the same spatial resolution, we have to deal with different spatial resolutions of the input and reference data.
The highest ground sampling distance (GSD) of 10 m comes with the infrared-, red-, green-and blue channels of the satellite images.The other two used bands 11 and 12 have a GSD of 20 m and the Twitter feature maps as well as the reference label maps have a GSD of 100 m.
Instead of sampling the satellite images down by a factor of 10 and 5 respectively, we decided to create a network with a downsampling architecture, which is capable to receive the input data within the highest available resolution.This is realized by using the input data with the highest spatial resolution as the first input and then performing strided convolutions to reduce the size of the intermediate feature maps.The remaining input images with lower resolutions as well as the Twitter feature maps are then concatenated to the intermediate feature maps with the according size.Besides the benefit of using the available input data in maximum resolution, the developed architecture has two additional advantages.Firstly, since the network is fully convolutional, it is possible to feed different sized images to the network (as long as the ratio between the different input data matches).This is used in terms of training the network on patches, where the size of the first input images is 250 x 250 px and evaluating the network on bigger parts of the image.The second advantage refers our investigation regarding the Twitter data.In order to set up the baseline model we simply skip the concatenation of the Twitter data, without modifying the rest of the network.The network architecture of the augmented model is shown in Figure 3.The filter sizes were chosen to create an overlap during the strided convolution and additionally infer an appropriate perceptive field of 1.6 x 1.6 km for the classification of each target tile.All convolutions are followed by adding a bias and applying the leaky rectified linear unit as activation function.We decided for a fixed number of iterations as a stopping criterion, since we do not use an additional validation set due to the low amount of data.We choose iteration 500 by analyzing the mean F1 score for the test set over 1000 epochs for both models and all quarters as test set.In Figure 4 the mean F1 score for the upper right quarter over 1000 training epochs is shown, where the network uses the patches of the other three quarters as training data.The shown graphs are slightly smoothed using the mean over a window of five epochs for visualization purpose.It can be seen, that the mean F1 score reaches a plateau at epoch 500 for both models.

RESULTS
In the first experiment the resulting mean overall accuracy of the test set classification reached 25.5 % which is slightly above the performance of a random classifier (7 classes, corresponding to 14.3 %).The results of the second experiment are shown in the last two columns of table 3.There we compare the F1 scores of each class for both models.For 11 out of the 13 classes which are represented in the data, the augmented model reaches a higher F1 score.This trend is also represented in the mean F1 scores, shown in the last row.The overall accuracy is also 1.3 % higher in the augmented model with a value of 78.6 %.The absolute differences in the F1 scores per class are visualized in Figure 5.
Although the improvement of the augmented model is below 5 % for most classes, the classes 1 and 2 show an improvement of approx.10 % and 20 % respectively.The F1 score for the classes 4 and E decreased by less than 5 % in comparison to the baseline model.In Figure 6

DISCUSSION
Since the overall accuracy achieved in the first experiment is better than a random classifier, we conclude that Twitter data contains information suitable for LCZ classification.As expected, the quality is low, due to missing explicit content information and noisy nature of the Twitter data.In second experiment we gathered additional evidence for the positive effect of Twitter data on a more realistic classification scenario.The results also imply, that our developed architecture is capable of fusing the dense satellite images and the sparse Twitter data.In both experiments this improvement was most prominent for LCZ class 2. Against our expectations we did not observe a dependency between the amount of Twitter data and the improvement of the prediction of a class.We assume, that this is due to the fact, that the absence of Twitter data itself is a valuable information for the implemented classification model.
Prominent is the F1-score decrease of the class 4 and E. As described in section 6 we combat the imbalanced distribution of the data by applying weight penalties to over presented classes.This penalties are proportional to the count of the particular class in the whole training area.The class 4 and E are under represented in Twitter data so that few noise instances of Twitter data could cause the performance loss.
As a conclusion we state that we proved our initial hypothesis about the beneficial contribution of Twitter data on land use classification and our classification approach is suitable for data fusion.

OUTLOOK
In this work we use most simple and straight forward Twitter data features.It would be a logical step to investigate, if more complex features derived from tweet text or pictures improve the classification results even more.Since we assume that many tweets are not related to their tagged location, those features may also help to provide more specific or additional locally related information to the model.On the other hand we did not investigate, if all of the used features derived from the Twitter data are actually relevant.In future work we want to train the model using only a subset of the features to get a better understanding about the relevant information.
Furthermore our goal was to investigate the improvement of land use classification using Twitter data.Since we proved this hypothesis, we now want to develop a more sophisticated and optimized classification model.Considering more training and testing data, e.g.including different cities, we could evaluate our approach on a more realistic scenario.This would additionally allow us to see whether the incorporation of Twitter data can improve the models capability of generalization.
. The classes 7 (Lightweight low-rise), 10 (Heavy industry), C (Bush scrub) and F (Bare soil or sand) are not present in this data set.The Twitter data set is generated by filtering 680.982.894Tweets, collected in the time between 09 February 2018 and 19 June 2018, by their location with respect to our target area.The result is 392559 tweets with geotags in Washington DC.The spatial distribution of the Tweets is shown in Figure 2. It is obvious, that most of the Tweets are present in the urban agglomerations.

3 .
Mean friends count (MFC) -mean count of friends of the author of the tweet.Similar as MTL this feature could be useful to identify business usage.Companies gather all possible followers in order to reach most potential customers.The values of this feature vary up to mean 193,066.0friends per cell.4. Mean time (MT) -Hour of time, when the tweet was posted.Tweet time could indicate information whether it is residential area or recreation area.This raster contains the mean over values in range of 0 to 23, representing round down local hour of time.

Figure 2 .
Figure 2. Spatial distribution of the Twitter data.Colors represent number of tweets in a single cell TC.The single cell hot spots are not visible in this view.Right lower corner zoom in: Tag count MTC value distribution.

Figure 3 .
Figure 3. Architecture of the Convolutional Neural Network.On top of the network the details of the convolution operation are shown.Below we displayed the size of input and feature maps.The blue box represents the Twitter feature maps, which are only used in the augmented model.

Figure 4 .
Figure 4. Exemplary smoothed trend of the mean F1 score in the test quarter for both models over 1000 iterations.Both models reach a plateau after approximately 500 training epochs.
the mean F1 scores for both models are shown for each epoch.The augmented model over performs the baseline model for most epochs.

Figure 5 .
Figure 5. Difference in F1 score per class.Positive value indicates a better score for the augmented model.

Figure 6 .
Figure 6.Mean F1 score for both models over 500 training epochs.The values are based on the summed confusion matrices over all runs and quarters per epoch and model. ).
Table 2. LCZ classification system, land cover structures

Table 3 .
LCZ support and F1 score per class.

Table 4 .
It is worth to note that the performance of the LCZ class 2 (table 4) is far above random classification.Results of the First experiment.