CLASSIFICATION OF THE STRUCTURE OF CITIES THROUGH MID-RESOLUTION SATELLITE IMAGERY AND PATCH BASED NEURAL NETWORKS

The studies in the classification of the urban spatial structure have been essential in deriving insights into the land cover and the built typology which helped in the estimation of energy consumption patterns, urban density, compactness, and hierarchy of settlements. However, the analysis and comparison of the physical forms of the cities have been attempted in a piecemeal fashion where the requirement of datasets and the computation power for analysis has been a major hindrance. With the advancement in machine learning based techniques, large datasets such as satellite imagery can be studied with advanced computer vision methods. These solutions may help in studying the intricate nature of human habitats in large extents of geographical areas including various urban areas. This study utilizes smaller patches of medium resolution Sentinel-2B Imagery of ten different cities in India to explore the urban forms present in these cities. This study uses Stacked Convolutional Autoencoder (CAE) to reduce the dimensionality of satellite imagery patches and unsupervised clustering techniques such as t-SNE and K-means to study the characteristics of similar patches. On analyzing the clusters through visual exploration, similar patches are delineated and provided with corresponding labels representing urban forms. Individual clusters are then studied with respect to each city. The motive of the study is to gain insights into the different types of morphological patterns present within and among cities.


INTRODUCTION
Land use Land cover (LULC) maps have been extensively used for the delineation of land characteristics.LULC maps are fundamental in the estimation of agricultural production (Zheng et al., 2015), analyzing biodiversity (Szostak et al., 2018), assessment of natural hazards (de Moel and Aerts, 2011; Khatami and Mountrakis, 2012) and urbanization (Taubenböck et al., 2009).The availability of open data from Earth Observation (EO) satellites combined with open image processing software and toolboxes have pushed the LULC based research to a new level.Several Machine learning algorithms have been studied and implemented to classify different features of the land (Noi and Kappas, 2018;Shao and Lunetta, 2012).However, the use of such algorithms in studying intra-urban features has been scarce.The major reason can be attributed to the availability of relatively coarser resolution of satellite imagery as open datasets which acts as a major limiting factor.Studies (Kuffer et al., 2017;Mboga et al., 2017) utilizing High and very high-resolution imagery have been conducted to understand the structure of cities which have limited availability with the researchers.The medium resolution datasets such as Sentinel and Landsat products, on the other hand, have only been utilized for regional level analysis.
With the rapid urban expansion due to urbanization particularly in cities in developing countries, there is a growing need to monitor changes in intra-urban structures and textures in quick succession.Urban morphology has been widely discussed in urban planning and management which studies form and function of urban areas.Morphology defines the uniqueness, identity and vibrant character of the city.Local Climate Zones (LCZ) (Bechtel et al., 2015) classification method has been widely used by researchers to understand the morphology and urban fabric.The methodology (Ching et al., 2015)  This study aims to study the inherent structure of cities through unsupervised learning.It provides a step by step approach from the creation of patch-based dataset to the classification of results using CAE as a dimensionality reduction technique.The present methodology is created as a test, which may be scaled to include various cities or cities from different countries.The methodology includes the sequential clipping of image tiles from cities, which are further stacked together to form a training dataset.The dataset is then passed through the CAE, and the embeddings for each of the clips are recorded and stored.These embeddings are plotted with the help of a t-SNE algorithm ( Van Der Maaten and Hinton, 2008).Clusters are created through Kmeans algorithm.Each cluster is assigned a class after studying the gist of information it provides.The presence of such classes is determined in the cities and statistics is generated (Fig. 1).

DATA COLLECTION
Satellite imagery is downloaded for the ten largest cities in India from Sentinel 2B database (Table 1).All the imageries were captured within one month (from 15 Mar to 15 Apr 2018) of duration.Bands 2,3,4 and 8; which represents Blue, Green, Red, and Infra-Red regions of the spectrum are stacked to create a composite set for each of the cities.The City boundaries are used to clip the city extents from the downloaded image dataset.The large variation in areas of the considered cities is evident in Table 1.The city of Delhi has the largest area among the other cities, which is due to the inclusion of contiguous hinterlands which extends up to the border of State of Delhi.To create patches from each set of imagery we used GDAL to sub-divide each city into smaller images of size 32x32 pixels, each covering an area of 10.2 Ha.The process resulted in the generation of approximately 105 patches (Table . 1).The set of patches are further processed with the Convolution based Autoencoders and Unsupervised clustering algorithms.Vanilla clustering techniques such as Random forests and KNN are effective in identifying clusters in small sets of data.However, these algorithms are not feasible for the larger datasets (105x32x32x4 values).Commonly used dimensionality reduction algorithms such as Principal Component Analysis (PCA) perform better in unstructured datasets.However, for the structured datasets, such as satellite imageries we utilized CNN to extract relevant features and to reduce dimensionality from image tiles.

ARCHITECTURE OF CONVOLUTIONAL AUTOENCODER
CNN can be dubbed as the extension of regular Artificial Neural Networks which focuses on image analysis.CNN is primarily used for Image classification, segmentation and object detection tasks widely used in industry and academia.The property of CNN to learn features from the provided image array while preserving local structure and composition of the image is utilized considerably in this study.These networks discover features and patterns and frame meaningful representations from the input data with the help of various filters present in the hidden layers.We utilize the learned features with the help of Autoencoders to create a unique set of values (embeddings) for each patch.Autoencoders perform automated translation of datasets from higher to a lower dimension known as embeddings.The autoencoders are data compression algorithm which is based on the three functions: encoding, decoding and the loss values between compressed and decompressed representation.The encoding task reduces the dimensionality of the input to generate its compressed representation.These compressed values are learned by the decoder to produce the output similar to the provided input.The Convolution Autoencoders (CAE) consist of convolutional layers as a part of encoders and decoders.CAE is used in this study to process the patches and to generate the set of values in lower dimensions.The CAE model (Fig. 2) includes convolution network (encoder), embedding layer and deconvolution network (decoder) which further consists of input, convolutional, max pooling, embedding, upsampling, and output layers in an organized fashion.The input layer is a placeholder for the prepared dataset in the form of 32x32x4 tiles.It is connected with a convolutional layer, in which kernel of size 3x3 is applied to the input layer, 16 activation maps are created and stacked to create a volume of 32x32x16, max pooling is a downsampling operation, which reduces the number of parameters from the volume.Here, max pooling operation is implemented with the help of 2x2 filters which decreases the volume size to 16x16x16.Similarly, the second convolutional layer is obtained by using the kernel of 3x3 with 12 activation maps and max-pooling with 2x2 filters.The resultant volume is reduced to 8x8x12.The max pooling operation with filter 2x2 is implemented to create an embedding layer of size 4x4x12 (192 values).Embedding layer holds the reduced and encoded form of the input data, which is then used to reconstruct the input through a group of convolutional and upsampling layers.In fig.
2, it can be seen that decoder is just the mirror view of the encoder architecture which differs only be upsampling operation instead of max pooling.The loss between the input and the reconstructed output is calculated with every iteration of the model run.The training is stopped after the loss is stabilized.Fig. 3 shows the input and reconstructed output of the image tiles.The CAE model is trained on a workstation with Multicore Xeon Processor with 32 Gigabytes of memory and 2 Gigabytes of Nvidia K2000 Graphics memory.The model is trained for two weeks until the loss subsided.The embedding values (192-D) generated by CAE are converted to 2D scatter plots with t-SNE for better visualization.

T-SNE CLUSTERING
t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique which preserves the relationship between data points while converting higher to the lower dimensions.clusters in the dataset.This study utilizes the similar approach by implementing K-means clustering with k =10 to the t-SNE generated representation.The map points are clustered into 10 classes (Fig. 5).The clips corresponding to map points in each cluster are visually inspected to find general characteristics.

RESULTS
Table .2 provides the general features of each of the created clusters which include various built forms, vegetation, water bodies, croplands, and forests.The distinction between the characteristics of each cluster is not perfect; however, by visual observation of the image tiles in each cluster, the general descriptions can be made.The clusters are studied in relation to the cities considered and the statistics regarding the presence of similar characteristics between different tiles is studied.Six among the ten clusters (1,2,5,6,7,9) show various urban built forms ranging from scattered built to densely compact and low-rise settlements.These clusters can be distinguished from the rest by presence of less vegetation and open areas among the built cover (Fig. 6).
Cluster 1 includes the built-up areas with the presence of playgrounds and trees, while cluster 7 and 9 represents uniform urban fabric with less open and green cover.The city of Jaipur shows 42 percent of the area as low rise with dense built form, which might be due to the unique built typology prevalent in arid and semi-arid regions of the country.Cluster 0 includes the areas with open/exposed soil character.This feature can be attributed to the large playgrounds; area cleared of vegetation for development and naturally existing low vegetated area.The city of Pune and Ahmedabad shows the highest presence of cluster 0 at 9.8 and 7.2 percent respectively.The existence of rocky terrain and hills around the periphery of Pune city and the large parcels of land under development in Ahmedabad city can be one of the reasons.

LIMITATIONS AND CONCLUSIONS
This study presented a method to understand the inherent structure of ten Indian cities through unsupervised learning based on embeddings generated by CAE.This study experiments with the new image processing ideas and combine them with the task of classification of the urban landscapes.The study aimed to create a novel method to understand cities.However, there have been various shortcomings which can be solved in later studies based on this topic.This study clips the patches of 32x32 pixels, which covers an area of approximately 10 Ha.The same methodology can be repeated with smaller clips for detailed urban studies.Tile size is a tradeoff between the required details in the study and increasing computation costs.Further, the number of bands in the imagery can be increased from 4 to 13 to offer spectral variety to the CNN based model.The CNNs are sensitive to the tuning of hyperparameters, slight changes in these values may affect the quality of image reconstruction.The quality of current output may be significantly improved with minor tweaks in hyperparameter settings.The extent of the city considered in this study expands beyond the administrative boundary line, which provides a slightly amplified statistical figures in classification.
Future studies may only consider the city area inside the municipal limits for better data comparison.Recently, studies have utilized Transfer learning approach which uses pre-trained Convolutional Neural Networks to produce embeddings, known as transfer values.Such pre-trained networks have been trained on multimillion images, which can find meaningful associations between extracted features in the images.Such methods can be applied to task demonstrated in this study, and the performance can be checked.Given the storage and large computational power, a large-scale study covering all the urban areas of the world can be combined and studied simultaneously with each other.This study used the t-SNE algorithm extensively to plot the embeddings and to create clusters through nearest neighbors' approach.However, the t-SNE algorithm does not preserve information regarding density and distances among the data points.The appropriateness of created clusters by K-means, which utilizes density and distance relation, is therefore questionable.The alternative options are the usage of selforganizing maps (SOM), which is an unsupervised learning method based on Artificial neural networks.
The study is part of an exploration of newer methods and the applicability in answering some of the questions in urban mapping studies.Future research involving the latest algorithms and approaches would enhance the understanding of the topic.
to create LCZ maps involves manual delineation of training samples for supervised classification which specifically requires a field expert to * Corresponding author interpret scenes and to create training samples.This study utilizes a novel unsupervised classification technique to categorize urban structure in which categorization of classes is done with the help of dimensionality reduction and clustering methods.Research in image processing is undergoing a paradigm shift with the inclusion of state of the art Deep learning methods.Convolutional Neural Networks (CNN) are one of such methods which have provided near human accuracy in Computer Vision tasks such as image classification.CNNs can learn the hierarchical representation of the variety of features present in the spatial and spectral domain in satellite imagery.In this study, CNN is used as encoder and decoder to learn the feature representation and to obtain the reduced dimension of input data as embeddings.

Figure 1 :
Figure 1: Flow diagram showing the methodology of the study.
t-SNE is used to find the relevant clusters by generating 2D representations of embeddings generated by CAE.The embedding values which are similar in structure form clusters.Embedding generated from the dataset lies in the data space R D , where D = 192.The final representation of the data point after the implementation of the t-SNE algorithm can be given as R 2 , where each data point is represented by a map point in the 2-D map space.

Figure 5 :
Figure 5: t-SNE representation with K-means clusters.Fig. 4 shows the output of the t-SNE representation.t-SNE captures the local and global texture of the data space.The interesting patterns in the map representation can be delineated through different methods.(Tang et al., 2016) experimented with K-means algorithm on the High-dimensional dataset to delineate

Figure 6 :
Figure 6: Example tiles present in each cluster.

Table 1 :
Data preparation from 10 cities.

Table 2 :
General characteristics of each Cluster.
1Open mid/high rise settlements, Moderate green cover, Open areas surrounded by buildings.2 Fringe areas, scattered development, Visible road/railroads, Tree plantations, patches of crops 3 Sea, rivers, and lakes 4 Major croplands with minor patches of development especially roads.5 A mix of compact settlements and Sparse built up, roads/railroads surrounded by urban settlements 6 Scattered low rise development like villages in urban boundaries, Tree plantations, croplands, roads/railroads surrounded by croplands.7 Compact built form, less open spaces, fewer trees, uniform built typology.8 Croplands, forests, water, almost no development, fewer visible road/railroads.9 Compact, low rise development, fewer visible road/railroads, less vegetation, uniform urban fabric.

Table 3 :
Percentage of image tiles of each city present in cluster 0-9.