PROCESSING OF CRAWLED URBAN IMAGERY FOR BUILDING USE CLASSIFICATION

Recent years have shown a shift from pure geometric 3D city models to data with semantics. This is induced by new applications (e.g. Virtual/Augmented Reality) and also a requirement for concepts like Smart Cities. However, essential urban semantic data like building use categories is often not available. We present a first step in bridging this gap by proposing a pipeline to use crawled urban imagery and link it with ground truth cadastral data as an input for automatic building use classification. We aim to extract this city-relevant semantic information automatically from Street View (SV) imagery. Convolutional Neural Networks (CNNs) proved to be extremely successful for image interpretation, however, require a huge amount of training data. Main contribution of the paper is the automatic provision of such training datasets by linking semantic information as already available from databases provided from national mapping agencies or city administrations to the corresponding façade images extracted from SV. Finally, we present first investigations with a CNN and an alternative classifier as a proof of concept.


INTRODUCTION
Over the last few years, there has been a shift in photogrammetry and geoinformation applications from pure geometric reconstruction of virtual cities to 'intelligent' data, models with semantics.Building Information Modeling (BIM) and Smart Cities currently are hot topics.These applications feed on a multitude of data sources.However, this reveals a discrepancy at the same time -semantic information as required for a multitude of applications like urban planning and infrastructure management, includes building use, number of dwelling units and more (Hecht, 2014).A key information, from which several other metrics can be derived or at least be approximated, is the aforementioned building use.Therefore, we see a need for largescale automatic building category classification.The following paper proposes an approach to leverage Google's region wide available Street View data and link the inherent buildings with data from the digital city base map provided by the City Survey Office Stuttgart.To extract only building-relevant parts from the Street View data we pre-process the images.Therefore, we utilize metadata provided by the Street View API (Google Developers, 2017) and take advantage of a Deep Learning framework for semantic image segmentation (Long et al., 2015) to analyze our data for relevant content.Based on the information obtained in the crawling process we try to link image content with building polygons in the ground truth.The outcome is a tuple of building images and its corresponding building category.This data is then used to train a classifier.With the trained classifier it will be possible to predict building categories for new input images.First experiments are focused on investigating the potential of a Bagof-Words (BoW) approach and a pre-trained CNN.
For now, we want to distinguish between four different building use types: residential (purely residential use), commercial (purely commercial use), hybrid (mixture of commercial and residential use) and special use (which can be a building use of anything else, for example: churches, hospitals, museums, but also construction sites).The remainder of this paper is structured as follows: in section 2 we give a brief review on urban classification using semantic segmentation and deep learning, section 3 describes our approach for the generation of training data to perform building use classification, section 4 shows some first results and in section 5 we discuss and draw some conclusions.

RELATED WORK
Within this section, several topics of related work are discussed.Section 2.1 briefly gives an overview of the subject of Urban Classification as a whole.In section 2.2 we more specifically address Semantic Segmentation for Urban Scenes.Finally, section 2.3 investigates recent related work in the field of Deep Learning.

Urban Classification
Urban classification can be hierarchically divided regarding the type of data acquisition the classification is based on.Satellite data provides information to perform classification with respect to different land use, based on hyperspectral analyses.(Hoberg et al., 2015) present a multitemporal and multiscale classification based on Conditional Random Fields (CRF).As well as there are several approaches to perform building outline detection from satellite imagery (Niemeyer et al., 2014).With aerial data acquisition, urban classification typically further diversifies -not only building outlines are extracted (Ortner et al., 2007), but typically the scenery is divided into vegetation, ground and buildings.Besides pure 2D image segmentation, state-of-the-art is to use 3D point cloud information obtained from dense image matching (Haala and Rothermel, 2015) or LiDAR (Guo et al., 2011).Data obtained by LiDAR systems can either stem from airborne laser scanning (ALS) or terrestrial -either static (TLS) or mobile (MLS).Particularly MLS data is in the focus of urban classification and will be discussed in the next section.

Semantic Segmentation for Urban Scenes
When dealing with terrestrial urban data a great number of tasks is tackled in literature.In (Weinmann et al., 2015) several approaches (e.g.Nearest Neighbor, Decision Tree, SVM, Random Forest, Multilayer Perceptron) are investigated to classify MLS point clouds into semantic urban classes like façade, ground, cars, motorcycles, traffic signs and pedestrians.They report that Random Forests provide the best trade-off between accuracy and efficiency.Wang et al. (2015) presented an approach for holistic scene understanding, which reasons jointly about 3D object detection, pose estimation, semantic segmentation and depth reconstruction from a single geo-tagged image by using a holistic CRF.Similarly, (Xiao and Quan, 2009) use pairwise Markov Random Fields across multiple views to perform semantic segmentation for Street View images.We are aware of the large body of literature concerning building façade segmentation and interpretation.However, since we do not aim on extracting individual façade parts such as windows and doors in the presented work, but rather want to determine a specific building use category, we are not covering this topic here.An extensive overview on urban reconstruction, including façade interpretation can be found in (Musialski et al., 2013).

Deep Learning
Recent years have shown rapid development in CNN designs, performances and applications.Deep Learning is not only successfully applied in speech recognition (Hinton et al., 2012) and natural language processing (Collobert and Weston, 2008) tasks but also state-of-the-art for image classification and segmentation nowadays (Russakovsky et al., 2015, Everingham et al., 2012).Recent work proposed an approach to generate full sentences that describe image content (Karpathy and Fei-Fei, 2015).With regards to urban data, (Weyand et al., 2016) presented an approach that treats the photo geo-location problem as classification problem, in contrast to the more popular strategy of framing it as an image retrieval problem.They subdivide the earth into thousands of multiscale, geographical cells and train a deep network (PlaNet) using millions of geotagged images.For a query image, PlaNet outputs the probability distribution over the surface of the earth.The same task is addressed by (Hershey and Wulfe, 2016).They use a GoogLeNet model, pre-trained on a scene classification data set, to geo-locate images taken from GSV from 10 different cities.They report human exceeding accuracy of 75%.The work of (Movshovitz-Attias et al., 2015) uses SV images for the classification of storefronts, more specifically the classification into business categories.They create a large training data set by propagating business category information with help of an ontology that uses geographical concepts.For learning, they also use a network based on GoogLeNet.With a top1 accuracy of 69%, they are approximately at human level.

REGISTRATION OF IMAGE DATA WITH BUILDING USE CATEGORY
This part is structured as follows: in section 3.1 we describe the crawling process to extract georeferenced façade images from SV data.Selection and preprocessing of images to provide suitable image patches for classifier training is covered in section 3.2.Finally, in section 3.3 we elaborate on linking image patches to existing semantic information using coarse georeferencing information from Street View.

Urban Image Crawling
A crucial element in performing classification tasks is to obtain an appropriate number of training samples.Frequently, these are available from datasets and benchmarks within the fields of Computer Vision and Machine Learning.The SUN database (Xiao et al., 2010) consists of almost 4000 object categories but there are only slightly over 1000 images containing buildings.ImageNet (Deng et al., 2009) provides over 20,000 indexed synsets (synonymous word fields) and over 14 million images in total.
There are also several benchmarks for urban scenes - (Geiger et al., 2013) developed a mobile mapping platform and host KITTI, a benchmark with data for a variety of vision tasks from stereo matching, over scene flow to semantic segmentation.Likewise, the CITYSCAPES dataset provided by (Cordts et al., 2016) contains scenes from 50 cities with corresponding semantic pixelwise annotations for each frame, obtained by a windshieldmounted stereo camera system.For these datasets, GPS information of the car's trajectory is available.However, for our task these datasets are not suitable since we aim on assigning specific usage categories to buildings.We take another path and make use of municipal surveying data in combination with a publicly available image source.This way we can narrow down amd merge the variety of building categories, and enforce correctness of ground truth.There are several reasons why we pursue the proposed framework at all, when there are already huge CNNs that classify hundreds of categories with a reasonable level of correctness, including classes like apartment building or office building.First, those very deep CNNs developed by companies are fed with massive amounts of training data -not everybody can provide or produce those huge collections of training examples.Moreover, large CNNs have a broad range of category types they cover, while our work aims on a small subset of those classes.We are not interested in classifying a plethora of different categories, but rather very few, with potentially high intra-class variance.The evaluation of state-of-the-art approaches with a multitude of classes is frequently based on the top5 error, however, since we aim on the determination of a rather limited number of classes at a rather high reliability, the top1 error is our main interest.
The actual crawling is implemented in Java Script based on (Ashwell, 2015) modified for our use.As output from the crawling process, we obtain a list of positions (longitude , latitude ) and headings , where = 1, . . ., , with as the total number of crawl positions.By dragging the Google Maps marker one can define the initial crawling position.Using the Street View API the crawler searches for the next available panorama based on the current position.Figure 1 shows the crawling interface with the initial Street View on the left and all crawled panoramas on the right.We use two different modes of crawling: panorama-link based and random sampling.The first method successively visits the link nodes stored in the current panorama until a predefined total number of panoramas is fetched.However, this method only returns the center heading of the street view car for this position.Therefore, when using panorama-link based method we add 90° to -thereby we obtain frontal views of the buildings.When using the random sampling technique, we generate random offsets for latitude and longitude, thereby performing a random walk of the geographical position.To prevent from excessive divergence we reset to the initial position in predefined intervals.Based on the randomly sampled positions we then search for the nearest panorama and calculate the heading.Outcome of both crawling processes is a list of 2D geographic coordinates and a corresponding heading .We use this data together with the parameters pitch Φ and field of view (FOV) to query an image Ι as part of the panorama via the Street View API.Φ is measured positively looking upwards with respect to the camera's initial horizontal position.We chose Φ = 15° and = = 90° to ensure that also larger buildings are covered.

Extraction of building-relevant images
We aim on the extraction of good training data, which are images with clear view onto only one single building in center.However, many of the initial crawled images do not meet those requirements (see also section 3.2.1 and section 3.2.2).Thus, after fetching the Street View data we preprocess all images Ι … to extract only samples with relevant content.One tool we use to analyze the images is a reimplementation of a Fully Convolutional Network (FCN) (Long et al., 2015) provided by (Caesar and Uijlings, 2016).This end-to-end/pixel-to-pixel trained network uses "raw" images as input and produces a semantic pixelwise labelling.We use the FCN-16s SIFT Flow model, which is based on the SIFT Flow dataset with roughly 3000 images and their corresponding pixel labels.In total, there are 33 semantic categories like awning, balcony, bird, over mountain, person to tree and window.However, there are not only semantic, but also geometric labels -the FCN can learn a joint representation and predict both.We are not interested in all of those classes.Effectively, we only want to detect whether or not a building is the actual main content of the current image.Hence, we merge several classes -for example, we merge awning, balcony and window to the building class.Similarly, we merge grass and tree to the plant class.

Occlusions:
As stated in the previous section, we have to ensure, that the main image content is the building of interest.Thus, as a first step of processing the crawled urban imagery, we use the described FCN to perform a pixelwise segmentation.By using the merged classes introduced in the previous section we obtain results like depicted in Figure 2 on the right.If the main content of our segmented image consists of plant or car pixels, we discard this image.

Blurred Images:
Each building owner has the legal right to demand Google to make his or her private property unrecognizable within the Street View data.Google approaches this the same way they anonymize persons -by blurring the affected buildings.Obviously, we want to discard those images since there is no actual content provided.There has been a lot of work on edge-based blur detection (Ong et al., 2003;Narvekar et al., 2011).In fact, edge detection delivers quite consistent results in our case, as shown in Figure 4.However, as we incorporate the aforementioned FCN, we can make use of a particular property when evaluating images.In that framework, blurred regions are typically classified as sky or sea pixels and can thus be detected easily.

Linkage of images with correct ground truth
Our ground truth data consists of a 2D shape file with ground plan polygons for each building, enriched with several aspects of semantic information like address, communal district, building block number and, especially of our interest, building use.For each building polygon we calculate its centroid , where = 1, … , , with as the total number of buildings in the data set.Once it is ensured, that there is actual building content contained in Ι , we have to link it to the correct corresponding ground truth.Here, we make use of the previously gathered data from the crawling process -we know the actual positions , for each obtained SV image.However, these positions are in geographic coordinates.Ground truth data is located in the Gauß-Krüger coordinate system.Therefore, we perform a datum transformation between geographic coordinates and the reference coordinate systems from the national mapping agency.Subsequently, for each we carry out a nearest neighbour (NN) search in the ground truth dataset based on the centroids for each building polygon and extract candidates … Those buildings depict our neighbourhood , in which we have to find the actual building displayed in the image, denoted as Γ .To obtain the correct Γ we have to address several issues, covered in the following now.

Interiors:
In the crawling process, especially the random sampling approach is not limited to the required street level imagery but potentially also provides images from interior panoramas.To eliminate such data, typically covering shops, public institutions and suchlike, we take and perform a pointin-polygon for each … in .If the test returns true for one of the polygons, Ι contains indoor scenery and is discarded.However, too limited geolocation accuracy of these interior panoramas might lead to an actual position outside the building.
In future work we have to counteract this problem since the semantic segmentation FCN is trained for outdoor scenes and hence does not provide useful information in this case.Once interiors are handled we make use of the heading information to construct a line of sight with the corresponding predefined / .We limit the length of to 20 meters, to ensure Γ is the central content of Ι .In the next step, we determine whether hits any of the polygons .

Multiple Hits and Viewing Angle Dependency:
To verify whether or not there exists a suitable Γ , we use the line of sight and perform a test for intersection with BP … .If there are intersections, we call this a hit … .However, it is possible that we obtain multiple hits.The second hit is likely to be the intersection of the same on its rear or side part.For multiple buildings in close proximity, there can be more than two hits.If this occurs, we simply sort … by distance to and take the candidate with the shortest Euclidean distance as our correct hit .Multiple hits are more likely if the viewing angle onto Γ is very flat.Not only therefore we want to avoid flat viewing angles but mainly due to the reason, that we do not consider those samples as good training input.Ideally, we aim on quasi-frontal shots of the building façades.Thus, we proceed as follows.First, we determine our hit and detect the edge where Γ is intersected.This edge is considered our façade plane.On the location we construct the façade normal and determine the angle between and , representing our viewing angle (Figure 6).Ideally, would be close to zero.The viewing angle depicted in Figure 3 is still in order, however if exceeds a certain threshold we discard this image candidate.In the future, we plan on not only considering the central line of sight but also the bounding rays for our / , in cases where the hit of might not represent the actual central building content but rather a different building polygon within the bounds of the / .Figure 5 depicts crawled imagery for all four classes.The first two rows show examples we consider as good, whereas the last row demonstrates some negative examples.

STREET-VIEW BASED IMAGE CLASSIFICATION
At the moment, we limit our classification problem in terms of the number of classes.Thus, one might argue about the classifier of choice.From our point of view it is worthwhile not to restrict ourselves to handcrafted features like HOG, SIFT or SURF but also investigate in learned features from CNNs.Several works show, that on small-scale datasets with homogenous distribution, performance of handcrafted features can be considered on a par with learned ones.Whereas increased and more heterogeneous datasets lead to superiority of CNNs (Antipov et al., 2015;Fischer et al., 2014).Since we are crawling Street View images, we effectively have a vast amount of training data available -our limiting factor is the availability of correct ground truth for the building use.

Bag-of-Words Classification
For comparison, we applied an already existing implementation of a Bag-of-Words classifier, based on SURF features and a multiclass linear SVM.The underlying training and test database is described in section 4.2 in more detail.The original training set is randomly split in 80% actual training and 20% validation set.SURF features for each image are extracted and subsequently clustered using K-Means to create the visual vocabulary.Based on this vocabulary a multiclass linear SVM is trained on the training set and evaluated on the validation set.Average accuracy on the validation set is only 62%, same holds for average accuracy on the training set, which is at 63%.This classifier is now applied to a test set with available ground truth (the same as in section 4.2).The average accuracy here is at 41%.Obviously, those results are not really useful, thus an alternative approach is required.

Pre-trained Convolutional Neural Network
The data we use for training and testing the CNN is the same as in section 4.1, therefore we further elaborate on it here.Our training set consists of 8000 images (4 classes, each 2000 images) and the validation set contains at least 70 images per class.However, the original training set is smaller -roughly 2200 images with a distribution of 19% commercial, 22% hybrid, 43% residential and 16% special use.Thus, we use data augmentation to provide an equal number of training samples for each class.Therefore we randomly pick images and randomly perform one of these three manipulations: 1.) flip image on its vertical axis, 2.) crop and resize to original dimension, 3.) define random 2D affine transformation (in certain range), warp the image and resize to original dimension.For our first proof of concept we use transfer learning on the imagenet-vgg-f model from (Chatfield et al., 2014).For further information about the architecture, we would like to refer to the reference.To adapt this network to our needs we remove the last two layers (the fully connected fc8 layer and the softmax layer) and add a custom fc8 layer, which only has an output data depth of 4 as opposed to the original output depth of 1000.As final layer we add cross-entropy because we want to determine loss.Additionally, we add two dropout layers between fc6 and fc7, as well as between fc7 and fc8, with a dropout rate of 0.5 eachsince they were probably removed in the testing phase of the original network.During training phase, we use jittering to reduce overfitting.Within each training batch we randomly flip and crop images.On top of that, we apply an alternation of the RGB channel intensities using PCA, as reported in (Krizhevsky et al., 2014).We use a batch size of 40 images and a fixed learning rate = 0.0001.After 96 epochs, the top1 training error is at 0.725% and the top1 validation error is at 21.4% (Figure 7).We run this on a test set (the same as for the BoW classifier), which however also contains images from the evaluation set.
Here, we obtain an average accuracy of 75.9%.In Table 1, the results for precision and recall are depicted.With 85%, the precision for residential is best, whereas the special use category is with 63.3% at the lower end.This is most likely due to the high intra-class variance of the special use category, whereas the residential class is more homogenous in terms of visual similarity.In Figure 8, some examples of the classification are provided.We depict correct and wrong examples in terms of a confusion matrix.Columns represent ground truth, rows are predictions from the CNN, correspondingly.Correctly classified images are therefore displayed on the main diagonal, all remaining images are wrong classifications.Transfer to unknown data representation type: For comparison purposes we additionally applied our trained net to data we used in a previous test, where humans should classify input images into respective building categories (Tutzauer et al., 2016).This database additionally provided two alternative representations for building objects -firstly screenshots of textured meshes from Google Earth and secondly screenshots of manually modelled untextured LOD3 building models.We picked the untextured LOD3 models for input to the CNN, since they only have an abstract resemblance with the original training data.In total we evaluated almost 80 images and achieve an average accuracy of 63.6%.There are two important issues: a) the CNN has not seen this representation type at all during training phase and b) the LOD3 models additionally contain several samples with class-specific geometric properties on which the network was not trained.However, this shows the transferability of the network to even a completely different representation type in the input data.Some examples are depicted in Figure 9.

CONCLUSIONS AND FUTURE WORK
In this paper, we successfully linked Google Street View imagery to a database that contains semantic information for every contained building polygon.Such databases are available for a number of cities.Hence, it is potentially possible to generate large amounts of training data, which is a prerequisite for the successful application of Deep Learning frameworks for classification.In a first test, it was verified, that this approach can be promising, however future work will aim exactly on that very topic.In order to do so, some additional work has to be done in the processing step.Indoor scenes with limited geo-location accuracy have to be detected and eliminated.The incorporation of the bounding / rays might help in cases where the hit of is not representing the actual central building content.Moreover, the FCN used for image analyses could be replaced by an object detector framework like Faster R-CNN (Ren et al., 2015), since we are ultimately only interested in the bounding boxes of buildings.However, pre-trained models do not contain a building class yet.Therefore, such a network has to be trained from scratch.In our investigations, we found that semantic data provided by the city administration can be ambiguous or even erroneous.This is an issue, which at the same time shows the necessity of the proposed approach of automatic building use classification.For now, obviously wrong or ambiguous samples were discarded in an interactive post-processing step to provide a reasonable training input.In the future, we aim on training a variety of architectural styles as well as performing the training phase in one city and testing in a different one to investigate transferability.For that purpose we want to train our own CNN architecture from scratch.Since we ultimately want to further diversify from the current four classes, it is conceivable to leverage the original building-related segmentation classes from the FCN (awning, balcony, door, window) as a meta-classifier.As an application for our approach, we think of area-wide enrichment of crowd-source data like OSM building polygons.

Figure 1 .
Figure 1.Left: Initial crawling position.Right: Markers depicting each crawled position after process has finished.

Figure 4 .
Figure 4. From left to right: SV images with ascending level of blurriness.From top to bottom: SV input data, edge images, output of the FCN evaluation.The colour coding in the last row is the same as in Figure 2.

Figure 2 .
Figure 2. Input image and corresponding output from the FCN evaluation.The semantic class building is depicted in blue, sky in red, road in yellow, plant in green and car in bright blue, respectively.

Figure 3 .
Figure 3. Left: Building Polygons BP … of Neighbourhood NH , based on crawling position P (depicted with a red cross, see also Figure 6); Right: Corresponding SV image Ι .

Figure 5 .
Figure 5. Columns: Our four classes (f.l.t.r): commercial, hybrid, residential, special use.The first two rows depict samples considered to be good, whereas the last row shows bad examples.

Figure 6 .
Figure 6.Viewing angle dependency.The red bounding box depicts the detected Γ .The straight line emerging from Γ is the façade normal, whereas is depicted in green. is the enclosed angle between those lines.

Figure 8 .
Figure8.Predictions of the approach described in section 4.2, depicted in the shape of a classification matrix.The main diagonal entries are correct predictions.Please note how some of the actual ground truth labels themselves are sometimes ambiguous or the correct class is even for humans hard to identify.Example 1: row 2, column 4 was classified as hybrid but has the ground truth label special use -actually this is a care facility and we class the entirety of care facilities as special use.Example 2: row 4, column 3 is clearly a building under construction, though the residential label is obviously correct -but we trained the network on several construction sites with the label special use, therefore the respective prediction.(Note: special use class is labelled with unknown in the images here.)

Figure 9 .
Figure 9. Results from prediction of pre-trained CNN.The first row shows some correct predictions, f.l.t.r: commercial, residential, special use.The second row depicts wrong classifications, f.l.t.r.denoted in ground truth vs predicted: residential vs commercial, hybrid vs residential, residential vs special use (note: special use is labelled with unknown here).

Table 1 .
Precision and recall after evaluation on our test set (a value of 1.0 equals 100%).