GAN-BASED SYNTHESIS OF DEEP LEARNING TRAINING DATA FOR UAV MONITORING

Wind energy is a critical part of overcoming the use of fossil or nuclear energy usage. The price pressure on the renewable industry sector demands to cut the costs for costly regular inspections carried out by industrial climbers. Drone-based video-inspection reduces costs as well as increases the safety of inspection personal. To further increase the throughput, automatic or semi-automatic solutions to analyze these videos are needed. However, modern machine learning architectures need a lot of data to work reliably. This is by design a problem, as structural damage is rather rare in industrial infrastructure. Our proposed approach uses Generative Adversarial Networks to generate synthetic unmanned aerial vehicle imagery. This allows us to create a large enough training dataset (> 10) from a dataset, which is at least an order of magnitude smaller (approx. 10). We show that we can increase the classification accuracy of up to 6 percentage points.


INTRODUCTION
Onshore and especially offshore wind energy farms are crucial to overcome the use of fossil fuels or nuclear energy to generate electricity. Due to high price pressure on the energy market, ongoing costs have to be reduced. Regular inspection carried out by professional industrial climbers is essential to prevent structural damage and to assure optimal performance of the turbines. New technological approaches are needed to lower the costs of this labor-intensive and thus costly method. This applies to a wide range of other industries running hard-to-reach infrastructure as well. Our research project is using unmanned aerial vehicle-based video inspection to avoid highly trained and thus costly climbers. Furthermore, unmanned aerial vehicle (UAV) inspection improves the security of the inspection task as an error will not be fatal. UAV-based inspection is already used for a multitude of scenarios, e.g. inspection of bridges (Metni, Hamel, 2007, Hallermann, Morgenthal, 2014, industrial facilities (Nikolic et al., 2013), power lines (Jones, 2005, Deng et al., 2014, poles (Sa et al., 2015), buildings (Phung et al., 2017) and power facilities (Jordan et al., 2017). For a more in depth overview on possibilties and limitations of UAV-based inspection have a look at (Morgenthal, Hallermann, 2014) and for an extensive review of applications of UAV inspection at (Jordan et al., 2017). Although the process of acquiring the images or video can be now done by UAV operators, experts are still needed to review the UAV videos, especially concerning that the large majority (> 99%) of the frames contain no damage at all. Thus semi-or fully automatic inspection is required in order to cope with the huge amounts of video data. A problem by design with machine learning-based structural-damage detection is that damage observations are quite rare, thus there is not much training data available. In computer vision this problem is called one-shot or low-shot learning, depending on the scarcity of data. Object detection, here damage detection, using modern deep learning architectures such as Mask R- CNN (He et al., 2017) or YOLO (Redmon, Farhadi, 2018) are quite advanced and show state of the art detection performance. How- * Corresponding author ever, these need a large amount of data samples to learn from (approx. > 10 3 ). Due to the very small number of damage observations in UAV-based inspections, these networks cannot be applied in this domain in a straightforward manner. To tackle this problem, algorithmic solutions, e.g. low-shot transfer detector (LSTD)  exchanging these classifiers exist. In most cases, an increase in training data and the use of these established state of the art detectors is favored. In addition, objects to learn from should not morphologically be too different from the objects to be detected in these approaches. Yet, damage patterns such as rust can morphologically be quite diverse, as it can form arbitrary shapes, thus a new solution is needed. Here we propose an alternative solution using Generative Adversarial Networks(GANs), based on the pix2pix architecture (Isola et al., 2017), to generate synthetic samples. The pix2pix network transfers the style of one image to that of another one. In their paper Isola et al. show the transfer from segmentation masks to a street scene or a facade image, from aerial imagery to a map, from a photo taken at day to one taken at night, from an edge image to a photo and they use it to colorize images. The basis for the generation of our synthetic samples is a segmentation mask of the image to be generated. The mask can either be created by altering the segmentation mask of an existing image, or by manually creating a new one. Altering means to introduce or remove features, such as rust, contaminations or oil spills, to be added or removed in the synthetic image. This process allows us to introduce damage into any image of an intact structure, which can be used as training data afterward. In addition, we can remove unwanted features like contaminations on the lens.

METHOD
For an overview of the methodology used please have a look at Figure 1. a) Image acquisition We trained and evaluated our approach on 310 images shot by a UAV, each showing a part of a wind turbine. The wind turbine is an on-shore model located in Bremerhaven on the German North-Sea coast. More information concernig each step is provided in the text (cmp. section 2). Training data and derived data and models are colored blue, test data are colored red and synthetic data are colored purple.
b) Train/Test split The images were split into 290 training images and 20 test images to avoid data leakage.
c) Image annotation The annotations were marked by hand, i.e. labeled polygon annotations marking the most important features, such as tower, oil, rust, or contamination on the lens. Of the 290 training images, 145 showed rust of different sizes and qualities. The annotations were automatically analyzed and the largest connected annotations were selected as area of interest, i.e. the wind turbine, whereas the rest was considered background. A segmentation mask for the labeled polygon annotations was created (without background).

d) Training of pix2pix
With the segmentation masks a network based on the pix2pix architecture (Isola et al., 2017) was trained. This network learns to transfer from one representation of a scene to another. In our case, this is from the segmentation mask of an image to the actual color image. To our knowledge, we are the first to propose the use of pix2pix in the context of inspection and monitoring. The pix2pix network was trained for 7500 epochs.
e) Mask/image generation Segmentation mask images of the test set were altered in a way that the masks for contaminations on the lens were removed and additional masks for rust were introduced by hand. Furthermore, for each segmentation mask 6 copies were created using shifts of [-300, -150, -50, 50, 150, 300] on the image is x direction and an additional 6 copies by random cropping and resizing to the original size, i.e. zooming in on the image. The pix2pix network can now be applied to the masks to generate synthetic images of wind turbines. As the background is not important to create training data and it is quite hard to generate a realistic-looking environment, we omitted the generation of it and replaced it by an excerpt of an image showing background. The resulting images are 512×768 px in size.
f) Extraction of (synthentic) rust For training a classifier, we need training data. Thus a grid is overlaid over the synthetic images and all patches of size 64 × 64 containing rust are extracted. For the wind turbine photographs the same grid is used, however, this time all patches are extracted, but divided into containing either rust or background.

g) Training of CNN A convolutional neural network (CNN)
is trained with the said patches to classify patches to either contain rust or belong to the background. For more details on the CNN have a look at the text below. We trained two classifiers with either the GAN generated rust patches added or without them. In the case without the GAN samples we trained the classifier with 7344 background and 498 rust patches. When adding the GAN generated rust samples there are 2874 rust patches and still 7344 background samples.
h) Patch generation The test images are annotated completely, thus marking all the rust spots. All images of the test set are cut into patches using the grid approach described above.
i) Classification of patches These patches are classified using the trained CNN. Using the annotations we can evaluate the images using precision, recall and (macro) F1-score. There were 140 test patches containing rust and 5236 test patches containing background.

CNN
We tested different recent classifier architectures with this problem. These are Alexnet  (Sandler et al., 2018). If not stated otherwise the networks were trained from scratch, however, some networks performed better using pretraining with ImageNet data. The number of epochs was 64 and a batch size of 32 was used. PyTorch (Paszke et al., 2019) was used for the patch classification experiment. For the pix2pix network the Tensorflow implementation was used (Abadi et al., 2016).

Evaluation metrics
For evaluation precision P, recall R and the F1-score F1 are computed.
where TP are the true positives, e.g. the samples containing rust and being classified as rust, FP are the false positives, e.g. the samples which are classified to contain rust, but there is no rust on the patch and FN are the false negatives, e.g. the samples which contain rust but are classified to not contain rust. These are per class measures. To yield one value for an experiment, one can compute the measure for each class and average it. Thus yielding the macro scores. The macro scores are invariant to class abundance, thus are quite relevant for our case as patches showing background are much more abundant than patches showing rust. For completeness we also provide the weighted scores, which are more commononly used. Here the class-wise scores are averaged using a weighting scheme proportional to the abundance of classes.

RESULTS
A resulting synthetic image is shown in Figure 2 a) alongside the original image 2 b), the original segmentation mask 2 c) and the altered segmentation mask 2 d). We show that rust can be successfully added, which looks realistic and that the lens contamination was removed. The whole images, along with the segmentation masks, can be used to train an object detection algorithm, such as Mask R-CNN or YOLO. In addition, the rust spots can be extracted to train a convolutional neural network (CNN), such as ResNet (He et al., 2016), to classify rust in images showing wind turbines. In addition, the generation of rust training samples for classification, i.e. images showing only rust, can directly be generated to train classification networks. Although pix2pix works deterministically, thus resulting in one output per segmentation mask, differently looking segmentation masks and thus resulting images can be generated by manipulating the masks with random flips of background pixels to rust pixels at the border of the already existing segmentation masks for rust. In addition, by moving either all segmentations, i.e. moving the tower including the rust annotations or the rust annotations alone, differently looking results can be achieved, thus introducing more training samples. In Table 1 the results of the patch classification experiment are shown. If multiple parametrizations of the algorithm have been tested only the best performing ones, i.e. the ones with the highest macro F1-score is chosen. The corresponding result (with or without added GAN samples) with the same network parametrization is then presented as well, even though there might be other configurations where the corresponding exeriment performed better. All tested networks, except resnext50 32x4d and squeezenet1 1, provide better results with the GAN generated test samples added. However, the decrase in performance for resnext50 32x4d and squeezenet1 1 is not that big. The biggest increase is that of vgg16 bn. However, for this network as well as for Alexnet the increases are a bit misleading as the version without added GAN samples performed very poorly by every time predicting background, thus achieving a classification performance of 49.34. With added GAN samples vgg16 bn performed rather competitively with the other networks. Apart from these trivial improvements resnet101 with 4.42 percentage points improvements has the biggest gain. The best overall accuracy and the only example above 90% is shufflenet v2 x0 5, which also features the second biggest improvement. More detailed results are shown in the Appendix.

CONCLUSION
We have shown that our approach is able to create arbitrarily large training data sets for deep learning-based detection or classification based on few training samples. This approach allows another solution to the low-shot learning problem in inspection based projects. We achieved improvements of up to 4.42 percentage points and could increase the absolute macro F1-score to over 90% only with our methodology. With 91.52% shufflenet with added GAN samples is the best performing network. Except of two examples adding GAN samples improved the classification performance of all networks (cmp. Table 1). Our approach allows to compensate for the scarcity of image samples of damaged structures. The overall results allow for a semi-automatic inspection process where attention is guided by the classification algorithm. This allows the inspection teams to check huge amounts of data acquired by drones today.   Krizhevsky, A., 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2020, 2020 XXIV ISPRS Congress (2020 edition)