WALL STONE EXTRACTION BASED ON STACKED CONDITIONAL GAN AND MULTISCALE IMAGE SEGMENTATION

The historical castles (castellated walls), which are cultural heritages in Japan, require regular maintenance, and it is necessary to record the arrangement of individual wall stones in the maintenance work. Recently, image processing techniques are practiced to optimize maintenance and management of the infrastructure assets. In the previous study, we proposed an automatic method for efficiently extracting individual wall stone polygons by improved multiscale image segmentation technique. However, the problem has remained that wall stone polygons could not be extracted properly when there were no clear gaps or boundaries between stones. To address this problem, we improved the multiscale image segmentation technique used in our previous studies. The first improvement is that in the region growing process, selecting the best combination of a plurality of objects instead of two. The second improvement is the modification of the shape criterion to be used. Besides, we proposed three-stage Stacked cGAN for wall stone edge detection that enables us to complement areas with weak or broken boundaries of stone edges. This approach is composed of a coarse-to-fine based image-to-edges translation network. The edge images derived from this method are used as the additional channel in multiscale image segmentation with a higher weight compared to the other RGB channels. It was confirmed that the separation performance of individual wall stone polygons was improved by the proposed method. Furthermore, the proposed method is highly effective to reduce the difficulty in setting of the scale parameter, which is usually sensitive to segmentation results and requires trial and error.


INTRODUCTION
There are hundreds of historical castles (castellated walls) in Japan and some of which are designated as national treasures and important cultural properties. Castellated walls are positioned as tangible cultural heritage, which requires regular maintenance to preserve their original state (Fig. 1). For the demolition and repair work of the castellated wall, it is necessary to identify the individual stones constituting the wall in advance, and is expected that the workload can be significantly reduced by automated processing. The following methods have been conventionally used in documentation regarding cultural heritage. One approach is to image the front view of the castellated walls using a laser scanner (Křemen et al., 2011;Vacca et al., 2012). However, it is difficult to identify individual wall stones after dismantling by this method. Another approach is to attach integrated circuits (IC) tags for identifying the individual wall stones before dismantling the castellated walls for restoration (Ryu et al., 2014). However, it is quite burdensome to attach IC tags on each stone and to capture the images individually. As a study similar to our theme, there is one to detect bricks from masonry walls (Ibrahim et al., 2019). This approach combines U-Net based brick seed localization and the Watershed algorithm for accurate instance segmentation of bricks. Though, the processing result almost depends on the extraction state of the seed region by U-Net.
In our previous study, we proposed a method for efficiently extracting individual wall stone polygons by improved multiscale image segmentation technique (Sakamoto et al., 2018). We focused on the fact that many wall stones have a convex hull shape, and we introduced the new shape criterion called convex hull fitness and showed that it is possible to improve the performance for the extraction of wall stone polygons.
On the other hand, even when this method was applied, the problem has still remained that wall stone polygons could not be extracted properly when clear gaps did not exist between stones or the texture between adjacent stones was very similar (shown as the red circle in Fig. 2).
In this study, we introduce the Conditional GAN (Generative Adversarial Networks), a kind of deep learning technique, to complement the unclear stone edges in the target image. By using these extracted edge images supplementarily, we confirmed that the separation performance of individual wall stone polygons was improved and at the same time, reduction of difficulty in setting scale parameters in image segmentation was achieved.

Extraction of Wall Stone Polygons by Multiscale Image Segmentation
The multiscale image segmentation is a kind of region growing approach of evaluating and merging image regions in the units referred to as objects (Baatz and Schäpe, 2000;Chen et al., 2005;Esch et al., 2008;Li et al., 2009).
Each region (object) starts from one pixel at the initial stage. The decision to merge the neighboring regions is determined based on evaluating the changes in heterogeneity including both the spectral component and shape component in objects. This relationship is formulated according to the following equation.
where Fcolor = heterogeneity in spectral component Fshape = heterogeneity in shape component wcolor = weight for Fcolor wshape = weight for Fshape The degree of object merge is regulated by a parameter called the scale parameter (SP), which also affects the size of the derived objects indirectly. The merge process of neighboring regions is performed when the evaluation value F does not exceed the square value of SP. SP starts from a small value and then all the possible merging of the objects is performed. If no objects can be processed, the SP is incremented. This process is applied repeatedly until the SP reaches the predefined value.
Fcolor is defined using the average and/or standard deviation calculated from each color component (channel) of pixels within the objects.
where p = an object before the merge r = an object after the merge {p} = a set of objects neighboring to object p s = subset of {p} n = number of pixels within an object σi = average and/or standard deviation of the i-th channel in an object wc i = weight for the i-th channel N = number of channels Fshape is calculated using several types of shape criteria that regulate geometric properties for the merged objects.
where fi = heterogeneity in the i-th shape criterion wsi = weight for the i-th shape criterion K = number of shape criteria This method has a very high affinity for the application to wall stone polygons extraction because of the existence of gaps between stones, differences in texture of individual stones, and geometric stability of each stone polygons. In addition, by adding a restriction on the size of the generated object in processing based on the prior knowledge of the dimensions of the stones, an excessive merging of the objects can be suppressed.
On the other hand, there is a problem that adjustment of the upper limit for the SP requires trial and error. This is because the region of a stone is divided into a plurality of polygons when the SP is small, and a plurality of stone regions are merged into one polygon when the scale parameter is large.
Based on our previous study (Sakamoto et al., 2018), it has been confirmed that the compactness (fcmpct) and the convex hull fitness (fcnvx) are both effective as shape criteria, which are defined as follows.
where l = boundary length of an object c = number of pixels in the convex hull of an object m = boundary length of the convex hull of an object t = adjustment parameter in the range of 0≤ t ≤ 1 The convex hull fitness is originally proposed in our previous work. Different from the formulas in all existing studies by various researchers, in this study, we have extended that the number of objects to be merged at one time is not limited to two. That is, in the region growing process, the best combination is selected from all possible combinations. Besides, the equation of the definition for convex hull fitness is modified to be more ideal index.
From our previous knowledge (Sakamoto et al., 2018), the processing parameters are adjusted as follows. (2) (3) (4) The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) (1) Multi-stage parameter setting The setting of the SP is divided into several stages, and the weights for the spectral evaluation index and the shape evaluation index are sequentially changed.
(2) The balance between spectral index and shape index At the stage where the SP is small, the weight for the spectral index (color and texture information) is increased, and the setting is made so that the weight for the shape index (shape stability) gradually increases as the SP increases.
(3) Balance in shape criteria In the initial stage, the weights for fcmpct and fcnvx are made approximately equal, and the weight of fcnvx is increased as the SP increases.
The processing result based on only the image segmentation is used as a baseline for evaluating the effect of newly introducing the edge images generated by the proposed Conditional GAN approach.

Edge Extraction of Wall Stones by Conditional GAN
2.2.1 Overview: Here we describe a methodology for the edge extraction of wall stones and its complementary usage in image segmentation using one of the Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) technique. GANs learn a representation of images that tries to determine whether the generated output images are real or fake, while simultaneously training a generative model to minimize the loss of these images. Isola et al. (2017) proposed pix2pix, a kind of conditional GAN (cGAN) framework , for image translation. They introduced a "U-net" based architecture (Ronneberger et al., 2015) as a generator and a convolutional classifier as a discriminator. We propose a diverted network architecture based on cGAN that enables to utilize edge detection of wall stones with unclear edge gaps or similar textures in boundaries. Therefore, our task is positioned as the "image-to-edges" translation problem.

Stacked cGAN:
First, we propose three-stage Stacked cGAN for wall stone edge detection that enables an end-to-end learning architecture. Our approach is composed of a coarse-tofine based image-to-edges translation network that consists of three stages: 1) thick edge generator, 2) refiner (generator for refinement) and 3) thin edge generator (Fig. 3). All stages follow an adversarial model, i.e. each stage consists of a generator and discriminator mechanism. First, the thick edge generator produces a coarse edge from the input wall stone image. Next, a refinement network modifies to fill in missing regions of coarse edges by generating more thick edges. Finally, thin edge generator sharpens reinforced coarse edges. All models in each stage are trainable and differentiable.
The real images (ground truth) in all stages are rasterized images having different thickness generated from the same vectorized edge data.

Network architecture:
The proposed network architectures for each stage are consisting of the U-net based generator and Convolutional Neural Networks (CNN) based discriminator. Fig. 4 shows our generator which predicts the edge image from the input data. Original U-net is simply an encoderdecoder framework with skip connections between the encoding and decoding layers. Our U-net based generator is modified to reduce the number of pooling layers while keeping the number of convolutional layers. Original U-net makes the smallest feature maps resized into 1/32 from the size of the input image, but our generator makes the smallest one resized into 1/8. To extract features for edge detection from small objects, feature maps with high resolution are very important to preserve vanishing edge information. On the other hand, our discriminator is a very simple convolutional image classifier with a pair of images, i.e. "input image (given to generator) and generated image (derived from generator)" or "input image (given to generator) and real image (true edge image)". where i = index of stage = generator in stage i x = input data y = ground truth z = input noise The input noise feuds in a dropout layer.
tries to minimize this objective against an adversarial that tries to maximize it. Additionally, our generator is expected to not only deceive the discriminator but also to be near the distance between ground truth and output in a color space. We used the loss function as the L1 distance ( ): Our final loss function for Stacked cGAN is defined as follows.
We use minibatch SGD and apply the Adam solver, with a learning rate of 0.0002 and momentum parameters β1 = 0.5, β2= 0.999.

Utilization of edge image:
The edge images derived from proposed Stacked cGAN are used as fourth channel of processed wall stone images in baseline method (multiscale image segmentation). The channel allocated for the edge is processed with a higher weight compared to the other RGB channels.

Training Data
The training image of wall stone edges applied to each proposed network model was obtained if the stones were different, regardless of whether there was a clear gap between the stones. The edge data were obtained in vector format and were converted to raster images by changing the line thickness as necessary.
For the training and the inference, we implemented a model, which was extended based on the network of pix2pix, to suit the processing of our proposed method. Three wall stone images and the corresponding wall stone edge images were used as training data, and another two wall stone images were used for the test.
Here, the number of images used for training is relatively small due to the following two reasons. Firstly, when a certain castellated wall is targeted, there are not so many variations on the edge of the wall stones to be extracted. Secondly, we confirmed that even if the number of images used for training was increased, there was almost no difference in the stability of model generation and the performance of wall stone extraction in crossvalidation. Thus, due to utilizing the proposed method in practical situations such as wall stone extraction in different types of castellated walls, there is an advantage to prepare considerably less training data.

Network Model
For comparison, we set the following three network models. Here the wall stone image to be learned is denoted by Img, the training edge image having a thickness of w pixels is denoted by Edgerealw, and the edge image generated so as to approach Edgerealw is denoted by Edgefakew. Also, a flow for generating a fake image (fake) so as to approach the conversion from the input image (input) to the true image (real) will be described as "input ↦ <real> ↦ fake".
(c) Type 3: A model in which stage 2 is replaced with "Edgefake40 ↦ <Edgereal80> ↦ Edgefake80" and stage 3 is replaced with "Edgefake80 ↦ <Edgereal10> ↦ Edgefake10" in model Type 2. Fig. 5 shows the processing results when only a pure image segmentation processing (baseline method) is used. When only a spectral component is used as the evaluation index, individual wall stone cannot be satisfactorily extracted (Fig. 5(b)). As shown in Fig. 5 (c) and (d), when the shape component is added as the evaluation index, it can be observed that the performance of wall stone polygon extraction is improved. However, some polygons have been generated across the stone boundaries or have less merging.

Edge Generation Result
In the network model of Type 1, the training process was slightly unstable, and failed to converge during training in some cases because of the unstable gradient from discriminator. On the other hand, in both three-stage network model of Type 2 and Type 3, stable learning was possible in all cross-validation results. Fig. 6 (a) shows the edge detection result for comparison using the Sobel filter. In this result, many small edges derived from the texture of the wall stones are detected, and when the stone has no clear boundary, there is a break in the edge. Fig. 6 (b), (c) and (d) indicate the edge detection results by each stage of our proposed Stacked cGAN.
In Fig. 6 (b), that is the output result of stage 1, desirable edge detection of wall stone with little noise is realized, but interruption of the edge is also observed. In Fig. 6 (c) in which a thicker edge is output, it can be confirmed that the breakage of many edges is improved. However, at a position where the interruption of the edge is large, the connection of the edge is not well established due to a lack of context in cGAN. Fig. 6 (d) represents the thinned edges generated from stage 3. Compared with the result of the Sobel filter, it can be confirmed that the reproducibility of the edges at the boundaries of the wall stones is improved and the noise is properly suppressed.    Table 1 shows the result of the quantitative evaluation of the number of extracted wall stones. In this table, "valid wall stones" indicate the stones successfully extracted. In addition, "undermerged wall stones" mean that the area is covering less than 90% by the extracted polygons, and "over-merged wall stones" indicate the case where one polygon is merged with the area exceeding 20% in any of a plurality of wall stones. In this table, Baseline 1 is a method when only the compactness is used as shape criteria and only the conventional merging of two neighboring regions. Baseline 2 is a method proposed in this study, that uses both compactness and the modified convex hull fitness as shape criteria, and in the region growing process selecting the best combination of a plurality of regions instead of two is allowed.
Baseline 2 has improved the extraction performance of wall stone polygons compared to the conventional Baseline 1. Furthermore, we can confirm that the result of all types of proposed network models based on Stacked cGAN achieved performance exceeding the baseline methods. In particular, the improvement by Type 3 is remarkable. In the object-based evaluation of the number for this case, 93.6% of wall stone polygons could be extracted properly. In Type 3, since the edge image is gradually improved as the stage progresses, it is considered that the effect of complementing the edge breakage became higher compared to other types. Therefore, it demonstrated the best performance for the extraction of the wall stone polygons.
On the other hand, it should be noted that the Type 1 has a problem in the stability of training process and thus the processing result fluctuates. The result by Type 1 shown here is somewhat irregular when the model happened to converge well.
When only the conventional baseline method is applied, it is difficult to finely adjust the degree of merging of wall stone polygons by setting the upper limit of SP. However, it was confirmed that when a wall stone edge image was used together, a stable processing result was obtained even when a relatively large variation was added to the upper limit of SP.   9 is an example showing the effect of the combined use of the edge image generated by the proposed Stacked cGAN and multiscale image segmentation. In Fig.9 (b), most of the outlines of the extracted wall stone polygons are generated in the region along the edges, but not all the edges are reflected (edges in red circle). Shown as the red circle in Fig. 9 (c), we can observe that the wall stone polygons are appropriately extracted by the multiscale image segmentation processing even in the area where the edge is not generated. This is a typical example of the effectiveness of the proposed method.

CONCLUSIONS
In the present study, we have addressed a solution to the problem that the extraction performance of wall stone polygons would be degraded when there were no clear edges or boundaries between wall stones. To solve this problem, we improved the multiscale image segmentation technique used in our previous studies. The first improvement is that in the region growing process, selecting the best combination of a plurality of objects instead of two. The second improvement is that we modified the shape criterion of convex fill fitness which was proposed in our previous study.
We proposed a new cGAN based network model, called Stacked cGAN, to deal with the case where no boundaries between wall stones could not be solved by the improved image segmentation technique. This model enabled to generate edges even in areas where the boundaries of wall stones were unclear by using the existing wall stone edges as context.
It was confirmed that the separation performance of individual wall stone polygons was improved by appending the edge image generated by the proposed method to the input channel in baseline image segmentation. Furthermore, the proposed method is highly effective to reduce the difficulty in setting of the scale parameter, which is usually sensitive to segmentation results and requires trial and error.