DENSE 3D OBJECT RECONSTRUCTION USING STRUCTURED-LIGHT SCANNER AND DEEP LEARNING

: Structured light scanners are intensively exploited in various applications such as non-destructive quality control at an assembly line, optical metrology, and cultural heritage documentation. While more than 20 companies develop commercially available structured light scanners, structured light technology accuracy has limitations for fast systems. Model surface discrepancies often present if the texture of the object has severe changes in brightness or reﬂective properties of its texture. The primary source of such discrepancies is errors in the stereo matching caused by complex surface texture. These errors result in ridge-like structures on the surface of the reconstructed 3D model. This paper is focused on the development of a deep neural network LineMatchGAN for error reduction in 3D models produced by a structured light scanner. We use the pix2pix model as a starting point for our research. The aim of our LineMatchGAN is a reﬁnement of the rough optical ﬂow A and generation of an error-free optical ﬂow ˆ B . We collected a dataset (which we term ZebraScan) consisting of 500 samples to train our LineMatchGAN model. Each sample includes image sequences ( S l , S r ) , ground-truth optical ﬂow B and a ground-truth 3D model. We evaluate our LineMatchGAN on a test split of our ZebraScan dataset that includes 50 samples. The evaluation proves that our LineMatchGAN improves the stereo matching accuracy (optical ﬂow end point error, EPE) from 0.05 pixels to 0.01 pixels


INTRODUCTION
Close-range photogrammetric techniques proved to be accurate and reliable 3D non-contact measurement in many applications beginning with industrial ones and spanning to anthropology and cultural heritage (Bosemann, 2011, Remondino, 2011, Knyaz and Maksimov, 2014. Active photogrammetric systems based on structured light demonstrate high accuracy and high performance for obtaining multiple 3D coordinates of an object's surface. Structured light scanners are intensively exploited in various applications such as non-destructive quality control at an assembly line, optical metrology, and cultural heritage documentation. While more than 20 companies develop commercially available structured light scanners, structured light technology accuracy has limitations for fast systems. Model surface discrepancies often present if the texture of the object has severe changes in brightness or reflective properties of its texture. The primary source of such discrepancies is errors in the stereo matching caused by complex surface texture. These errors result in ridge-like structures on the surface of the reconstructed 3D model. Many methods were proposed to compensate error in stereo matching for structured light systems (Curless and Levoy, 1995, Wang et al., 2016, Taylor, 2012, O'Toole et al., 2014, Wang and Feng, 2014, Chen and Shen, 2018, Bian and Liu, 2016, Knyaz, 2010. While these methods reduce surface distance error between reconstructed and the ground truth models, they could not eliminate the discrepancies caused by uneven texture brightness. The problems of stereo matching can be considered as a special case of optical flow estimation. For the structured light systems, the matches between the correspondent points result in dense optical flow from the left to the right camera. * Corresponding author Deep neural networks (Krizhevsky et al., 2012) have proved to be the most effective algorithms for robust optical flow estimation. Moreover, recently a new generation of neural networks has been proposed that is commonly named Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). These networks could be trained for complex image-to-image translation tasks such as object transfiguration , image superresolution  and noise reduction (Chen et al., 2020). A GAN model consists of two networks: a generator G and a discriminator D. Two networks are trained simultaneously for concurrent tasks. The aim of the discriminator D is to distinguish 'real' samples B from the training dataset from 'fake' samplesB produced by the generator G. The objective of the generator G is the synthesis of 'fake' samplesB that are indistinguishable from the random samples B from the training datasets. This paper is focused on the development of a deep neural network LineMatchGAN for error reduction in 3D models produced by a structured light scanner. We use the pix2pix model as a starting point for our research. The aim of our LineMatchGAN is a refinement of the rough optical flow A and generation of an error-free optical flowB. We hypothesize that our model can simultaneously process the object images S from the left and the right cameras and the rough optical flow A to fix errors in the preliminary stereo matching. Our proposed pipeline is given in Figure 1.
We collected a dataset (which we term ZebraScan) to train our LineMatchGAN model. The dataset consists of 500 samples. Each sample includes image sequences (Q l , Qr), object images (S l , Sr), ground-truth optical flow B and a ground-truth 3D model. We evaluate our LineMatchGAN algorithms on a test split of our Ze-braScan dataset that includes 50 samples. The evaluation proves The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) that our LineMatchGAN improves the stereo matching accuracy (optical flow end point error, EPE) from 0.05 pixels to 0.01 pixels. This improvement allows reducing the surface error from 0.05 mm to 0.015 mm for a working volume of 300×300×300 mm 3 . Our pipeline can be used to improve the accuracy of existing hand-held laser scanners. The evaluation of our LineMatchGAN model using our DeepScan scanner demonstrates that the proposed approach is robust against complex object textures.
The inference time of our LineMatchGAN is 1 second for input with a resolution of 1024×768 pixels using NVIDIA Jetson TX2 GPU. Such computational efficiency allows using the proposed pipeline to improve the accuracy of existing hand-held scanners for on-line data processing. We made our dataset and model publicly available 1 .

Structured Light Scanners
Today the use of structured light scanners is widespread in various tasks of photogrammetry and optical metrology. The works of (Akca et al., 2007) and (Chalmers et al., 2001) demonstrate the application of 3D scanners for documentation and visualization of cultural heritage. The first work presents the results of 3D modelling of two cultural heritage objects, where a closerange coded structured light system was used for generation of 3D models. In the second work a low-cost 3D scanner based on structured light projection with versatile coloured stripe pattern approach was designed. It adopts a set of patterns produced by recursive subdivision, which mixes thin stripes (from which the shape is reconstructed) and coloured bands that are used to reindex the stripes.
Structured light scanners are also used for a shape measurement and a 3D object surface reconstruction. Lin et al. have proposed 1 http://www.zefirus.org/en/linematchgan an automatic 3D color shape measurement system (Lin, 2020) based on images recorded by a stereo camera was developed. Also 3D shape measurement techniques have been widely used in industrial inspection, intelligent manufacturing, reverse engineering, and many other aspects , Gupta et al., 2011. Pribanic et al. have developed a multiple phase-shifting method . This method is not influenced by wrapped phase computation inaccuracies as the original approach and it is faster than common LUT-based (search) methods. Xiao et al. have proposed a structured light measurement technique (Xiao et al., 2020) based on the reverse photography. The technique includes an auxiliary reverse camera installed behind the structured light system. The camera allows to unify the local 3D shape data acquired from multi-view structured light measurements to a global frame. Such approach allows to achieve a holistic 3D shape data fusion.
Finally, using a structured light scanner in conjunction with deep learning methods allows to generate depth maps of real-world scenes. Li et al. have proposed a novel method (Li et al., 2019) that combines structured light and deep learning stereo matching to calculate the depth map. To prevent the holes in the textureless areas during the stereo matching, a depth map is predicted by a convolutional neural network. Then, a fine and accurate depth map is obtained by phase matching. The proposed method can generate a high precision depth and relieve the occlusion in the structured light system.

Mobile Scanners
Increasingly, the need arises for accurate and reliable reconstruction of three-dimensional objects in tasks where it is impossible to use expensive and bulky equipment. Recent progress and availability of the small and accurate industrial cameras have allowed the creation of low-cost and mobile systems for reconstruction of 3D objects. Piccirilli et al. developed a mobile sensor based on the fringe projection techniques (Piccirilli et al., 2016). The goal The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) of the developed sensor is acquisition of the 3D model of a face and its color texture using a smartphone device. The data acquisition, pattern generation, and reconstruction of the final 3D point cloud are all driven only by the smartphone. This system is used for 3D face acquisition.  (Kniaz, 2019) for robust 3D model reconstruction.
The methods combines a mobile structured light 3D scanner with a deep learning technique for online 3D reconstruction. The deep learning-based approach allows to perform stereo matching with a 3D scanner and achieve a sub-millimeter accuracy in the object space.

Generative Adversarial Networks
The development of a new type of neural networks known as generative adversarial networks (GANs) (Goodfellow et al., 2014) made it possible to take a significant step forward in the field of image processing. GANs consist of two deep convolutional neural networks: a Generator network tries to synthesize an image that visually indistinguishable from a given sample of images from the target domain. A Discriminator network tries to distinguish the 'fake' images generated by the Generator network from the real images in the target domain. Generator and Discriminator networks are trained simultaneously. This approach can be considered as an adversarial game of two players.
One of the first goals solved using GANs was image synthesis. Image-to-image translation problem was solved using conditional GAN termed pix2pix . Such network learns a mapping G : (x, z) → y from observed image x and random noise vector z, to output y. This method also uses a sum of two loss functions: a conditional adversarial objective function and an L1 distance. However, for many tasks it is not possible to generate paired training datasets for image-to-image translation tasks.
To overcome this difficulty a CycleGAN    (Karras et al., 2018) that provides a superior performance in the perceptual realism and quality of the reconstructed image. Unlike the common generator architecture that feeds the latent code through the input layer, the StyleGAN appends a mapping of the input to an intermediate latent space, which controls the generator. Moreover, an adaptive instance normalization (AdaIN) is used at each convolution layer. Gaussian noise is injected after each convolution facilitating generation of stochastic features such as hair-dress or freckles. The problems of the first StyleGAN model were partially eliminated in the second StyleGANv2 model (Karras et al., 2019). In this model parameters are optimized and the neural network training pipeline was adjusted. The changes made have improved the quality of the results.

METHOD
The aim of our LineMatchGAN model is improving stereo-matching accuracy using a deep neural network. Unlike an existing ap-proaches (Li et al., 2019), where error correction is performed in the domain of the depth map, we perform processing in the domain of the optical flow (Wedel and Cremers, 2011). Specifically, we consider stereomatches as a sparse optical flow that can be densified and improved by a GAN model.
The rest of this section presents an overview of our 3D reconstruction pipeline. After that technical properties of our mobile scanner are discussed. Finally, we present details of our Line MatchGAN model.

Framework Overview
The first step in our pipeline is acquiring of the structured light sequences S l , Sr using a mobile scanner. After that, we use a handcrafted stereo matching algorithm (Knyaz, 2010) S : (S l , Sr) → A to generate rough optical flow A. We use the rough optical flow A and object images S l , Sr as an input for our LineMatchGAN G : (A, S l , Sr) →B. Finally, we feed the predicted refined optical flowB to the hand-crafted 3D reconstruction algorithm to obtain a 3D mesh. We developed a mobile hand-held 3D scanner DeepScan to train and evaluate our LineMatchGAN model.

Mobile Scanner
We use a mobile scanner developed in our previous research (Kniaz, 2019, Kniaz et al., 2020. The developed scanner is based on the assumptions made by Knyaz et al. (Knyaz, 2010, Knyaz, 2015.
The whole system consists of two high-speed industrial cameras located on an aluminum beam. The cameras a separated by a basis of 300 mm. The structured light illumination is provided by a mobile multimedia projector. The projector has an autonomous power supply capable for 40 minutes of operation.
We use an external synchronization clock to synchronize the cameras and the projector. The total weight of our system is 1.3 kg. It allows to use it in the filed for online scanning of archeological objects and cultural heritage. The complete system is presented in Figure 3. Technical specifications of the system are presented in Table 1.

LineMatchGAN
Our LineMatchGAN is based on the pix2pix model . Specifically, it is a conditional genertive adversarial framework with a generator and a discriminator. Our model works by translating an input tensor X ∈ R W ×H×8 into a corrected optical flow B ∈ R W ×H×2 , where X is a concatenation The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition)  of a rough optical flow A ∈ R W ×H×2 and images for left and right cameras S l ∈ R W ×H×3 . Therefore, our generator G learns a mapping G : (S l , Sr, A) →B from the input data to the corrected optical flow.
We made two main contributions to the pix2pix model. Firstly, we added an inverted residual block (Sandler et al., 2018) after each convolutional layer to increase the number of training parameters and improve the optical flow reconstruction accuracy. Secondly, we add a preprocessing module to convert the input rough optical flow to the [−1, 1] range suitable for the training pipeline. Calculate the difference between P Right and P Lef t ; 7 end

Dataset Generation
To train our LineMatchGAN framework, we collected a new dataset ZebraScan. It includes synthetic and real images of four solid objects: Gnome, Vase, Nefertiti, and David. Some of models were generated using a structured light scanner (Gnome, Vase). Some of models were collected from the open source 3D models available in the internet. Synthetic images were generated using the Blender 3D creation suite. The dataset includes three spilits: real, 'synthetic full' and 'synthetic reduced'.
The real split of the dataset (Figure 4) was collected using a structured light scanner. To generate the ground truth optical flow, we imported the ground truth 3D models into the scene and simulated the optical flow from the left to the right camera using the estimated external orientation of the cameras.
The 'synthetic full' split of the dataset consists of images that simulate a structured light scanner (Figure 2). We created the images from the left and the right cameras and simulate a moving light line. The first split includes 300 pairs of images. The program code was developed that processes the synthetic images and generates rough optical flow. The algorithm for the code is presented in the Algorithm 1.
The 'synthetic reduced' split of the dataset was created to expand the training dataset ( Figure 2). We used special technique that creates not only images from cameras, but also simulates the processing of sequences by a scanner. We simultaneously generated ground-truth optical flow and rough optical flow. The second split contains 90 quadruples of images. The resolution of rendering images is 1280 × 960 pixels. We used a PNG format for the color images and OpenEXR format for the optical flow.

Network Training
The LineMatchGAN framework was trained on the training split of the ZebraScan dataset using the PyTorch library (Paszke et al., 2017). The training split includes 450 local image patches. The training was performed using the NVIDIA 2080 RTX GPU and took 12 hours. For network optimization, we use minibatch SGD with an Adam solver. We set learning rate to 0.0002 with momentum parameters β1 = 0.5, β2 = 0.999 similar to .

Qualitative Evaluation
We evaluate our LineMatchGAN framework qualitatively in terms of the smoothness of the reconstructed optical flowB. For the evaluation, we use an independent test split of our ZebraScan dataset consisting of 20 images for the real split and 20 images for the synthetic split. We present results in Figure 5. The evaluation results prove the our LineMatchGAN model learns both optical flow completion and elimination of the discrepancies caused by uneven brightness of the surface texture.

Quantitative Evaluation
We evaluate our LineMatchGAN framework quantitatively in terms of L1 distance between the ground truth optical flow and the optical flowB processed by our model. The evaluation results are presented in Table 2. The average endpoint optical flow error for the optical flow produced by the scanner (EPE base) includes discrepancies caused by the illumination and ranges from 0.5 to 0.1 pixel. Processing of the rough optical flow using our LineMatchGAN model reduces the error more then ten times.
We reconstructed 3D models of the objects using the rough optical flow and the optical flow filtered by our LineMatchGAN model to compare surface distance accuracy. The surface distance to the ground truth 3D model for the models reconstructed using a rough optical flow (SD base) was about 0.05 mm for all objects. 3D models reconstructed using the filtered optical flow have average surface distance of 0.01 mm.

CONCLUSION
We showed that a conditional adversarial loss function can be used to improve a 3D model reconstruction accuracy for mobile structured light scanner. Specifically, our model corrects the stereomatching errors causes by the uneven surface brightness of the object's texture.
We developed a LineMatchGAN conditional generative adversarial model for optical flow filtering. Our model receives a joint input consisting of a stereopair and the rough optical flow generated by a mobile scanner. The model attempts to reconstruct the errorfree optical flow using the learnt experience. We collected a Ze-braScan dataset to train and evaluate our LineMatchGAN model. Both qualitative and quantitative evaluation demonstrates that the framework successfully removes errors in the optical flow. Moreover 3D models reconstructed using the filtered optical flow have five times lower surface distance error compared to the models reconstructed using the rough optical flow.