3D DIGITIZATION OF TRANSPARENT AND GLASS SURFACES: STATE OF THE ART AND ANALYSIS OF SOME METHODS

In the field of industrial metrology, there is a rising need for 3D information at a very high resolution for micro-measurements and quality control of transparent objects such as glass bottles (beer, wine, cola, cosmetics, etc.). However, such objects are particularly challenging for optical-based 3D reconstruction methods and systems such as photogrammetry, photometric stereo, structured light scanning, laser scanning, typically resulting in poor metrological performances. Indeed, these methods require the surface of the object to diffusely reflect the incoming light, which is not the case with the glass material where refraction and absorption phenomena do not permit their use. Over the years, various methods have been investigated and developed to avoid the coating (or powdering) treatment often used to make transparent objects opaque and diffusely reflecting. Most of the approaches require either some a priori knowledge of the transparent object or assumptions about how light interacts with the surface. This paper provides a general overview of state-of-the-art 3D digitization methods for optically non-cooperative surfaces featuring absorption, scattering, and refraction. The paper reviews research works summarizing them into four categories including shape-from-X, direct ray measurements, hybrid, and learning-based approaches. Moreover, we provided some 3D results to better highlight the advantages and disadvantages of each method in practice when dealing with transparent objects.


INTRODUCTION
In the field of industrial metrology, there is a rising need for 3D information at a very high resolution for quality inspection (i.e., shape and textures) and 3D monitoring over time of manufactured products. 3D object reconstruction is generally performed using either active (range-based) or passive (imagebased) methods (Blais, 2004;Remondino and El-Hakim, 2006;Remondino et al., 2013;Ahmadabadian et al., 2019;Karami et al., 2021). However, these methods are not directly applicable to the 3D reconstruction of transparent objects such as glass bottles (beverages, cosmetics, oil, etc.). This is mainly because glass does not diffusely reflect the incoming light and, also, for passive methods, do not have a texture of their own needed for image matching tasks. Instead, because of refraction and specular reflections, their appearance depends on the object's shape, surrounding background, and lighting conditions with light traveling through the surface, distorting or changing the path of the light in the process. This makes standard techniques inappropriate, causing large errors, and most often failures in the process of 3D reconstruction (Figure 1). One of the traditional approaches generally used in industrial sectors to deal with such objects is to spray a thin layer of powder onto the object's surface ( Figure 1) to make its surface opaque and diffusely reflecting (Palousek et al., 2015;Lin et al., 2017;Pereira et al., 2019). This supplementary treatment, on the other hand, is challenging, timeconsuming, and may not always be feasible in real-time 3D inspection of industrial components (Pereira et al., 2019;Karami et al., 2022). Moreover, the added layer could increase the overall volume of the object and may negatively affect the final accuracy depending on the powder thickness and its homogeneity (Palousek et al., 2015;Pereira et al., 2019). Over the years, various approaches have been investigated and developed to avoid the coating treatment. Most of them require either some a priori knowledge of the transparent object's shape or assumptions about how light interacts with the surface. a) Image samples before and after powdering b) Camera network c) 3D reconstruction Figure 1. 3D reconstruction results for a glass bottle (400x80x80 mm) and a plastic bottle (300x70x70 mm) (a) using photogrammetry and 36 images (ground sample distance -GSD ≈ 38 µm) acquired with a turntable (b). Without powdering, no results are achieved. Only after powdering a successful 3D result can be obtained (c).
A comprehensive review was provided in (Ihrke et al., 2010), which presents a taxonomy of nine object types to describe light transport, ranging from ideally diffuse to a more complicated surface with absorption, scattering, and refraction. This work was further completed by Mériaudeau et al. (2012), in which all techniques were categorized based on the physical interactions in conjunction with developed techniques like transmission, reflection, and emission. Stolz et al. (2016) proposed a short overview of only polarimetric imaging-based methods for 3D measurements of transparent objects.

Aim of the work
This paper aims to provide an updated and general overview of 3D reconstruction approaches for optically non-cooperative surfaces, in particular transparent ones ( Figure 2). It reviews the most related research works, summarizing them into different categories, including shape-from-X (distortion, silhouette, reflection, polarization, heating), direct ray measurements, hybrid, and learning-based approaches. The main benefits and drawbacks, as well as their applications in industrial sectors, are also discussed and assessed. Moreover, for some techniques, we provide a comparative or visual evaluation to show, with practical examples, the advantages and disadvantages of each method when dealing with transparent objects. Figure 2. General taxonomy of 3D digitization of transparent and glass surfaces. SFS and SFP are the abbreviation for Shape from Silhouette and Shape from Polarization, respectively.

STATE OF THE ART
In this Section, we are summarizing many research works related to the 3D measurement of transparent surfaces into four different categories: shape-from-X, direct ray measurements, hybrid, and learning-based approaches.

Shape from X
Several approaches known as Shape from X techniques have been developed for extracting shape information from 2D images, where X could be distortion, Silhouette, reflection, polarization, heating, and so on. Shape from distortion, also known as Deflectometry, is one of the earliest methods specifically developed for transparent objects. This technique recovers the 3D shape of an object by analyzing the distortion of a known pattern placed behind or near the surface. This approach has been investigated for long to reconstruct either mirror-like surfaces (Tarini et al., 2005), liquids (Murase, 1990;Jähne et al., 1994), or solid refractive surfaces (Ben-Ezra and Nayar, 2003;Wetzstein et al., 2011;Tanaka et al., 2016;Kim et al., 2017). The 3D reconstruction of refractive surfaces is more complex than the corresponding specular, or textureless surfaces because the ray path depends on the refractive index in addition to the dependence on the surface normal (Wu et al., 2018;Lyu et al., 2020). These approaches are also limited to the recovery of a single refractive surface or the reconstruction of parametric surface with simple geometry and therefore are not generalizable if not with approximation to a wider range of object categories (Wu et al., 2018;Lyu et al., 2020). Shape from Silhouette (SFS) is a well-known 3D reconstruction method applied to a wider range of object categories. This method reconstructs the 3D shape of an object using a sequence of images taken from different views, where the silhouette of the object is the sole relevant feature of the image. Depending on the geometric projection of the imaging system (e.g.: telecentric, central perspective) the silhouette of the object at each station (image) can be seen as the base of a prismatic /conic volume in three-dimensional space. The silhouette itself represents the locus of tangent points on the straight line departing from the perspective center of the camera (for a central perspective). By intersecting the pyramidal volumes, which is also known as Space Carving, a 3D reconstruction of an object can be generated. This method was first presented by Baumgart in 1974. Since then, various versions of the SFS have been proposed. For example, (Martin and Aggarwal, 1983;Kim and Aggarwal, 1986) used volumetric descriptions to represent the reconstructed shape. Following this, some works (Potmesil, 1987;Ahuja and Veenstra, 1989) used an octree data structure to speed up the 3D reconstruction process. Szeliski (1993) built a non-invasive 3D digitizer using a turntable and a single camera with SFS as the reconstruction method. As shown in Figure 3, SFS can recover the 3D shape of an object regardless of the object's property and shape as long as the region of the object in each image is distinguishable from the background (Figure 3-red and yellow boxes demonstrate how a non-distinguishable background affects the 3D model). On the other hand, as photogrammetric 3D results directly depend on the object's surface property, the method completely failed to reconstruct the object's 3D shape before powdering the surface due to the refraction of the light ( Figure 1). However, the accuracy of SFS is directly depending on the silhouette boundary binarization, which can be done using automated or user-defined global thresholding of an image. In many cases, it might be difficult to determine the optimum threshold for distinguishing transparent objects from the background. As a result, the silhouette of an object may be reduced or increased, making the resulting 3D model smaller or larger than the real size of the object or making it noisier. To evaluate this, 3D results achieved with SFS were geometrically compared against photogrammetric data where its surface was coated, and the generated 3D model was of relatively higher accuracy than those of SFS. The quantitative analysis in Figure 3 shows that the accuracy of the generated 3D data using the SFS approach before powdering was 0.6mm, while after coating the surface, it decreased slightly to 0.51mm. This is due to the fact that, without powdering, refraction makes it more difficult to identify the object's silhouette on the captured images and distinguish it from the background. Moreover, another primary issue with SFS is that concavities on an object's surface remain unseen, finding it unsuitable for reconstructing the inside of a hole or concave areas (Figure 3-red and brown boxes). To deal with this issue, Zuo et al. (2015) incorporate internal occluding contours into traditional SFS methods to recover the concavities on an object's surface. Wu et al. (2018) and Lyu et al. (2020) started with an initial 3D shape reconstruction generated from traditional SFS, and then gradually optimizes the model.  Figure 3. Quantitative analyses for the SFS-based 3D results on the two transparent objects of Figure 1. Using the 3D data achieved with photogrammetry on the powdered objects (Figure 1), the RMSE of a cloud-to-cloud comparison is computed.
Shape from reflection/refraction is also another approach introduced for the first time by Morris and Kutulakos (2007) to recover the 3D shape of the transparent object. This approach usually describes the behavior of rays as they passe through a refractive object by controlling the background behind the refractive object (Morris and Kutulakos, 2007;Yeung et al. 2011;Yeung et al. 2015;Han et al., 2015;Han et al., 2021). However, the data collection of this method may result complex and ineffective, and it is necessary to manually rotate a spotlight around the hemisphere to illuminate the object and a reference sphere from various angles. Following a similar idea, Yeung et al. (2011Yeung et al. ( , 2015 used a more convenient data collection method to obtain the specular reflection information on the surface of a transparent object and applies the graph cut theory to recover and optimize the normal vectors, consequently the depth map. Although the results are insufficiently precise for industrial inspection, they are promising for 3D computer graphics animation. Iwabuchi et al. (2011) also presented a similar method based on inverse ray-tracing. This method uses multiple sensors placed around a transparent object with simple geometry and can recover the shape and refraction index of the object. Chari et al., (2013) proposed a method that combines both geometric and radiometric information to do reconstruction. The position and direction for each light-path were recovered and combined with light radiance at the beginning and end of each light-path. More recently, Han et al. (2015Han et al. ( , 2021) employed a single camera that was set in place with a refractive object in front of a checkerboard background. The approach required two images with the background pattern placed in two different known locations. However, the approach required a change in refractive index, necessitating immersion of the object in water, which is a significant disadvantage for industrial purposes. Shape from Polarization (SFP) Miyazaki et al., (2002Miyazaki et al., ( , 2003Miyazaki et al., ( , 2004, Miyazaki and Ikeuchi (2005), Huynh et al. (2010), Cui et al. (2017), Sun et al. (2020) recover the 3D shape of an object from polarization information of the reflected light. The basic principle is that after capturing the polarization information such as the intensity, degree of polarization, polarization phase angle, the surface normal can be recovered by analyzing the relationship between the surface normal and the polarization image formation model. This method has been applied on different object types with various reflection properties such as dielectrics (Huynh et al., 2010), black (Miyazaki et al., 2016), metal (Morel et al., 2006), translucent (Chen et al., 2007) and transparent (Miyazaki and Ikeuchi, 2005;Huynh et al., 2010;Cui et al., 2017) objects. This method is also quite robust and stable to different lighting conditions such as indoors, outdoors, or under patterned illumination as long as incident light is unpolarized (Durou et al., 2020). These methods calculate surface normals, which must afterward be converted into a height map. However, the results are highly vulnerable to noise since they depend solely on the weak shape cue supplied by polarization and do not ensure integrability (Durou et al., 2020). The ambiguity in polarization analysis is also one of the main issues for this approach. To resolve the azimuth and zenith angle ambiguity, for example, Miyazaki et al. (2002) used the polarization degree in the farinfrared wavelength for estimating the surface orientation instead of the visible wavelength. Morel et al. (2006) recommended using active lighting. Stolz et al. (2012) proposed a multispectral method for determining the optimal zenith angle, and Garcia et al. (2015) used circularly polarized light. More recently, ambiguities in this approach are adjusted by combining with other approaches in which rough geometric information is provided such as Multi-View Stereo (Miyazaki et al., 2004;Zhu and Smith, 2019) binocular stereo vision (Tian et al., 2022), lightpath triangulation (Xu and Qiao, 2016;Xu et al., 2017) and etc. (Durou et al., 2020). Shape from heating is another technique for 3D reconstruction of transparent objects (Eren et al., 2009) that, unlike the previously described approaches, ignores the refractive properties of the object. Laser range scanning of transparent objects is possible using an IR laser rather than visible light since long-wave and thermal infrared spectrum is not refracted by glass. This technique is based on the principle of infrared thermal imaging, in which the infrared source heats up the object, and then the IR-sensitive sensor detects and records the geometric surface information of the object. Aubreton et al. (2013) also demonstrated a very similar approach for high specular objects utilizing high-power lasers. Since these approaches utilized single laser spots as activating light sources, their measurement areas and acquisition speed are restricted owing to the time required for scanning. There are additional limitations in spatial resolution and precision because of the size of the laser dots. To overcome these restrictions, Wiedenmann et al. (2015) developed a demonstrator system based on a CO2 laser with a single thermal camera and phase-shifting projection technique of sinusoidal heat patterns. Brahm et al. (2016) developed a stereo-vision configuration consisting of two uncooled long-wave infrared (LWIR) cameras to detect the emitted heat radiation from an object induced by a pattern projection unit generated by a CO2 laser. More recently, Landmann et al. (2019) demonstrated real-time 3D thermographs with a 30-frames per second frame rate (fps). This technique is well suited to applications where the geometry or temperature distribution of the objects is rapidly changing. Landmann et al. (2021) developed a simplified and robust projection approach based on a focused single thermal fringe that can rapidly scan across the object's surface. Higher intensities were obtained using such focused single thermal fringe compared to multi-fringe projection, which increased acquisition speed while improving measurement accuracy.

Direct ray measurement
Direct ray measurement techniques, which detect light rays directly, have for long been utilized for refractive surface 3D reconstruction. Kutulakos et al. (2008) published foundational work on measuring the geometry of refractive objects using light-ray correspondences. By mapping the light rays which reach and depart from the object, the geometry of transparent objects characterized by depths and surface normal can be determined. As shown in Figure 4, The projection of a point is defined by the 3D path(s) that light would take to reach the camera, given an arbitrary 3D point p, a known viewpoint c, and a known image plane. As expressed by Kutulakos et al. (2008), refractive surface reconstruction problems are expressed as N-K-M triangulation, where N represents view-points required for reconstruction, K represents refractive surface points on a piecewise linear path, and M represents the number of calibrated reference points along the ray exiting the refractive object.  Kutulakos et al. (2008). To reach point q on the image plane, the light path from p crosses three surfaces, including refractive and mirror-like ones, passing from three vertices, v1, v2, and v3, which form four segments. The objective of light-path triangulation is to estimate the normals and coordinates of the vertices using the known coordinates of c, q and p.
However, methods based on light-path triangulation are known to have collinearity ambiguities as the 3D surface point can be located anywhere along the optical ray that passes through the pixel. To remove the ambiguity, Tsai et al. (2015) assumed that the light rays are refracted twice. They recovered the geometry of a transparent object with only one monocular image using a monitor controlling the background image without even needing to immerse the object in the water. Some researchers (Morris and Kutulakos, 2011;Ding et al., 2011) employed stereo/multiple cameras to record the refractive surface, relying on a cross-view normal consistency constraint: the normals computed using the pixel-point correspondences obtained from multiple viewpoints must be consistent. Alternatively, some studies have been conducted (Ye et al., 2012;Wetzstein et al., 2011;Tsai, 2020;Tsai et al., 2021) to estimate ray-ray correspondences utilizing specific devices such as Bokode (Ye et al., 2012) and light field probes (Wetzstein et al., 2011;Tsai et al., 2021) by capturing the incident rays released from the background and the exiting rays traveling to the camera. Although 3D results appear to be highly promising, the high cost of such devices is an important downside of these approaches. In addition, one of the main common shortcomings of the aforementioned approaches is that they provide only dependable normals but noisy depths. To provide the boundary condition for the integration of normal, they need to presume a planer surface near the boundary (Ye et al., 2012;Ding et al., 2011) or approximate the border using noisy depths (Morris and Kutulakos, 2011;Wetzstein et al., 2011). To address the restrictions mentioned above, Qian et al. (2016Qian et al. ( , 2017 propose a position-normal consistency based on a global optimization method to restore depth maps of the surface from front and back. Similarly, Kim et al. (2017) proposed a method based on optimizing the object's shape and refractive index to minimize the disparity between observed and simulated transmission/refraction rays traveling through an object. It cannot, however, be applied to any non-symmetric objects. Following that, Wu et al. (2018) expanded this technique and provided the non-intrusive method to reconstruct the whole geometry of a transparent object; nevertheless, the results are always over-smoothed due to their independent optimization and multi-view fusion of recovered point clouds. Lately, Lyu et al. (2020) expanded this work by optimizing directly the surface mesh generated from the SFS method using differentiable rendering algorithms. However, these approaches rely on feature correspondence across several views to discover similar features for triangulation, requiring more assumptions and constraints making it insufficient for actual industrial applications that must struggle with a wide range of circumstances or environments.

Hybrid methods
This group of methods includes combinations of different approaches. The primary goal of combining two techniques is to overcome the constraints of one method by leveraging the strengths of the other, allowing complete and precise 3D reconstruction of optically non-cooperative objects to be generated. For instance, SFS is considered a more suitable and practical approach to reconstruct the 3D shape of transparent objects regardless of object's property and shape. However, the concavities on an object's surface remain unseen. Therefore, some works (Kampel et al., 2002; have been conducted to correct the problem of SFS by combining a structured light method. Some researchers have also tried to merge the range sensor and silhouette information to provide more reliable sensor data on transparent objects. For instance, Chiu et al. (2011) described a method to improve Microsoft Kinect depth maps by employing a cross-modal stereo path derived from disparity matching between the Kinect's included IR and RGB sensors. Narayan et al. (2015) merged the silhouette information and depth images on the 2D image domain, which can improve 3D reconstruction for concave and transparent objects with interactive segmentation. Ji et al. (2017) also combined silhouette information and depth from an RGB-D sensor to retrieve the missing surface of transparent objects. First, they seek the 3D region from multiple views that includes the transparent object using incorrect depth led by transparent materials. The 3D shape was then retrieved inside these noisy areas using SFS technology. Another solution developed to deal with transparent surfaces is to combine SFP with other approaches such as light-path triangulation (Xu and Qiao, 2016;Xu et al., 2017), conventional raytracing (Miyazaki et al., 2007), Multi-View Stereo (Miyazaki et al., 2004;Zhu and Smith, 2019), binocular stereo vision (Tian et al., 2022). For instance, Miyazaki et al. (2007) developed a polarization raytracing approach, which combines traditional raytracing (calculates the path of light rays) with SFP (calculates the polarization state of the light). Starting with an initial shape of the transparent object, by modifying the shape, the difference between the input polarization data and the rendered polarization data obtained by polarization raytracing was minimized. More recently, He et al., (2022) developed a pipeline based on the fusion of the laser tracking frame to frame (LTFtF) method and stereo vision to distinguish and extract the reflected laser lines on the front surface from several laser reflection candidates caused by the refraction of the transparent objects.

Learning-based methods
Recently, many researchers have used (machine or deep) learning-based approaches to solve the problem of measuring 3D transparent objects. These approaches could be categorized into three groups as follows. Li et al. (2020) suggested a physically-based network for generating the 3D geometry of transparent objects using multiple images acquired from different viewpoints while also taking into account light transport patterns. More similar to Lyu et al. (2020), this method (Li et al., 2020) optimizes surface normals corresponding to a back-projected ray from both sides of the object using an in-network differentiable rendering layer, given the visual hull construction as an initial 3D reconstruction. Despite the fact that their method is less restrictive than previous ones (Wu et al., 2018;Lyu et al., 2020) that utilized multi-view images, it still requires the environment map and the object's refraction index. It is also difficult to be used in real-time applications because of the time-consuming optimization procedure. Furthermore, these data-driven algorithms rely on training synthetic images since getting a significant quantity of real image training data is difficult (Lyu et al., 2020).

Depth completion (from partial RGB-D depths)
These approaches use different learning-based methods to fill in missing depths (where transparent objects are) acquired with an RGB-D sensor ( Figure 5). Sajjan et al. (2020) presented a deep learning approach (named ClearGrasp) for predicting the 3D geometry of transparent objects partly surveyed with an RGB-D sensor. Deep networks are used to identify masks, occlusion borders, and surface normals given RGB images, and then the initial depth is optimized using the network predictions. The optimization, however, needs transparent objects having interaction boundaries with non-transparent objects. Otherwise, the depth of the transparent region remains unpredictable. Figure  5 shows an example of a depth completion using the method of Sajjan et al. (2020): the missing parts of the scene (where both transparent objects are located) are predicted and the new point cloud is more complete. Zhu et al. (2021) proposed another learning-based technique which uses a local implicit neural representation built on rayvoxel pairs that can generalize to unseen objects and fill in missing depth on given noisy depth maps.

Monocular shape prediction
This group of approaches requires only a single image as input in order to predict the 3D shape of transparent objects. Stets et al. (2019) proposed a deep convolutional neural network (CNN) method for determining depths and normals of a transparent object using a single image obtained under an arbitrary environment map. More recently, Eppel et al. (2022) presented a method for predicting 3D points of transparent objects straight from an image taken from unknown source using an advanced neural net that is independent of camera parameters. In this method, each pixel in the predicted map is assigned with the X, Y, Z coordinates of a point rather than the distances to that point. To train the net, 50k transparent container images containing 13k different objects, 500 different environments, and 1450 material textures were utilized. A total of 104 real-world transparent images of various containers with depth maps were also utilized. Instead of using absolute XYZ coordinates to calculate the training loss, the distance between pairs of points inside the 3D model was utilized, making the loss function translation invariant. Unlike previous methods, this approach does not require camera parameters and can work with images from unknown cameras. The method was designed for specific manipulation applications of transparent chemical bins but with specific re-training operations, it could be generalized to other objects. Figure 6 shows some results obtained using the method presented in Eppel et al. (2022). It can be seen that the predicted 3D shape is only an approximate 3D shape with also anisotropic scaling issues remaining unsolved.

CONCLUSIONS
This paper presented a general overview of 3D digitization methods for non-cooperative surfaces featuring absorption, scattering, and refraction. The paper reviewed the most relevant research works, summarizing them into four categories, including shape-from-X, direct ray measurements, hybrid, and learning-based approaches. Shape from silhouette has become a popular 3D reconstruction method for static objects. Besides, the accuracy of SFS is directly dependent on the binarization of acquired images, making it difficult to determine the optimum threshold for distinguishing transparent objects from the background. Shape from reflection may also be used to generate a precise 3D reconstruction of transparent objects with complex, inhomogeneous interiors when additional limitations are taken into account. However, data collection is more complex making it difficult and unsuitable for many real-time applications. Approaches based on the shape from heating appear to be very promising for 3D reconstruction of optically non-cooperative objects compared to other approaches. However, their resolution is limited since the incident illumination's wavelength is significantly longer than visible light. In addition, high-resolution IR cameras are quite expensive. SFP is a quite accurate method and its recent integration with other technologies like light-path triangulation, traditional raytracing, and Multi-View Stereo indicates some further possibilities. Most light path-based approaches for refractive object shape reconstruction rely on feature correspondence across several views to discover similar features for triangulation. To simplify and make the problem more affordable, the majority of the techniques use assumptions and constraints. As a result, the application window of their approach becomes too limited, making them unstable and untrustworthy for actual industrial applications that must struggle with a wide range of circumstances or environments. More recent learning-based approaches learn directly from real or synthetic training data and do not require assumptions or constraints such as controlled data acquisition, darkroom environments, or other limitations. Nevertheless, these approaches are still significantly less accurate than traditional methods, making them unsuitable for industrial applications where accuracy, reliability, and traceability of 3D measurements are mandatory. In our experience, learning-based approaches can deliver a rough 3D shape of transparent objects which can only be utilized for low accuracy applications. Nevertheless, learningbased methods have demonstrated encouraging results with a cost-effective and re-trainable approach. However, training such algorithms need huge datasets annotated for the specific case. And there is still a gap between real-world and synthetic images, making it difficult to generalize to realworld input datasets.