REINFORCEMENT LEARNING FOR AUTONOMOUS 3D DATA RETRIEVAL USING A MOBILE ROBOT

3D data retrieval is required in various fields such as an industrial monitoring, agriculture, and robotics. Recent advances in photogrammetry and computer vision allowed to perform 3D reconstruction using a set of images captured with uncalibrated camera. Such technique is commonly known as Structure-from-Motion. In this paper, we propose a reinforcement learning framework RL3D for online strong camera configuration planning onboard of a mobile robot. The mobile robot consists of a skid-steered wheeled platform, a single-board computer and an industrial camera. Our aim is developing a model that plans a set of robot location that provide a strong camera configuration. We developed an environment simulator to train our RL3D framework. The simulator was implemented using a 3D model of the indoor scene and includes a model of robot’s dynamics. We trained our framework using the simulator and evaluated it using a virtual and real environments. The results of the evaluation are encouraging and demonstrate that the controller model successfully learns simple camera configurations such as a circle around an object.


INTRODUCTION
3D data retrieval is required in various fields such as an industrial monitoring, agriculture, and robotics. Recent advances in photogrammetry and computer vision allowed to perform 3D reconstruction using a set of images captured with uncalibrated camera. Such technique is commonly known as Structure-from-Motion (Remondino et al., 2017) (SfM). Multiple commercial and open-source software implement SfM. It became a convenient solution for fast 3D reconstruction of indoor and outdoor scenes. Still the success and accuracy of the 3D reconstruction using SfM strongly depends on the configuration of the cameras during the collection of the data. Weak placement of the cameras could lead to large amount of outliers or a complete fail of the bundle adjustment step. While an experienced surveyor can plan a strong camera configuration, a technique for automatic configuration planning still remains an open problem. Camera configuration effectiveness estimation received a lot of scholar attention recently (Hastedt et al., 2021). While modern methods can robustly estimate strong camera configuration setting using a 3D model of a scene, online configuration planning using a set of reference photos remains an open problem.
Reinforcement learning methods demonstrated an exciting progress recently and proved that they can be used for such complicated tasks as an autonomous helicopter flight (Kim et al., 2004, Ng et al., 2006, robot hand manipulation (OpenAI et al., 2018a) and playing games Schmidhuber, 2018b, Freeman et al., 2019). In RL framework a decision-making model, commonly called an agent, learns to interact with the environment by choosing from available actions and earning some awards. Related to our controller T is an RL-GAN-Net (Sarmad et al., 2019) model in which a GAN model for point cloud shape completion is controlled by a RL controller. Related to our model is WorldModels (Ha and Schmidhuber, * Corresponding author 2018b, Freeman et al., 2019) framework that consists of three key components: a Variational Auto Encoder (VAE) (Kingma and Welling, 2014) that translates the input image into latent code, the Mixture Density Network Recurrent Neural Network (MDN-RNN) (Graves, 2013, Freeman et al., 2019) that learns to predict a sequence of actions and a controller that is trained using the CMA-ES (Hansen andOstermeier, 2001, Hansen, 2016) algorithm. Our FFD framework leverages the training using 'world model' of the WorldModels framework to learn generating realistic and self-consistent image splices. In this paper, we propose a reinforcement learning framework RL3D for online strong camera configuration planning onboard of a mobile robot. The mobile robot consists of a skid-steered wheeled platform, a single-board computer and an industrial camera. Our aim is developing a model that plans a set of robot location that provide a strong camera configuration. We use the WorldModels (Ha and Schmidhuber, 2018b) framework as a starting point for our research. Specifically, our framework includes three deep models: a controller C that predicts the robot control, a variational auto-encoder V that encodes the input image A into a latent code z, and a MDN-RNN M that learns to predict the robot's behavior from the movement history and the the input image A. We design a new loss function that uses the fraction of the scene surface captured by the robot's camera and the final residual of the bundle adjustment as training penalty.
We developed an environment simulator to train our RL3D framework. The simulator was implemented using a 3D model of the indoor scene and includes a model of robot's dynamics. We trained our framework using the simulator and evaluated it using a virtual and real environments. The results of the evaluation are encouraging and demonstrate that the controller model successfully learns simple camera configurations such as a circle around an object.

Camera configuration planning
Camera configuration planning received a lot of scholar attention recently (Michelini and Mayer, 2014, Chiabrando et al., 2017, Tufarolo et al., 2019. In (Michelini and Mayer, 2014) authors present an approach for detection of weak camera configurations. The presented approach can be applied to planning of the survey stage for a calibrated Structure from Motion (SfM) approach. Moreover, the presented method can leverage image triplets for complex, unordered image sets, e.g., obtained by combining terrestrial images and images from small Unmanned Aerial Systems (UAS).

Reinforcement Learning
Reinforcement learning has become a powerful tool for learning a test system (agent) to interact with a certain environment. One of the first deep learning models that were able to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning was presented in 2013 by Mnih et al. (Mnih et al., 2013). It was trained with a variant of Q-learning and demonstrated its ability to master complicated control policies for computer games, using only raw pixels as an input. Such an approach outperformed all previous methods on six computer games.
Hausknecht et al. have developed the Deep Q-Networks (DQN) method (Hausknecht and Stone, 2015). In this approach, the effects of adding recurrence to a Deep Q-Network were investigated. The authors proposed replacing the first postconvolutional fully-connected layer with a recurrent Long Term Short Memory (LSTM) (Hochreiter and Schmidhuber, 1997). It allowed eliminating some shortcomings related to the memory limitations of game controllers and incomplete and noisy state information, resulting from partial observability.
Reinforcement learning is also actively used for learning to control a real manipulator on model data. Andrychowicz et al. have proposed a method (OpenAI et al., 2018b) to train control policies that perform in-hand manipulation. The authors have demonstrated the application of the method on a real robot prototype. The training is performed in a simulated environment in which many of the physical properties of the system, like friction coefficients and an object's appearance, were randomized. Kalashnikov et al. have proposed a robot control method (Kalashnikov et al., 2018) with a reinforcement learning-based vision system. It enables closed-loop visionbased control, whereby the robot continuously updates its operating strategy based on the most recent observations to optimize long-horizon grasp success.
An alternative approach to reinforcement learning was proposed by Schmidhuber et al. in (Ha and Schmidhuber, 2018a). In this approach, a generative recurrent neural network is trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatiotemporal representations. A method fully replaces an actual reinforcement learning environment with its copy based on the generative modeling. The training agent's controller C is trained using only the internal world model M of the environment. The controller C transfers the learned policy back into the actual environment. This approach offers many practical benefits, such as eliminating rendering of image frames that requires significant computing resources.
Reinforcement learning methods demonstrated an exciting progress recently and proved that they can be used for such complicated tasks as an autonomous helicopter flight (Kim et al., 2004, Ng et al., 2006, robot hand manipulation (OpenAI et al., 2018a) and playing games Schmidhuber, 2018b, Freeman et al., 2019). In RL framework a decision-making model, commonly called an agent, learns to interact with the environment by choosing from available actions and earning some awards. Related to our controller T is an RL-GAN-Net (Sarmad et al., 2019) model in which a GAN model for point cloud shape completion is controlled by a RL controller. Related to our model is WorldModels Schmidhuber, 2018b, Freeman et al., 2019) framework that consists of three key components: a Variational Auto Encoder (VAE) (Kingma and Welling, 2014) that translates the input image into latent code, the Mixture Density Network Recurrent Neural Network (MDN-RNN) (Graves, 2013, Freeman et al., 2019) that learns to predict a sequence of actions and a controller that is trained using the CMA-ES (Hansen andOstermeier, 2001, Hansen, 2016) algorithm. Our RL3D framework leverages the training using 'world model' of the WorldModels framework to learn generating realistic and self-consistent image splices.

Reinforcement learning in GAIL
The authors of the article (Ho and Ermon, 2016) propose a new general framework for directly extracting a policy from data as if it were obtained by reinforcement learning following inverse reinforcement learning. They show that a certain instantiation of the framework draw an analogy between imitation learning and generative adversarial networks, from which they drive a model-free imitation learning algorithm that obtains significant performance gains over existing model-free methods in imitating complex behaviors in large, high-dimensional environments. This method is called Generative Adversarial Imitation Learning.

RL3D Framework
Our method is inspired by GAIL (Ho and Ermon, 2016). The GAIL approach defines a general pipeline for training an actor in a given environment using trajectories provided by experts.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2022 XXIV ISPRS Congress (2022 edition), 6-11 June 2022, Nice, France Using an IRL approach GAIL aims to estimate the expert's policy and objective.
The aim of our RL3D framework is training an agent that move around a given object following a trajectory that provides a strong camera configuration. We assume that the agent is a skidsteered mobile robot equipped with a forward looking camera. Therefore, our RL3D framework extends the the GAIL framework with two key technical contributions: (1) a 3D pose encoder based on the YOLOv3 (Redmon and Farhadi, 2018), (2) a camera configuration loss function.
In the inverse reinforcement learning (IRL) task is to learn an agent that tries to match the outcomes of an expert's entire trajectory rather than individual actions, as in behavioral cloning. The output of this algorithm is then a function that scores "expert behavior" on a trajectory higher than novice behavior. In Figure 2 on the left we see a typical feedback loop for an RL, where an agent (blue) observes a state (s) and using a reward function (R) chooses an action (a) that yields a transition (T ) to a new state and a reward (r). In contrast, on the right, the rewards resulting from these states, actions, and transitions are represented implicitly by examples from an expert (E), and the agent (blue) instead learns to replicate this sequence through a learned reward function (RE) rather than being explicitly "in the loop" of the algorithm. In other words, instead of learning policy from an explicit reward function, we observe an expert's behavior and infer a reward function that would lead to their observed actions. The aim of the Inverse Reinforcement Learning is the estimation of expert's policy and the approximation of a loss function that explains the expert's behavior. The behavior is given as a set of trajectories produced by an expert in the same environment. Nevertheless, in most cases it is hard to obtain expert trajectories. Moreover, IRL requires a large computational cost. The GAIL (Ho and Ermon, 2016) framework allows to resolve this problem by combining the concepts developed in the field of IRL and GANs.
The Generative Adversarial Imitation Learning (GAIL) approach is to minimize the loss for such an agent, whose behavior mimics the expert's behavior. On the contrary, agents whose behavior is significantly different from the expert's behavior receive a large penalty. Moreover the GAIL framework allows to find such loss function that explain the expert's behavior. This allows to define the overall objective of the GAIL framework The GAIL framework aims at minimizing the loss function. To achieve this the following algorithm is applied: 1. Prepare a set of expert trajectories and randomly initialize the discriminator and policy parameters.
2. Generate a set of trajectories for the RL agent under the current policy.
3. Update the discriminator parameters with a stochastic gradient descent step.
4. Update the policy parameters with gradient-based updates using an algorithm called Trust Region Policy Optimization (TRPO).
5. Repeat steps 2-4 of the algorithm until the values of the parameters of the policy and the discriminator converge.

Mobile Robot
To train and evaluate our RL3D framework we used a mobile robot developed in previous research (Kniaz, 2015, Kniaz, 2016, Kniaz, 2017. The robot consists of a four-wheeled mobile platforms that is operated as a skid-steer. In other words, left and right wheels are independently controlled by four motors. If both left and right wheels are turning the same direction the robot moves forwards of backwards. Otherwise, the robot turns to the left or to the right. The robot is equipped with a Raspberry Pi single-board computer and a Raspberry Pi camera module.

Virtual Environment
To provide a common coordinate system for the 3D model of the scene and estimated camera trajectories a special test scene was designed. The origin of object coordinate system OoXoYoZo is defined as follows: the center is located at the center of the scene, the Xo is directed towards the window, the Yo is directed towards the wall (Figure 3).
Three additional coordinate systems are defined to transform coordinates from the image space to the object space. To define the robot's coordinate system a set of circular targets were located on the upper deck of the robot. The origin of the robot's coordinate system OrXrYrZr is defined by the central target (#8). The Yr axis is directed towards the forward motion of the robot. The Zr axis is normal to the upped deck of the robot.
The origin of the image coordinate system OiXiYiZi is located in the upper left pixel, the Xi is directed to the right, the Yi axis is directed downwards. The origin of the camera coordinate system OcXcYcZc is located in the perspective center, the Xc axis is collinear with the Xi axis, the Yc axis is collinear with the Yi axis, the Zc axis is normal to Xc and Yc axes. The rotation of the robot's coordinate system with respect to the object coordinate system is defined by rotation matrix Ror where Rα -rotation matrix around the axis Y , Rω -rotation matrix around the axis X, Rκ -rotation matrix around the axis Z.
The environment was developed in a special room to perform experiments with a real robot. The virtual environment was developed using the Blender 3D Creation Suite and Unreal Engine 4. The virtual environment includes the robot's dynamics model and a 3D scene. This allows us to train the robot in the virtual environment. The comparison of the real and virtual environments is presented in Figure 4.

Network Training
For all deep learning models we use the pre-trained models provided by the authors. We optimize our controller T using the CMA-ES algorithm similar to (Freeman et al., 2019). For other models we use an Adam solver with a batch size of one and an initial learning rate of 2 · 10 −4 .

Quantitative Evaluation
We evaluate our RL3D framework quantitatively in terms of the quality of the trajectories of the robot. Most of the learned trajectories were close to the classical all around camera configuration. An example of the robot trajectory is presented in Figure 5.

Reconstruction Accuracy
We compare the accuracy of reconstructed 3D models generated from images captured by mobile robot that was controlled by an expert and by our RL3D framework. We compare the models reconstructed using an open-source SfM implementation with respect to 3D models generated by a 3D scanner based on fringe projection. The 3D scanner (Knyaz, 2010) provides 0.1 mm accuracy for reconstructed reference 3D models. We used models provided by the MVSIR dataset (Knyaz et al., 2017). To evaluate the deviation of 3D models obtained by various techniques from the reference 3D model we transform them to a common coordinate system and display deviations using pseudo colors. The accuracy of the reconstructed surfaces is presented in Figure 6, Figure 7 and Table 1.

CONCLUSION
We demonstrated that the inverse reinforcement learning can be applied to the challenging task of the robot path planning. Furthermore, we demonstrated that our RL3D framework allows to Figure 6. Distances in mm to the ground-truth 3D model for the 3D model reconstructed using images captured by a human operator (top) and the 3D model reconstructed using images captured by an agent trained using our RL3D framework (bottom). Figure 7. Comparison of the reference model with the 3D model reconstructed using images captured by a human operator (top) and by an agent trained using our RL3D framework (bottom). discover complex trajectories that provide strong camera configuration.