A MODULAR MOBILE MAPPING PLATFORM FOR COMPLEX INDOOR AND OUTDOOR ENVIRONMENTS

: In this work we present the development of a prototype, mobile mapping platform with modular design and architecture that can be suitably modified to address effectively both outdoors and indoors environments. Our system is built on the Robotics Operation System (ROS) and utilizes multiple sensors to capture images, pointclouds and 3D motion trajectories. These include synchronized cameras with wide angle lenses, a lidar sensor, a GPS/IMU unit and a tracking optical sensor. We report on the individual components of the platform, it’s architecture, the integration and the calibration of its components, the fusion of all recorded data and provide initial 3D reconstruction results. The processing algorithms are based on existing implementations of SLAM (Simultaneous Localisation and Mapping) methods combined with SfM (Structure-from-Motion) for optimal estimations of orientations and 3D pointclouds. The scope of this work, which is part of an ongoing H2020 program, is to digitize the physical world, collect relevant spatial data and make digital copies available to experts and public for covering a wide range of needs; remote access and viewing, process, design, use in VR etc.


INTRODUCTION
3D mapping platforms are a core component for many and diverse workflows and are becoming increasingly useful the last years. There is a growing interest for platforms that allow accurate, but also fast and massive capturing of data for 3D mapping. Many commercial solutions are offered as products or services that target the recording of either outdoors or indoors environments. They usually rely on a combination of modern geospatial technologies such as laser scanning, GNSS navigation and photogrammetry. They come in different forms; systems mounted on cars, trolleys, backpacks, or even autonomous robots. Such systems can capture effectively large urban areas, or buildings and construction sites, but they remain very expensive due to the top-end hardware components they rely on.
Urban planning and public infrastructure management are application areas that benefit from such technologies as they require constantly updated geographic information. Mobile mapping technologies have been widely used in a variety of applications in urban areas, for mapping transportation infrastructure, utilities, buildings, vegetation and lately for autonomous vehicle driving (Shi et al., 2017). A recent survey of such applications for lidar based mobile mapping is presented in the work of Wang et al. (2019). Real Estate, and Architecture, Engineering, Construction (AEC) sectors are also adopting digitization. Building Information Modelling (BIM) and Geographic Information Systems (GIS) are becoming standard tools that handle large amounts of geospatial data.
Besides specialists, the general public is also daily consuming mobile mapping data through online tools or mobile apps like Google Street View (Anguelov et al., 2010). Of special interest is also the case of Mapillary, a collaborative alternative of Google Street View that allows users not only to access but also to capture street level videos or image sequences with any camera and upload them on a map.
The development of new, more versatile mobile mapping systems is expected to grow due to i) an abundance of new medium/low cost sensors that are widely produced for the mobile phone and the automotive industry, and ii) constant advancement of the underlying methods and technologies from the robotics and autonomous navigation communities.
The work presented here is part of an ongoing European H2020 STARTS Research Program called "Mindspaces" that aims to utilize 3D mapping among other technologies, towards artdriven adaptive outdoors and indoors design (Alvanitopoulos et al., 2019). The scope of our research is to provide a tool that captures multiple types of relevant spatial data of the environment such as raw video footage, georeferenced imagery, pointclouds etc. These can be subsequently exploited by designers and artists to collaborate with scientists and engineers towards the creation of innovative designs and experiences.
In the following sections, after a short review of related work, we present the individual sensors and components of our platform, we describe their integration within the Robotics Operation System (ROS) and discuss processing workflows for generating 3D reconstructions from the collected data. Initial experiments are also presented.

Mobile Mapping Systems
3D laser scanners, photogrammetry and surveying have been the typical means for 3D recording of physical world and manmade constructions. The scientific and technological advances during the last decade have made possible the adaptation in everyday use of much more scalable approaches of data capturing for 3D reconstruction. In this context, during the last years, several mobile mapping platforms are available in the market (Puente et al., 2013). These fall into two main categories, i) those that are suitable for mapping of large-scale outdoor environments and ii) those suitable for indoor scenes.

Outdoor mapping
Most major players in the geospatial market offer similar systems, like UltraCam Mustang by VEXCEL (VEXCEL, 2020), RIEGL (RIEGL, 2020), LEICA Pegasus by HEXAGON (LEICA, 2020) and VIAMETRIS (VIAMETRIS, 2020). All these systems combine proprietary high-end laser scanners with high performance INS/GNSS units and optionally 360 panoramic high resolutions multi-camera rigs. The latter two offer also backpack versions of their platforms for vehicle restricted areas. Imajbox by imajing, originally designed for trains and now updated for cars is a lower cost vision-based alternative (imajing, 2020).

Indoor mapping
For interior spaces different approaches exist. There are platforms with similar technologies like the car-mounted systems that are built on trolleys (NAVVIS, 2020), helmets (REscan, 2020) or backpacks (LEICA, 2020) (VIAMETRIS, 2020). Handheld devices like the PARACOSM PX-80 (PARACOSM, 2020) are also available but their accuracy is not directly comparable to the above systems.
Matterport (MATTERPORT, 2020) has a dedicated solution for creating digital twins for the Real Estate market. Indoor spaces are scanned via a proprietary low-cost 360 camera with depth sensors, or lately via a mobile phone and all required processes as well as hosting of data is done on a web service they provide.
For the Architecture, Engineering, Construction (AEC) market Doxel (DOXEL, 2020) provides automated solutions for quality inspection and progress tracking. They use artificial intelligence and autonomous robots that capture images and perform laser scanning surveys on a daily basis.

Simultaneous Localization and Mapping
Estimating the 6 Degrees of Freedom (DoF) motion trajectory of a mobile mapping platform is key to obtain georeferenced data. Direct Georeferencing from the GNSS/INS sensors is not always accurate and can fail in GPS restricted areas. Lately many systems adopt workflows from the robotics literature, like Visual Odometry or Simultaneous Localization and Mapping. A taxonomy and review of standard methods for visual odometry can be found in the well-known articles of Fraundorfer (2011a, 2011b).
Simultaneous Localization and Mapping is currently under heavy research. Current state-of-the-art SLAM algorithms exploit a broad range of data, such as images, IMUs or laser scanners and achieve remarkable results (Zhang and Singh, 2015), especially in autonomous driving scenarios, whilst maintaining near real-time performance. Visual methods of SLAM can be divided into feature-based, where features are first extracted on images (Mur-Artal et al., 2015) and direct methods that exploit all image gradients on the available images (Newcombe et al., 2011). Forster et al., (2017) proposed a semidirect approach that combines direct methods for tracking pixels and features correspondences to refine both camera poses and structure by bundle adjustment. In a recent publication Kuo et al. (2020) propose a generic vision based SLAM solution, which is sensor-agnostic and adapts to arbitrary multi-camera configurations. Other approaches rely on 3D point cloud to image matching using specialized descriptors (Pujol-Miro et al., 2017), as well as on constraining a SLAM algorithm given a street map background (Vysotska and Stachniss, 2017).

Structure from Motion
Structure from Motion (SfM), for the last two decades, is widely considered as the dominant image-based technique for automatic image alignment and 3D model generation and can be employed in workflows of processing data from mobile mapping platforms. SfM is a well-studied topic in the research community with a lot of nearly production-ready implementations (Schonberger and Frahm, 2016) and extensions, such as integration of video from aerial platforms (Leotta et al., 2016). However, it's still an open research field and new approaches have emerged, for example in robust image matching, mainly due to recent developments in deep learning (Yi et al., 2016).

SENSORS -COMPONENTS
Most mobile mapping systems share similar sensors for recording simultaneously visual information, depth, 3D point clouds, as well as the position and the orientation of the system in the world. More specifically, the proposed space sensing platform can support multiple sensors. The current implementation ( Figure 1, Figure 2) consists of: i) four embedded 13MP machine vision cameras by econ-systems which can record still images or synchronized 4K video sequences, ii) a Velodyne® PUC VLP-16 LiDAR sensor which captures 3D point clouds, iii) an Xsens MTI-G-700 GPS/IMU unit that record absolute 3D positions and rotations and iv) an Intel RealSense T265 Tracking camera for relative positioning in GPS restricted areas (such as indoor scenes). Currently no lighting device is integrated in the platform.

Lidar
The platform uses a Velodyne® PUC VLP-16 LiDAR sensor for pointcloud recording. The specific sensor is selected for it's relatively low cost and high performance balance. It has a range of 100m, a positional accuracy of ~3cm, and 360 o horizontal and 30 o vertical fields of view (at 16 discrete channels). It can capture up to 600.000 points/second depending on the selected horizontal rotation velocity. It can be directly connected to a GPS/IMU device and supports data synchronization with precise GPS-supplied time via Pulse-Per-Second (PPS), in conjunction with a once-per-second NMEA GPRMC or GPGGA sentence. The lidar sensor is mounted on a ball camera tripod head on top of the platform to avoid occlusions from the other sensors and it is placed with an inclination of ~35 o to capture floors and ceilings.

GPS/IMU
For direct Georeferencing in outdoors spaces the platform uses the Xsens® Mti-G-700 GPS/IMU Unit. It is the 4 th generation motion tracker by Xsens and has built-in vibration-rejecting gyroscopes and accelerometer, a multi-GNSS receiver (GPS, GLONASS, BeiDou and Galileo) and a barometer. It measures attitude angles and accelerations and Xsens applies a Kalman Filter based sensor fusion algorithm to provide 3D position and orientation information.

Tracking camera
For GPS restricted areas like indoors environments the platform uses a new sensor by Intel®, the RealSense™ Tracking Camera T265. It is an embedded computer vision solution that combines two fisheye lens sensors with a combined close to hemispherical ~160 o FOV, an Inertial Measurement Unit (IMU) and an Intel Movidius Myriad 2 Visual Processing Unit (VPU) that runs a proprietary Visual SLAM algorithm directly on the device. The T265 is connected and powered via USB and outputs 6DoF data at a sample rate of 200Hz.

Embedded PC & Laptop PC (optional)
To host the multi-camera rig, the platform utilizes an NVIDIA® Jetson AGX Xavier™ development kit that is widely used for the development of end-to-end AI robotics applications. This kit bundles a carrier board, an integrated thermal solution together with the embedded system-on-module (SoM) Jetson AGX Xavier. It combines an 8-Core ARM v8.2 64-Bit CPU, a 512-Core Volta GPU with Tensor Cores and 32 GB 256-Bit LPDDR4x Memory. It is configured to run Ubuntu 18.04.
An NVMe disk is added for storage and an LCD touch screen for control and visualization. Since this embedded system is powerful enough, our initial intention was to build the entire platform on it. This was partially achieved, except of support for the Xsens GPS/IMU unit, since no drivers were implemented for the ARM architecture. Thus, a laptop PC configured with Ubuntu 18.04 is an additional component used to include the GPS/IMU sensor (see distributed architecture implementation in Section 4.1). In an upcoming version of the platform we plan to replace the specific sensor with one compatible with the embedded PC.

Power supply
To power all the sensors and the embedded PC a 4 cell LiPo Battery of 5500mAh and 14.8V voltage was used. This provides enough power to run the platform for ~30min. When the platform is mounted on a car a typical 12v-220v inverter can be used instead.

Mounting
To combine physically all available sensors a prototype base was designed in 3D and then 3D printed (Figure 3). This design also provided a good approximation of all sensors' relative orientations (boresight alignment parameters). The base includes a typical camera mount that can be connected on a camera tripod on a dolly ( Figure 2) and pushed around to perform data collections of interior environments or relatively small outdoors areas (like squares, individual buildings etc). Alternatively, it can be mounted on a car roof, via a DSLR suction cup camera mount. To further optimize the capturing process the use of a gimbal to reduce sensors shake as well as a backpack form factor version are considered for future implementations. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2020, 2020 XXIV ISPRS Congress (2020 edition)

SENSORS INTEGRATION & DATA PROCESSING
Our platform aims to provide precise and fast 3D recording of outdoors and indoors spaces and consists of two main modules, i) the space sensing module that is responsible for data collection and ii) the 3D reconstruction module for processing all available data. The platform runs also in two separate modes, one for indoors and one for outdoors, each with some different hardware components, such as the tracking camera for indoors and the GPS/IMU sensor for outdoors.

Space Sensing Module
For the integration of all sensors into a single capturing system the Robotic Operation System (ROS) (Quigley et al., 2009) was adopted. ROS is a middleware that is widely used by robotics teams both in Academia and the development of commercial products. Several ROS based opensource projects that implement sensors integration are available. Lately ROS was also proven to be a suitable platform for building mobile mapping systems to capture 3D interior (Blaser et al., 2018) and underground environments ( (Blaser et al., 2019). ROS was selected since it is open source, it supports multiple programming languages (C++ and Python), it allows for lowlevel device control and is modular by design, making it relatively easy to add or remove devices.
ROS implements a message-passing communication architecture. A node is created for each sensor, which publishes sensor data as messages on specific topics. Topics may contain raw or processed values and each data entry inside a topic is assigned with a timestamp. Real-time processes are usually implemented as nodes that subscribe to specific topics and then publish their estimations in new topics. For offline processing, all messages are recorded on a single "bag" file. This is implemented by a "rosbag" node that subscribes to all messages from available sensors and stores them on the disk drive in a "bag" file. Then "bag" files can be reproduced for developing and testing algorithms.
ROS offers tools to monitor all recorded topics ("rqt-topic") ( Figure 4) as well as tools to visualize 2D and 3D sensor data (rviz) ( Figure 5). Since all data are published as topics with timestamps, these timestamps are recorded in the "bag" file. Synchronization during offline data access or processing is usually handled by taking the data of each sensor that corresponds to the nearest timestamp or by interpolation. Figure 4. Data from all sensors in ROS can be monitored via "rqt-topic" tool. Data entries are usually accessed through a timeline feature.
More specifically, the proposed space sensing platform was implemented in ROS Melodic Morenia on Ubuntu 18.04. A node was created for each sensor. Nodes communicate with the sensors and publish their data on a suitable designed topic. Figure 5. Visualization of the pointcloud topic from the Velodyne VLP-16 Lidar node in rviz ROS tool.
For the Velodyne® PUC VLP-16 LiDAR node, the official ROS "velodyne_driver" 1 and "velodyne_pointcloud" 2 packages were used. The first provides basic device handling for Velodyne lidars and publishes the raw data packets that are transmitted from the sensor through an ethernet connection. The second provides point cloud conversions. The Velodyne node publishes a "velodyne_points (sensor_msgs/PointCloud2)" topic which contains accumulated Velodyne points transformed in a selected frame of reference.
An official ROS package "xsens_mti_driver" 3 was also used for the Xsens® Mti-G-700 GPS/IMU Unit Node. The node publishes a "tf (geometry_msgs/TransformStamped)" topic that contains 6 DoF orientation parameter (X, Y, Z translations and quaternion rotations) transformed in a selected frame of reference. A similar topic is published from the RealSense™ Tracking Camera T265 node that uses the ROS "realsense2_camera" 4 package.
A new package was developed by our team for the multi-camera rig since no ROS compatible implementation was available. It is designed to work on the NVIDIA® Jetson AGX Xavier™, with custom made nodes and topics. The package consists of two subprograms, the "capturer" (C) and the "publisher" (C++). The first handles the cameras and captures images via v4l2 and gStreamer libraries, while the second is responsible to publish image data and metadata as a ROS topic. Initially we published image frames in a ROS topic but this approach lead to low FPS performance. In the current implementation the "capturer" app records 4 synchronised 4K videos at 30 FPS as .mkv files with H264 encoding format at a storage path defined by the "publisher" app. The latter publishes the start/end timestamps of the video sequence, as well as the timestamps and the frame_ids of every synchronized frame that is added to the buffer of the gStreamer. Video files require ~60MB/camera/minute. ROS natively supports a distributed architecture where sensors can run across multiple machines, which communicate through a local Network via a talker/listener logic. All nodes are configured to use a single ROS Master app ("roscore"), the address of which is defined by an environmental variable ("ROS_MASTER_URI"). Although the initial plan was to build the space sensing module on a single machine (NVIDIA® Jetson AGX Xavier™) where all sensors would be connected, due to incompatibility of the GPS/IMU sensor with ARM processors, the system was actually built following two alternative architectures (Figure 6). A single machine mode, when GPS/IMU is not used (for example in indoors environments) and a distributed one that supports the GPS/IMU device. In the latter configuration all sensors are connected on a laptop pc, except for the camera-rig, which by design requires to run on the Nvidia Jetson Xavier embedded computer. The space sensing module is executed by a script that launches all processes (Figure 7). A basic GUI for touch screens ( Figure  8) was also developed. It has tools to set the capturing parameters and start/stop the capturing session. Tools to inspect sensors connectivity and to assist the refocusing of each camera are also included.

3D Reconstruction Module
Every capturing mission with the space sensing module collects multiple types of data, from the connected sensors, which are stored into a "bag" file. All data are organised based on their timestamps. For any given time point or period it is possible to retrieve the corresponding data (ie image frames, pointclouds and 6DoF motion trajectories) and apply 3D reconstruction workflows. Since the work presented here is part of an ongoing research, several alternative approaches are into consideration before concluding to an optimal workflow. More specifically, for the 3D reconstruction module of our platform we relied on existing software libraries, such as ORB-SLAM2 (Mur-Artal et al., 2015) and Google Cartographer (Hess et al., 2016) for vision-based and lidar-based SLAM respectively, AliceVision & Meshroom (Jancosek andPajdla, 2011 andMoulon et al., 2012) for Structure-from-Motion and Open3D (Zhou et al., 2018) for pointcloud processing.
Direct georeferencing from the specific GPS/IMU sensor or the tracking camera is not preferable due to their limited accuracy. However initial tests have shown that the provided trajectories from these sensors can be used to assist vision-based or lidar based SLAM algorithms. The latter provide more accurate estimations of the platform's motion trajectory and initialization of orientation parameters for the individual pointclouds and the image frames. Global maps in the form of registered pointclouds are also provided but are most of the times sparse, incomplete, and noisy. In most cases though, the 3D models can be further improved by means of Structure-from-Motion Solutions.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B1-2020XXIV ISPRS Congress (2020 Since the four cameras of the platform capture images at high FPS rates, each data collection mission consists of several millions of image frames. Using the SLAM provided image orientations, key poses of the multi-camera rig are selected so that they capture the area of interest with sufficient overlap and leave no gaps. Only these key frames are used in a Structurefrom-Motion workflow through Meshroom open-source software framework. Meshroom implements a self-calibration bundle adjustment solution that supports Camera Rig Calibration. This allows for optimal estimation of the four cameras interior orientation parameters along with their relative orientation. This also leads to more accurate and consistent 3D reconstruction results. Finally, dense 3D point clouds are generated via Multi View Stereo 3D reconstruction algorithms. In GPS deprived areas, where absolute orientation is a requisite, relative path and reconstruction estimations can be updated by means of Ground Control Points (GCPs) measured through standard Surveying techniques.
It must be mentioned that since all sensors were placed on a custom designed 3D printed case with known dimensions, good approximations of all sensors relative orientations were a-priori available. The effect of small misalignments was handled by the SLAM and bundle-adjustment solutions. An approach that we plan to further investigate is to update those relative orientation parameters by matching in 3D space the individual motion trajectories provided from the different sensors.

EXPERIMENTS
To demonstrate the effectiveness of a first prototype of the platform two data collection experiments are presented.

Indoors Environment
During the development of both the space sensing and the 3D reconstruction modules of the platform several experiments were conducted inside our office space. The presented survey corresponds to a single loop of the platform around an open desk area of ~90m 2 . Figure 9, shows four image frames from the multi-camera rig. In this experiment the 6 DoF motion trajectory from the RealSense™ T265 Tracking Camera was used to initialize a Structure-from-Motion Solution, with Camera Rig Calibration. A subset of the multi-camera rig image frames was used ( Figure  10) (images at a given distance interval). Figure 11 shows an accumulated pointcloud from individual scans of the Velodyne® PUC VLP-16 LiDAR. The 3D translations and rotations of each individual scan were interpolated from the synchronized trajectory of the tracking camera.

Outdoors Environment
A second, more complete data capture mission was carried out in the area around the cultural centre of Tecla Sala which is situated in the City of L' Hospitalet, in Barcelona, Spain. This is the first Pilot Use Case of the "Mindspaces" H2020 Research Program. The tripod dolly was moved slowly around the cultural centre building to ensure that there is sufficient overlap between scanlines and image frames. The whole survey with the mobile mapping platform lasted ~15 minutes. In this experiment a vision-based SLAM solution was used to estimate the motion and rotation trajectory of the platform and then an automatically selected subset of the collected image dataset was fed to the Structure-from-Motion workflow ( Figure  14). A dense point cloud was also computed via Multi View Stereo Dense Reconstruction (Figure 15).

CONCLUDING REMARKS
In this contribution we presented a first implementation of a modular mobile mapping platform that is based on commercial hardware components and open source software libraries. The integration of all sensors was carried out with the Robotics Operation System which allows for easy additions, changes and updates of the platform's components.
Several improvements are under consideration. A first one is the replacement of the GPS/IMU sensor with one compatible with ARM CPUs. This will allow the platform to run exclusively on the NVIDIA® Jetson AGX Xavier™ development kit and thus minimize it's dependency on hardware, it's size and it's overall portability. The mounting of the platform on a camera gimbal is also considered to facilitate data collection sessions and obtain more stabilized data. The 3D reconstruction module requires further development and all workflows need to be thoroughly tested and evaluated with respect to their effectiveness, accuracy, and performance on well organised experiments.
A final more general remark has to do with a well-known restriction of mobile mapping systems, which is the inability to capture spatial information that is not directly visible from street-level (i.e. building roofs, backyards etc). Occlusions due to obstacles such as buildings, parked cars or trees lead also to unavoidable gaps. To get complete digital copies of complex spaces mobile mapping missions need to be combined with either existing geospatial data from open databases, either with aerial missions from drones. This was initially planned for the Tecla Sala Pilot Use Case but was not realised because of a general ban of drone flight missions in the specific area. However, a second Pilot Use Case is currently under preparation in a more suitable area where licence to perform both mobile and aerial mapping missions is granted to the consortium. This will allow us to present soon the potential of combining mobile mapping with UAV photogrammetry.