PROGRESS ON ISPRS BENCHMARK ON MULTISENSORY INDOOR MAPPING AND POSITIONING

This paper presents the design of the benchmark dataset on multisensory indoor mapping and position (MIMAP) which is sponsored by ISPRS scientific initiatives. The benchmark dataset including point clouds captured by indoor mobile laser scanning system (IMLS) in indoor environments of various complexity. The benchmark aims to stimulate and promote research in the following three fields: (1) SLAM-based indoor point cloud generation; (2) automated BIM feature extraction from point clouds, with an emphasis on the elements, such as floors, walls, ceilings, doors, windows, stairs, lamps, switches, air outlets, that are involved in building management and navigation tasks ; and (3) low-cost multisensory indoor positioning, focusing on the smartphone platform solution. MIMAP provides a common framework for the evaluation and comparison of LiDAR-based SLAM, BIM feature extraction, and smartphone indoor positioning methods.


INTRODUCTION
Indoor environments are essential to people's daily life. Indoor mapping and positioning technologies have become in high demand in recent years. Visualization, positioning, and locationbased services (LBS), routing and navigation in large public buildings, navigational assistance for disabled or aged people and evacuation under different emergency conditions are just a few examples of the emerging applications that require 3D mapping and positioning of indoor environments. SLAM-based indoor mobile laser scanning systems (IMLS) like provide an effective tool for indoor applications. During the IMLS procedure, 3D point clouds and high accuracy trajectories with position and orientation are acquired. Many efforts have been made in the last few years to improve the SLAM algorithms (Zhang & singh, 2014a) and the geometric/semantic information extraction from point clouds and images (Armeni et al., 2016a). There are still some challenges as follows: first, lack of efficient or real-time 3D point cloud generation methods of as-built 3D indoor environment; second, difficulties of building information model (BIM) features extraction in the clustered and occluded indoor environment. Also, given the relatively high accuracy, the IMLS trajectory provides a good reference or ground-truth for the lowcost indoor positioning solutions.

SENSORS AND DATA ACQUISITION
Standard datasets are critical for evaluating and comparing indoor mapping and positioning methodologies. In this project, The XBeibao II system (Wen et. al., 2016a) shown in Figure 1. (a) , which was developed by SCSC Lab in Xiamen University is used to collect the multi-sensory indoor data. The system includes two Velodyne multi-beam laser scanners, fisheye lens camera ( Figure 1. (b)). Also, the navigation-related data from smart-phone built-in sensors, such as barometer, magnetometer, six degrees of freedom MEMS IMU data and Wifi information can be collected. The SLAM-based 3D point cloud of the indoor environment can also be provided using the processing software package of XBeibao. Also, the Rigel VZ 1000 (Figure 1. (c)) can provide a high accuracy point cloud as ground-truth for the indoor mapping.
 1×Rigel VZ 1000 scanner (www.riegl.com/datasheet_vz-1000). Range from 1.5m up to 1200m, 5mm precision, 8mm accuracy, collecting 0.3 million points/second, with field of view of 100° vertical ×360° horizontal. When collecting the data, we placed the smartphone facing up on the top of the upper LiDAR sensor. A laptop is used to control the camera and LiDARs. Also, it is used as a hotspot to connect with the smartphone to synchronize the sensors and used to store the incoming LiDAR data streams. A system operator needs to carry the laptop during the collection process.

Dataset
We collect the raw data in three different scenes; each scene is recorded more than three times with a different route. Each round of the data consisting of three parts, the raw data, the benchmarks data, and the calibration files. Only half of the complete version of the overall dataset was released for the purpose of applying in different tests. No benchmark releases for indoor LiDAR-based SLAM test and BIM feature extraction methods test. For smartphone indoor positioning methods test, there are only raw smartphone data and the calibration files.

Data description:
A sequence of data is compressed into a file with the name format "date _number_ type.zip," where "date" is the placeholder for recording date and "number" represents the serial number of this day's recording round. The "type" has four values--00, 01, 03 and 04, representing the complete data, the SLAM test data, the BIM feature extraction test data and the indoor positioning test data, respectively. The directory structure is shown in Figure 2. Figure 2. Structure of the dataset. Here, 'date', 'number', 'unixtime', 'sensor_name', 'scene' and 'type' are placeholders. The 'date_x. pacp' refers to the two LiDAR streams, and 'date_x.mp4' refers to the two video camera streams.
The raw data is saved in the subdirectory "date_rawdata/," there are three kinds of sensor mentioned above: LiDAR, camera, and smartphone.
 Velodyne LiDAR: To separate the Velodyne readings from LiDAR sensor A and LiDAR B, we name the LiDAR scans "date_A.pcap" or "date_B.pcap", where 'date' is the date that collecting these data. Each point is stored with its (x, y, z) coordinate and its reflectance intensity value (r). The "unixtime_start.txt" records the starting time of this record.
 Smartphone: Each sensor's recording data is saved in the file "unixtime_data/sensor_name.txt," where "unixtime" and "sensor_name" are the placeholders of the starting time of this record and this sensor's abbreviation name, respectively. For each piece of data of different sensors, we record the Unix-timestamp. The "timeOffset.txt" records the time offsets from the phone to a local NTP server at different time.
The three kinds of benchmarks are saved in the corresponding zip file. Files' format and detailed description are all included in the zip file. The benchmark will be discussed in subsection 3.3. Figure 3. An example of a scene's architectural plan, the red dot on the picture is the origin we select, which is on the ground. Also, the blue arrows point the direction of X-axis and Y-axis.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W13, 2019 ISPRS Geospatial Week 2019, 10-14 June 2019, Enschede, The Netherlands The Z-axis is perpendicular to the X and Y axis and the direction is from the ground to the ceiling.

Time synchronization
In order to synchronize all the sensors, our laptop is set as a local NTP (Network Time Protocol) server, then all the sensors are connected to it to synchronize the time. The LiDAR is connected to the laptop through a network cable; the smartphone and camera's connections are through WiFi. For the LiDAR, we only get the start Unix-timestamp of the data collection. The timestamp of every point or frame is a relative time to the start Unix-timestamp. As for the Camera, we also could only get the start Unix-timestamp of the videos. The good news is that every frame's time can be obtained via interpolation according to the frame rate. However, unfortunately, frame loss sometimes happens. For the smartphone, the time can synchronize to the local NTP server during the recording, so the Unix-timestamp in every piece of data is relatively accurate. Since all data's timestamps are acquired, we can obtain the position at any time by interpolation and also can use the LiDAR's positioning result as the smartphone' positioning ground-truth.

Multi-Sensors Calibration
In this system, LiDAR sensor A ( 1 , 1 , 1 ) is mounted horizontally; LiDAR sensor B ( 2 , 2 , 2 ) is mounted 45°b elow the LiDAR sensor A (Figure 1 (b)). Based on our previous work , point cloud data of LiDAR sensor A, ( ), and point cloud data of LiDAR sensor B, ( ), are fused into by the 4 × 4 transform matrix between the two LiDAR sensors ( ). (Eq. (2)). Additionally, Terrestrial Laser Scanning (TLS) data is introduced to bridge the calibration between LiDAR sensors and cameras. The calibration process is shown in

LiDAR-to-LiDAR calibration:
The calibration of the multi-LIDAR sensor is calculated recursively in the construction of the sub-map and its isomorphism constraint . Assuming is the trajectory of LIDAR sensor A at a time (0~n) in the mapping algorithm, is the point cloud of LIDAR sensor B at time n.
is the initial coordinate system transformation between the LIDAR sensors. Calibration is the calculation of the exact calibration matrix by: where (·) is the nearest neighbour point search algorithm. Using 1 and , is first transformed to its location at time n in the sub-map M. Then the (·) algorithm is used to search the sub-map for the nearest neighbour point set, . Lastly, an environmental consistency constraint is introduced to obtain .

Camera -to-LiDAR calibration:
The camera Intrinsic calibration matrix is given by � where ( , ) is the focal length of the camera, ( , ) is the position of the of the camera and ( 1 , 2 , 3 ) is the factors of radial distortion. Also, Scaramuzza's the camera calibration method (Scaramuzza et al., 2006a) is used to determine the internal parameters and distortion factors of the camera and obtain the camera internal reference model.
We utilize a TLS (e.g., Riegl VZ 1000) to bridge the calibration between LiDAR sensors and cameras. By manually selected matching points between them, we can acquire the camera's extrinsic transformation [ , ] , where is the 3 × 3 rotation matrix, and is the 1×3 translation vector.

Phone-to-LiDAR calibration:
We place the smartphone face up on the LiDAR A ( Figure 5), and making the Y-axis parrallel to the laser beam scanning direction. Thus, the phone's coordinate system and the LiDAR's coordinate system have the same XYZ-axis direction. Then we use Rigel VZ 1000 TLS to scan the XBeibao II system and calibrate the translation ( , , ) to LiDAR by manually picking the points in the high accuracy 3-D point cloud.

Reference data generation
For benchmark evaluation, we generated reference data from a subset of the raw data and introduced other high accuracy data.
For SLAM-based indoor point cloud evaluation, we built a high accuracy 3-D reference map via the data collected by Rigel VZ 1000. Firstly, we placed many high-reflection rectangle stickers on the wall and ground. Then we scanned the scene in a different position and ensured there is an overlap between adjacent submaps. Finally, the sub-maps were manually calibrated by picking the same sticker and other feature points via the software named RiSCAN PRO.
For BIM feature benchmark, we used the building line framework exacted by the wang's method  and the semantic objects labeled via our manually work. We selected the building lines with their length greater than 0.1 m in The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-2/W13, 2019 ISPRS Geospatial Week 2019, 10-14 June 2019, Enschede, The Netherlands structured indoor building and saved their own two endpoints' coordinates. Fig.6 gives an example of BIM features. For our Indoor positioning evaluation, we used the LiDAR's trajectory generated by a SLAM method (Zhang & singh, 2014a) with loop closure as the reference.

SLAM-based indoor point cloud:
Kümmerle (K¨ummerle at al., 2009a) proposed a metric for measuring the performance of a SLAM algorithm by considering poses of a robot during data acquisition. It is not based on the error of the trajectory end-point, but the average of all relations between poses. Geiger (Geiger at al., 2012a) extended the metric by treat the rotation and translation errors separately. Here, we do similar operation as where N is the number of relative relations, and ⊖ is the inverse of a standard motion composition operator. Let δ , be the relative transformation from pose j to pose i and , * be the reference relative relation.
(·) and (·) are used to separate the translation and rotation error.
However, for indoor environments, it is hard to get the reference for the trajectory poses. However, based on K¨ummerle's method, we can apply the metric operating on the landmark locations instead of based on the trajectory poses. In this way, the relations can be determined by measuring the relative distances between landmarks.

BIM feature:
We propose a method to evaluate the BIM feature extraction method. Here, we assume that we have the ground truth line and the evaluation line (the nearest midpoint to ' midpoint). For both lines, we calculate the corresponding direction vector and , midpoint and , and length and . Based on the above information, we can get the angle between the two lines, the distance between the two midpoints, and the length difference Δ by Eq. (6). Then we set three thresholds ℎ , ℎ and Δ ℎ . We consider the evaluation line is valid only if three conditions are all met: (1) ≤ ℎ , (2) ≤ ℎ , (3) Δ ≤ Δ ℎ . Finally, we can calculate the accuracy acc by Eq. (6). where is the true line number and is the all groundtruth line number.

Indoor positioning:
The approach of evaluating indoor positioning is the same as the translation evaluating in subsection 3.4.1. However, there exists a problem that the frequency of positions output by mobile phones varies with the ground-truth's frequency generated by SLAM. To solve this problem, we generate position at a time by a linear interpolation according to the timestamp. Formally, the ground truth position at time is calculated by: where falls within the interval ( , ) which are two timestamps of the trajectory from the benchmark. and represents the ground-truth positions at time and respectively. And ⊕ denotes a compositional operator. Fig. 7 shows some examples of this dataset. The Fig 7. (a) is a frame of the Velodyne VLP-16L LiDAR data. Different color represents the intensity of every point, the brighter color means the stronger intensity. The Fig 7. (b) shows the high accuracy data from Rigel VZ 1000, which is used as Indoor LiDAR SLAM ground truth. The (c) and (d) in Figure 7 show the BIM benchmark, and (e) and (d) show the Indoor positioning benchmark. The blue dots in (d) are trajectories generated from LiDAR SLAM method, and the yellow dots are trajectories generated by the smartphone sensor data.

CONCLUSION
This paper presents the design of the benchmark dataset on multisensory indoor mapping and position (MIMAP). Each scene in the dataset contains the point clouds from the multi-beam laser scanner, the images from fisheye lens camera, the signals from MIMU and the records from the attached smartphone sensors. The benchmark dataset can be used to evaluate algorithms on: (1) SLAM-based indoor point cloud generation; (2) automated BIM feature extraction from point clouds; and (3) low-cost multisensory indoor positioning, focusing on the smartphone platform solution.