A 3D MAP AIDED DEEP LEARNING BASED INDOOR LOCALIZATION SYSTEM FOR SMART DEVICES

Indoor positioning technologies represent a fast developing field of research due to the rapidly increasing need for indoor locationbased services (ILBS); in particular, for applications using personal smart devices. Recently, progress in indoor mapping, including 3D modeling and semantic labeling started to offer benefits to indoor positioning algorithms; mainly, in terms of accuracy. This work presents a method for efficient and robust indoor localization, allowing to support applications in large-scale environments. To achieve high performance, the proposed concept integrates two main indoor localization techniques: Wi-Fi fingerprinting and deep learningbased visual localization using 3D map. The robustness and efficiency of technique is demonstrated with real-world experiences.


INTRODUCTION
Recently, the need for indoor positioning systems is rapidly growing due to the emerging indoor commercial application market, including asset tracking, personal security and entertainment (Holman, 2012) with ILBS, fueled by the proliferation of using personal smart devices. In general, the typical requirements of indoor positioning techniques using smart devices are: low cost, high accuracy and availability in a large variety of scenarios; e.g., large-scale environments . Since GPS devices generally work poorly in indoor environment, various radio-frequency (RF) based alternative approaches with different signals and sensors, such as Wi-Fi, Bluetooth Low Energy (BLE) beacons, Radio Frequency Identification (RFID), Ultra-wideband (UWB), etc., have been proposed for indoor positioning (Yassin et al., 2016). However, the main drawbacks of these technologies are low accuracy and high cost of the required infrastructure. The typical 2D localization accuracy for Wi-Fi, BLE and RFID system varies from 1-2 meters to a few tens of meters, while UWB can achieve accuracy on a few decimeters (Anagnostopoulos, 2017;Ficco et al., 2014). On the other hand, BLE, RFID and UWB positioning systems need additional infrastructures and extra sensors on user's end, which are not integrated in modern smart devices, therefore the cost for using these systems are relatively high.
In the commercial arena, several companies have proposed indoor map solutions, such as Google Maps Indoor, or HERE Indoor Maps (Li et al., 2019). Obviously, the role of indoor maps is important for achieving high performance of any indoor localization system, besides the fundamental visualization (Li et al., 2019). For example, the requirements for 3D indoor maps to support indoor navigation applications have been investigated in (Brown et al., 2013) with respect to recover the 6 degree-offreedom (DOF) camera pose of a query image captured by smart devices. Many methods have been proposed for this task with different representation of 3D map data, such as Sarlin et al., 2018) for outdoor environments with feature maps or (Taira et al., 2018) for indoor environments with dense RGB-D point cloud. To improve the robustness of indoor imagebased localization, deep learning was introduced for processing and representing query image on object level (Xu et al., 2017) and feature level (Taira et al., 2018). Specifically, Taira et al., (2018) demonstrated that their open sourced visual indoor localization system, called InLoc 1 , can achieve 40.7% at the localization accuracy of 0.5 m by using state-of-the-art CNNbased image retrieving method followed by the 2D-3D dense matching with CNN features. However, the InLoc can fail in the photogrammetrically challenging scenarios, e.g., images contain a lot of dynamic elements, such as moving people and objects. Additionally, since lack of initial location estimation, the method needs to compare the query picture with all database images for every time of operation, and thus the image retrieving performance will be significantly decrease with the growing size of the map.
To avoid using sensors not available in smart devices, in this work we integrated received signal strength (RSS)-based Wi-Fi fingerprinting positioning (WFP) with the InLoc. Since WFP is robust in complex indoor environment against non-line-of-sight (NLoS), signal fluctuation and multipath effect (He, 2015), we use WFP to provide a coarse estimation of the position as initial position estimation or as the final location when the visual algorithm fails. In our approach, InLoc is supplied with WFP results to perform coarse-to-fine 6DOF estimation using a RGBD-based 3D indoor map. The details of proposed method is discussed in the remainder of this paper, organized as follows. Section 2 reviews the techniques integrated in our indoor localization system ( Figure 1); which are WFP and InLoc. The field experiment setup, including building indoor maps with different representations of the environment, and results are presented in Section 3. Finally, the conclusions are summarized in Section 4.

Wi-Fi Fingerprinting Positioning System
In Wi-Fi fingerprinting techniques, fingerprints or signatures represent the information and clues about the environment. In the case of WFP methods, the fingerprints are built from Wi-Fi received signal strength. While RSS is the essential component of fingerprints, other geo-related information, such as IP number, MAC address of Wi-Fi access points (APs), which are helpful for localization in large-scale environments, can also be added in the fingerprint (Honkavirta et al., 2009). One typical example of using IP and MAC addresses for localization is the Geolocation module in Google Maps 2 . Since the location is estimated by matching user fingerprint measurements against the fingerprint database, WFP generally consists of two phases: a training phase (offline), and then a localization phase (online) (Kim et al., 2012). The workflow of WF algorithm used in this study is shown in Figure 2.  To create the radio map in training phase, we applied the mean peak value to sample the RSS observations (Mallozzi et al., 1996). Such Peak-based Wi-Fi Fingerprinting (PWF) technique shows robustness and improved accuracy by overcoming the RSS variance problem (Kim et al., 2012). With the probabilistic assumptions, such as probabilistic independence or Gaussian noise in the samples from different APs (He et al., 2015), the matching problem is solved to obtain the posterior distribution by using Bayes' rule, which is described as: where ( | ) = posterior of a possible CP location by given the observation Then the possible CP location where the maximum posterior was calculated, is used as 2D positioning results ( ), )).

InLoc
InLoc is a state-of-the-art visual indoor localization system, which can estimate 6DOF camera pose of a query image by using dense matching with an RGBD-based indoor map, including 3D model and image database. The pipeline of InLoc is summarized as follows: 1. Given a query image taken by a smartphone, the system firstly retrieves N=100 most similar images from the whole dataset by comparing the CNN-based descriptor resulted from NetVLAD (Arandjelovic et al., 2016). The architecture of NetVLAD is shown in Figure 3. 2. The CNN features are built with output of 17th (fine features: length=256) and 30th (coarse features: length=512) CNN layers from NetVLAD and dense matched in a coarse-to-fine manner, in which the matches of finer features are restricted by the correspondences of the matches of coarse features. In the next, the camera poses of N candidate images are computed using associated 3D model with Perspective-3-Points Random Sample Consensus (P3P-RANSAC) (Fischler & Bolles, 1981). Then top 10 candidates are picked out based on the number of RANSAC inliers. 3. In the final pose verification step, the best 6DOF pose estimation is picked from the previous 10 camera poses by comparing the differences between the query image and the re-projected synthetic image. In this work, the original CNN model 3 in InLoc is used, since it was trained with a bigger dataset (254,064 images) than our test dataset. ( Arandjelovic et al., 2016) In this work, we improve the efficiency and robustness of InLoc by (1) applying Wi-Fi positioning results as initial information to significantly reduce the image retrieving search space as well as the map matching space; particularly, effective in large-scale environment, (2) the Wi-Fi positioning results can also be offered to user as the final localization estimation when InLoc failed. In our tests, we use images with top-2 Wi-Fi fingerprinting matching posteriors for InLoc processing. More details of the methods are given in the experiment section.

Mapping
The map used in this research contains two components: (1) Wi-Fi radio/fingerprint map, and (2) RGBD-based 3D indoor map, including RGB images, depth maps, and 3D indoor models. To minimize the drifting of RGBD SLAM algorithm, data was collected in a typical office hallway (55m x 3m) at the Ohio State University, see Figure 4.

Radio Map:
For the Wi-Fi radio map, 18 calibration points are used for collecting fingerprints at the center of cells. As a trade-off between accuracy and effort (Cherntanomwong et al., 2009), we set the interval between every two calibration points to 3 m, hence the length of cells is also 3 m. The measurement time of fingerprints varied from 50 to 60 seconds, and the MAC addresses and IP information of APs are also recorded. A VAIO Z Canvas laptop was used for Wi-Fi data collection. The coordinates of CP are manually picked from 3D model rendered from RGBD SLAM.

3D Indoor Map:
The platform used for indoor 3D mapping is the LooMo robot with a Kinect V1 RGBD camera mounted on the top, see Figure 5, and data was collected at 10Hz. The data is processed with an RGBD SLAM technique, named RTAB-map 4 (Labbé and Michaud, 2019), which resulted in an indoor 3D model of about 55 million points in color, see Figure  4 b and c, 5122 key frames including RGB images (640*480), 3D scan data, and 6DoF camera poses. After optimizing and noise filtering, the 3D indoor map dataset is integrated with the Wi-Fi radio map by (1) sampling and storing 5 RGB images from key frames for each cell with an average 0.6m interval between images, (2) the depth maps for each stored images are made with the registered and optimized 3D scans within 8 m range from the camera location, see Figure  6, (3) in order to reduce the calculation cost for the final step in InLoc, the 3D model of the floor is segmented and stored in each cell. The length of segments varies from 6m to 30m.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B4-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 6. Stored RGB image (top) and corresponding depth map (bottom).

Test Data
Test dataset is created on 13 points, including two groups: (1) six query images, selected from 3D map dataset with 6DoF camera pose as ground-truth, (2) seven query images, taken by a cell phone with 2D location ground-truth (X, Y). The Wi-Fi fingerprints are collected on all test points with about 10 second measuring time for each fingerprint. In order to avoid scale problem, the coordinates of CP and ground-truth of test points are manually picked from the 3D model, according to the markers of different colors, see Figure 7. Then Group 1 query images, associated with 6DoF camera poses are selected from the 3D map dataset as the ones, which are closest to corresponding locations of black markers. A SONY XPERIA X cellphone is used for Wi-Fi data collection and taking query images (2160*2880) on green markers. For Group 2 test data, Wi-Fi fingerprint is measured right after taking image on each point. For Group 1 test data, Wi-Fi fingerprints are collected after the robot stopped on the green markers when 3D mapping processing is ongoing. The intrinsic parameters of the cellphone camera are calculated with MATLAB camera calibration function, and those of Kinect are manufacturer data, which are used in RTAB-map as default setting. Figure 7. Examples of markers (left), in which CP is denoted in red, Group 2 test point is in black and Group 1 test point is in green; (Right) shows markers in the 3D model.

Test Results
Comparing to the original InLoc system, our workflow has efficiently reduced the search space for image retrieving and computation cost, see For the step 2 in InLoc pipeline, dense matching with CNN-based (VGG-16) feature significantly outperforms the classic features, such as SURF in the texture-less hallway testing area, see Figure  8, and Table 2. In the next step, the best pose estimation is determined by evaluating image similarity on comparing local patch descriptors between query image and synthetic image. The examples of best pose estimation on Group 1 and Group 2 data are visualized in Figures 9 and 10, respectively. Warm colors denote large errors.    The Table 3 shows that the horizontal positioning accuracy increased after InLoc process compared to WFP standalone results. Note that the performance of system with Group 2 data is better than with Group 1 data. The reason may be related to the accuracy of intrinsic parameters. Most of the WFP + InLoc localization results are better than WFP, though the error budget is higher than WFP, as plotted in Figure 11. Clearly, the quality of the 3D model directly impacts the performance of InLoc (Taira et al., 2018). Note that the quality of 3D model in this study is limited by the performance of RGBD SLAM on data obtained by inexpensive sensors.

CONCLUSION
In this study, we have demonstrated the efficiency of our system in terms of increased search and computation speed. With the help of InLoc, the localization performance is better than using only WFP. However, the localization accuracy is impact by the quality of 3D model. In the future work, we will try to (1) make improvement on the quality of 3D model, and then (2) test the system on large-scale indoor environment.