INDOOR 3 D INTERACTIVE ASSET DETECTION USING A SMARTPHONE

Building floor plans with locations of safety, security and energy assets such as IoT sensors, thermostats, fire sprinklers, EXIT signs, fire alarms, smoke detectors, routers etc. are vital for climate control, emergency security, safety, and maintenance of building infrastructure. Existing approaches to building survey are manual, and usually involve an operator with a clipboard and pen, or a tablet enumerating and localizing assets in each room. In this paper, we propose an interactive method for a human operator to use an app on a smart phone to (a) create the 2D layout of a room, (b) detect assets of interest, and (c) localize them within the layout. We use deep learning methods to train a neural network to recognize assets of interest, and use human in the loop interactive methods to correct erroneous recognitions by the networks. These corrections are then used to improve the accuracy of the system over time as the inspector moves from one room to another in a given building or from one building to the next; this progressive training and testing mechanism makes our system useful in building inspection scenarios where a given class of assets in a building are same instantiation of that object category, thus reducing the problem to instance, rather than category recognition. Experiments show our proposed method to achieve accuracy rate of 76% for testing 102 objects across 10 classes.


INTRODUCTION
Building floor plans with locations of safety, security and energy assets such as Internet of Things (IoT) sensors, thermostats, fire sprinklers, EXIT signs, fire alarms, smoke detectors, routers etc. are vital for asset management, climate control, emergency security, safety, and maintenance of building infrastructure (Minoli et al., 2017).Existing approaches to building survey are manual, and usually involve an operator with a clipboard and a pen, or a tablet, enumerating and localizing assets in each room.As such, the process is tedious, time consuming, and error prone.Also, it does not result in any contextual data i.e. the proximity and relationship between the sensors, and the proximity and relationship between the assets and the room (Asplund et al., 2018), (Wagner et al., 2003), (LaFlamme et al., 2006).
In this paper, we propose an interactive method for a human operator to use an app on a smartphone to (a) create the 2D layout of a room, (b) detect assets of interest, and (c) localize them within the layout.The output is the layout of each room in a building with the location of each asset marked in 2D and 3D in the layout, and an associated picture for each asset.This "as built" recovery of the assets in a building can be used in Building Information Modeling (BIM) of buildings by architects, owners, construction firms and facility managers (Boyes et al., 2017).Furthermore it can be integrated with Facilities Management (FM) software tools such as TRIRIGA from IBM or Archibus.
The driving force behind our approach is the ubiquity of smartphones, and advances in machine learning and augmented reality (AR) on mobile platforms such as smartphones.Today's mobile devices are equipped with powerful processors and a myriad of sensors such as accelerometers, gyroscopes, and high resolution cameras (Zhang et al., 2018).Thus they are well suited for AR tracking systems and applications as well as running object detection models (Carmigniani, 2011), (Craig, 2013).Furthermore, tracking is now scalable to large environments (Zhang, 2001).AR allows the existence of virtual objects in the real world (Ruan et al., 2012) by assigning anchors tied to a location in the real world environment (Azuma, 1997), (Azuma et al., 2001).By integrating object detection with AR, it is possible to develop a marker-less based system to free the user from the need to use QR codes or other types of markers to extract 3D position of assets (Yan, 2014).
The outline of this paper is as follows.In Section 2, we provide an overview of our system.Section 3 describes the user interface and operation of the app, and Section 4 is on deep learning methods used in our system.Section 5 includes experimental results, and Section 6 is conclusions and future work.

SYSTEM OVERVIEW
Our goal is to develop a smartphone based app which can be used for fast, and semi-automated asset detection and localization during building survey by inspectors and auditors.In this paper, we are concerned with 10 classes of assets related to safety, security and comfort of users, but our method can be extended to a much larger class of assets.In particular, our system in this paper is designed to recognize the following class objects: router, fire sprinkler, fire alarm, fire alarm handle, EXIT sign, cardkey reader, light switch, emergency lights, fire extinguisher, and outlet.
To automatically detect assets, we leverage existing advances in deep learning.Broadly speaking, there are two general methods for applying machine learning methods to object detection: instance recognition and category recognition.Instance recognition refers to situations where one is interested in finding all instances of a particular brand and model number of a given asset such as "Honeywell 1231304 programmable 7-day thermostat" (Zhang et al., 2006).This is in contrast with category recognition whereby one is interested in finding all objects in the same category regardless of brand or model number.An example of category recognition in the building context would be to find all thermostats inside a building regardless of their brand or model number (Wang et al., 2006).Both problems require training neural networks with examples of the object to be detected.As expected, category recognition is more involved and requires a significantly larger number of training examples than instance recognition.
For our particular application of surveying a building and detecting and localizing assets of interests, a number of considerations need to be taken into account in choosing between instance and category recognition.To begin with, the number of publicly available training examples for our desired class of objects is small, making category recognition less attractive.For example, the picture of a fire extinguisher found in the public image databases Google shown in Fig. 1(b) is quite different from a picture of an installed fire extinguisher in an actual building taken under realistic lighting conditions, as shown in Fig 1(a).As such, it is not possible to rely solely on publicly available image databases for training the models for our application.Furthermore, the assets found inside a given building are usually limited to few brands/models and as such are quite amenable to instance recognition.Specifically, one can envision using instances of particular assets inside a building to construct an instance recognition engine, which is enhanced as the inspector progressively adds new examples of the same instance as he or she visits more rooms inside a given building or more buildings with similar devices in a campus.This bootstrapping strategy improves the recognition accuracy of the instance recognition model over time, and can also be used as a starting point of a category recognition model to be used in this or other applications.

USER EXPERIENCE
We start with the user experience in creating the layout for each room, and then describe the way assets in a given room are tagged, located and documented.

Layout Generation
Upon launching the app, the user first creates the layout of the room and then localizes the assets within the room.There are multiple approaches to creating the layout of a room.One way is to point the viewfinder on the smartphone to each vertical wall to detect it, compute the equation of the plane for that wall, find the intersection of those planes to find the layout, and then project them into the x-z horizontal plane to create a 2D layout.To extend the layout to 3D, we can detect the plane associated with the ceiling and find its intersection with the remaining vertical planes.We found this method to be error prone and inaccurate.Rather, we opt to use a simpler approach by instructing the user to click on the corners of the room, in a clockwise or counter-clockwise manner to generate the 2D layout.To extrude to 3D the user can also place an anchor on a ceiling to assign a generic height to the room.In both approaches we take advantage of the positioning and tracking capabilities of modern smartphones via their AR capabilities.

Asset Documentation
The initial recognition model for a given building can either be generated from previously inspected buildings, or it can be made in situ from the assets of the building under consideration itself.In this paper, we take the latter approach in our experiments.As mentioned earlier, this initial recognition model is refined as the operator progresses from room to room inside a given building.
Once the layout for a given room is created, the work flow of our system for detecting and localizing objects in that room is as follows: the user points and clicks on the smartphone viewfinder at a given asset to be detected to capture a picture, which is run through the neural network on the smartphone assigning a probability to each of the N classes of objects it has been trained on.The most likely class, together with its associated probability or confidence is then displayed on the screen.The human operator will then either confirm or disvalidate the auto-recognition results.In the latter case, a dropdown menu with all categories of assets is displayed so that the user can choose the correct class, localize the asset, and draw its associated bounding box.The final output is a 2D layout of a room superimposed with the location of the detected assets.
Figure 2 shows an example of the above interactive process.The user points the viewfinder on the app to the object of interest, and taps on a button to capture a picture to be input to the recognition engine, as shown in Figures 2a and 2b for an outlet and light switch respectively.The recognition engine then outputs the category with the highest confidence or probability and puts an AR anchor on that object.As seen, for the objects in Figures 2a and 2b the confidence is 0.92 and 0.87 respectively, and the detected category by the system is correct in both cases.Figure 2c shows an example of an outlet which has erroneously been detected by the recognition system to be a light switch.As seen, the confidence reported by the system is 9%.In this case, the user clicks on the "UNDO" button at the bottom of the screen, in which case a list of alternate objects in a dropdown menu shown in Figure 2d is presented to the user, giving a chance for the user to correct the error by choosing the correct category of light switch.Once the erroneous category is chosen, the user is given a chance to place the anchor in the right location on the image as shown in Figure 2e, and draw the corresponding bounding box for the object as shown in Fig. 2f.
Note that the location of the anchor is not only registered in the 2D image as a feedback to the user, but also in a fixed 3D coordinate system for all objects, which is also registered with the floor plan the user generates for the room at the outset.The final output is a 2D or 3D layout of a room superimposed with the location of the detected assets, where the correctly detected and classified assets are displayed in green and misclassified assets are displayed in red, as seen in Figure 3.
Regardless of whether the system detects objects correctly or erroneously, all positive and negative examples generated during the human interaction process are saved, and used for future training of the learning algorithm.It is this bootstrapping activity that makes our system more valuable over time as more operators use it.In this way, it is somewhat similar to search engines whose performance improve over time as more users

DEEP LEARNING PIPELINE
Our approach for detecting assets consists of a training and testing phase.We chose the Single Shot Detector (SSD) model which was pre-trained on the MSCOCO dataset (Lin et al., 2014) as a starting checkpoint for our model.SSD requires only a single pass through the neural network during inference, thus making it inherently faster and in turn more suitable for our use case of integrating object detection with anchor placement (Liu et al., 2016).In our system, we retrain the last layer of the neural network using the training examples obtained in situ.We use few-shot learning by training on multiple views of a few instances of an object.The input images are down sampled to 600 × 600 pixels.The model takes in images and ground truth bounding boxes for each object.Our SSD model has 6 neural network layers, the last layer of which is retrained with our training data for 30,000 steps.We fine-tune the model using RMSProp with an initial learning rate of 4, decay factor of 0.9, momentum of 0.9, batch size of 15 and 8000 decay steps.The model returns the class label with the highest predicted score for the detected asset and the confidence level for that prediction.

Data Augmentation
To make the model more robust (Perez et al., 2017) to various input object sizes, orientations, view angles, and room lighting conditions (Taqi et al., 2018)

TensorFlowLite (TFLite)
To reduce the size and complexity of our model to operate on a smartphone, we froze our 86 MB 30,000-step TensorFlow model and converted our 23 MB frozen inference graph into a 22 MB TFLite model, which is an offline model optimized for smartphone devices requiring low latency and a small binary size (Ushakov et al., 2018).

RESULTS
Table 1 shows experimental results of our proposed system for ten object categories carried out in three rounds of experiments.
The object classes currently in our system are router, fire sprinkler, fire alarm, fire alarm handle, EXIT sign, cardkey reader, light switch, emergency lights, fire extinguisher, and outlet.
In the first round of our experiment, we used 234 training example over all ten classes, with fire sprinkler having the most In this round, we tested the system in 4 rooms visited in round 1 with 49 assets, and three new rooms with 24 assets, resulting in a total of 73 assets.As expected there is overall improvement in accuracy from 69% in round 1 to 83% for new rooms in round 2.
In round 3 of our experiments, we used the training example for round 2, in addition to the incorrectly classified images in round 2, which were interactively corrected by the operator.This resulted in a total of 282 examples.In this round we visited 7 new rooms not previously seen in either of the two previous rounds.Overall accuracy is at 75%, which is an improvement over round 1, but slight decrease compared to round 2. For a great number of classes the number of tests is not large enough for the results for

CONCLUSIONS AND FUTURE WORK
We proposed an interactive way of documenting and localizing assets inside a room using a smartphone equipped with native AR capability, and capable of running machine learning models real time.Future work involves extending the layout creation to 3D, and to more than one room, and improving the accuracy of the system via additional

Figure 1 .
Figure 1.An example of fire extinguisher (a) captured by the operator in an actual room; (b) from Google Image Database.

Figure 2 :
Figure 2: Screenshots of our system showing (a) correct detection of an outlet with confidence level of 93%; (b) correct detection of a light switch with confidence level of 87%; (c) incorrect detection of an outlet as lighting switch with confidence level of 9.7%; (d) dropdown menu showing the incorrect detection, which the user can correct; (e) user placing anchor in the correct place; (f) user drawing a bounding box around the object to enable the system to use this image for future training; , each training image is randomly sampled by one of the data augmentation options native to the Tensorflow Object Detection API.These consist of: (a) use the original image; (b) flip the original image horizontally; (c) flip the original image vertically; (d) crop the original image such that at least 0.1, 0.3, 0.5, 0.7, 0.9, or 1.0 fraction of the input bounding box remains and that the minimum overlap with the new cropped image under which the bounding box is kept is 0.1, 0.3, 0.5, 0.7, 0.9, or 1.0; (e) scale the original image with a scale ratio ≥ 0.5 and ≤ 2.0; (f) adjust the brightness of the original image by a factor in the range [0, 0.2]; (g) adjust the contrast of the original image by a contrast factor equal to original contrast times a value in range [0.8, 1.25]; (h) rotate the original image by 90 degrees.

Figure 3 :
Figure 3: 2D floor plan of a room with (a) all correctly classified assets; (b) correctly detected assets (green) and misclassified/incorrectly detected (red) assets.examples at 30, and fire extinguisher with fewest examples at 2. A total of 12 medium sized rooms with 73 objects were tested in this round, resulting in overall accuracy of 69%.As expected classes with large number of training examples, such as sprinklers achieved higher accuracy of 86% than classes with fewer examples such as fire extinguisher achieving zero percent accuracy.Of course, it must be noted that the test size for fire extinguisher class is quite small, making the zero percent accuracy not statistically significant.For round two of our experiments, the training set consisted of the original training set in round 1, plus the incorrectly classified images in the first round which were interactively corrected by the user resulting in a total of 267 training examples.In this round, we tested the system in 4 rooms visited in round 1 with 49 assets, and three new rooms with 24 assets, resulting in a total of 73 assets.As expected there is overall improvement in accuracy from 69% in round 1 to 83% for new rooms in round 2.

Figure 4 .
Figure 4. Round 3 experiment classification accuracy vs. number of training examples per class.

Table 1 :
A augmentation of training examples.It would also be useful to develop online learning algorithms which can incrementally update the models as new examples become available.Perez, L. and Wang, J., 2017.The effectiveness of data augmentation in image classification using deep learning.arXiv preprint arXiv:1712.04621.