VIDEO-BASED POINT CLOUD GENERATION USING MULTIPLE ACTION CAMERAS

Due to the development of action cameras, the use of video technology for collecting geo-spatial data becomes an important trend. The objective of this study is to compare the image-mode and video-mode of multiple action cameras for 3D point clouds generation. Frame images are acquired from discrete camera stations while videos are taken from continuous trajectories. The proposed method includes five major parts: (1) camera calibration, (2) video conversion and alignment, (3) orientation modelling, (4) dense matching, and (5) evaluation. As the action cameras usually have large FOV in wide viewing mode, camera calibration plays an important role to calibrate the effect of lens distortion before image matching. Once the camera has been calibrated, the author use these action cameras to take video in an indoor environment. The videos are further converted into multiple frame images based on the frame rates. In order to overcome the time synchronous issues in between videos from different viewpoints, an additional timer APP is used to determine the time shift factor between cameras in time alignment. A structure form motion (SfM) technique is utilized to obtain the image orientations. Then, semi-global matching (SGM) algorithm is adopted to obtain dense 3D point clouds. The preliminary results indicated that the 3D points from 4K video are similar to 12MP images, but the data acquisition performance of 4K video is more efficient than 12MP digital images. * Corresponding author.


Motivation
Three-dimensional geospatial information of indoor environment can be generated from cameras and laser scanners. Laser scanners obtain 3D points directly while camera indirectly obtains 3D points via stereo image matching. Digital still cameras and digital videos are two possible ways to collect digital images for image matching. Nowadays, a lightweight action camera such as GoPro Hero 4 Black Edition is able to collect digital still images up to 12Mp (4000 x 3000) resolution and video up to 8.3MP (3840 x 2160) resolution at 30 frames per second. Although the spatial resolution of a digital still camera is higher than a digital video, the sampling rate of a digital video is better than a digital still camera. As the video data can be converted to frame images like digital still camera, these highly overlapped frame images from video provide high similarity and high redundancy for image matching. In addition, action camera is able to acquire both video and image (5 seconds per frame) simultaneously. Therefore, there is a need to compare these two strategies for indoor point clouds generation.

Action Cameras
With the development of camera technology, most action cameras provide both image and video functions. To compare the traditional consumer digital camera and action camera, the action camera, such as GoPro (GoPro, 2015), emphasizes on: light weight, small dimensions, waterproof, large field-of-view (FOV), 4K video recording and high burst frame rate. The comparison of up-to-date action cameras can be found at (Crisp, 2014;Staub, 2015). The action cameras are originally developed for sports and underwater usage. The user uses the action camera to record their activities during extreme sports or special events. Due to the light weight, low cost and high spatial resolution of video mode, the usage of action cameras are extended to unmanned aerial vehicle (UAV), mobile mapping system (MMS), and other photogrammetric purposes.

Related Works
The digital video devices record sequence images and these dynamic sampling images can be used for different applications. The traditional photogrammetry is mostly relied on high spatial resolution images. Due to the improvement of video's resolution and frame rate, the use of video technology for collecting geo-spatial data becomes an important trend. Many video-related applications are presented in different geoinformation-related domains. For example, the space borne Skybox TM constellation is capable of acquiring sub-meter satellite imagery and high-definition panchromatic video for earth monitoring; the video collected by UAV can be used to produce geospatial data via Full Motion Video (FMV) in ArcGIS TM software or other commercial software; the video of car cam recorder can be used for crowdsourced street level mapping via Mapillary.com or other online-mapping services.
Several photogrammetry studies used GoPro action cameras for 3D measurement purposes. Balletti et al. (2014) discussed different camera calibration methods using GoPro for 3D measurement purposes. Kim et al., (2014) construct the 3D point clouds of building façade using GoPro 1080P super-view stereo video. As the needs of stereo vision, the GoPro Company provide accessories (i.e. dual cameras stereo housing, synchronization cable, software) to capture and produce 3D movie. Because of water proof housing, this technology has also applied in underwater stereo vision. For example, Schmidt and Rzhanov (2012) used dual GoPro cameras to measure seafloor micro-bathymetry. The 4K stereo videos are able to generate 3mm resolution grid of seafloor at 70cm distance. Nelson et al., (2014) combined the sonar scanner and dual GoPro cameras in a remotely operated vehicle for underwater 3D reconstruction. The results showed the potential of combining 3D sonar data and 3D surface from image matching for underwater archaeological application. The previous studies indicated that GoPro stereo videos are suitable for close-range photogrammetry purposes.

Research Purposes
The objective of this study is to compare the image-mode and video-mode of multiple action cameras for 3D point clouds generation. Frame images are acquired from discrete camera stations while videos are taken from continuous trajectories. The proposed method includes five major parts: (1) camera calibration, (2) video conversion and alignment, (3) orientation modelling, (4) dense matching, and (5) evaluation. As the action cameras usually have large FOV in wide viewing mode, camera calibration plays an important role to calibrate the effect of lens distortion before image matching. A black and white chess box pattern and Brown equation are adopted in camera calibration. Once the camera has been calibrated, the author use these action cameras to take video in an indoor environment. The videos are further converted into multiple frame images based on the frame rates. In order to overcome the time synchronous issues between videos from different viewpoints, the author manually identify image scene to calculate the time shift factor between cameras in time alignment. A structure form motion (SfM) technique is utilized to obtain the image orientations. Then, semi-global matching (SGM) algorithm is adopted to obtain dense 3D point clouds (Remondino et al., 2014).

System Specifications
This study uses five GoPro Hero4 Black cameras for point clouds generation. These five cameras are integrated in a Freedom360 TM mount to obtain data 360 degrees panorama image and controlled by a GoPro Remote Controller. The size of this multi-view camera is about 10cm x 10cm x 10cm cube (see Figure 1). The camera provides both camera and video modes. The highest spatial image resolution for a digital still image is 12MP (4000 x 3000) while the finer spatial image resolution for a digital video is 4K (3840 x 2160) at 30 frames per second (fps). As the shutter of 4K video (1/30 sec) might produce blur images, this study also consider 1080P (1920 x 1080) at 120fps to avoid image blur. Table 1 shows the related camera parameters.
The spatial resolution of action camera is usually lower than digital single-lens reflex (DSLR) cameras. In order to understand the suitability of using action camera in close-range photogrammetry, this study analyse the spatial resolution of action camera at different distances and different modes. Figure  2 summaries the spatial resolution of image and video at nadir and diagonal points. The action camera usually has large FOV and consequently the point near to image boundaries has larger spatial resolution. This issue should be taken into consideration in 3D measurement. To obtain at least 5cm resolution, the maximum distance for 12MP image and 4K video should be less than 20m. The action camera might not suitable for longrange photogrammetry, but it is suitable for indoor environment at near range distance (<20m). Therefore, the scope of this study is to use the multiple action cameras in an indoor environment.

Camera Calibration
As the action cameras usually have large FOV in wide viewing mode, camera calibration plays an important role to calibrate the effect of lens distortion for image matching. This study uses Brown distortion model (equations (1) to (4)) (Brown, 1971) to determine the lens distortion. PhotoScan (Agisoft, 2015) and PhotoModeler (EOS System, 2015) are used to evaluate the results. PhotoScan uses regular chessboard pattern to obtain a large number of conjugate points in camera calibration. PhotoModeler uses circular signalized targets and selfcalibration to determine the lens distortion parameters. Notice that, the radial distortion parameters K3 is needed for a large FOV camera.
This study performs the camera calibration for a 12MP image, a 4K video and a 1080P video separately. In video calibration, this study uses video mode to shoot the target code at different view angles and positions. Then, these video frames are converted into images at 1 image per second. Besides, the initial focal length and frame size (Kolor, 2015) are also written at EXIF for calibration purpose. The total errors of PhotoModeler are smaller than 2 pixels in all modes. However, the PhotoScan does not provide accuracy index in lens distortion correction. Table 2 shows the results of camera calibration for camera id 2 using Photomodeler. Figure 3 show the distortion curves of radial and tangential distortions. The impact of radial distortion is significantly larger than the tangential distortion. To compare the digital still image and video, the results of PhotoModeler show high consistence in radial distortion except the tangential distortion for 1080P.
To compare the results of PhotoModeler and PhotoScan, the radial distortion of PhotoModeler is larger than PhotoScan. This study also generates two undistorted images using these two methods (See Figure 4). The behavior of these two methods is similar at the center area. But for straight lines near to the corner area, the result of PhotoModeler is better than PhotoScan. Therefore, this study uses the lens distortion parameters from PhotoModeler.

Cameras Alignment
These five cameras are fixed together in a mount and a camera alignment is needed to determine the geometrical relationship between cameras. This study uses camera 1 as the master camera while the other 4 are the slave cameras. The transformation between master and slave cameras is descripted by two sets of parameters, i.e. lever-arms (dx, dy, dz) and boresight-angles (dω, dφ, dκ). In this study, 120 signalized targets (markers) are distributed on a 90cm x 90cm x 65cm box (see Figure 5a). Then, 80 images are taken from 16 stations by 5 cameras. These 80 images are used for bundle adjustment and determine their exterior orientations in mapping frame (see Figure 5b). The lever-arms and boresight-angles are calculated by equations (5) and (6) Table 3 summaries the results of cameras alignment. The standard deviations of boresight-angles are better than 0.6 degrees except for camera 5 on the top. The standard deviations of lever-arms are less than 1.9cm in all cases. It is about 19% of the size of this camera system (see Figure 1). In other words, the variation of lever-arms is around 1.9cm. These parameters can only be treated as initial values in orientation modelling and further investigation is needed.

Data Synchronous
These five cameras are controlled by a remote control and no cables are connected between cameras. The author found a slightly time lag when triggering the camera to take image or video. This time lag does not affect the digital still image on a fixed tripod, but it might cause the data asynchronous in the video mode. As there is no cable to connect these cameras for synchronous purpose, the only way is using an additional timer to align the videos. Figure 6 shows the same timer taken from different cameras using video mode. A timer APP which has 1/100 sec precision in time alignment is used. All videos are shot to a same timer separately and the videos recorded times are shifted to the reference time of time. Although the timer may provide 1/100 sec precision, the time alignment precision is restricted by frame rate. For example, the time interval of 4K video frame is 1/30 sec. This method can only ensure 1/30 sec time synchronous for a 4K video.

Point clouds generation
After camera calibration, cameras and time alignment, video are converted into image frames at different sampling interval for 3D point clouds generation. The procedure includes: (1) structure from motion (SfM) technique for image orientations; (2) absolute orientation using control points; (3) semi-global matching (SGM) algorithm for dense point matching. This study utilizes a commercial Agisoft PhotoScan in 3D point clouds generation.

EVALUATION
The evaluation includes two cases, one is a stair and the other is a lobby.

Case 1. Stair
The 3D stair modelling is a challenging task in indoor modelling. A discrete digital still image usually cannot provide favourite intersection geometry due to the limited camera station. In the contrary, a digital video is able to take multi-view images effectively. The aim of this session is to compare the performance of a 12MP image, a 4K video and a 1080P video. Only one action camera is adopted in this section. For a digital still image, the author take the images for every steps of the stair. The duration of images is about 150 seconds. However, the duration of video is only 25 seconds for the same area. The data acquisition of video mode is much effective than image mode. Besides, the standard deviations of camera baseline are 15.1cm for 12MP, 7.9cm for 4K and 3.8cm for 1080P. The digital video may provide more uniform camera station than digital camera. To compare the 4K and 1080P videos, the resolution of 4K is higher than 1080P while the sampling rate of 4K (i.e. 30fps) is lower than 1080P (i.e. 120fps). Hence, the image quality (e.g. effect of motion blur) for 1080P is better than 4K visually.
The author use the image and video in relative orientation modelling and 4 control points are manually selected in absolute orientation. The residual of control points are less than 5cm in the three cases. Then, high density image matching is used to obtain point clouds of a stair. Table 4 summaries the results of these three modes. The point density of 12MP is the highest one, but the result of 4K video is similar to the results of 12MP. Figure 7 is a section of stair for comparison. The section includes 18 steps and the size of the stair is about 1.5m width, 4m length and 2.4m height. The shape of these three results shows high consistency. In other words, the 4K video is possible to produce similar results like 12MP images.

Case 2. Lobby
In Case 2, the author uses the multi action cameras system to reconstruct the point clouds of a lobby. The test area is about 20m width, 15m length and 3m height. In order to have multiview images for image matching, a tripod is used to take digital still images at five different heights (i.e. 1.0m, 1.25m, 1.50m, 1.75m, and 2.00m). The distance between cameras for the same station is about 0.25m while the distance between different stations is about 3m. The duration of image acquisition for these five stations is about 10 minutes. The duration of 4K video is just 22 seconds for the same area. Table 5 summaries the results of these two modes. The video mode obtains continuous image frames. The average camera centre of video mode is 32.2cm. Therefore, the number of frame from video is larger than traditional digital image (i.e. 376 images > 125 images). However, the video mode needs more computational time to produce point clouds (i.e. 4hrs > 2hrs).  Figure 8. Results of a lobby: (a) perspective centres of 12MP images; (b) perspective centres of 4K video frames; (c) points from 12MP image; (d) points from 4K video.

CONCLUTIONS AND FUTURE WORKS
This research proposed a multiple action cameras system for indoor mapping. The characteristic of this system is 360 degrees panorama imaging and 4K high resolution video. It is beneficial for data acquisition in an indoor environment as well as 3D point clouds generation. This study also demonstrated the results of camera calibration for image and video modes. The maximum radial distortion of a4K video reached 500 pixels at image boundary. The lens distortion should be pre-calibrated as the impact of lens distortion was significant in related to image frame. These five cameras were mounted together and the leverarms and boresight-angles were calculated by cameras alignment. The results of cameras alignment can be used as the initial orientations in orientation modelling. The time synchronous was implemented by an additional timer in video mode. It can adjust the time tag issue of this system. Finally, the 3D point clouds were generated by orientation modelling and dense matching.
The preliminary result indicated that the 3D points from a4K video were similar to 12MP images. Besides the data acquisition performance of a4K video was faster than 12MP digital images, the limitation of this video-based point clouds generation is the huge computational time for large data set and low image quality caused by video compression and motion blur. Future works will evaluate the system in different scenarios and different parameters. As the radiometric performance of action camera will influence the geometrical performance, future works will focus on the radiometric performance for action cameras in image and video modes.