COOPERATIVE LOCALISATION USING IMAGE SENSORS IN A DYNAMIC TRAFFIC SCENARIO

Localisation is one of the key elements in navigation. Especially due to the development in automated driving, precise and reliable localisation becomes essential. In this paper, we report on different cooperation approaches in visual localisation with two vehicles driving in a convoy formation. Each vehicle is equipped with a multi-sensor platform consisting of front-facing stereo cameras and a global navigation satellite system (GNSS) receiver. In the first approach, the GNSS signals are used as excentric observations for the projection centres of the cameras in a bundle adjustment, whereas the second approach uses markers on the front vehicle as dynamic ground control points (GCPs). As the platforms are moving and data acquisition is not synchronised, we use time dependent platform poses. These time dependent poses are represented by trajectories consisting of multiple 6 Degree of Freedom (DoF) anchor points between which linear interpolation takes place. In order to investigate the developed approach experimentally, in particular the potential of dynamic GCPs, we captured data using two platforms driving on a public road at normal speed. As a baseline, we determine the localisation parameters of one platform using only data of that platform. We then compute a solution based on image and GNSS data from both platforms. In a third scenario, the front platform is used as a dynamic GCP which can be related to the trailing platform by markers observed in the images acquired by the latter. We show that both cooperative approaches lead to significant improvements in the precision of the poses of the anchor points after bundle adjustment compared to the baseline. The improvement achieved due to the inclusion of dynamic GCPs is somewhat smaller than the one due to relating the platforms by tie points. Finally, we show that for an individual vehicle, the use of dynamic GCPs can compensate for the lack of GNSS data.


INTRODUCTION
In the field of automated driving, precise and reliable self-localisation is a fundamental pre-condition. Especially in densely built-up urban areas classical positioning sensors like global navigation satellite system (GNSS) receivers reach their limits, as occlusions and multipath effects can lead to systematic errors. In these areas, in particular, additional sensors such as laser scanners or cameras are used to improve the localisation, e.g. (Garcia-Fernandez and Schön, 2019). Although cameras require an external light source, they are lighter and cheaper than laser scanners, making them more flexible in use. Examples of the usage of cameras in challenging urban areas are (Cavegn et al., 2016) and (Cavegn, 2020), where the authors show an improvement of the accuracy of the object points by a factor of 10 when comparing image-based georeferencing using ground control points (GCPs) with direct georeferencing using GNSS/INS. They also demonstrate that the uncertainty of the localisation derived from GNSS/INS only can be too optimistic.
While typical visual localisation methods, for example visual simultaneous localisation and mapping (SLAM), use a single platform, sharing information among multiple platforms can lead to better reliability of the solution. Therefore, applications for cooperative visual SLAM with multiple agents have recently been developed (Zou et al., 2019). Due to the emergence of car-to-car and car-to-X communication, cooperation is also interesting for automated driving. Cooperation can be archived, for example, by sharing position, in which context In this paper, we investigate the potential improvement of localisation using image sensors in typical traffic scenarios due to cooperation. The paper extends our prior work (Trusheim and Heipke, 2020), in which only simulations were investigated, by presenting results achieved using real data. For this purpose, we recorded data in a typical traffic situation. Two vehicles equipped with multi-sensor platforms drive in a convoy formation through a road with normal traffic volume. Markers on the vehicles allow to identify them in the images (figure 1). Both multi-sensor platforms are equipped with a GNSS receiver and a stereo camera pair. Images are captured with GNSS timestamps. Therefore, all sensor data are available in a common time frame. The pose of all sensors and the markers relative to the platform is known from prior calibration.
Our main contribution is the demonstration of the advantages of cooperative visual localisation in urban traffic scenarios using bundle adjustment. For this purpose we qualitatively investigate the trajectory precision of the corresponding vehicles, using (a) common tie points and (b) so called dynamic GCPs.
The paper is structured as follows: After the introduction, we give a short overview of existing work in section 2. In section 3 we describe the employed bundle adjustment approach with a detailed description of the used functional models. Section 4 contains our experiments. We first introduce the different scenarios and give a short overview of the sensors and data used in the experiments. We then present and discuss the results of each scenario in a comparative way. In section 5 we recapitulate the results and discuss possible steps for future work.

RELATED WORK
Visual sensors provide important information for localisation in difficult environments. Cavegn et al. (2016) and Cavegn (2020) deal with the challenge of georeferencing image sequences. The authors employ image-based georeferencing by using bundle adjustment. With this method, they can reduce the residuals at checkpoints from approx. 40 cm, achieved with GNSS/IMU sensors only, to 4 cm. While these results show the potential of the idea to use visual localisation in autonomous driving for determining ego-motion, the authors use GCPs to obtain their results. In many applications, however, such GCPs are not available.
To increase the accuracy and reliability of localisation, a cooperation of several cameras is also used, especially in robotics; cf. the overview in (Zou et al., 2019). In CoSLAM (Zou and Tan, 2013), several moving cameras acquire data processed in a centralised adjustment. The authors state that they can scale their system to up to 12 cameras; image coordinates of points on moving objects are eliminated by outlier detection. Another possibility of cooperation is the recognition of cooperating platforms in the images. Stoven-Dubois et al. (2018) introduce an unmanned aerial vehicle (UAV) tandem system for surveying objects in GNSS denied areas. A so called surveying UAV flies next to the object to be surveyed and takes images while being tracked in an image and georeferenced by another UAV that flies at a higher altitude with a good GNSS signal. MapKITE (Molina et al., 2017;Nahon et al., 2019) also uses a tandem system. Here, the authors combine a terrestrial mobile mapping van with a UAV, so that they can make use of both types of measurements. The van has a much higher payload and, thus, can be used for heavier and also more accurate equipment. In that approach, the vehicle is used as a dynamic GCP. For accurate automated positioning, a circular target is placed on the vehicle roof. In our previous work (Trusheim and Heipke, 2020), a comparison of static and dynamic GCPs in a trafficrelated scenario based on simulated data was shown. Here, we apply this approach to real data.
Obviously, dynamic approaches must be able to handle timedependent parameters. To this end, Colomina and Blázquez (2004) describe a model for trajectory and sensor orientation. The authors compare a state space and a network approach and point out the respective advantages and disadvantages. They show that, whereas state space approaches are faster, network approaches achieve higher precisions and allow for the integration of both time-dependent and time-independent models. For short trajectory sections, linear interpolation can be performed for time-dependent parameters; this is discussed in (Cucci et al., 2017a) and (Cucci et al., 2017b) regarding raw observations from inertial measurement units (IMUs) in dynamic networks.

COOPERATIVE VISUAL POSE ESTIMATION
In a dynamic environment, observations taken at different epochs are related to different states. In our case, we use observations of moving GNSS sensors to derive what we call dynamic GCPs which can be observed in images images acquired by cameras from another moving platform. In this case, the GNSS observations used to define the dynamic GCP do not refer to the 3D position that dynamic GCP was at when the images showing that GCP were captured. Thus, fusing observations of time-dependent processes typically requires interpolation of some entities, here platform poses, in a time-dependent model. We solve this problem by modelling the platform pose by a set of anchor points with linear interpolation in between. Each anchor point represents a 6 Degree of Freedom (DoF) pose of a platform at a specific point time. Image points in different images showing the same object point (i.e. potential tie points of the photogrammetric block) may also refer to different positions in 3D if these object points are not static. In this paper, we consider these nonstatic points as blunders, which are to be eliminated in a robust bundle adjustment.
In the bundle adjustment, we use three types of observations:

Functional models
We start the description of the functional model with the GNSS observations. Figure 2 shows the relationship between the position of the GNSS antenna in the platform frame and the position of the antenna observed in the global frame. We formulate this relationship as follows (the superscript indicates the frame; note that the global frame does not have a superscript. Also, a t in the subscript means "time-dependent"): The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2021 XXIV ISPRS Congress (2021 edition) Here, X GN SS,t,l is the position of the GNSS antenna observed in the global frame at time t. X plat,t represents the position of the platform at time t, whereas R plat,t is a rotation matrix describing the rotation between the platform and the global systems at time t. It is a function of three time-dependent rotation angles roll, pitch, yaw (rt, pt, yt). Thus, (R plat,t , X plat,t ) describe the unknown pose of the platform time t. X plat GN SS is the position of the antenna in the platform frame, which is constant and known from prior calibration. Figure 3 illustrates the observation of a static object point which is used as a tie point. The functional model for the image coordinates of these tie points is described as follows: In eq. 2 the coordinates of the tie point in the global frame Xtp and the pose (R plat,t , X plat,t ) of the platform are unknown. The rotation matrix R cam plat and the shift X cam plat (eq. 3) represent the pose of the camera in the platform frame; this transformation is constant and given from prior calibration. Also, the interior orientation parameters x0, y0, c and the parameters of the distortion correction functions ∆x, ∆y (eq. 4) are known from this calibration. Here, X X cam tp,t is the first component of the vector X cam tp,t , which represents the position of the tie point at time t in the camera frame, Y X cam tp,t is correspondingly the second component and Z X cam tp,t the third. xtp, ytp are the observed image coordinates of the tie point.
Finally, in figure 4 we show the case of an observed marker point attached to a cooperating vehicle. For such observations, we use the following model:  Here, platform j is the observed platform and platform k is the observing platform. We transform the position X plat j mk of the marker point on the observed platform j into the global frame using the pose (R plat j ,t , X plat j ,t ) of platform j, where X plat j mk is determined in a prior calibration and assumed to be constant and the pose of the observed platform (R plat j ,t , X plat j ,t ) is unknown (eq. 5). Using the pose (R plat k ,t , X plat k ,t ) of the observing platform k, this point is transformed into the platform frame of k. Similar to eqs. 3 and 4, we then transform the point into image coordinates x mk , y mk , used as observations (see eqs. 6 and 7). Note that this type of observations contributes to the determination of the poses of both platforms.

Stochastic models
Each of the three different groups of observations is assumed to be of constant precision, and all correlations are neglected. As a consequence, the variance-covariance matrix of the observations is a diagonal matrix. For the observation of the GNSS antenna positions, we used a relatively conservative standard deviation of σXY Z GN SS = 0.5 m for all components. The image coordinates are introduced with σxy tp = 0.8 pixel for the tie points and σxy mk = 0.5 pixel for the marker points. The accuracy for the marker points was chosen to be slightly smaller than the one for tie points because the markers were specially designed to be well identifiable in the images.

Platform trajectory modelling
As mentioned before, the trajectory of a platform is modelled by a series of anchor points A plat,t i , each consisting of a 3D vector X plat,t i representing the position of the platform at time ti and another 3D vector O plat,t i = (rt i , pt i , yt i , ) representing the time-dependent rotation angles forming the other component of the platform pose at time ti. The components of the anchor points are the actual unknowns in the adjustment. To determine the pose of the platform at the time t at which an image or a GNSS observation is acquired, we use linear interpolation between the neighbouring anchor points for both components: where X plat,t and O plat,t represent the pose of the platform at time t, i.e. the time of observation, and the entities at times ti and ti+1 are the corresponding entities of the anchor points. The time ti is the time of the anchor point before the observation and ti+1 the one after the observation. The angles O plat,t are used to compute the rotation matrix R plat,t (e.g. eq. 1).

Precision
As a measure of quality of our investigation, we use the precision resulting from the bundle adjustment, contained in the covariance matrix of the unknowns. The inverse of the matrix of normal equations N contains information about the variances of the unknowns. In our case, we are interested in the 6 DoF poses of the anchor points of the trajectories and the 3D coordinates of the object points. The precision is then calculated by taking the square root of the corresponding entry of N −1 , multiplied by the a posteriori standard deviation of unit weight.

Scenarios
In this section, we consider five different scenarios involving two platforms to demonstrate the potential of cooperative localisation using bundle adjustment. Note that for some scenarios only image related observations from one platform are needed, and thus all computations (feature extraction, matching and bundle adjustment) can be executed locally on that vehicle: only the GNSS data of the other vehicle need to be communicated. For other scenarios, image observations from both vehicles are needed, thus locally extracted features and their description also need to be transferred to the vehicle which performs the bundle adjustment. However, a more detailed discussion of the communication aspects is beyond the scope of this paper. The five scenarios are defined as follows: 1. In the first scenario, we use only a single platform. The GNSS data and the image coordinates of tie points are used as observations in a bundle adjustment. This scenario provides a baseline regarding the obtainable precision using one multi-sensor platform and no cooperation.
2. In the second scenario the GNSS data and the image coordinates of the tie points of both platforms are used as observations in a common bundle adjustment. The inclusion of information from different platforms, including tie points observed from both platforms, should lead to a better result for the precision than in scenario 1.
3. In the third scenario, we use the image observations and the GNSS data of platf orm back, in addition to the image coordinates of the observed marker points of platf orm f ront and its GNSS data, thus we use the markers to be able to use the observed platform as a dynamic GCP. Up to four marker points are visible in one image. Due to the use of the dynamic GCP, results should be better than in scenario 1, but not as precise as in scenario 2, because tie points observations are only used from platf orm back.
4. The fourth scenario combines scenarios 2 and 3 by considering multiple platforms including a dynamic GCP. With this scenario, we want to check if the precision of scenario 2 can be further improved by using the additional cooperation strategy.
5. Finally, we show that the use of a dynamic GCP also makes it possible to calculate the ego-pose of the platf orm back even if no own GNSS data are available for that platform.
For this purpose, we use the image coordinates of the tie points of platf orm back and the image coordinates of the observed marker points of platf orm f ront, as well as the GNSS data of platf orm f ront.

Data Acquisition
The data was recorded in a measurement campaign with several multi-sensor platforms. For this paper, a track is chosen where two platforms travel in tandem around a curve of approximately 90 degrees in an urban canyon. For the first half of the track, the vehicles travel in easterly direction and then turn turn south (figure 5). We use the stereo camera pairs and the GNSS receivers of both platforms shown in figure 1 in different combinations as described above. As the cameras look into the direction of travel, the distribution of tie points varies along the trajectory: for the part of the track after the turn there are significantly more points than for the first part, see figure 5.
The cameras we use are Grasshopper 3 USB cameras. They acquire images of 1920 × 1200 pixels at a frequency of 5 Hz and have a focal length of 11.3 mm, equivalent to 1930 pixels. Image acquisition was initiated by an external trigger signal provided to both cameras. Based on this signal also the GNSS time is saved, therefore, all sensor data is given in the same time frame. The GNSS positions are captured using geodetic receivers Septentrio PolaRx5e SN 3061550 with a JAVRING-ANT G5T NONE, SN 06380 antenna at a frequency of 1 Hz. The images were taken on Aug. 25, 2020, at 5 pm, thus relatively late in the day. The sky was overcast, which led to difficult lighting conditions (figure 6).
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2021 XXIV ISPRS Congress (2021 edition)

Experimental Setup
For the experiments, a section of the track covering about 25 s was chosen. At a frequency of 1 Hz, this makes 25 GNSS observations for each platform, as well as 125 image pairs per stereo camera pair captured at 5 Hz (500 images in total). To define the trajectory we place an anchor point every 0.25 s (4 Hz), yielding a total of 100 anchor points per trajectory. These anchor points were aligned with the GNSS observations such that the time stamp of every GNSS observation corresponds exactly to the time associated with one anchor point. Image coordinates of tie points were extracted using the software COLMAP , where we used SIFT-features and exhaustive matching.
In scenarios 3 and 5, where only GNSS observations are available for platf orm f ront, the anchor point density of this platform is reduced to 1 Hz (25 anchor points), as there are no image observations which could support a denser selection. For the marker points, only observations in images closest in time to the GNSS observations of platf orm f ront are used, which is done to avoid interpolation errors. Thus, these marker point observations are only available at a frequency of 1 Hz.

Results
In all scenarios, the estimated standard deviation of the weight unit indicated a good fit of the stochastic model after convergence. The average back-projection error of the image coordinates amounts to approximately 0.65 pixels in the x-direction and 0.8 pixels in the y-direction. In general, it can be said that the height component of the positions of both the anchor and the object points can be determined with a lower precision than the planimetric components.
The number of tie points observed in the individual images is shown in figure 7. The reduction of the number of tie points at the beginning and the end of the right-hand turn of the vehicles (near anchor points 25 and 50, respectively, for platf orm back) is particularly noticeable. Figure 7 also shows that for platf orm f ront there are significantly more tie points per image, which can be explained by a longer exposure time for the cameras on this platform. An example for the distribution of these tie points can be seen in figure 6. The figure also shows the effects of using robust adjustment: in this image, some of the tie points were eliminated as blunders, and the majority of them lies on the moving platf orm f ront. We note, however, that not all these moving tie points were identified as blunders. In this image, four marker points can also be seen (marked in yellow). It has to be noted that they are not well distributed across the whole image, which somewhat weakens the solution. It is interesting to observe the impact of the right-hand turn of the platform on the precision. While driving along a straight trajectory, the σ Height values decrease, whereas they increases during the right-hand turn of the platform before becoming smaller and increasing again towards the end of the trajectory. In general, the planimetric precision is better than the one in height. Due to the turning manoeuvre, the relationship between the two global planimetric position axes (East vs. North) and those of the platform (in vs. across driving direction) changes between anchor points 35 and 55. In the first part of the track, the orange curve (North/across-track) has behaviour a similar to the height, but is slightly more precise, partly due to the images having a wider horizontal than vertical extension. In the second half of the track, the blue curve (East/across-track) shows a corresponding behaviour. The precision in driving direction (first blue, then orange curve) is better than the one across driving direction.
The bottom part of figure 9 shows the estimated precisions of the rotation angles. The effect of the change of the driving direction on σ Roll and σ P itch is similar to the one of σEast and σ N orth . Due to the spatial distribution of object points (figure 5), the precision of the rotation around the global East-axis (blue curve during the first part of the drive; orange curve in the second part) is better than the one around the global northaxis. The precision around the height axis (σY aw ) is the best one along the entire trajectory.

Scenario 2:
In the second scenario, platf orm f ront is introduced as a second platform to investigate cooperative localisation. In order to compare the results to scenario 1, again the results of platf orm back are shown, see figure 10. It can be seen that the general shapes of the curves are similar to those in scenario 1, but in scenario 2 a better precision is achieved for all pose elements. This corresponds to our expectations according to the results of earlier simulations (Lenz, 2020).

Scenario 3:
In the third scenario, the cooperative part is introduced by using the attached marker points (eqs. 5 and 7). This allows the platf orm f ront to be used as a dynamic GCP. The observations used are the image coordinates of the marker points and the GNSS observations of platf orm f ront in addition to those used in scenario 1.
In this scenario, a problem occurs: In the course of the trajectory there is a short time interval of about 5 s during the turn in which the marker points are not visible. Thus, we do not have any observations supporting the estimation of the related anchor points of platf orm f ront. This means that rotations of platf orm f ront cannot be determined here. We find a workaround by regularising the solution; we introduce direct observation of the rotations of platf orm f ront for each anchor point with relatively large standard deviations (σ = 0.2 rad (11.5 o ) for roll and pitch and σ = 0.5 rad (28.6 o ) for yaw), based on the assumption that the car only moves in the direction of travel and roll and pitch are rather constant over time. The yaw angle is estimated from two consecutive GNSS observations. In this way, numerical instabilities of the solutions are prevented.
If we compare the precision results of this scenario (figure 11) with those of the two previous ones, we notice that we achieve an improvement compared to scenario 1, but it is somewhat smaller than the one obtained in scenario 2. The reason is that the photogrammetric block is already geometrically rather stable in scenario 1, as enough well distributed tie points are available, so the GNSS data for platf orm back are sufficient to yield a rather precise solution, and the additional dynamic GCP does not have much effect. The precision of yaw is slightly decreased in the anchor points corresponding to the turn of the platform, when the platf orm f ront is not visible.

Scenario 4:
The fourth scenario combines scenarios 2 and 3. As the determination of the rotation of the platf orm f ront is supported by the image coordinate observations of the tie points between the vehicles, the introduction of a regularisation as in scenario 3 is not necessary. The results achieved in this scenario are shown in figure 12. These are similar to the results obtained in scenario 2, which again shows that the introduction of the dynamic GCP does not have a significant effect if the photogrammetric block has a stable geometry stemming from a large enough number of well distributed tie points.

Scenario 5:
In the fifth scenario, we consider the situation of localisation of a platform without GNSS observations taken by its own sensor, so that dynamic GCPs are the only information about the global frame. For this purpose, additional rotation observations of the platf orm f ront are again needed to regularise the solution as described in section 4.4.3.
The results are shown in figure 13. The precision plots have a similar appearance than those for the other scenarios, but, as expected, compared to scenario 1, the solution is significantly less precise. The fact that during the right-hand turn the front vehicle is not visible has a further negative impact on these results.

Comparison
Finally, we compare all scenarios based on the mean values for the precision of the pose parameters at the anchor points of the whole trajectory of platf orm back (see table 2). We consider the first scenario as the baseline, as it does not contain any cooperation. Scenario 2 yields an improvement in the precision of the position of 37 mm in East, 31 mm in North, 47 mm in height and an improvement in rotation precision of 0.16 o for roll, 0.14 o for pitch and 0.08 o for yaw. Overall, the average improvement over the precision of the 6 DoF pose is 27.5 % compared to the non-cooperative solution.
For scenario 3, an improvement in the precision of the 6 DoF poses of the anchor points compared to the baseline can also be observed. The improvements are 33 mm in East, 29 mm in North, 42 mm in height and 0.13 o in roll, 0.10 o in pitch and 0.07 o in yaw. This results in an average improvement of 24.0 % compared to the non-cooperative solution.   Figure 10. Results of platf orm back, scenario 2 (details see figure 9). Figure 11. Results of platf orm back, scenario 3 (details see figure 9). Figure 12. Results of platf orm back, scenario 4 (details see figure 9). Figure 13. Results of platf orm back, scenario 5 (details see figure 9).
. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B1-2021 XXIV ISPRS Congress (2021 edition) For the combination of the two cooperation strategies (scenario 4), the individual, as well as the average improvement (27.8 %), is very similar to the one obtained in scenario 2.
Scenario 5 is interesting, as in this case cooperation is the only way to achieve a solution. Compared to scenario 1, the precision in position deteriorates by 28 mm in East, 17 mm in North and 50 mm in height, a deterioration is also found in the precision of rotation of 0.32 o in roll, 0.25 o in pitch and 0.08 o in yaw.
Overall, the precision of the 6 DoF pose deteriorates by 32.2 % on average.

CONCLUSION
In summary, we underline that a significant improvement in visual localisation can be obtained through cooperation. It was shown that both, cooperation by using other participants as dynamic GCPs as well as cooperation in a common bundle adjustment, lead to improvements in the precision of over 20 % with respect to the uncooperative approach.
The improvement obtained with a common bundle is slightly higher compared to using a single dynamic GCP, but a larger number and a better distribution of dynamic GCPs will improve the results obtained. Furthermore, the results show that cooperation can also compensate for the (temporal) absence of GNSS data. Such situations often happen in urban environments.
In further work, we will introduce more general interpolation schemes, which will allow us to be more flexible concerning defining anchor points. These will also be chosen as a function of driving mode (straight course, turn etc.), and we will select different distances between anchor points for different pose parameters. As a further improvement, additional sensor data could be introduced, such as IMU data, which are available in a higher measuring frequency and provide further information about the driving behaviour. These extensions should also be examined to avoid the regularisation for the rotation introduced in scenarios 3 and 5.
Another aspect is that in this work dynamic tie points are considered as blunders and are eliminated in a robust adjustment. In figure 6 it is shown that this is true for some points, but it is also visible that some points lying on the front vehicle are not eliminated. In future work, we will investigate possibilities to subdivide the tie points into dynamic and static before the adjustment. This should lead to a further improvement, also because trajectories can be defined for dynamic tie points also.
We conclude that our work shows that cooperative visual localisation in a real-world traffic environment leads to promising results with improved precision.