AUTOMATED 3D ROAD SIGN MAPPING WITH STEREOVISION-BASED MOBILE MAPPING EXPLOITING DISPARITY INFORMATION FROM DENSE STEREO MATCHING

This paper presents algorithms and investigations on the automated detection, classification and mapping of road signs which systematically exploit depth information from stereo images. This approach was chosen due to recent progress in the development of stereo matching algorithms enabling the generation of accurate and dense depth maps. In comparison to mono imagery-based approaches, depth maps also allow 3D mapping of the objects. This is essential for efficient inventory and for future change detection purposes. Test measurements with the mobile mapping system by the Institute of Geomatics Engineering of the FHNW University of Applied Sciences and Arts Northwestern Switzerland demonstrated that the developed algorithms for the automated 3D road sign mapping perform well, even under difficult to poor lighting conditions. Approximately 90% of the relevant road signs with predominantly red, blue and yellow colors in Switzerland can be detected, and 85% can be classified correctly. Furthermore, fully automated mapping with a 3D accuracy of better than 10 cm is possible.


INTRODUCTION
A great many road signs can be found along the streets in Western Europe, e.g. in Switzerland alone approximately five millions signs are in existence.In many cases, there is no digital information available concerning position and state of these road signs and several of them are believed to be unnecessary.For analysis purposes and to overcome these issues, a road sign inventory could be the solution.To establish such an inventory, attribute data and images of the road signs are traditionally captured in situ and the position is determined using a GNSS receiver with meter to decimeter accuracy.In recent years, mapping and inventory has increasingly been carried out on basis of data recorded with mobile mapping systems as they permit efficient mapping of 3D road infrastructure assets without disrupting the traffic flow and endangering the surveying staff.In Belgium, road signs over the whole country could be mapped by means of laserscanning data; attribute data was mostly obtained by user interaction (Trimble 2009).In The Netherlands, road sign mapping was carried out manually on basis of panorama imagery which was collected every 5 m (de With et al. 2010).If road signs can largely be extracted automatically from georeferenced images, the manual effort can be reduced significantly.This paper introduces algorithms for the automated road sign detection and classification from mobile stereo image sequences as well as the determination of the 3D position and other attribute data.These algorithms were primarily optimized for road signs in Switzerland with mainly red, blue and yellow colors which can appear in the shapes circle, triangle, rectangle, square and diamond as well as in four different dimensions depending on the road type (see Figure 1).However, the algorithms can be adapted to road signs of other countries.Since driver assistance systems or intelligent autonomous vehicles are not the focus of this work, real-time execution is not of top priority.Instead, the emphasis is on completeness, correctness and geometric accuracy.The image-based road sign extraction process can typically be subdivided into two main steps.First, a detection of the road signs is carried out aiming at localizing potential candidates.Second, a classification is necessary to identify the type of road sign.If the absolute position of the detected road signs is of interest, mapping of the signs is performed in a third step.A comprehensive overview of different approaches for road sign detection and classification is given in Nguwi & Kouzani (2008); the most relevant are documented in the following chapters and at length in Cavegn & Nebiker (2012).

Detection of road signs
In many cases, road sign detection is based on color information.Color segmentation with thresholds allows fast focusing on search regions.As the RGB color space is sensitive to changes of lighting conditions due to shadows, illumination and view geometry as well as strong reflections, segmentation is usually carried out in the HSV color space based on the hue and saturation components (Fleyeh 2006, Maldonado-Bascón et al. 2008).Madeira et al. (2005) use the hue and the chromatic RGB component for color segmentation.In comparison to the chromatic RGB component, the saturation component is very sensitive to noise in case of small values.

Classification of road signs
Road signs are frequently classified by means of neural networks (de la Escalera et al. 2003, Nguwi & Kouzani 2008).Since the algorithms have to be trained based on many images appearing in different scaling, orientation and illumination contexts, they are usually just implemented for a few types such as speed signs (Ren et al. 2009).Another method for the classification process is template matching.This intensity based image correlation approach is, for example, used by Piccioli et al. (1996) and Malik et al. (2007).In its basic form, it is not robust regarding scaling, rotation or affine transformations in general and is sensitive to illumination changes (Ren et al. 2009).

Further approaches for the detection and classification of road signs
Many approaches are not designed to exclusively detect or classify road signs, but they are able to perform both tasks.A few of them are mentioned in the following.
The Hough transform tolerates gaps and is not very sensitive to noise.However, due to different dimensions and shapes of road signs, many scales have to be considered which negatively influence the computation time and memory requirements.Therefore, real-time applications need faster modified methods.Chutatape & Guo (1999) proposed a modified version of the Hough transform which is utilized by Kim et al. (2006) for road sign detection following the extraction of edges from image data by means of the Canny operator.Barrile et al. (2007) detect shapes based on the standardized Hough transform.For the classification, they use the generalized Hough transform which is also utilized by Habib et al. (1999) on edges which were extracted with the Canny filter.
The approaches of Support Vector Machines (SVM) and Scale Invariant Feature Transform (SIFT) are increasingly applied to both road sign detection and classification.If the SIFT approach by Lowe (2004) is used, the extracted features are invariant in terms of translation, rotation and scaling as well as insensitive to illumination changes, image noise and small geometric deformations (Reiterer et al. 2009, Ren et al. 2009).Maldonado-Bascón et al. (2007) implemented two types of SVM which enable their algorithms to handle translations, rotations, scaling and mostly partial occlusions.

EXPLOITATION OF DEPTH INFORMATION FROM STEREOVISION GEOMETRY
For the designed and subsequently presented approach aiming at detection, classification and mapping of road signs, the exploitation of depth maps from stereovision imagery is the core element.Although depth information has an enormous potential, earlier and related work on vision-based road sign extraction was primarily focused on utilizing mono imagery.
Only Cyganek (2008) incorporated depth data from stereo imagery as an optional contribution for search space reduction in the extraction process.Furthermore, previous investigations in general did not focus on establishing the 3D position of the extracted road signs.Exceptions are Madeira et al. (2005), Kim et al. (2006) and Baró et al. (2009) who determine the absolute 3D object point coordinates based on stereo imagery as well as Shi et al. (2008) who use a combined approach of image and laserscanning data.While Shi et al. (2008) are able to achieve an accuracy of approximately 30 cm, Madeira et al. (2005) just obtain point coordinates with meter accuracy.However, precise determination of infrastructure objects in all three dimensions in a global geodetic reference system is crucial and has become increasingly important with respect to traffic planning, automated change detection, simulations and visual inspection in mixed reality environments.
For efficient data capturing, a stereovision-based mobile mapping system (MMS) has to be employed (see Figure 2).The generation of depth maps is advantageously based on normalized images.Therefore, the distortion of the collected stereo images has to be corrected and the imagery subsequently transformed into the stereo normal case.Based on the resulting normalized images, the disparity for each pixel is determined by means of a stereo matching algorithm.The stereo geometry allows computing a depth value for each disparity and all values of an image constitute a depth map.For the investigations described in this paper, dense matching was performed with the semi-global block matching algorithm implemented in OpenCV (OpenCV 2012), which differs in a few points from the SGM algorithm by Hirschmüller ( 2008) (e.g.computation of matching costs).
For the subsequent automated detection and mapping of road signs, both normalized images and depth maps are required (see Figure 2).The classification process additionally needs templates of all possible road signs.After successful detection, classification and mapping, the regions of interest, the attribute data and the 3D position of the road signs are known.
The developed object extraction algorithms exploit the stereo disparities and the derived depth maps, respectively, for the following tasks: • Search space reduction using a predefined distance range interval • Definition of distance-related criteria for the color segments • Generation of regions with similar depth values (planar segments) • Computation of 3D coordinates

DEVELOPED ALGORITHMS AND SOFTWARE MODULES
The presented approach which is based on stereo images and depth maps was implemented in Matlab with several algorithms and software modules.They cover the whole workflow from the automated detection and classification through to the mapping of road signs (see Figure 3) and are explained below.

Automated detection of road signs
The input to the detection process consists of the left normalized image and the corresponding depth map for each stereo image pair (see Figure 4).Since no permanent road signs are expected to occur in the lower third of the normalized image, this region is colored black.As the hue und saturation components are relatively insensitive to the varying lighting conditions, which are typical to vision-based mobile mapping, the RGB normalized image is transformed into the HSV color space.Afterwards, the depth map is used to restrict the subsequent search space in the imagery by applying a predefined distance range interval.To enable the detection of road signs on an adjacent lane, a base-depth ratio from 0.06 to 0.25 was chosen.In addition, with a high image acquisition frequency, the same road sign can be detected and classified redundantly.The segmentation of red, blue and yellow color segments is carried out using thresholds for the hue and saturation components, which were determined empirically based on images from different measuring campaigns.For blue segments, the hue values have to be between 0.52 and 0.72 and the saturation range is from 0.20 to 0.80.Pixels featuring a hue value between 0.04 and 0.19 as well as a saturation value which is higher than 0.50 and smaller than 0.98 are covering yellow segments.If the area of the color segment corresponds to distance-related criteria, its shape is described by the two features roundness and fill factor: The extents of a segment must match the standardized road sign dimensions within a certain tolerance.Again, the dense depth maps are used in determining the metric heights and widths of segments in object space.The depth maps are also utilized in the detection of planar segments.These are regions with similar depth values.The ratio between the area of the planar segment in the color segment (intersection of Figure 4f and 4j) and the full area of the color segment (Figure 4f) serves as detection indicator which is used to assess the detection process.

Automated classification and mapping of road signs
The classification process for a detected road sign is performed using cross-correlation-based template matching with predefined reference templates.Since the hierarchical classification approach uses the properties color and shape to considerably reduce the candidate set, not all road signs have to be tested.Road signs which are inexistent on the captured roads can also be excluded from the classification process.
The template with the highest normalized cross-correlation coefficient within the search image is determined.This value also serves as classification indicator.If it exceeds a predefined threshold, the classification is considered as successful.
Candidates with a detection or classification indicator below this predefined threshold are assigned to a list of uncertain objects for a subsequent user-controlled verification and (re-) classification.
The dimensions of the search image are defined by the road sign dimensions in the normalized image plus a margin on each side (e.g. 10 pixels).The template is scaled to the dimensions of the color segment within the search image.The correlation is computed based on the channel which empirically showed the highest similarity between a real road sign image and the corresponding synthetic template.This is the blue channel for red road signs, the red channel for blue signs and the saturation component for yellow road signs.Pixels of the search image which are white in the template have a too low gray value depending on the image quality.Hence, to improve the matching results, all pixel values are set to white (maximal value) or black (zero).The required threshold is computed dynamically based on the gray value distribution of the search image.Further details can be found in Cavegn & Nebiker (2012).
When a road sign could be detected and classified, the 3D object coordinates of the sign are determined.For the computation of the model coordinates, the image coordinates of the sign's center of gravity, the corresponding depth value as well as the parameters of the interior orientation are needed.
The following transformation to the desired geodetic reference system requires that the exterior orientation parameters of the left normalized image are known.
The detection, classification and mapping processes automatically yield a number of attribute data like the 3D coordinates, the template number and the standardized side lengths.They can further be used for creating or updating a GIS database.This is essential, because even in highly developed countries, GIS-based digital road sign inventories either do not yet exist at all or were derived from analogue maps and are normally not up-to-date.

INVESTIGATIONS AND RESULTS
The implemented algorithms were evaluated based on two field test campaigns in the city of Muttenz near Basel with the stereovision-based mobile mapping system by the FHNW Institute of Geomatics Engineering (IVGI).Currently, this MMS features two pairs of stereo systems, each with a stereo base of approximately 90 cm, and with industry cameras at different geometric resolutions (Full HD and 11MP).Direct georeferencing of the stereo imagery is provided by an entrylevel GNSS/IMU system in combination with a distance measuring indicator.Earlier empirical tests of the IVGI MMS in multiple test campaigns demonstrate accuracies in object coordinate space for well-defined points of 3-4 cm in alongtrack and cross-track and 2-3 cm in vertical dimension -under presence of a good GNSS availability (Burkhard et al. 2012).
The first test campaign was carried out in winter time (November 2010) with difficult to poor lighting conditions, the second in summer (July 2011) in sunny conditions.In both cases, about 2500 stereo image pairs were captured on residential roads at five frames per second and at a driving speed of approximately 40 km/h resulting in about one Full HD stereo frame every two meters.For the subsequent evaluation of the detection and classification quality, all relevant road signs with predominantly red, blue and yellow colors were identified.These relevant signs were all road signs adjacent to the driving lane on the right-hand side facing the driver, i.e. with a road sign plane roughly perpendicular to the road axis, thus covering the vast majority of road signs.The group of relevant road signs did also not include road signs for cross-roads which were not facing the mapped roads.The designed algorithms were performed in the distance range from 4 to 14 m.The first test with winter imagery yielded an automatic detection of 89% of all relevant road signs and a correct automatic classification of 82% (see Table 1).Based on the summer imagery, 91% of all road signs could automatically be detected, and 89% could be classified correctly.With an additional user-supported step, the classification accuracy could be increased by another 5%.Due to this user-supported approach and some further built-in constraints, there were hardly any false positives.
There are different reasons for an incorrect detection or classification of road signs.The detection process yields many red segments close to construction areas due to safety fences and warning devices.If the areas of these color segments are not too big, they can lead to false positives.Although the road signs in Switzerland generally appear in good condition, a few of them are yellowed.Thus, there are very low values for the saturation component.The same is also the case for road signs which are located in shadows.Since the defined threshold for this component cannot be exceeded, the detection of such road signs is not possible.In addition, there are some difficulties to automatically detect road signs if the depth maps are poor or incomplete.For several road signs, there exists no predefined template which leads to no or a wrong classification.A suboptimal threshold for the search image binarization can cause a too low correlation coefficient.For the evaluation of the geometric accuracy, 3D positions for 22 reference road signs were determined using precise tachymetric observations.For the first test campaign, the differences between the 3D positions which were automatically derived by the described algorithms and the reference positions were computed.The maximal residual for a component is 16 cm; however, most differences are in the range of 5 cm (see Table 2).For the empirical standard deviation of the 3D position difference, a value of 9.5 cm was calculated.

CONCLUSIONS AND OUTLOOK
The investigations demonstrate the potential, in terms of automation and accuracy, offered by stereovision-based mobile mapping, if dense depth information is exploited.Approximately 90% of the relevant road signs with predominantly red, blue and yellow colors in Switzerland can be detected, and 85% can be classified correctly.By means of a user-supported approach (Cavegn & Nebiker 2012), these rates can be increased by another 5%.Therefore, only 5 to 10% of the road signs have to be digitized either interactively in the stereo imagery or on site.Moreover, due to various constraints built into the algorithms, there are hardly any false positives.
The presented approach is robust in terms of scaling, translations and small rotations.Although it is expected to obtain better results with nearby road signs, they can be detected in the whole predefined distance range interval.Road signs can arbitrarily be positioned in the image and small rotations are tolerated.Furthermore, it is possible to detect multiple road signs in the same image appearing in the shapes circle, rectangle, square, triangle and diamond.
Not only depth maps of good quality but also sufficient color segmentation is crucial for the detection success.For this purpose, appropriate thresholds have to be applied.For the presented investigations, the interval for each component was chosen to be quite large.However, this was only possible since the search space could significantly be reduced due to depth information and false positives could be rejected using certain built-in constraints.
Since a detection and classification quality of 100% is unlikely, it is possible to overlay the automatically mapped road signs in a georeferenced 3D video.The 3D videos can be viewed with a stereovision client (e.g.Burkhard et al. 2011), the results visually verified and the missing road signs quickly digitized.In the future, a first implementation of the algorithms for white and gray road signs which uses the depth information in combination with the Hough transform (Cavegn & Nebiker 2012) will further be improved.The detection of other complex road signs and the identification of text (Wu et al. 2005) are also planned.An increase of the geometric accuracy and reliability could be achieved by matching in stereo image sequences (Huber et al. 2011).Tracking of road signs over multiple stereo image pairs would particularly effect an enhancement of the semantic quality.
The goal of related work in progress is to determine the impact of different camera resolutions on the detection and classification quality.First investigations with a stereo system composed of industry cameras with a higher resolution of eleven megapixels show a slight improvement of the results.For the identification of text, the higher geometric resolution is mandatory.Current investigations also show that the depth map quality can significantly be increased using both image sensors with a higher resolution and adequate radiometric adjustments, which again positively affect the automated road sign mapping.

Figure 1 .
Figure 1.Road signs in Switzerland which can automatically be detected, classified and mapped with the developed algorithms

Figure 2 .
Figure 2. Input and output data for the automated detection, classification and 3D mapping of road signs

Figure 3 .
Figure 3. Developed algorithms for the automated detection, classification and 3D mapping of chromatic road signs (gray fields: operations exploiting the disparity and depth information respectively)

Figure 4 .
Figure 4. Automated detection of a road sign with predominantly yellow colors (a: left normalized image, b: hue component, c: saturation component, d: distance reduced hue component, e: distance reduced saturation component, f: yellow color segments, g: depth map, h: distance reduced depth map, i: planar segments, j: planar segments after morphological operations)

Table 1 .
Detection and classification quality of the developed algorithms for two test campaigns