AUTOMATIC OBJECT SEGMENTATION TO SUPPORT CRISIS MANAGEMENT OF LARGE-SCALE EVENTS

LARGE-SCALE EVENTS S. M. Azimi1, ∗ , R. Kiefl2, V. Gstaiger1, R. Bahmanyar1, N. Merkle1, C. Henry1, D. Rosenbaum1, F. Kurz1 1 Remote Sensing Technology Institute, German Aerospace Center (DLR), Oberpfaffenhofen, Germany {seyedmajid.azimi; veronika.gstaiger; reza.bahmanya; nina.merkle; corentin.henry; dominik.rosenbaum; franz.kurz}@dlr.de 2 German Remote Sensing Data Center, German Aerospace Center (DLR), Oberpfaffenhofen, Germany ralph.kiefl@dlr.de Commission II,WG 6


INTRODUCTION
Large-scale events with widely distributed parking and camping areas represent a particular challenge for event and crisis management and require extensive preparation and constant monitoring to guarantee the safety of participants. Injuries and deaths occur repeatedly at large gatherings of people and for years research has been conducted into the causes of accidents and ways of avoiding them in order to make large events safer (Fruin, 1993, Helbing et al., 2000. In order to prevent situations of danger or damage at large events and to be able to act quickly and effectively in an emergency, decision-makers need information with spatial reference for a situation picture that is as close to reality as possible during the event. Due to the increasing availability of high resolution remote sensing data and the growing awareness of the possibility of deriving area-wide information from it, this is more and more being integrated into disaster management procedures (Aina, Bello, 2014, Römer et al., 2016. In the event of an emergency, it must be ensured that rescue routes are wide enough and, above all, free of any objects that would obstruct the passability of the emergency services and that participants can leave the site at any time. An important aspect is therefore the information on both the number of event participants and their distribution on the event site. In general, this information is also essential for the installation of infrastructures such as waste disposal or the supply of food and drinking water.
There are existing works (Meynberg et al., 2016, Bahmanyar et al., 2019 on crowd analysis and the measurement of their den- * Corresponding author   Figure 1. Illustration of a sample result from DLR-AerialTent dataset with pixel-and instance-wise segmentation using aerial imagery of a music festival in 2013 in Germany with 9cm/px GSD. The red outlined area represents a sample from the test set. sity on festival sites; however, the situation at night has not yet been investigated. In order to estimate the distribution and number of people on the festival site as accurately as possible, we propose the approach of analyzing the sleeping facilities on the different sites. In particular, the type and size of tents, vehicles and similar objects has to be determined and a distinction according to their function has to be made, such as those with and without a sleeping function. In order to make this information available in a timely manner, an automated situation assessment is required as a manual evaluation of larger areas would be too time-consuming. In recent years, an end-to-end monitoring system has been developed, improved and tested under real world conditions and was successfully demonstrated at several large scale events (Römer et al., 2016). It aims to support the management of events and authorities in charge of security and rescuing effort by recording and providing optical aerial imagery and relevant derived information. Examples include overviews of the current traffic situation and the occupancy of parking and camping areas. This system consists of a chain of loosely coupled components. It includes an optical camera system (Kurz et al., 2014), software and hardware for pre-processing and analysing data on board, a down-link for data transmission in near realtime, additional ground-based components for information extraction (Römer et al., 2014, Kersten, 2014 as well as modules for provision and interactive visualisation of situational information based on web services (Römer et al., 2016). To prepare future advancements of the image analysis components of such a processing chain, this study focuses on the detection and feature-based classification of vehicles, tents, and similar objects.

RELATED WORKS
In recent years, deep learning methods have shown promising object detection and instance-wise segmentation results for ground imagery and outperformed the traditional methods. The enhanced performance owe its rapid promotion to a large extent to large-scale datasets such as ImageNet (Deng et al., 2009), Pascal VOC (Everingham et al., 2010) and MS-COCO (Lin et al., 2014). However, as for aerial imagery, similar datasets are scarce, which has slowed down the development of such methods. Furthermore, the existing aerial image datasets for semantic segmentation are either limited to a few individual classes such as roads and building boundaries in the INRIA (Maggiori et al., 2017), Massachusetts (Mnih, 2013), SpaceNet (Van Etten et al., 2018), and DeepGlobe (Demir et al., 2018) datasets, or provide very coarse classes in the ISPRS Vaihingen and Potsdam (Cramer, 2010) datasets. For object detection and instancewise segmentation on the other hand, multi-class object detection plays a major role in remote sensing applications and several datasets are public available for these tasks. Example aerial image datasets in this area are iSAID (Waqas Zamir et al., 2019), DOTA (Xia et al., 2017), TAS (Heitz, Koller, 2008), VEDAI (Razakarivony, Jurie, 2016), COWO (Mundhenk et al., 2016), DLR-3K-Munich-Vehicle (Liu, Mattyus, 2015), and UCAS-AOD (Zhu et al., 2015). These datasets were generated either for general purposes or particular applications. However, to the best of our knowledge, none of them tackles the tent classification in large events with campsites. To address this limitation, we propose a new aerial image dataset with detailed annotations, the so-called "DLR-AerialTent" (see Figure 2).
To investigate the feasibility of instance-wise segmentation for function-based tent classification, we apply, among others, a well-established variant of the Region-based Convolutional Neural Network (RCNN) algorithm (Girshick et al., 2014), the so-called Mask-RCNN , as our baseline. As the other RCNN variants, Fast-RCNN (Girshick, 2015) augments the detection performance of RCNN by the minimization of the region proposal regression and classification losses simultaneously. Faster-RCNN (Ren et al., 2015) improves the localization accuracy of Fast-RCNN by deploying a region proposal network (RPN) for learning the region proposals. Faster-RCNN can be further improved by multi-scale training and testing to learn the feature maps in multiple levels. However, this increases the memory usage and the inference time. Alternatively, image pyramids or Feature Pyramid Networks (FPNs) (Pinheiro et al., 2016, Honari et al., 2016, Ghiasi, Fowlkes, 2016, Newell et al., 2016, Lin et al., 2017 can be utilized to improve the per-  formance in different scales at a marginal extra cost. Rotated region proposals (Liu et al., 2017) improve the localization of the oriented bounding box (OBB) tasks by predicting object orientations using single shot detector (SSD) (Liu et al., 2016). For instance-wise segmentation, a new method has been proposed which applies adaptive weighted pooling and discriminative Region of Interest (RoI)-pooling in a two-stage process together with a RPN (Cao et al., 2020). In addition, ISDNet (Garg et al., 2020) has been developed which applies atrous spatial pyramid pooling (ASPP) module from the DeepLabv3+ (Chen et al., 2018) algorithm in the Mask-RCNN and Cascaded-RCNN manner. In this paper, we are providing a new aerial dataset for instance-wise segmentation with highly accurate annotations and fine-grained classes for camp-relevant objects to promote the development of models for previously unsupported tasks, such as accommodation-wise event monitoring. Additionally, we are carrying out first evaluations of one of the wellestablished instance-wise segmentation algorithms.

DATASET
This study is based on true color aerial images taken over a festival in Germany in early August 2013 and 2016. The images were acquired by a camera-array sensor system mounted on a helicopter, which provides high flexibility for airborne monitoring and is usually available to rescue and security related authorities and organizations (Kurz et al., 2014). The images cover an area of 3.44 km 2 and were acquired at a flight height of around 1000 m above ground, which results in a ground sampling distance of 9 cm and 10 cm, respectively. Note that a part of the aerial images acquired in 2013 were already described in (Römer et al., 2016). We prepared a dataset called "DLR-AerialTent" with images from the years 2013 and 2016 and split it into training and test sets as shown in Figure 2. It is composed of the following 10 semantic classes: 1) tents (with sleeping function), 2) small vehicle / transporter, 3) trailer, 4) truck/bus, 5) camper/caravan, 6) pavilion/large tent (assembly and supply function), 7) awning, tarpaulin, 8) inflatable pool, 9) infrastructure and 10) other objects ("clutter"). This classification is based on experiences with large events gained over the past 10 years. It takes into account the most common and, for our research question, most important classes of objects found in parking and camping areas at festivals and similar large scale events in Germany, and should be considered as a first proposal for such a dataset. Figure 3 shows some samples of the different classes and Table 1 provides an overview of the classes and the number of instances contained in each class.

METHOD
At the beginning of this research work we would like to find out if it is possible to detect and distinguish tents based on their function. For this reason, we apply a pixel-wise semantic segmentation on a small dataset and focus on identifying and localizing tents with sleeping function, pavilion/large tents and vehicles. First, we annotate a part of the aerial image of 2013 to serve as training set and then, we test our methods on the rest of the image. A two stream pixel-wise semantic segmentation algorithm is used which considers large and small scale objects to combine shallow features from high spatial resolution inputs and rich features from low spatial resolution inputs as described in (d' Angelo et al., 2019). The first segmentation results are visible in the left red outlined sample in Figure 1. After achieving these promising results, we set our goal to identify further types of tents, vehicles and other artificial structures such as infrastructure elements.
In order to localize objects more accurately and to be be able to count each instance object, each object of interest has to be identified separately regardless of having shared border with another object having the same class. Therefore, we decide to analyse the images using the instance-wise segmentation approach. We choose Mask-RCNN as the base-line which is a well-established deep neural network aiming to resolve instance segmentation problems in computer vision. Specifically speaking, it separates different instances of objects in an image by providing object bounding boxes, classes and masks as three heads. Mask-RCNN is a extension of Faster-RCNN for instance-wise segmentation. Similar to the Faster-RCNN, there are also two stages in Mask-RCNN. First, it generates region proposals for possible existing object regions, and second, it predicts the object class, refines its bounding box and generates a polygon mask in the pixel level. Both stages are added downstream of the backbone network to extract high-level features, which can be either single-scale or multi-scale. In other words, to adapt Faster-RCNN to the instance-wise instance segmentation, Mask-RCNN contains two heads: One head for box object detection and another for instance mask segmentation, which are trained end-to-end.
In contrast to the majority of the recent systems, where classification is dependent on mask predictions, Mask-RCNN outputs single binary mask for each RoI. During training, a multi-task loss is utilized on each selected RoI Las = L cls +L box +L mask . The classification L cls and bounding-box L box loss functions are identical to Faster-RCNN. Considering K classes in total, the mask branch yields a Km 2 -dimensional output per RoI encoding K binary masks of m × m resolution. A per-pixel sigmoid is applied to this, and we define L mask as the average binary cross-entropy loss function. Therefore, for a sampled RoI connected with ground-truth class k, L mask is applied only on the k − th mask i.e., other mask outputs do not affect the loss. There are several sub-modules in the algorithm in the case of multi-scale backbone such as FPN, RPN, region of interest network (ROI), non-maximum suppression (NMS) and the mask head. As in the RPN module, we minimize the multi-task loss where for an anchor i in a mini-batch, pi is the predicted probability of an object existence and p * i is the ground-truth binary label. For the classification (object/not-object), the logloss L obj (pi, p * i ) = −p * i log pi is applied, while we employ the smooth l1 loss function for the bounding box regression. Here, (6) are the coordinates of the predicted and ground-truth anchors with xi, xi,a, and x * i indicates the predicted, anchor, and ground-truth respectively (the same also goes for y); and wa and ha are the anchor width and height. N obj and Nreg normalize hyper-parameters (the mini-batch size as well as the number of anchor locations); and λ denotes the balancing hyper-parameter between the two loss functions, which is set to 10.
In the module of ROI, each chosen region proposal is regressed and classified simultaneously.
where horizontal bounding box (HBB) and L cls (p, u) = −u log p. u is the true class and p is the discrete probability distribution of the predicted classes which is defined over K +1 categories as p = (p0, ...., pK ) where "1" is for the background. In contrast to Faster-RCNN, Mask-RCNN uses ROIAlign instead of ROIPool to improve localization performance of each L loc−HBB (t u , v)ROI. is defined similar to the Lreg in which {xmin, ymin, w, h} (the upper-left coordinates, width and height) of t u and v for the corresponding HBB coordinates are computed.
In the case of classification of an object as background, [u ≥ 1] ignores the offset regression. The balancing hyper-parameter λ is also set to 1 in this case. The same region proposal is fed to the mask-head, which outputs the boundary mask for the object inside of the region proposal. It is accepted as final output, if the region proposal is classified with a class except background. To obtain the final detections, as the final post-processing, we deploy NMS in which overlaps among detections is computed to choose the best localized region and to omit redundant regions.

EXPERIMENTAL SETUP
We have carried out the experiments using two Titan XP GPUs and the Detectron 1 framework based on Caffe2. We trained algorithms for 5000, 10000, 20000, and 30000 iterations denoted in the result tables as 1x, 2x, 3x, and 4x. For the training, we used the learning rate of 0.02 with a scheduled learning rate procedure of 60% and 80% of the total iteration with the gamma of 0.1. As the backbone networks, we used ResNet-50, ResNet-101 (He et al., 2016), and ResNeXt-101 (Xie et al., 2017). The ResNeXt backbones are trained with the cardinalities of 32 and 64 with the bottleneck widths of 8d and 4d, respectively. In addition, the features of the last convolution layer of the 4-th stage of the backbones (C4) as well as the FPN features are used as inputs for the three heads. Using FPN after the backbone network allows images to be processed at multiple feature scales, which should improve the performance on small objects significantly as they are usually lost in the output of high-level features.
The head resolution of Mask-RCNN is 28 and uses RoIAlign for aligning the region proposals. RoI batch size for each image is 512 and the image-wise batch size is 1. Moreover, we use  We employ mean Average Precision (mAP) as evaluation metric, similar to the evaluation of the MS-COCO dataset. For bounding box and segmentation mask detections, APs are computed based on Intersection over Union (IoU) with 50%, 75%, and 95% intersection rates. Furthermore, since the dataset is heavily skewed and unbalanced, we calculate the instance-weighted mAP (mAPinsW ). Table 2 and Table 3 represent a baseline comparison for the instance-wise segmentation task. According to the results, ResNeXt-101 with cardinality = 32 and bottleneck width = 8d after 3× training with an augmentation in the training and test sets outperforms the other configurations. It achieves mAPs of 36.5% on the instance-wise segmentation task. This configuration also achieves the best mAPinsW (54.04%) with 4× training. Moreover, according to Table 2, almost all configurations perform poorly for the infrastructure, inflatable pool, and   Figure 5. Samples of visual outputs for mask segmentation and confusion in the DLR-AerialTent test set. Color codes for the first and third row: tent: sleep function, small vehicle/transporter, trailer, truck/bus, camper/caravan, pavilion/large tent: assembly and supply function, awning, tarpaulin, inflatable pool, infrastructure, other objects ("clutter"). Confusion color codes for second and fourth: true positive, false positive (wrong class) and false negative (object not detected). truck/bus classes. This could be expected due to the small number of available samples for these classes (see Table 1). Results show that more training iteration improves the performance for the inflatable pool class. However, it decreases the performance for the other classes due to overfitting. They also show that, despite their large diversities, tents with sleeping functions can be distinguished from similar objects classes such as large tents, pavilions, awnings, tarpaulins, and sun sails with a high accuracy. In addition, it can be seen that camping vehicles with a sleeping function can be distinguished from other vehicle classes with a relatively high level of confidence. In order to better analyse the correlation of the performance with the object sizes, in Table 3, we show the average precision for large (AP Box l ), medium (AP Box m ) and small (AP Box s ) objects. According to the results, small objects are harder to be detected and segmented in comparison to the larger ones. This is due to their smaller number of samples in our dataset as well as their complex features resembling those of the other classes. This can be confirmed by analysing false positives in Figure 4. This figure demonstrates the performance for the best Mask-RCNN configuration on the tent and vehicle categories of the DLR-AerialTent test set. The diagrams on the left side show the cumulative fraction of detections which were classified correctly (Cor), or represent false positive classifications due to poor localization (Loc), or due to confusion with similar (Sim), or with other (Oth) categories, or with the background (BG). The solid red line indicates the change of recall with the strong criteria of 0.5 (jaccard overlap) by increasing detection numbers. The dashed red line reflects the weak criteria of 0.1 (jaccard overlap). The diagrams on the right side indicate the distribution of top-ranked false positive factors.

RESULTS AND DISCUSSION
We have carried out such analysis for all classes; however, for the sake of space, we merge the tent, pavilion, large-tents and awning classes and the small-vehicle, caravan, camper, trailer and truck/bus classes. In both cases, similarity and confusion with objects from other classes can be considered as the main reason for the false positives. Figure 5 shows also some examples of the visual output for the mask segmentation in the DLR-AerialTent test set.

CONCLUSION AND FUTURE WORKS
In this paper, we present a proof-of-concept that it is feasible to distinguish tents based on their functionalities on camp sites. We introduce the first dataset for this application, which we use to train an instance-wise segmentation algorithm of Mask-RCNN with multiple configuration. The results show promising outputs for the most important categories despite low performance for a few classes. From the operational point of view, results of this study can support future developments and improve monitoring systems for area occupancy and passability of rescue routes during large-scale events. With the help of the object classes, the number of people and their distribution can be estimated by assigning specific, empirically determined values to the classes. This step, as well as the evaluation of the results, will follow this study. Additionally, we will investigate more recent network architectures and will work on developing dedicated algorithms for this task to achieve better performance. An expansion to analyse data of temporary refugee camps, as well as the use of satellite data, is being considered.