EVALUATING SECTOR RING HISTOGRAM OF ORIENTED GRADIENTS FILTER IN LOCATING HUMANS WITHIN UAV IMAGES

Developing systems to find injured people quickly after natural disasters is an important topic. In recent years, special attention has been paid to the use of UAV images for this purpose. In this regard, an accurate and strong feature is required. It is shown that the Sector Ring Histogram of Oriented Gradients, is a feature very much independent from rotation and scale. The aim of this paper is to evaluate the performance of a human detection algorithm which is based on this strong feature. Experiments carried out suggest that using SRHOG feature humans can be detected with an accuracy of 73.69%. However, despite giving good accuracy, SRHOG results contain more than 33.33 % false labels. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-23-2020 | © Authors 2020. CC BY 4.0 License. 23 Approximate Radial Gradient Transformation (ARGT) (Takacs et al., 2013), is a rotation invariant feature which can solve the rotation in a plan. According to the Radial Gradient Transform (RGT) (Takacs et al., 2013) coordination system, which varies with the pixel position, instead of the fixed global (X, Y) system to describe the pixel’s gradient, makes the feature vector resistant to the image rotations. Orthogonal bases of local frame are the radial and tangential unit vectors at the pixel p relative to the detection window center C. The components of the gradient (Gr, Gt) are described in directions r and t. Therefore, when this local coordinate system is rotated the resulting vector does not change and makes the feature vector resistant to the image rotations. Figure 2. Definition of local coordinate system for each search window in the SRHOG method As shown in Fig.2, the gradient orientation is redefined as the angle β between the gradient and local basic vector. After the image rotation, the new gradient orientation β′ still is equal to the β, which guarantees the rotation-invariance of statistic at each pixel. The aim of this paper is to assess SRHOG for human detection in UAV images. This paper includes three parts: Part one discusses how to implement the SRHOG function. Experiments performed to test it are recorded in the following. Ultimately conclusions are drawn and recommendations are proposed for future works. In the remainder of this article, the second part describes how to carry out evaluations. 2. METHODOLOGY OF COMPUTING SRHOG FEATURE AND ITS IMPLEMENTATION FOR HUMAN DETECTION In this section the methodology is explained briefly. At first, the image is scanned by a 128*128 search windows. Then the radial and tangential gradients (Figure 2) of each pixel are calculated via Approximate Radial Gradient Transformation method. The magnitude and direction of gradient vector is achieved by: Gradient magnitude = Gr+Gt (1) Gradient direction=arc tan( ) (2) Where Gr and Gt are the radial and tangential gradients respectively. Once the gradients are calculated, using 15 cocentered circles and 16 angular sectors, the search window is split into several sector rings. It is worth noting that these blocks have some overlaps (Figure 3) to make the search window feature more robust against changes in illumination (Liu et al., 2017). The next step is measurement of each block's gradient histogram. Once the blocks are formed, the gradient histogram of gradients shall be determined. The horizontal axis of this histogram corresponds to gradient directions ranging from 0 to 160 (i.e. 9 bins per 20). Pixels in the block whose gradient directions are within the bin range, are assigned to that bin. The height of each bar is equal to the weighted sum of pixel gradient magnitudes. At the end, the feature vectors of block histograms are put together to form the final feature vector of the search window. In this paper the supervised classification method is used to identify humans, the Support Vector Machine (SVM). SVM is conducted in two main phases: training and testing. During the training phase, a classification model is used later to assign each test picture to the human or non-human tag. For this, features of many positive (completely or partially containing humans) and negative (containing no humans) are extracted using SRHOG. These features set out the training data that SVM needs to build up its test model. To find the label of any test image its features are extracted and passed to the SVM classifier. In the end, all test images that contain a one or more humans are assigned to the human category whereas the test images that contain no humans are assigned to the non-human category. To find a human in scene a sliding window approach have been used which label each window by means of the mentioned method. There exists large gap between the data input speed and processing speed in large-size images. To shorten this gap, a parallel processing scheme is used. In this method several process (the number of process depends on the CPU) can be perform at the same. Figure 3. a) spatial configurations in SRHOG, b) gradient of pixels in one of the blocks in the search window, c) histogram of gradients of the block 3. EVALUATIONS Three datasets were used for the evaluations. The first is the proposed INRIA dataset along with HOG feature (Dalal et al., 2005), which was frequently used in numerous studies as a systematic benchmark for testing human and pedestrian detection algorithms (Benenson et al., 2014). The collection The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-23-2020 | © Authors 2020. CC BY 4.0 License. 24 contains only standing, walking and upright views of people. However, as already described, an individual is mostly viewed from a top-down angle on a UAV image. Therefore, humans are deformable objects, and thus have variations in the class. The training dataset should therefore be detailed to allow for accurate classification. Thus, we acquired and used many additional images taken by AR Drone 2.0, DJI Tello, DJI Inspire 2.0, and DJI Phantom 4 Pro drones. Figure 4 shows some example image from INRIA data set, from which we used 2164 positive and 432 negative samples for training and 1126 positive and 453 negative images for the test. Figure 4. Some examples of INRIA dataset pixels. However, as mentioned before, the samples need to be 128x128 pixels. Therefore, to increase the dimensions, the border parts of samples were duplicated and merge to them (Figure 5a) to make them 128x128 pixels image. Figure 5. Some examples of Drone images The additional drone images acquired by the authors were 592 and 200 positive and negative samples respectively. The positive images included humans either completely or partially. Figure 5 shows some examples of both positive and negative images. Figure 6. Some examples of Tello images. The Tello images were taken at lower attitudes were to make the training dataset even more comprehensive. In the training phase, 162 positive and 92 negative images taken by Tello were added to other two data sets. Some examples are shown in Figure 6. To evaluate the performance of the algorithm, three indices Recall, Recall_neg, and Precision (Wójcikowski, 2016), were used which are computed using TP, TN, FP, and FN figures (Table 1). The Recall index shows the ratio of the correctly identified positive windows over the total number of positive windows and is computed by: Recall= (3) In effect, Recall shows how strong the proposed feature is in identifying the positive samples. The second index, Recall_neg, refers to the ratio of correctly identified negative samples over the total number of all negative samples. and is calculated by:


INTROUCTION
The use of unmanned aerial vehicles (UAVs) in various subject areas and applications has increased dramatically in recent years. It is of great interest to users and scientists to develop systems for the rapid identification of injured persons following natural disasters in unmanned aerial vehicles (UAV) images (Jingxuan et al., 2016). Unfortunately, when UAV images are used for human sensing with other challenges, the use of a UAV as an image platform introduces certain other problems (Blondel, 2013). Because pictures are taken from above, individuals can be quite different from those on the ground. Therefore, the injured person may be partially covered with snow, rock and the like (Liu, 2017).
Work on human detection using photos has concentrated mostly on pedestrian detection (Mihçioğlu et al., 2019, ZhangSr et at., 2019, Karg et al.,2020, study of human movement (Bahri et al., 2019 and facial recognition , Ding et al., 2019, Prasad et al., 2020. Benenson, et.al in 2014 compared over 40+ methods and concluded that the main challenge ahead seems to develop a deeper understanding of what makes good features good, so as to enable the design of even better ones (Benenson et al, 2014). Therefore, the principal task is to identify a feature that can define the presence of the human body. Displaying various features, the data such as texture (Ojala et al., 2002, Leibe et al., 2005, colour (Ott et al., 2009, Walk et al., 2010 and edge (Nguyen et al., 2009) is often removed. For instance, Leibe, et al. uses the texture information to identify pedestrians in a crowded scene (Leibe et al., 2005). For this, so called Haar Wavelets (Dollár et al., 2008) are used which are gray level patterns computed based on the magnitude of the difference between neighboring pixel intensities. Another example is Color self-similarity (CSS) feature (Walk et al., 2010) which is defined using the histogram of color tones present in different parts of an image. In general, techniques that are based on texture or color features highly depend on the pixel values and, thus, the image background may disturb their outcomes. To this end, they are usually used along with background subtraction or motion analysis techniques (Cutler et al., 1998).
To find the shape of an object, edge features are of great use as objects can be well be represented through their edges (Nguyen et al., 2011). In contrast to color and texture, the edge features describe objects mainly based on their geometry.
Histograms of oriented gradients (HOG) (Dalal et al., 2005) plus support vector machine (SVM) (Cortes et al., 1995) has been paid great attention and applied to human detection extensively since it was proposed in 2005. Similar to Edge Orientation Histograms (EOH) (Gerónimo et al., 2007) and Scale-Invariant Feature Transformation (SIFT) (Lowe et al., 2004), HOG concentrates on the gradient information of image, but it is different that HOG employs the dense grid of uniformly spaced cells and the overlapping local contrast normalization to strengthen the robustness to illumination and shadow. UAVs move in a 3D world. A drone's camera undergoes rolling, pitching, heading or a combination of all and this makes the detection more complex. So the feature should be rotation invariant. SRHOG (Liu et al., 2017), Inspired by HOG (Dalal et al., 2005), which utilizes a dynamically defined polar coordinate system ( Figure 2) to calculate the gradients via Approximate Radial Gradient Transformation (ARGT) (Takacs et al., 2013), is a rotation invariant feature which can solve the rotation in a plan.
According to the Radial Gradient Transform (RGT) (Takacs et al., 2013) coordination system, which varies with the pixel position, instead of the fixed global ( , ) system to describe the pixel's gradient, makes the feature vector resistant to the image rotations. Orthogonal bases of local frame are the radial and tangential unit vectors at the pixel relative to the detection window center . The components of the gradient (Gr, Gt) are described in directions r and t. Therefore, when this local coordinate system is rotated the resulting vector does not change and makes the feature vector resistant to the image rotations. Figure 2. Definition of local coordinate system for each search window in the SRHOG method As shown in Fig.2, the gradient orientation is redefined as the angle between the gradient and local basic vector. After the image rotation, the new gradient orientation ′ still is equal to the , which guarantees the rotation-invariance of statistic at each pixel.
The aim of this paper is to assess SRHOG for human detection in UAV images. This paper includes three parts: Part one discusses how to implement the SRHOG function. Experiments performed to test it are recorded in the following. Ultimately conclusions are drawn and recommendations are proposed for future works.
In the remainder of this article, the second part describes how to carry out evaluations.

METHODOLOGY OF COMPUTING SRHOG FEATURE AND ITS IMPLEMENTATION FOR HUMAN DETECTION
In this section the methodology is explained briefly. At first, the image is scanned by a 128*128 search windows. Then the radial and tangential gradients ( Figure 2) of each pixel are calculated via Approximate Radial Gradient Transformation method. The magnitude and direction of gradient vector is achieved by: Gradient magnitude = Gr 2 +Gt 2 (1) Where Gr and Gt are the radial and tangential gradients respectively. Once the gradients are calculated, using 15 cocentered circles and 16 angular sectors, the search window is split into several sector rings. It is worth noting that these blocks have some overlaps (Figure 3) to make the search window feature more robust against changes in illumination (Liu et al., 2017). The next step is measurement of each block's gradient histogram. Once the blocks are formed, the gradient histogram of gradients shall be determined. The horizontal axis of this histogram corresponds to gradient directions ranging from 0 o to 160 o (i.e. 9 bins per 20 o ). Pixels in the block whose gradient directions are within the bin range, are assigned to that bin. The height of each bar is equal to the weighted sum of pixel gradient magnitudes. At the end, the feature vectors of block histograms are put together to form the final feature vector of the search window.
In this paper the supervised classification method is used to identify humans, the Support Vector Machine (SVM). SVM is conducted in two main phases: training and testing. During the training phase, a classification model is used later to assign each test picture to the human or non-human tag. For this, features of many positive (completely or partially containing humans) and negative (containing no humans) are extracted using SRHOG. These features set out the training data that SVM needs to build up its test model. To find the label of any test image its features are extracted and passed to the SVM classifier. In the end, all test images that contain a one or more humans are assigned to the human category whereas the test images that contain no humans are assigned to the non-human category. To find a human in scene a sliding window approach have been used which label each window by means of the mentioned method. There exists large gap between the data input speed and processing speed in large-size images. To shorten this gap, a parallel processing scheme is used. In this method several process (the number of process depends on the CPU) can be perform at the same.

EVALUATIONS
Three datasets were used for the evaluations. The first is the proposed INRIA dataset along with HOG feature (Dalal et al., 2005), which was frequently used in numerous studies as a systematic benchmark for testing human and pedestrian detection algorithms (Benenson et al., 2014). The collection contains only standing, walking and upright views of people. However, as already described, an individual is mostly viewed from a top-down angle on a UAV image. Therefore, humans are deformable objects, and thus have variations in the class. The training dataset should therefore be detailed to allow for accurate classification. Thus, we acquired and used many additional images taken by AR Drone 2.0, DJI Tello, DJI Inspire 2.0, and DJI Phantom 4 Pro drones. Figure 4 shows some example image from INRIA data set, from which we used 2164 positive and 432 negative samples for training and 1126 positive and 453 negative images for the test. pixels. However, as mentioned before, the samples need to be 128x128 pixels. Therefore, to increase the dimensions, the border parts of samples were duplicated and merge to them ( Figure 5a) to make them 128x128 pixels image.  The Tello images were taken at lower attitudes were to make the training dataset even more comprehensive. In the training phase, 162 positive and 92 negative images taken by Tello were added to other two data sets. Some examples are shown in Figure 6.
To evaluate the performance of the algorithm, three indices Recall, Recall_neg, and Precision (Wójcikowski, 2016), were used which are computed using TP, TN, FP, and FN figures ( Table 1). The Recall index shows the ratio of the correctly identified positive windows over the total number of positive windows and is computed by: In effect, Recall shows how strong the proposed feature is in identifying the positive samples.
The second index, Recall_neg, refers to the ratio of correctly identified negative samples over the total number of all negative samples. and is calculated by:

Recall_neg = (4)
A bigger Recall_neg suggests that the procedure is stronger and makes less mistakes. The last, Precision, refers to the ratio of correctly identified positive windows over all of the positive windows and is computed by: Precision shows the overall accuracy of the method.

Meaning Index
Positive windows correctly identified For the first two tests, ROC curve and the area below the curve was also calculated which is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. And the AUC (area under curve) tells how much model is capable of distinguishing between classes.

RESULTS AND DISCUSSIONS
At first, the overall efficiency of the SRHOG feature was evaluated using the INRIA data set. Then this test was carried out once again but this time with Drone images to test the ability of it in detecting humans in nadir images. Second, the ability of SRHOG in detecting humans appearing in different situations like standing and sitting was evaluated.
The results of the first experiment are presented in Table 2. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) Indeed, here out of 453 negative samples, SRHOG classified only 151 cases correctly. This means despite its relatively good accuracy in detecting correct positive labels, SRHOG leads to too many false negative labels too.   Table 3, shows the next experiment were the performance of SRHOG for detecting humans in UAV images is studied. In this experiment, 803 positive and 150 negative samples taken by a Parrot Ar. Drone camera (Fig. 6) are used.  which is much less than previous test. Therefor the capability to distinguish between classes dropped.
As mentioned, the third experiment concerned checking SRHOG in images that include a person in various standing or seating positions and lighting conditions. In other words, injured people can appear in a variety of positions, such as standing, sitting, lying down, and a part of the body covered or is in shadow. There is also no guarantee that a human is imaged under proper lighting conditions.
In this experiment, for each situation,500 positive samples taken by Tello drone were used. This is because only in this data set, we were able to capture various standing positions. Examples of these images are presented in Figure 9.

CONCLUSION
In this paper, several experiment were carried out to examine the SRHOG performance in detecting humans in UAV images.
The tests were carried out in all sitting and lying positions, occluded body and inappropriate light conditions were The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 XXIV ISPRS Congress (2020 edition) This contribution has been peer-reviewed. https://doi.org/10.5194/isprs-archives-XLIII-B2-2020-23-2020 | © Authors 2020. CC BY 4.0 License. separately studied. The SRHOG Recall in lying and standing positions were respectively 85.10% and 73.42%.
The biggest weakness of SRHOG was in giving many false labels where the image does not contain any humans. Development of a feature to overcome this issue is desired in the future studies. It was also observed that the feature has a smaller success rate in detecting a sitting position, which shows a different appearance from a human being in the image. Last, perhaps the most significant problem was occlusion, where applying part-based techniques may lead to better results. .