KNOWLEDGE DISTILLATION USING GANS FOR FAST OBJECT DETECTION

In this paper, we propose a new method for knowledge distilling based on generative adversarial networks. Discriminator CNNs is used as an adaptive knowledge distilling loss. In experiments, single shot multibox detector SSD based on MobileNet v2 and ShuffleNet v1 are used as student networks. Our tests showed AP and mAP improvement of more than 3% on PascalVOC and 1% on MS Coco datasets compared with the baseline algorithm without any architecture or dataset changes. The proposed approach is general and can be used not only with SSD but also with any type of object detection algorithms.


INTRODUCTION
Nowadays, 1 there are many practical computer vision tasks that can be solved with the previously unattainable quality by using convolutional neural networks (CNN). However, the wellknown drawback of CNNs is high computational cost that makes them quite difficult to implement on embedded systems, especially for real-time image sequence analysis. This is the case even despite of the latest advance in embedded hardware capabilities for neural network processing (e.g. Google TPU or NVIDIA Xavier).
From algorithmic side, special "mobile" CNN architectures have been developed (e.g. MobileNet , ShuffleNet , which are very computational efficient in inference (in terms of floating point operations) compared with the regular CNNs. Though, their practical use is restrained by a difficult hyperparameters fine tuning and the further performance improvement for such "mobile" architectures is still an acute task.
Knowledge distillation can be one of the possible solutions in this area since it is a method for knowledge transfer from one neural network to another, usually, from deep and slow to small and fast one. This approach is becoming increasingly popular in the practical application of artificial neural networks.
In this work, we consider the problem of fast object detection as a test task. In contrast to classical distilling technique , we use generative adversarial networks (GAN) as an adaptive loss function for deep feature mimic. As basic object detection algorithm we use single shot multibox detector SSD. The proposed approach allows us to get mAP gain on COCO and Pascal VOC Datasets without any architecture or dataset changes.

RELATED WORKS
Object detection. Currently, there are two basic concepts that implemented in object detection algorithms: region proposal object detection and single shot object detection.
Historically, the region proposal detectors have appeared first and implemented the idea to split a detection problem on two * Corresponding author stages: to create hypotheses about the possible location of objects on the image without their classification and then on the second stage to verify hypotheses and refine objects location. Examples of such algorithms are R-CNN , Fast RCNN (Girshick, 2015), Faster RCNN (Ren, 2015), R-FCN (Dai, 2016), FPN . In RCNN, which can be considered as the basic work for this class of algorithms, the first part that generates hypotheses is based on selective search procedure, whereas on the second stage a neural network is used for classification. In Faster RCNN, that is the further development of RCNN, two stages are combined in one network architecture that delivers significant increase in processing speed. R-FCN improves speed and accuracy by removing fully connected layers for final detection. In the case of R-FPN, a pyramidal architecture with lateral connections was developed for building multi-scale high-level feature maps, in which object detection is performed independently at each level. In general, the region proposal detectors provide high flexibility and accuracy by dividing a processing flow on two stages, but at the same time, it is still extremely difficult to implement this concept in real time.
Single short detectors solve object detection problem using one processing stage based on one neural network. Such neural network receives images as an input and outputs bounding boxes relative to positions of detected objects along with their class labels. This group of detectors are represented by YOLO (Redmon, 2016), SSD (Liu, 2016), DSOD (Shen, 2017), RetinaNet .One of the first single shot algorithms is YOLO, which is based on the original CNN architecture and provides processing speed of 244 FPS in TinyYolo modification. The further development of single short detectors is SSD. In contrast to YOLO, the SSD architecture uses deep features from various layers of the neural network, depending on the size of the object. In addition, to improve the quality of object detection of various shapes, anchor boxes similar to those proposed in the R-CNN algorithm are used. Currently, there are a lot of various SSD detectors (e.g. Yolo v2, RetinaNet, DSOD, DSSD), which have modified network architecture and loss function, but employing the same ideology.
Single short detectors are currently used in strict real-time applications such as real-time face detectors SSH, S3FD and FaceBoxes.
Knowledge distillation. Knowledge distillation task involves the transfer of knowledge from a "teacher" network to a "student" network in order to improve the quality of the latter. One of the first works in this area was (Romero, 2014), in which the transfer of knowledge in relation to the classification problem was considered. The proposed method minimizes L2 difference in deep features of teacher and student networks. This method is widely used in practice because of its simple implementation. For example, in (Quanquan, 2017) the method was used to transfer the knowledge for region proposal two-stages Faster R-CNN detector.Another approach is to "implicitly" learn deep attributes (Hinton, 2015). In (Hinton, 2015) the so-called "soft labels" were proposed. These labels are created from the teacher answers and are further used in the student loss function. In this case, direct minimization of differences in deep features does not occur.  employs relationship between different samples to improve quality. In (Huang, 2017) knowledge distilling problem was transformed to distribution matching problem. In (Wang, 2018) an algorithm for transferring knowledge by using generative adversarial networks was proposed. The discriminator is used as a loss function to minimize differences in deep features. In the original article, knowledge transfer problem was considered relatively to the classification task. In our work, we propose a similar approach, but adapted for the task of object detection.

GAN FOR KNOWLEDGE DISTILLING
Classical GANs generate some signal or image from random vector. Conditional GANs transforms some input data and maybe random vector to output image or vector. Typically, GAN presumes two neural networks: Ggenerator and Ddiscriminator. Generator network G generates some signal in target domain from input data. Discriminator network D is trained to distinguish "real" signals from the target domain from the "fakes" produced by Generator. Generator and Discriminator are trained simultaneously. Discriminator provides the adversarial loss that enforces Generator to produce "fakes" that cannot be distinguished from "real" signal. Condition generative adversarial networks are widely used for domain-to-domain translation.
Generator and Discriminator are trained simultaneously using the following loss: LGAN -"adversarial" loss, LRECreconstruction loss.
Binary cross entropy loss is widely used as an adversarial loss: where y -real sample xinput data, D -Discriminator CNN, G -Generator CNN.
L1 or L2 distances are usually used as a reconstruction loss function(LREC). Reconstruction loss leads to better convergence and prevents constant output.
Knowledge distilling problem can be easily represented as a domain-to-domain transform. In that case, the source domain is the student feature space and the target domain is the teacher feature space (see figure 1). Figure 1. GAN for knowledge distilling.
Using Discriminator CNN instead of L1 or L2 distances provides adaptive loss function that leads to better accuracy.
In case of knowledge distilling reconstruction loss from (1) can be replaced by task specific loss (like object detection, classification or segmentation loss). A similar approach was used in (Wang, 2018) for the classification problem, which increased accuracy from 68.43 to 74.1 for MobileNet.

PROPOSED APPROACH
In our work, we consider SSD based object detection algorithms (SSD, DSOD, RetinaNet and others). In contrast to region proposal approaches, SSD detects objects using only one forward pass. These algorithms discretizing the output space of objects bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. During prediction, CNN generates scores separately for each default box and for each object class type. Instead of direct bounding box prediction, CNN generates adjustments to default boxes. In addition, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.
SSD is one of the most popular and fast object detectors. For example, SSD with the base CNN MobileNet v2, which is discussed in our article, provides 5 FPS processing speed on a Google Pixel 1 phone with a Qualcomm Snapdragon 821 processor (Sandler 2018). SSD-MobileNet v2 reaches processing speed of more than 250 FPS on NVIDIA GTX 2080 Ti GPU with the usage of Tensor RT library.
Basic Single Shot MultiBox Detector (Liu, 2016) can be divided into two parts(see Figure 2) -basic network and extra layers. In original paper, VGG-16 network is used as the basic network. By replacing VGG on mobile networks like MobileNets  or ShuffleNets , opens possibility to provide high computational speed on embedded platforms and for real-time solutions.
Extra layers are a series of progressively smaller convolutional layers (see Figure 2). Layers from "extra layers", along with some of the earlier base network layers, are used to predict scores and bounding boxes. We name these layers as "feature layers".These predictions are performed by 3x3 convolutions, one filter for each category score and one for each dimension of the bounding box that is regressed. At the end, a non-maximum suppression (NMS) is used for post-processing of the predictions to get final detection results.
Therefore, in SSD case, we have several feature layers for mimicking. The Student network plays the role of Generator that generates features, and we need unique discriminator for each feature map (see Figure 3).Using this approach, we can transfer knowledge between two SSDs with different basic networks and the same extra layers architecture. The loss function according to (1) will be the following: where LGANi -"adversarial" loss for i-th discriminator, LMBOXmultibox loss.
In our work, we use the original multibox loss function from (Liu 2016).
LGANi is similar to (2). The proposed approach is general and can be used for any type of object detection algorithms or any convolutional neural network. In this case, for each feature layer we need individual discriminator. For example, for simple classification we need only one discriminator.

Architecture.
Teacher Network. We used SSD based on DarkNet-53 as the teacher network due o the following reasons: 1. In comparison to well-known ResNet-152 CNN Darknet-53 provides similar performance on ImageNet dataset but twice faster 2. It provides much higher accuracy on test datasets with SSD: Pascal VOC0712 ~ 0.77 mAP than MobileNet or ShuffleNet based SSD; 3. It has feature layer shapes similar to MobileNet and ShuffleNet based SSD.
This neural network was proposed in (Redmon, 2018) for training the Yolo v3 object detector and has the architecture presented in figure 4. Student Network. As we have mentioned before, in this work, we consider the class of algorithms for fast object detection. In our experiments we used SSD-Lite modification of SSD. MobileNet v2 (Sandler 2018) and ShuffleNet v1 was used as the student networks. That CNNs are specially developed for fast and embedded applications. Due to the features architecture of the basic networks, original SSD was slightly modified following (Sandler 2018).
Discriminator Networks. In SSD case we have 6 discriminator networks. We tried different architectures and discriminator types. The best results were obtained with medium size discriminator CNNs shown in Tables 2 and 3. Deeper discriminator architectures can lead to overfitting and worse results.
Our discriminators are built from blocks that contain convolutional layers, instance normalization layers (Ulyanov, 2016) and leaky relu activation functions. CNN architecture for The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2020, 2020 the first 5 discriminators(DNet1-DNet5) is based on patch GAN ideology (Isola, 2017). This approach implies that instead of one answer, the output of the discriminator is a semantic map for the two classes -"real" and "fake". These types of discriminators are often used in image processing and can improve the quality of the generator compared to the classical ones. In our case, SSD feature maps also are spatial-aware except the last feature layer so we use path-gan architecture for DNet1-DNet5 with the following output sizes: 3x3 for DNet1-DNet4 and 2x2 for DNet5.
The additional tests were performed to study the influence of the effect of mimicry on the quality of object detection for various scales. Objects were divided into three groups: small, medium and large, in accordance with the rules for mapping anchors of the original SSD. Table 1. AP gain for different object sizes.
As can be seen from the table above(see Table 1), the use of mimicry improves the quality of detection for all considered objects sizes.
Discriminator pre-learning -Discriminator CNN pre-learning stage. On this stage we used only 75 iterations to pretrain discriminator and only adversarial loss (2) (Student network was frozen).
Student-Net learningstudent CNN training stage (Generator training). On this stage, the discriminator network was frozen (weights were not changing), only the student network was learnt using loss (3). According to GAN ideology, a "real" label is passed to the Discriminator as an answer. This stage was running until the mean adversarial loss per epoch became less than a given threshold.
Discriminator fine tuningdiscriminator CNN training stage. On this stage only adversarial loss (2) were used (the student network was frozen) and number of iteration were also equal to 75.
Student-Net post-learningon this stage the classical SSD training method (Liu, 2016) and SGD solver was used according to the original paper (for other stages we used Adam optimizer with fixed learning rate).
After pre-learning stage we applied the Student network and Discriminator learning stages during 180 epochs. Then we applied 80 epochs of post learning stage.
Following the original paper, we used backbones pretrained on the ImageNet dataset and the same augmentation set.

Datasets.
For our experiments, we used Pascal VOC (Everingham, 2014) and MS Coco ( (Lin, 2014)) datasets to balance between large scaled and small sized objects.

Sigmoid
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII- B2-2020, 2020XXIV ISPRS Congress (2020 PASCAL VOC 0712 dataset contains 21493 images of annotated predominantly large objects from 20 classes. For testing we used metrics and test subset from PASCAL VOC benchmark described in (Everingham, 2014). For training we used the united dataset from PASCAL VOC and MS COCO training subsets.
MS COCO dataset contains 328 k images of 91 object types (more than 2.5 million labeled predominantly small sized objects). For testing we used COCO 17 val. subset and metrics according to the object detection protocol described in original paper (Lin, 2014

CONCLUSIONS
This paper introduces a new approach for knowledge distilling based on generative neural networks. Instead of classical approaches, which employ loss function based on distance, we propose to use adversarial loss function. In this case, knowledge distilling problem is transformed to the domain-to-domain translation. Generative adversarial networks successfully applied to such type of problems in many practical applications. In this case, the feature space of Teacher network is a target domain, an input image is a source domain, and Student network is a generator network.
We used SSD Single shot multibox detector as the basic algorithm and object detection as the test task. We targeted the problem of improving quality of fast SSD, which is based on MobileNet, through knowledge distilling from SSD, which is based on deep and slow network. According to SSD architecture, we used 6 discriminators networks with original architecture based on patch GAN ideology.
Training process presumes 4 main stages: 1. Discriminator pre-learning; 2. Student-Net learning; 3. Discriminator fine tuning; 4. Student-Net post-learning. Unlike the classical generative adversarial network, we train discriminator and generator sequentially. Generator CNN is trained with frozen Discriminators until the adversarial loss reaches some level and then Discriminator is fine tuned with the frozen generator network.
Our approach was tested on well-known PASCAL VOC and MS COCO datasets. SSD based on DarkNet-53 network was employed as the teacher network, and MobileNet v2 and ShuffleNet v1as two options for the student network. The proposed approach allows us to get about 3% mAP gain (depends on the selected basic CNN architecture) on Pascal VOC Dataset and approximately 1.5% mAP gain on COCO Dataset without any architecture or dataset changes. In addition, it is worth to mention that the approach is general and can be used not only with SSD but also with any type of object detection algorithms.