MOHE-NET: MONOCULAR OBJECT HEIGHT ESTIMATION NETWORK USING DEEP LEARNING AND SCENE GEOMETRY

Estimating the heights of objects in the field of view has applications in many tasks such as robotics, autonomous platforms and video surveillance. Object height is a concrete and indispensable characteristic people or machine could learn and capture. Many actions such as vehicle avoiding obstacles will be taken based on it. Traditionally, object height can be estimated using laser ranging, radar or stereo camera. Depending on the application, cost of these techniques may inhibit their use, especially in autonomous platforms. Use of available sensors with lower cost would make the adoption of such techniques at higher rates. Our approach to height estimation requires only a single 2D image. To solve this problem we introduce the Monocular Object Height Estimation Network (MOHE-Net) that includes a cascade of two networks. The first network performs the object detection task. This network detects the bounding box of objects of interest. This information is then input to a second network to estimate the object height and is a linear Multi-layer Perceptron (MLP). The linear MLP model models the camera-scene geometry and does not require training or contain activation function as normal MLP did. The developed approach works for static camera set up as well as moving platform. The proposed approach performs state-of-the-art and can be deployed for obstacle avoidance on autonomous platforms. Our code is available at https://github.com/OSUPCVLab/Ford2019/tree/master/Moving%20Object%20Height% 20Estimation%20Network


INTRODUCTION
Object height has applications in a number of problem domains including but not limited to autonomous driving, robotics and visual surveillance. Once the object height information estimated, this information can be used, for instance, to avoid obstacles in autonomous driving scenarios to ensure the safety. Deploying a height estimation system requires two an object detection module and a height estimation module , both of which are required to perform in real-time processing for time constraint problems.
Arguably object detection can be considered a mid-level perception problem required by many higher level tasks (Zou et al., 2019). It has been an activate area of research for several decades. The goal of object detection is to determine whether instances of an object, such as person, car, truck, exists in the image and return its location as an enclosing mask or bounding box (Liu et al., 2020). Recently, deep learning techniques, including but not limited to faster r-cnn, yolo series (Ren et al., 2016, Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018, have been shown to work comparatively far more accurate and faster than other traditional approaches based on such as SIFT, SURF and BRIEF (Lindeberg, 2012, Calonder et al., 2010, Bay et al., 2006). Our proposed object height estimation system adopts an existing pretrained deep convolutional neural networks (Ren et al., 2016, Jocher et al., 2020 to detect object instances with bounding boxes recorded in monocular cameras. We refer to the height estimation problem as the metric estimation of the object height from the 2D bounding boxes. In particular, this step uses, backprojections of the pixel coordinates to 3D camera coordinates using view geometry modeled as MLP. Considering 2D to 3D relation is projective and the object scale is unknown, we introduce additional geometric constraints to solve the problem. The first of these is the assumption that the camera looks at piece-wise planar scene, such that normalized Direct Linear Transformation (DLT) (Hartley and Zisserman, 2003) applies as shown in Fig. 1. Second, we assume the objects and image plane are standing upright and vertical to the ground which generates a special geometry that will be discussed later in text. Red lines indicate projective physical distance of embedded camera and objects. Each red line has two ending points. One is from camera outside the image. The other is located from the object pixel coordinate. DLT utilized relationships between points in projective coordinates and geometry coordinates.
Our main contributions to height estimation can be summarized as follows: -MOHE-Net requires only a monocular camera.
-It can estimate object height from both stationary and moving platform.
-The geometry is represented as an MLP to generate the network cascade.
-It generates accurate results on the collected dataset. More specifically, it achieves 5.08 cm mean error and 26.4 fps speed.
The rest of this paper is organized as follows. Section 2 reviews recent related work on object detection as well as height estimation. Section 3 describes the problem and provides details of proposed MOHE-Net. Section 4 introduces our collected data and experimental implementation. Section 5 provides details on the results.

RELATED WORK
Object height has been considered an important piece of information for autonomous systems and can be directly solved using range systems estimation such as LIDAR or stereo camera. Its importance stems from the fact that avoiding high or low lying obstacles will reduce defects to the vehicle while ensuring the safety of the passengers and the objects around the vehicle. Below we will discuss the the two modules required for an end to end system: object detection and height estimation.
Considering that the amount of work published on object detection is vast, we will only consider more recent studies that uses deep learning. With the introduction of regions to the CNN architectures (R-CNN) (Girshick et al., 2014) object detection methods have started to produce results significantly better than traditional approaches. These developments can be divided into two categories. The first category is a two-stage approach which starts from a region proposal followed by classification and bounding box regression. The approaches in the first category include R-CNN and its improved version, fast R-CNN (Girshick, 2015) and faster R- CNN (Ren et al., 2016). Fast R-CNN performs feature extraction as a whole, avoiding independent feature extraction of each proposed region. Faster R-CNN replaces selective search with a region proposal network to generate proposed regions. The second category is a one-stage approach which performs the classification and localization steps simultaneously via grid regression. Arguably the most representative model for this category is the You only look once (YOLO) variations (Redmon et al., 2016, Redmon and Farhadi, 2017, Redmon and Farhadi, 2018, Bochkovskiy et al., 2020, Jocher et al., 2020. Those models, typically, take Pascal VOC (Everingham et al., 2010) and MS COCO (Lin et al., 2014) for training and evaluation purposes. We adopted the second category of approaches, and observed that using existing their pre-trained networks provided good accuracy and speed for the object detection task in our approach.
Height estimation task estimates the height information by transforming the image pixel coordinates to 3D coordinates (Hartley and Zisserman, 2003). When an image of a scene is captured, the depth information has lost. The estimation of the 3D object characteristics, one has to backproject the image into the 3D space. Godard et al. (Godard et al., 2017) proposed an unsupervised monocular depth estimation approach to predict depth using a single camera. Zhou et al. (Zhou et al., 2017) proposed an approach to recover depth information from 2D motion information providing disparity. These methods, while can be used for height estimation, cannot be used due to high computational cost that recovers depth information for all the pixels. The planarity condition of the scene also makes these approaches impractical for height estimation. Our approach in contrast uses the planarity condition directly (Abdel-Aziz et al., 2015), to estimate heights of upright objects.
Many recent approaches (Mousavian et al., 2017, Wu et al., 2019, Ke et al., 2020, Kundu et al., 2018 were proposed to estimate vehicle size (length, width and height) and 6-DoF. But without exception, those approaches require estimation of camera rotation and translation even though they use a monocular likewise. What distinguish our approach from the others is that we doesn't require any estimation of camera pose but can estimate over 80 object classes height accurately achieving real-time.

METHODOLOGY
For metric height estimation, the proposed MOHE-Net cascades two neural networks as shown in Fig

Problem formulation
Let 4x1 tuple A = (I, S0, L, c0) defines imaged scene for the i th frame, where I ∈ R H * W * 3 is the image frame with width W and height H, S0 represents the ROI, L0 = (l1, l2, ..., ln) is the classes of objects of interest (COI), and c0 is the confidence threshold of object detector. Objects not a member of COI are ignored by MOHE-Net as well as objects with confidences lower than c0.

Object Detection Network
Object detection network is a pretrained CNN model with objects within the COI: Oi = fOD−Net(Ii). This network maps the input image, Ii, to output Oi, where Oi is a 6xn tuple, Oi = (bi,j,1, bi,j,2, bi,j,3, bi,j,4, li,j, ci,j), where bi,j,1 ∼ bi,j,4 are the upper left and lower right bounding box coordinates of the detected objects, shown in Fig. 3. In this equation, the first subscript i denotes the frame index, the second subscript j indicates j th object-instance, the last subscript represents one of the four sides of j th instance, li,j is OD-Net predicted label of j th instance and ci,j represents its corresponding confidence. Height estimation network only activates when li,j ∈ L and ci,j ≥ c0. The implementation details of fOD−Net is given in Section 4. Figure 3. Object detection output generated by the detection network in the form of a bounding box: bi,j,1, bi,j,2, bi,j,3, bi,j,4.
Marked two points (middle bottom and middle above) of the bounding box represent object bottom and top, such that the object height is the the length of the line connecting these two points.
To represent object height in the image domain, the algorithm selects two points, middle bottom point and middle above point of the rectangular bounding box, referred to as the bottom point and top point as shown in Fig. 3.

Height Estimation Network
To estimate the height of an object instance, we designed a task oriented multi-layer perceptron (MLP) referred to as the HE-Net that inversely project the bottom and top points from image coordinates to the real world coordinates, where Oi is OD-Net output, fHE−Net mapped input Oi into 3xm (m ≤ n) tuple Hi, Hi = (hi,j, li,j, ci,j), satisfying li,j ∈ L and ci,j ≥ c0, and subscripts i,j are same as in Section 3.2.
The object bottom and top points are coplanar and lie on a plane perpendicular to the horizontal road plane as shown in Fig. 4. This coplanarity condition is denoted by the orange line and is vertical to the road plane (black canvas). The two dotted lines originating from the camera center (black point) represent the backprojection from the image plane to object space. Distance between two parallel planes is referred to as parameter z on camera z direction shown in Fig. 5, also called object depth estimation. In our data collection, a single camera is mounted on the vehicle for collecting dataset. We defined spatial xyz axis to be right, down and forward direction with respect to vehicle moving forward. Let (u, v, 1) be the homogeneous coordinates of x in image coordinates and X = (XW , YW , ZW , 1) be the corresponding homogeneous coordinate in the real world. Given the camera intrinsic matrix K and the pose of camera in world coordinate frame in i th video frame (R, T ) ∈ SE(3), the geometric relationship between x and X is: In fact, object height is the same in both camera frame and world frame. The proposed MOHE-Net estimates object height in camera coordinate to avoid estimating the motion of camera, which is (R, T ) of equation (3). Given (XC, YC , ZC , 1) the camera frame homogeneous coordinate, the projection from 3D X Y Z Figure 5. Dataset collection platform with predefined camera coordinates with respect to vehicle moving forward direction, x, y and z axis point at right, down and forward directions.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) point in camera coordinate frame to image plane is: and the inverse projection from image plane to camera coordinate frame is: where Kinv is the inverse camera intrinsic matrix. We took a short video with a 6 × 9 chessboard, to calibrate camera intrinsic matrix K and distortion coefficients D using chessboard calibration algorithm developed by OpenCV library (Bradski, 2000). Apart from Kinv, object depth, parameter z denoted as the distance between parallel image planes and object plane in Fig. 4, is also needed for image point inverse projection. In our proposed HE-Net, we implement DLT for estimating object depth, parameter z. This will be discussed later in Section 3.4.
Overall HE-Net is a handcrafted linear MLP, whose flowchart is shown in Fig. 6. The MLP has K,D as its pre-trained weights, inversely projects object bottom point and top point on image plane back to 3D depth-normalized camera coordinate frame, denoted as P bottom,3D,N and P bottom,3D,N . The subscript N indicates point depth-normalized. With estimated object depth, parameter z in equation (5), those two points are denormalized back to 3D camera frame. Object height is the distance in the vertical direction from the bottom point (P bottom,3D ) to top point (Ptop,3D), also shown in Fig. 4.

Object Depth Estimation
In our above mentioned HE-Net, we applied DLT to estimate parameter z. In Fig. 7, there are 36 cone markers on the ground within ROI or on its margins. Those markers on the ground plane have 2 degree of freedom, x and z. When collecting dataset, we firstly measure markers coordinates with respect to the mounted camera as the origin projecting on the road plane and defined as PC = (pC,1, pC,2, ..., pC,36), where pC,i = (xi, zi, 1). Then, we manually picked up those makers on image plane and recorded their homogeneous coordinates as PI = (pI,1, pI,2, ..., pI,36), where pI,i = (ui, vi, 1).
Normalization is basically a preconditioning to decrease condition number of the matrix PC and PI . Assuming TC and TI are normalization matrix to normalize PC and PI respectively toPC andPI in camera coordinate frame and image coordinate frame with mean 0 and standard deviation √ 2, we estimate homography matrix h3×3 as: Matrix h3×3 is estimated based on normalized points,PC and PI . Taking matrix denormalization, H3×3 will be: Given object bottom point P i,j,bottom = (u,v,1) in Section 3.2, we are able to estimate parameter z as: where z in equation (8) would be applied into inverse projection equation (5) in Section 3.3 as object depth in camera frame.

EXPERIMENT DESIGN
To evaluate the performance of the MOHE-Net, we conducted threee case studies. In case I, objects and the platform are both stationary. The relative distance between objects and platform remains the same. We measure ground truth height for all objects within ROI, which range from 20 cm inches to 180 cm. In case II, we keep the platform motionless and a person 183cm high is walking within ROI. The person walks from the left to the right, from the close to the distant. In case III, platform as vehicle is moving forward so that more object instances are on camera. Figure 7. Camera view from monocular camera mounted on the vehicle. Red polyline is the margin of ROI. HE-Net will be activated only objects whose bottom points are within ROI. Cones on the ground are markers for homography estimation.
For collecting dataset, we mounted a monocular camera on top of a moving vehicle. The z-direction of the camera in Fig. 5 aligned with vehicle moving forward direction. Its front view is shown in Fig 7 without any occlusion. The MOHE-Net design requires calibration of the vehicle mounted camera to estimate homography matrix. In order to calibrate, we uniformly placed 36 red cones at measured locations and use the red polylines to denote the ROI margins. The homography estimation is then achieved using Random Sample Consensus (RANSAC) (Fischler and Bolles, 1981) that minimizes overall geometric error. The coordinates of these points are also marked in images. Vehicle mounted camera recorded image sequence with 20 frame per second (fps) and the resolution of images are 1280 × 720.
The proposed MOHE-NET pipeline consists of object detection and height estimation. For object detection, we adopted Faster r-cnn (Ren et al., 2016) and YOLOv5 (Jocher et al., 2020). Fig.  3 shows as example object detection via YOLOv5. Both detectors could detect and classify up to 80 object classes achieving real-time performance. In the experiment, we set the confidence threshold to c0 = 0.25. The second component of the pipeline, HE-Net, uses the estimated camera intrinsic matrix K, distortion coefficients D and H3×3 matrix to assign the MLP weights. We additionally introduce the ROI map shown in Fig.  7 in the HE-Net to reduce computation time and increase inference speed.

RESULTS
We discuss two different aspects of results generated by MOHE-Net, the accuracy and speed. In either case the output of MOHE-Net is predicted objects heights within ROI shown in Fig. 8. We note that all experiments are conducted on NVIDIA Titan V.
In the first case study dataset, objects (such as bottles, chair, sports ball) are statically placed on the ground (see Fig. 8).
Object height predictions and ground truths are summarized in Table 1. As can be observed the errors for different objects range from 0 to within 6 centimeters. We observed that, the MOHE-Net with YOLOv5+OST as its object detection backbone has accurately estimated object heights no matter they are tall or short. In case study II, the height estimation is performed sequentially. Fig. 9 shows predictions registered on the ground truth for one of the backbones used in the study. The statistical analysis for all other backbones is tabulated in Table 2. We observe that as the gait of the person is moving up and down as the person walks our approach provided a range of height estimation with a 5.09 cm mean error. The gait change is manifested in the plot as a sinusoidal variation as shown in Fig. 10. Among the sequential predictions, there are several errors which we point out with arrows in Fig. 10. The main reason for these errors is that our approach relies on monocular camera, such that appearance changes affect height estimation. Fig. 11  In case study III, vehicle mounted camera moves with the platform. Region of interest simultaneously changes as platform moving forward. Many objects comes in and out ROI, as shown in Fig. 13. In each row, there are several vehicles within ROI are estimated height. For instance, totally four vehicles within red polygons are estimated height. From our perspective, in the last row, the silver wagon looks close to its right one but higher than the other two. Predicted heights displayed on in the blue box in meters match our judgement.
The computational bandwidth for autonomous vehicles is consumed by many tasks the vehicles is performing every second. Hence, the speed of object height estimation is a key factor in algorithm evaluation. Aside from the quantitative comparisons, we also compare the speed of the entire architecture when the object detector is changed in the MOHE-Net pipeline. The results for Faster r-cnn, YOLOv5 and its variants are shown in Table 3. The table shows total parameter count, height estimation error and speed of the pipeline respectively. The results in the table are also ploted in In Fig. 12

CONCLUSIONS AND FUTURE WORK
In this paper, we achieved object height estimation from monocular image sequence using a cascade of neural networks that encodes the view geometry. The cascade architecture referred to as the MOHE-Net is evaluated for its accuracy and speed in autonomous vehicle setting and is observed to achieve state-ofthe-art accuracy. The proposed MOHE-Net cascade contains an object detector network and a height estimator network and perform real time estimation of height of all objects in the field of view.   Zou, Z., Shi, Z., Guo, Y., Ye, J., 2019. Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition) Figure 13. Platform moves and objects within the red polygons are estimated height. Heights are also displayed in the blue boxes in meters, matching judgements we made on vehicles appearance.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIII-B2-2021 XXIV ISPRS Congress (2021 edition)