A NOVEL APPROACH FOR PART BASED OBJECT MATCHING USING DISTANCE METRIC LEARNING WITH GRAPH CONVOLUTIONAL NETWORKS

Part-based object representation and part matching problem often appear in various areas of data analysis. A special case of particular interest is when parts are not fully separated, but in relations with each other. The natural way to model such objects are graphs, and part matching problem becomes graph matching problem. Over the years, many methods to solve graph matching problems have been proposed, but it remains relevant due to its complexity. We propose a novel approach to solving graph matching problem based on learning distance metric on graph vertices. We empirically demonstrate that our method outperforms traditional methods based on solving quadratic assignment problem. We also provide an theoretical estimation of computational complexity of proposed method.


INTRODUCTION
Part-based object representation is often used in areas of image analysis and computer vision, and has applications in image detection and classification, object tracking, shape matching and more. Under part based approach the object is viewed as a set of meaningful primitive parts. One naturally arising problem for such representation is part matching, which is finding a correspondence between parts of two different objects.
Often, a relation between the parts within the object can be established. Such objects can be naturally modeled by graphs, with parts corresponding to vertices and relations corresponding to edges. In this case, part matching problem becomes graph matching problem, which is establishing correspondences between vertices with respect to edges. This problem, however, has been proven to be NP-hard. Over the years, many methods to solve this problem have beed proposed, but due to its complexity it remains highly relevant.
In this work, we concern ourselves with graph matching problem applied to matching objects on photos. We propose a novel graph matching method based on deep distance metric learning on graph vertices. We show empirically that our method achieves higher matching accuracy than graph matching methods based on traditional techniques. Moreover, it performs significantly better than these methods when actual match between graphs is low relative to their number of vertices.
The rest of this paper is structured as follows. In Section 2 we provide an overview of modern graph matching methods and discuss related works in the field. In Section 3 we describe our method in detail and provide theoretical estimates of computational complexity. In Section 4 we describe our experimental setup and provide empirical results.

RELATED WORK
In broad terms, graph matching problem for graphs G1 and G2 is finding some binary relation r between their vertices: * Corresponding author r ⊆ V1 × V2 (Conte et al., 2004). Often, the relation r is required to be mapping r : V1 → V2 or even bijection. A possible additional constraint is for the mapping r to preserve edges. This special case is referred to as exact graph matching problem. However, edge preservation requirement usually contradicts high object variability that is common in the field of data analysis.
An attributed graph (Tsai and Fu, 1979) is an extension of the traditional notion of the graph. An attributed graph is a tuple < V, E, µ, ε >, where µ = {µ1, . . . , µ |V | } are vertex attributes and ε = {ε1, . . . , ε |E| } are edge attributes. This is particularly useful in data analysis, as vertex and edge attributes represent features extracted from object parts and relations between them. In this work, both vertex and edge attributes are numerical vectors.
Typically, inexact graph matching problem is formally defined as a discrete optimization problem of minimizing cost function (Yan et al., 2016). Under this approach, a pair of vertices i ∈ V1 and i ′ ∈ V2 are assigned unary matching cost c ii ′ and two pairs of vertices (i, i ′ ), (j, j ′ ) ∈ V1 × V2 are assigned pairwise matching cost d ii ′ ,jj ′ based on vertex and edge attributes and graph edge structure. The matching problem is then reduced to the binary quadratic programming problem (BQPP): where r is a binary matrix of shape |V1| × |V2| r is constrained by matching requirements Depending on matching requirements, various constraints can be used; if match is required to be a mapping, ∀i i ′ r i,i ′ = 1 and ∀i ′ i r i,i ′ ≤ 1, and if match is required to be a bijection, r is a permutation matrix. In these cases, the cost function can be rewritten as i c ir(i) + i,j d ir(i),jr(j) , and the BQPP becomes a quadratic assignment problem (QAP, (Lawler, 1963)). It can be shown that many popular methods of graph matching based on cost optimization may be reduced to the BQPP; for instance, the reduction for graph edit distance is provided in (Neuhaus and Bunke, 2007).
As QAP itself is NP-hard, multiple techniques for finding an approximate solution have been proposed over the years, including the ones based on finding primary eigenvector of cost matrix Hebert, 2005, Cour et al., 2006), projection onto convex sets-based method (van Wyk and van Wyk, 2004), modified gradient descent methods that use problem specifics to obtain better approximation of solution (Gold andRangarajan, 1996, Leordeanu et al., 2009), and an interior point-like optimization procedure (Zhou and de la Torre, 2012). Typically, these techniques follow the same pattern: 1. approximating the original discrete problem with continuous one; 2. solving the continuous problem approximately; 3. performing some discretization procedure over the continuous solution.
To the best of our knowledge, machine learning methods in relation to graph matching have only been employed to calculate matching costs c ii ′ and d ii ′ ,jj ′ as a function of graphs (Caetano et al., 2009, Leordeanu et al., 2012, Zanfir and Sminchisescu, 2018, Nowak et al., 2018. In most cases, matching costs are treated as expert knowledge.
To sum up, the graph matching pipeline in most cases is the following: 1. for both objects a domain-specific feature extraction method is employed to provide attributed graphs; 2. from the graphs, a QAP is constructed; 3. the QAP is solved approximately, and the matching is produced.
In this pipeline, machine learning methods are typically used to train feature extraction step and potentially QAP construction step. The paper (Zanfir and Sminchisescu, 2018) in particular is very representative. The authors use deep learningbased feature extraction method to convert images to attributed graphs. Then two attributed graphs are converted to QAP. The key feature of this approach is that feature extraction, conversion and solution of QAP allow for joint back-propagation, so feature extraction and conversion steps may be trained. We use an extended version of the method provided in (Zanfir and Sminchisescu, 2018) as our baseline competitor.

PROPOSED GRAPH MATCHING MODEL
The obvious drawback of QAP-based approach to graph matching is high computational intensiveness inherent to it, as pairwise costs d ii ′ ,jj ′ form a 4-dimensional tensor with the size |V1| 2 · |V2| 2 . As such, we suggest a different approach that does not rely on solving QAP.
We propose a pipeline for graph matching based on siamese networks (Bromley et al., 1993) and distance metric learning between graph vertices. Under this approach, two graphs are processed in parallel and independently, their intermediate representations are produced, and matching is synthesized from these representations. The model can be divided in 3 consecutive parts: graph construction, graph processing and matching synthesis.

Graph Construction
Graph construction methods are domain-specific. In this work, we use a set of images with already specified keypoints with known coordinates, with every image transformed into a graph. Graph vertices correspond to keypoints of the image. We employ a pre-trained convolutional neural network to produce a feature map for the image, and use the values from that feature map taken at keypoints as vertex features. Graph edges correspond to the edges of Delaunay triangulation of keypoint set, with edge length used as edge attribute. The pipeline for attributed graph construction from images is presented on Fig. 1

Graph Processing
Now we observe two attributed graphs constructed from objects. We propose a machine learning model that learns to produce a matching between object graphs by learning distance metric on graph vertices. The model can be divided into the following main stages: 1. embedding stage that, given attributed graphs, constructs a new representation for each vertex and edge using provided graph; 2. similarity computation stage that produces pairwise similarity matrix between components from these representations.
For embedding stage, we propose graph convolutional networks, and for similarity computation stage, we propose metric learning on vertices. The pipeline is presented on Fig. 2. On the embedding stage, both graphs are processed independently and in parallel using the same model. To obtain secondary representations of the vertices, we employ a graph convolutional network (GCN) (Kipf and Welling, 2019), also known as message passing neural network. Under this approach, each layer of the network calculates new features for graph vertex using both features of the vertex itself and its neighbors. Conventional GCN does not make use of edge attributes; we, however, propose an extended version that also incorporates that information. In this model, each layer accepts an attributed directed graph G =< V, E, µ, ε > and recalculates vertex and edge attributes by formulas: 1. vertex attribute transformation: 2. edge attribute transformation: where µ k and µ ′ k are input and output vectors of features for vertex k ∈ V respectively, εij and ε ′ ij are input and output vectors of features for edge (i, j) ∈ E respectively, Nin(k), Nout(k) are neighbors of vertex k that have an edge going from them to k and to them from k respectively, W and b denote trained model parameters, σ denotes some activation function. If Nin(k) or Nout(k) are empty, the corresponding member simply is not computed; that means that for an isolated vertex, the layer is identical to the dense layer, and if the graph has no edges at all (is simply a set of vertices), GCN is equivalent to applying MLP to each component. The purpose of this part is to produce representations for vertices to match; therefore, we discard the edge features in the end.
As distance metric between vertices of graphs G1 and G2, we suggest using conventional distance between vertex embeddings. We use Mahalanobis metric, as is typically learned in contemporary metric learning problems (Bellet et al., 2013). To that end, we simply apply a linear transformation to vertex representations for both graphs and calculate pairwise Euclidean distance. The result is numeric matrix D of shape |V1| × |V2| of pairwise distances between graph vertices.

Matching Synthesis
Matching synthesis stage takes pairwise distance matrix and produces the binary matrix R, obtained as binarization of matrix D. We use a simple threshold rule: if the distance is less than a threshold, the vertices match, if the distance is greater, they do not. Matching stage is only used to produce matching itself, and is not used during learning process.

Model Learning
The nature of our method allows us to combine embedding stage and similarity computation stage into single pipeline that allows for backpropagation. It should be noted that in our particular case graph feature extractor based on convolutional neural network can be included into the pipeline as well, allowing for fine-tuning feature selection.
Training set consists of pairs of objects. Each pair has an associated binary matrix R that represents actual target relation between the parts. For each object we construct a directed attributed graph of parts as explained in 3.1.
For a pair of graphs from the training set, we perform forward pass up to the distance matrix D. As the surrogate loss function, we suggest a MSE-inspired loss L(R, D) = ||R − exp(−D 2 )|| 2 F /(|V1| · |V2|). This finishes the model definition.

Computational Efficiency
We stress here that our model differs dramatically from a traditional QAP-based method. During matching itself, we do not deal with pairs of pairs of vertices and their associated 4-D cost tensor d ii ′ ,jj ′ . In fact, we discard edge attributes and all edge information after producing vertex representations. We expect this fact to positively affect the computational efficiency of our method. Here, we discuss the matter of computational complexity of our model.
Suppose we have two graphs Gt =< Vt, Et, µ t , ε t >, t = 1, 2. Each graph Gt is defined by its adjacency matrix At, a binary matrix of shape |Vt| × |Vt|, its vertex features matrix µ t , a continuous matrix of shape m × |Vt|, and its edge features matrix ε t , a continuous matrix of shape n × |Et|. From these, we can calculate the following auxiliary matrices: 1. Gt, Ht -incidence matrices, binary matrices of shape |Vt| × |Et|: if edge e ∈ Et begins in vertex i ∈ Vt and ends in j ∈ Vt, then (Gt)ie = 1, (Ht)je = 1, otherwise it's 0. These matrices are hugely sparse, with only one 1 in each column. They can be obtined from At in O(|Vt| 2 ), and GtH T t = At. We note that our baseline competitor (Zanfir and Sminchisescu, 2018) makes use of these matrices as well, as inspired by (Zhou and de la Torre, 2012). We will also denote [x] a diagonal matrix with vector x on its main diagonal.

ct = (
Let us investigate the computational complexity of the forward pass of GCC vertex layer 2: 1. W eself εij in matrix form can be computed as W eself ε t , and its complexity is O(|Et|) 2. W vf rom µi + Wvtoµj in matrix form can be computed as W vf rom µ t Gt + Wvtoµ t Ht, and its complexity is O(|Et| · |Vt|).
3. complexity of other parts of 3 is negligible compared to the ones above.

Dataset
To test our method, we apply it to matching points in images. We use dataset CUB 200 2011 (Wah et al., 2011). The dataset contains almost 12000 photographs of birds of different species and in different poses. On each photo, no more than 15 keypoints are marked, each annotated with its type. There are 15 types of points in total, denoting different parts of the bird's body. In a photo, no two points have the same type.
For each image, the bounding box of a bird is provided in the dataset. We use this information to normalize the images. We cut out the bird from the image using bounding box information and reshape it to the size of 224 × 224. The annotated point coordinates are transformed accordingly.
The original dataset provides no graphs. We construct the graph edges using Delaunay triangulation of keypoint set. For vertex feature extraction, we use MobileNetV2 convolutional network (Sandler et al., 2018). As our feature map, we use the output of block 5 expand relu. Then, we use the elements of the feature map in positions that correspond to the coordinates of the keypoints as vertex features. This means the size of vertex feature vectors in object graph is 192. We use Euclidean distance between points as the only edge feature. We do not fine-tune our feature extraction model.
The original dataset comes already split in non-overlapping train and test subsets, almost 6000 images each. We make use of this and select training and test pairs from respective sets. This ensures that images used for training are never used for testing, and vice versa. Unlike paper (Zanfir and Sminchisescu, 2018), we use arbitrary test pairs, so we can not expect that birds are in similar poses in each pair to be tested.
The original dataset has no target relations. We assume in our matching problem that same body parts on images match. If a body part is visible on one image in a pair but not on other, it does not have any match.

Models and Metrics
We consider 2 models: baseline model, based on (Zanfir and Sminchisescu, 2018), and our model. Both models begin with the same feature extraction step and embedding step. After that, baseline model pipeline performs learning and approximately solving QAP, and our model performing distance metric learning and fast matching.
Our baseline competitor is not identical to the one described in (Zanfir and Sminchisescu, 2018). First, we use our own vertex and edge features. Second, we extend the model by adding the embedding like in our approach. This change, however, is expected to actually make our baseline stronger because we allow vertex and edge features to incorporate information from the neighborhood, unlike in the original. Only when we get representations for both vertices and features from GCN, we proceed with the pipeline from (Zanfir and Sminchisescu, 2018) to construct and solve QAP. In this case, we do not discard edge information. The baseline competitor still contains a large 4D tensor in its pipeline.
For embedding step in both baseline and our model, we use a GCN with of 4 layers with output dimensions of 128, 64, 32 and 16 for vertex features and 1 for edge features. As activation, we use hyperbolic tangent for vertex features and ReLU for edge features. We also apply a dense layer with linear activation, square weight matrix and no bias to final vertex features. This layer performs linear transform of vertex representations, which is equivalent to learning Mahalanobis distance.
Here R (actual matching) and M (predicted matching) are interpreted as both binary matrices of size |V1| × |V2| and as subsets of V1 × V2. Accuracy is a standard quality metric for graph matching. We also decided to use Jaccard measure because number of pairs in R is small compared to size of V1 × V2.
During the experiments, the model is trained on random pairs of images drawn from training subset, and performance metrics are averaged over random pairs of images drawn from test subset.

Results
We have trained both our model and our baseline competitor on the same dataset and compared their average matching accuracies and Jaccard measures. The results are provided in Table 1. This shows that our model clearly outperforms the competitor in general.
In addition, we have conducted the study of method robustness. Namely, we wanted to know how the methods would behave if actual matching is small, that is, when the objects have not many parts in common. For that, we have recorded average matching accuracies and Jaccard measures for various sizes of actual match from 6 to 13 (as the others sizes were too rare to draw conclusions). The results are presented on Fig. 3 and 4. It can clearly be seen that our matching model outperforms the baseline for every size.

CONCLUSION
We have presented and examined a novel machine learningbased approach to graph matching that abandons typical method of solving a quadratic assignment problem and instead uses a siamese graph convolutional network that perform distance metric learning on graph vertices. We have demonstrated empyrically that our approach outperforms traditional QAP-based graph matching approache. We have also provided a theoretical estimation of computational complexity of the approach, showing that it under many circumstances less computationally intensive than QAP-based ones.