Development of Neuromorphic SIFT Operator with Application to High Speed Image Matching

: There was always a speed/accuracy challenge in photogrammetric mapping process, including feature detection and matching. Most of the researches have improved algorithm's speed with simplifications or software modifications which increase the accuracy of the image matching process. This research tries to improve speed without enhancing the accuracy of the same algorithm using Neuromorphic techniques. In this research we have developed a general design of a Neuromorphic ASIC to handle algorithms such as SIFT. We also have investigated neural assignment in each step of the SIFT algorithm. With a rough estimation based on delay of the used elements including MAC and comparator, we have estimated the resulting chip's performance for 3 scenarios, Full HD movie (Videogrammetry), 24 MP (UAV photogrammetry), and 88 MP image sequence. Our estimations led to approximate 3000 fps for Full HD movie, 250 fps for 24 MP image sequence and 68 fps for 88MP Ultracam image sequence which can be a huge improvement for current photogrammetric processing systems. We also estimated the power consumption of less than10 watts which is not comparable to current workflows.


INTRODUCTION
Neuromorphic techniques include both electronic and bioinspired studies which enables the user to reach a much higher performance per watt in the compatible algorithms.Several companies such as Intel, Qualcomm and IBM are currently developing Neuromorphic chips so that the developers will be able to use the neural architecture in many studies.This improvement demands a significant change in algorithm's processing architecture.The resulting architecture can be implemented using customized CMOS chips.However, it also can be simulated using FPGA elements to define each neural unit and produce an equivalent to that CMOS chip with a lower speed.The Neuromorphic framework in this research is not restricted to SIFT operator, and can be used dynamically to compute any other mathematical operator, from comparison based (binary) operators like FREAK to bio inspired object recognition methods such as HMAX.Generally this method is based on neural units as computational elements used for operations such as averaging, Gaussian or comparison.The entire process is based on logic gates which are packed into a neural unit.There are four types of neural units which are developed using CMOS library which can be simulated with FPGAs.Based on the input data, image sequence vs. high resolution images, assignment of these neural units may vary to reach the maximum throughput.However this research attempts to purely speed up the algorithm but it's also presenting a scalable framework for SIFT operator to reach very high frames per second, since there is no delay in the processor for analysing input commands.This will results in several thousands of frames per second.This estimation is based on a computationally equivalent algorithm on CM1K Application Specific Integrated Circuit (ASIC) package which is reached to 500 nano-seconds delay.The resulting algorithm can be used on a customized ASIC package as a co-processor in any computational scale.In the first look, a several thousand frames per second is not needed for many applications, but this huge improvement in performance results in much lower power consumption in 30-60 fps range.Lowering the power consumption of the device can be useful in devices such as smart glasses, smartphones, security cameras and smart cars.Lowering the power consumption leads to low temperature, passive cooling and smaller package size of the computational unit.Development of such ASIC packages demands a great use of electronics which is not the main goal of this research, subsequently a mid-range FPGA is used to simulate the ASIC chip in SIFT computations which lowers the target FPS to several hundreds.There test images include 24 MP aerial images and HD close range image sequence.The processing time is then compared to several commercial photogrammetry softwares such as AGISoft to show the applicability of the research.

LITERATURE REVIEW
In this section, FPGA and ASIC applications in image processing are explained and some of commercial devices are mentioned.Then the FPGA implementation of SIFT and some of optimizations by other researchers are briefly described.

ASIC and FPGAs for image processing
Since an FPGA implements the logic required by an application by building separate hardware for each function, FPGAs are inherently parallel.This gives them the speed that results from a hardware design while retaining the reprogrammable flexibility of software at a relatively low cost.This makes FPGAs well suited to image processing, particularly at the low and intermediate levels where they are able to exploit the parallelism inherent in images.Nowadays this type of processing is found in many security, traffic and professional cameras, in which an FPGA is coupled to the CMOS directly.Altera Cyclone, Xilinx and Lattice are the most used FPGA brands in these cameras.ASICs are a little different than FPGA since they are optimized for a specific application, which leads to a more costly solution as well as a higher speed.An FPGA requires 20-40 times the silicon area of an equivalent ASIC but it is cheaper due to its added value in higher volumes of production.On the other hand, ASICs are much faster than equivalent FPGAs, consume less power and are smaller so that the form factor will be completely different.An example of ASICs are image processors such as BIONZ in Sony, DIGIC in canon and EXPEED in Nikon cameras.There are ASICs used in other fields than RAW image processing such as CM1K in neural network development field also.Metaio also had an ASIC under development in order to handle Augmented Reality applications in image and location processing algorithms in the past years.

FPGA SIFT
There were many efforts in SIFT parallelization including [4 -9] each of which have improved the algorithms performance by implementing it on a FPGA device.[10] used five element Gaussian filters and four iterations of CORDIC1 to calculate the magnitude and direction of the gradient.They also simplified the gradient calculation, with only one filter used per octave rather than one per scale.[11] made the observation that after downsampling the data volume is reduced by a factor of four.This allows all of the processing for the second octave and below to share a single Gaussian filter block and peak detect block.[12] took the processing one step further and also implemented the descriptor extraction in hardware.The gradient magnitude used the simplified formulation of eq. 1 rather than the square root, and a small lookup table was used to calculate the orientation.They used a modified descriptor that had fewer parameters, but was easier to implement in hardware.The feature matching, however, was performed in software using an embedded MicroBlaze processor. = || + || . 1

PROPOSED METHOD
In this section we're going to describe the proposed method in detail including neural assignment, network size and reusability of the neurons.

Neuron Types
There are 5 types of electronic neurons defined in this network, each of which are capable of forming an independent part of network which can be used by specific functions of SIFT computation.The most important elements in any photogrammetric computation is multiplication as well as accumulation.So, there are two types of electronic neurons for multiplication and accumulation in this network which are called N1 and N2 respectively.The third neural element is one of the most famous hardware elements in the past decade called MAC which is composed by combining several multiplication units with an accumulation unit.We have used a very fast implementation of MAC, [1], in our NP1 neural element.The fourth and fifth types of electronic neurons are not arithmetic functions, instead, they solve logical problems.N3 neural unit is a logical element and it can act as any logical gate including NAND, NOR, XOR … only by changing its inputs.N3 neural unit is also capable of producing constant true or false outputs independent of its main inputs.It will be used to adapt the neural network to algorithms other than SIFT which are not the subject of this paper.
There is a need to comparators in any type of photogrammetric problems which are concerned about any type of thresholding.So, the last type of electronic neurons, so called N4 in this paper, is a 16-bit comparator which can be used in neighbourhood comparison in SIFT as well as any comparison operation in other algorithms such as FREAK.In the following sections the architectures of NP1-N4 and their input and output formats and measured delay are described.

Multiply and Accumulator Architecture (NP1)
NP1 contains the most important element in every photogrammetric calculation, including collinearity equations, fundamental matrix calculation, and descriptor generation.It's originally a MAC unit which is developed by [1] and has eight inputs of 16 bit floating numbers.In order to fit the Gaussian kernel we'll modify this unit to have ten inputs, which fits the five kernel size.It also generates one 16-bit output which is computed as fig. 1.

Gaussian Convolution
Any convolution algorithm consists of several Multiplication and Accumulation (MAC) operations, so the first section of our proposed neural network will be the network of NP1 neural elements.In order to increase usability of this section each NP1 is divided to N1 and N2 neural units so that the multiplication and accumulation parts can be used separately.It's obvious that the size of NP1 network is highly dependent of its compatibility to image resolutions such as HD, full HD and so on.In the other hand, there are two limitations due to production and speed aspects of this problem.The production limitation is about the maximum number of transistors in a single chip, which limits the size of our computational network.The speed limitation concerns about number of computation cycles for total pixels in a single image.If the size of NP1 network is to small it will be needed to handle a single computation for too many cycles so that the total speed of entire chip will decrease.On the other hand, the exact assignment of each neural unit should be specified before determining the optimal size for computational networks.A convolution operation is dependent of the kernel size as well as the size of the image, in this case on of the image dimensions should be multiplication of number of the neurons in this network.To make it simple, we embedded the kernel size in each neural element.Due to the 5 x 5 kernel size each NP1 neuron should be able to handle 5 MACs in each cycle so that each NP1 neuron will have 10 inputs (5 pixels and 5 weights) and one output.Assuming HD or full HD size of the input frame, the number of NP1 neurons should be the maximum number that can be multiplied to 1280 or 1920 which leads to 640

Difference of Gaussians
It is obvious that same window can't be used to detect key-points with different scale.It is possible with small corner but to detect larger corners larger windows are needed.Scale-space filtering is used to solve this.Laplacian of Gaussian is found for the image with various σ values in the scale-space filtering process.LoG acts as a blob detector which detects blobs in various sizes due to change in σ.In other words, σ acts as a scaling parameter.For e.g., in the above image, Gaussian kernel with low σ gives high value for small corner while Gaussian kernel with high σ fits well for larger corner.So, we can find the local maxima across the scale and space which gives us a list of (x,y,σ) values which means there is a potential key-point at (x,y) at σ scale.SIFT algorithm uses Difference of Gaussians which is an approximation of LoG.Difference of Gaussian is obtained as the difference of Gaussian blurring of an image with two different σ, let it be σ and kσ.This process is done for different octaves of the image in Gaussian Pyramid.Which is shown in figure… After Computation of Gaussian in each octave, the difference operation can be done by using the same array of adders (640 N2 units) in NP1 network.We've used the adder architecture described in [1] because of its simple of implementation and high speed.

Fine Scale Calculation
Once potential key-points locations are found, they have to be refined to get more accurate results.They used Taylor series expansion of scale space to get more accurate location of extrema, and if the intensity at this extrema is less than a threshold value, it is rejected.DoG has higher response for edges, so edges also need to be removed.A concept similar to Harris corner detector is used for edge removal.So that, a 2x2 Hessian matrix (H) is used to compute the principal curvature and if the ratio of eigen values is greater than a threshold, so called edge threshold, that key-point is discarded.Using the edge threshold will eliminate any lowcontrast key-points and edge key-points and what remains is strong interest points.

Orientation Assignment
In the SIFT operator, an orientation is assigned to each key-point to achieve invariance to image rotation.A neighbourhood is taken around the key-point location depending on the scale, and the gradient magnitude and direction is calculated in that region.An orientation histogram with 36 bins covering 360 degrees is created.(It is weighted by gradient magnitude and Gaussianweighted circular window with σ equal to 1.5 times the scale of key-point.The highest peak in the histogram is taken and any peak above 80% of it is also considered to calculate the orientation.It creates key-points with same location and scale, but different directions.It contribute to stability of matching.

Tangent Approximation
Despite of any advantages in ASIC compatible algorithms, there's always a limitation of using complex functions such as trigonometric operations.As it's obvious, rotation assignment and descriptor generation parts of SIFT algorithm both need to compute inverse tangent of the horizontal to vertical gradient ratio which cannot be done directly.Fortunately, many functions have been approximated in embedded computing applications including the inverse tangent.
There are several types of approximation with different accuracies and speeds.We have used the table 3 because of its simplicity, accuracy of 0.3 degrees and MAC compatibility.The detail of inverse tangent approximation is shown in table 3.

Neural Assignment
The first layer of NP1 neurons do the computation of image gradients by convolving [−1 0 1] kernel into the neighbourhood.Then NP1 neurons compute the first stage of orientation computation due to formula ().After that, the N1 neurons will do the dividing operation in formula () to finish the inverse tangent computation.In order to generate the orientation histogram, N4 neurons needed for comparison and categorization of computer orientations into 32 bins each of which will cover 11.25 degrees in 360.Decreasing the number of bins in the orientation histogram leads to better compatibility of network size and also it's recommended because of inverse tangent approximation.However, it could decreased to 16 bins to lower the orientation noise due to 0.3 degrees accuracy, while it was also unknown in the robustness aspect of SIFT operator.Decreasing 36 bins to 32 also improve the performance by reducing computation cycles by 12.5 %.

Descriptor Generation
The descriptor generation controller acts exactly like orientation assignment section by doing the same computations in a 16*16 neighbourhood containing 4*4 sub-blocks.For each sub-block, 8 bin orientation histogram is created.So a total of 128 bin values are available.It is represented as a vector to form key-point descriptor.In addition to this, several measures are taken to achieve robustness against illumination changes, rotation etc.

Performance Evaluation
In this section, an overview of estimated performance, power consumption and area of the resulting chip layout is described.According to reusability of two layer MAC architecture as well as comparator neural units, there is a maximum number of 640 MAC units in the first layer, 128 MAC units in the second layer and 4096 comparator units which will be used by controllers of each phase of the algorithm.The transistor count, power consumption and area of each controller is currently unknown but it will be much less than the entire neural units described here.

Gaussian Convolution
Having 3*1080*2 cycles of computation leads to 10 microseconds of delay due to 1.6 ns delay of each NP1 (MAC) neuron.However, this delay should multiplied by 6.6, which is the number of total octaves and scales (4 and 5 respectively), gives about 66 microseconds of delay.Since each octave halves the image dimension, the total processing time consumed for all octaves should be 1 + by the number of scale levels in each octave and gives slow down factor of about 6.6.According to the fact that this stage processes the massive number of pixels in high number of scale levels, it can be said that it will be the slowest part of the algorithm.On the other hand, this stage will not consume the highest power since it doesn't use N4 neurons.The entire power consumption of this stage should be about 5 watt.

Difference of Gaussians
In order to calculate the number of processing cycles for total difference operations, assuming a full HD image, 4 octaves and 5 scales, we will have total number of pixels in each octave, 2 million, 500k, 125k and 31250.Assuming 4 subtraction operations in each octave leads to total number of pixels processed in each octave, 8 million, 2 million, 500k, and 125k.So the Total number of pixels processed in all octaves will be 10625000.Having 640 NP1 neurons leads to about 16600 cycles.If we use N2 array to subtract two images, it gives about 1.2 ns of delay in each operation, which will produce about 19 microseconds of delay.Using NP1 neurons will increase this delay to about 26 microseconds.So the entire DoG delay will be 92 microseconds.

Comparison
Assuming 3 scale levels in each octave, there are 26*3* total number of pixels in each scale level in each octave, leads to about 207 millions of comparison operations.Having 4096 enables the chip to process about 157*26 comparison operations in each cycle.This leads to 50757 cycles of comparison and produces about 31 microseconds of delay.It will consume 8 watts of power since almost all of N4 neurons are active.

Orientation Assignment
One of the main power consumption bottle necks lies in orientation assignment and descriptor generation steps, which use both NP1 and N4 neurons simultaneously.However they will operate at very high speeds of processing since assuming 10k key-points versus 2 million pixels, gives a huge difference in processing cycles.Assuming 32 bins of histogram, a 4*4 neighbourhood, and 8 directional gradients, leads to total number of 128 MACs of first layer, 32 MACs of the second layer and 32 comparisons in each cycle for each key-point.Having 512 and 128 MACs in two layers, limits the chip to process only 4 cycles simultaneously.The first layer of NP1 neurons will calculate 64 MAC operations for total 16 pixels in 2 direction each of which contain 2 MAC operations to calculate the gradient.Then the first layer of NP1 neurons do the 4 operations including AB, A^2, B^2 and base angle (depending on the octant) which use all 512 NP1 neurons in the first layer for 8 key-points in each cycle (16 pixels and 2 directions, 4 calculations).Then the second layer calculates 0.28125 multiplication into A^2 or B^2 depending on the octant which uses all 128 NP1 neurons in the second layer for 8 keypoints.Then the divide operation will be calculated using the second layer N1 neurons (in NP1 neurons).After this step, the N4 neurons are responsible to categorize the output orientation angle into the bins of orientation histogram.So there are 32 bins for each key-point and 32 N4 neurons are used for each keypoint.Having 4096 N4 neurons leads to 128 key-point in each cycle.Assuming 10k key-points in each image it leads to 7.5k cycles for NP1 neurons and about 80 cycles for 10k key-points.This leads to maximum delay of 12 microseconds for NP1 neurons and 49.6 nanoseconds for N4 neural units.Due to huge cycle difference between NP1 and N4 neurons the power consumption of this stage will not be much higher than the NP1 neurons themselves leading to 5-6 watts.

Descriptor Generation
In order to calculate power consumption and delay of this we have to calculate just to scale factors for NP1 and N4 neurons.NP1 neurons should calculate 16 times more than the orientation assignment process since there are 16 of these 4*4 neighbourhoods in each key-point.On the other hand N4 neurons are 4 times less engaged with respect to orientation assignment section since orientation histogram is 8-bin instead of 32-bin.Due to low number of cycles of N4 neurons they will have almost no effect in delay because of their approximately 20 cycles while NP1 neurons slow down the entire process of MAC calculation by 1/16 scale factor which leads to 192 microseconds.As it's obvious since we used the same NP1 and N4 arrays in multiple processes we faced a slowdown in comparison section for low number of N4 neurons while the exact number of N4 neurons were too much for orientation assignment and descriptor generation

Resulting frame rate
As the delay and power consumption of the resulting chip has been estimated in previous sections it will operate approximately in the range of 2000-3000 frames per second (327 micro seconds delay) of Full HD movie which is absolutely different from current photogrammetric processes.

High resolution Image Sequence
Since, all of evaluations in the previous sections were related to videogrammetry, with the assumption of Full HD video input, we're going to evaluate the resulting chip's performance on 24 and 88 (Ultracam) megapixels image input which is very common in UAV and traditional photogrammetry.Assuming the number of key-points to resolution ratio is constant, the 327 µs delay will be about 4000 µs which leads to about 250 frames per second for 24 MP input.For Ultracam images, it should take about 14 ms to detect key-points and calculate descriptors in 88 MP which is equivalent to 68 frames per second.However, all of these numbers, are dependent on fabrication technology, which changes power consumption and speed.Due to the fact that NP1 and N4 neurons use a MAC implementation which is based on 150nm and 65nm technology respectively.Just for a comparison, table 4 shows some of the common processors with their fabrication technology.

Chips area
As it mentioned before, area of each NP1 neuron (MAC) is equal to 12000 µm^2 with 0.15 micron fabrication technology.Area of each N4 neuron also can be computed using a rough estimation with respect to a regular Core 2 Duo CPU with 291 million transistors and 143 mm^2 die size or 8800GT GPU with 754 million transistors and 324 mm^2 die size.Each N4 neuron consists of approximately 1000 transistors which is equal to 429 µm^2.Having 640 NP1 and 4096 N4 neurons leads to an approximate die size of 10 mm^2 with 0.15 and 0.065 micron fabrication technology for NP1 and N4 neurons respectively.By upgrading the fabrication technology for all neurons to 28nm the chip will be significantly smaller.

Future Works
This papers is just a part from a bigger research named "Design of a Reconfigurable Neuromorphic Framework for Photogrammetric applications" which is PhD thesis of the author.There are several sections concerning about other algorithms like HMAX, FREAK, 2.5D and 3D SIFT which are all compatible to this type of computation.The resulting chip should be able to handle any similar algorithm at very high speeds or very low power consumptions.One of the other suggestions for continuing this line of research is to port photogrammetric equations, matching problem, classifiers and filtering algorithms to these neural elements which leads to change of the chip layout in network size and neural element aspects.There is another idea in these type of neural networks which concerns about learning algorithm, which can enable each neural element (consisting these basic elements such as N1, N2) to learn how to act in every algorithm.

Conclusion
In this research, an implementation of Neuromorphic SIFT for ASICs is described.The main difference of this method with the others is the ability to adapt to other algorithms such as HMAX and FREAK without major change in chip's layout.This method can be used in any photogrammetric application by implementation on a FPGA or ASIC.A rough estimation of performance evaluation led to 3000 fps for Full HD movie, 250 fps for 24 MP image sequence, 68 fps for Ultracam input images which can be a huge improvement for current photogrammetry workflows.As it mentioned before, the resulting chip's abilities will not be limited to SIFT, and it can be extended to HMAX, FREAK, and photogrammetric equations and so on with minor changes in the chip layout.

Figure 5 . 4 Comparison
Figure 5. Neuromorphic layout for Difference of Gaussians

Figure 7 .
Figure 7. Neuromorphic layout for Descriptor Generation (Similar to orientation assignment)

Table 4 .
Some common processors with their fabrication technology