Getting Started

To start using IMatch, first upload two images that you want to match against each other, or use the two provided. The images will be matched the same regardless of which is the "target" and which is the "query", but the process and output are optimized for the target image being larger than the query image.

Next, click "Generate Target Features" and "Generate Query Features" buttons. This will begin the process of gathering feature data from each of the images - the progress of these processes will be displayed above each of the images, and once complete, the features themselves will be displayed as circles with lines on top of the images. Feel free to click both buttons at once: the processes will run in parallel.

Finally, once both images have had their features generated, click the "Match" button. This will compute matches between the two feature sets, then calculate a suitable transformation for aligning the images, and display this alignment as output, with the "query" image (bounded by a checkered line) overlaid onto the "target" image.

You can optionally display the "inliers" and "outliers" generated during this process - inliers are feature matches that are deemed useful, while outliers are feature matches that were rejected as being too far apart. The inliers are displayed as either blue-green lines (with the blue end at a point in the target image and the green end at a point in the transformed query image) or green points (if those points are nearly identical). The outliers are displayed as red-yellow lines (with the red end at a point in the target image and the yellow end at a point in the transformed query image).

How It Works

Feature Finding: The first step in the image-matching process is to find points of interest in each of the images, which is what the "Generate Features" buttons do. To do so, the script opens up two web workers - one for each image - and they both load OpenCV, a popular open-source image-processing library. OpenCV supports ORB, an algorithm which finds and labels features of interest in images using clever pixel sampling and intensity weighting, and this is the first step in IMatch's operation.

However, there are two problems with ORB. First, it tends to focus too much on certain areas, returning too many features in particularly "feature-dense" regions and too few in others, even though those other regions might still be of interest. To fix this, IMatch runs ORB multiple times, but prevents it from returning any features too close to those that have already been found, essentially forcing it to spread itself out. Second, ORB's features aren't flip-invariant. To work around this, after finding all of the features in the query image, it's flipped, and the feature-finding process is run again, providing two sets of features to match with. If the flipped features fit better than the normal features, the image is flipped.

Once calculated, features are displayed over top of the base images as circles with single radii. The circle encloses all pixels used to calculate the feature's descriptor, and the single radius points in the chosen reference direction for the feature. This reference direction is important as a way to provide rotation-invariance: by calculating the descriptor relative to the reference direction, it remains the same regardless of what angle the image itself is rotated to.

Feature Matching: Once feature sets for the target and query images have been collected, they need to be matched together. BRIEF feature descriptors (the type used by ORB) are bitstrings of length 256, and the distance between two features is defined as the Hamming distance between their descriptors. Since the data points are sparse, and Hamming distance is simple, the fastest approximate nearest neighbors solution is a set of Locality-Sensitive Hash Tables. To make these tables, each target feature descriptor is broken into 16 chunks of 16 bits each. Then, 16 tables are created - one for each chunk. Finally, each feature is placed into all 16 tables, indexed by its value within the specified chunk.

To find matches, the query feature's descriptor is broken down similarly, and the appropriate index in each of the tables is checked. This returns any target features which share at least one chunk with the query feature - note that this is what makes it an approximate approach, as two descriptors with a Hamming distance of only 16 may share no chunks (though in practice this is unlikely). The Hamming distances between each of the target features and the query feature are then computed, using a binary representation of the data and some bit-fiddling to speed up computation. All of these target features and distances are placed into a min heap, and the closest two matches are then taken off of the top.

Images generally have hundreds or thousands of features, but often have only tens of matches. Two methods are used to pare down potential matches. First, any matches with a distance of 32 (1/8) or more are ignored. Second, for each match, the closest match must be at most 70% the distance of the second-closest match (or, equivalently, the second-closest match must be at least 43% further away than the closest match). Any matches which don't meet said criterion are considered low-quality and discarded.

The matching step is definitely not instantaneous, but since it runs in less than a second even when thousands of features need to be matched, I have decided to keep it blocking in the main thread, rather than creating another web worker for it.

Transform Computation: Once the set of matches between the target and query images has been found, it must be used to compute a transform that maps the query image onto the target image, minimizing the distances between matching features. To do this, a simplified version of RANSAC (RANdom SAmple Consensus) is used; this has the benefit of being able to produce a good result even if the "outlier" matches are more numerous than the "inlier" matches (which is often the case when working with images that differ significantly from each other). Additionally, the modifications reduce its reliance on a pre-determined outlier percentage.

To start, two random matches are chosen from the set of matches computed earlier. Given these two matches - and knowing from the feature collection step whether the image will need to be flipped or not - there is exactly one transform which maps the query points onto the target points using only rotation, translation, and uniform scaling. This transform has a closed-form solution, and is thus very quick to compute. Once computed, all of the matches are run through it, returning a new distance for each of them. This new distance is used to place all of the matches into a min heap.

Matches are taken off of the min heap and considered inliers until the following three criteria are met: there are at least 5 inliers, there are at least 80% as many inliers as could be expected from the relative number of features, and the axis-aligned bounding boxes surrounding all inliers (one each for the query and target frames of reference) have an area of at least 10000 pixels or 1% of the original image (in the approprate reference frame). These criteria ensure that there are plenty of inliers, and that they're not concentrated in a single point. Once all of these criteria are met, all matches remaining on the heap are considered outliers. Unlike a normal RANSAC, no linear least-squares step is performed; eschewing it speeds things up significantly, and that speed allows the algorithm to run more steps in the same amount of time, recouping most (if not all) of the accuracy loss.

Each of the inlier matches then has the distance between its transformed query point and its target point measured, and added to a running total. This total is considered the "cost" of the transform. The process is then repeated, selecting a new random pair of matches, computing a new transform, and meauring its cost - if the new cost is lower, the new transform must be better, and it is kept as the new preferred result. This continues until we haven't encountered a better transform in a certain number of steps - in practice, this is 1000 steps each for the flipped and non-flipped feature sets (illustrating the value of removing a more computationally expensive linear least-squares step).

Finally, once the best transform is decided upon, it's used to transform the query image onto the target image using a handy HTML5 function. A bounding box is drawn on top of it, and the inliers and outliers are rendered as well, if desired. This all occurs on a single HTML canvas, enabling the user to save it as a single image if they wish.

Lessons Learned

OpenCV in JavaScript: This project originally started as an investigation into OpenCV. I had heard a lot about it, and figured it was worth trying out. I decided to use the Python version, since I'm very familiar with it and its simplicity supports quick development in these early stages. So easy and user-friendly was the package and its documentation that within a day I had feature recognition and matching working, and after another day I had a full program for overlaying a query image onto a target one. I was astounded by this early progress, and decided to write a JavaScript version so that I could display my work on my portfolio.

Unfortunately, while the Python version of OpenCV is very well-documented, the JavaScript version is not. There are a few basic tutorials, but users are sent to the C++ documentation for anything more in-depth. This is mostly fine, since the JS functions mirror the C++ functions fairly closely. Unfortunately, not all of the C++ functions are implemented in the JS port, and the data types used are completely different (as one might expect from such different languages). This ended up requiring a lot of experimentation on my part.

As an example, while the Python version includes a utility for finding Hamming nearest neighbors, the JS version does not, which required me to implement it myself - more on that below. Also, the features returned by ORB are in strange data structures and types - after dumping them to the console a few times, I found that I needed to copy the raw data into a new array if I wnated to work with it. Worse, there was one function that I used in my Python version which did exist in the JS version, but threw an unnamed error when called, providing only a memory address to debug with. I ended up removing it completely; it was the function which overlaid the query image onto the target one, and I discovered that HTML5 offered a function that performed the same task, allowing me to ignore that part of OpenCV. Indeed, while the Python version made it seem effortless, I more often found myself trying to circumvent OpenCV when working in JS.

Finally, there was the aspect of loading a script - and, in the end, loading it into a web worker - which I was unfamiliar with at the start. Loading it into the main thread wasn't too unusual, but loading it into a web worker seemed to cause a problem: after loading the script, it seemed to not have any of the functionalities I needed. Whole functions and variables were missing from the scope. I thought for sure that it couldn't be a case of premature access, since the script loading operation in JS web workers is blocking, so in theory I shouldn't've been able to access it until everything was ready. What I eventually discovered is that OpenCV is not done loading once the script loads; it performs some asynchronous initialization (including making most functions available) afterwards. I had to await a different event - OpenCV's internal loading complete event - before accessing it. Would've been nice if they had put that in the tutorial somewhere.

Hamming Nearest Neighbors: Having already done a lot of work on Euclidean nearest neighbors, I figured that the methods I had learned there would be easy enough to apply to Hamming distance. So I ditched my brute-force matcher, which was taking about 5 seconds to run on the test case, and implemented a KD-Tree (operating as a prefix tree for the descriptors) with A* search (also known as Best Bin First), one of the fastest methods for finding approximate or exact nearest neighbors in Euclidean space. Unfortunately, after an entire day spent writing the implementation, it took about 8 seconds to run on the test case. Frustrated, I dug into the data to find out why it was being so slow, and found that the data was simply too sparse: almost none of the branches got pruned, because they were separated by only one or two bits each time, whereas the whole descriptor was 256 bits. With no pruning, each query descriptor was basically being measured against all of the target descriptors, returning to brute force.

My next idea was to do a relative distance search: Hamming distance obeys a slightly stronger version of the triangle inequality, where |AC| <= |AB|+|BC|, but also, |AC| >= abs(|AB|-|BC|), since the number of bitflips in the longer distance can only be "undone", in the best case, by as many bitflips are in the shorter distance. To take advantage of this, I set up a data structure that started with a number of "atoms" with large Hamming distances from each other. I took all of the target descriptors, measured their Hamming distances from each atom, and stored them in one bin per atom based on their distance. Query descriptors then calculated their distance from each atom, and looked in the appropriate bins (and those around them) to find potential nearby matches. This brought matching times on the test case back down to 5 seconds, but no further, to my chagrin. Upon closer inspection, I was being thwarted by statistics: on average, any 256-bit string has a Hamming distance of about 128 bits from any other 256-bit string. Indeed, most of the target descriptors ended up in these middle bins relative to each of the atoms. Then, when a query descriptor also ended up being in this range, it had to measure the distance to pretty much all of the target descriptors, bringing it back to brute force.

Finally, I decided to go by the book, and implement a Locality-Sensitive Hash. I had read about their utility in this case before I started, but didn't realize just how important they would be. I set up a system that could take an arbitrary number of chunks from a descriptor - even overlapping ones - and index them into a hash table accordingly, as described above. I started with 32 chunks of 16 bits each, which allowed for some overlap - and, to my surprise, it ran on the test case in only 2 seconds. Excited, I set about trying different parameters, and eventually found that 16 chunks of 16 bits ran the fastest on the test case, about 1 second. These parameters may not be optimal for other cases or other machines, and this may be worth future study. For the time being, though, I was happy.

Except for one thing: the bit math. I had been developing these data structures using JS string representations of the descriptors for simplicity's sake, but I knew that the overhead involved in those operations was likely slowing things down a little. I wanted to use a proper binary representation, and opted for unsigned 32-bit integers - this required a data structure to handle and operate on a set of 8 integers, but that was easy enough. Efficiently computing the distance between any pair of these integers is a two-step process: first, XOR them to get a representation of all the places where they differ. Next, add up all of the 1s in said representation, and voila, the Hamming distance. But that second step can be a bit annoying to do - iterating through each bit and checking if it's set is canonical, but relatively very slow.

For this, I found an excellent bit-fiddling algorithm online, but it came without explanation, and I didn't want to use it until I understood it, which took some exploration. The process, as I discovered, is as follows: the integer is broken up into its even and odd bits, and the odd bits are shifted right to line up with the even ones. They're added, meaning that each pair of bits now contains the sum of the bits that used to occupy that space. Since the sum can be at most 2, 2 bits is enough to represent it. This process is repeated with even and odd pairs of bits, putting the sum of each original nybl's bits (which is at most 4) into said nybl. The process is then repeated one more time, putting the sum of each original byte's bits (which is at most 8) into said byte. Finally, this result is multiplied by four bytes of 1, which has the effect of adding all 4 bytes into the most significant byte - and since the sum is at most 32, this result fits into said byte. Right-shifting to put that byte in the least-significant place gives the final result: a count, from 0 to 32, of the number of 1s in the original representation. This last improvement, along with bit-fiddling of my own to replace the substring functionality, brought the matching time on the test case down to 1/5 of a second - so fast that it didn't even have to be offloaded onto a web worker!

Cross-Browser Compatibility: I use Firefox in my day-to-day life, and so developing with it is natural to me. However, when I tested the site on Chrome, the web workers failed to return their results. Upon closer inspection, the descriptors that OpenCV returns - which are Uint8Arrays - come with array buffer objects, and while Firefox handled these without issue (presumably working around them), Chrome threw an error and declared them uncopyable. The fix was fairly simple: casting the data into a new Uint8Array removed the extraneous buffers, and Chrome handled the site fine thereafter.

... with one exception. OpenCV's ORB feature-finder takes a parameter, maxFeatures, which sets the maxmimum number of features it's allowed to find in any one round of computation. In the Python version, I had found this to be more of a suggestion than an actual limit, and it would often return a few more features than requested. In Firefox, though, it seemed to stick to the limit, never returning more than the maximum. But for some reason, in Chrome, it will occasionally - not always - return one more feature than requested. I'm not sure whether one more is the limit for all cases, and I certainly have no idea why it happens, but it means that the features found by running this page in Firefox and Chrome are slightly different. Luckily, the difference is small enough that it doesn't affect the final outcome, so I'm content to let it be for now.

IMatch Image Matcher

Getting Started

How It Works

Lessons Learned