Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Transcript of SIFT
Francisco Scale-invariant feature transform SIFT What is SIFT? Algorithm that detects and describes local features in images.
Extract distinctive invariant features from images.
Match between different views of an image.
Features are invariant to image scale and rotation.
Detect locations that are invariant to scale change of the image.
The only possible scale-space kernel is the Gaussian function.
Get rid of some detail from the image using Gaussian Blur.
Ensure that we do not introduce new false details.
Create a scale space –
Take the original image and generate progressively blurred out images.
Resize the original image to half size and generate blurred out images again, and so on. Scale-space extrema detection Object Recognition Object Recognition Image rotation Introduction Image features have many properties that makes them suitable for matching different images of an object or scene.
To extract these features we use major stages of calculation:
1. Scale-space extrema detection
2. Keypoint localization
3. Orientation assignment
4. Descriptor representation A few technical details Octaves and Scales
The number of octaves and scale depends on the size of the original image. However, the creator of SIFT suggests that 3 octaves and 3 blur levels are ideal for the algorithm.
The first octave
If the original image is doubled in size and blurred a bit then the algorithm produces a lot more keypoints.
Mathematically, “blurring” is referred to as the convolution of the gaussian operator and the image. Gaussian blur has a particular expression or “operator” that is applied to each pixel. Gaussian Blur operator The symbols:
L is a blurred image
G is the Gaussian Blur operator
I is an image
x,y are the location coordinates
σ is the “scale” parameter. (the amount of blur).
Greater the value, greater the blur.
The * is the convolution operation in x and y.
It “applies” gaussian blur G onto the image I. Laplacian of Gaussian We use those blurred images to generate another set of images, the Difference of Gaussians (DoG). This is a great way for finding interesting keypoints in the image.
How it works
take an image and blur it a little.
calculate second order derivatives on it (or, the “laplacian”).
This locates edges and corners on the image -
potentially good keypoints.
The second order derivative is extremely sensitive to noise. The blur smoothes out the noise and stabilizes the second order derivative. Laplacian of Gaussian Use blurred images to generate another set of images, the Difference of Gaussians (DoG).
Difference between two consecutive scales.
approximately equivalent to the Laplacian of Gaussian.
“scale invariant”. Image Alignment Some of the keypoints lie along an edge, or they don’t have enough contrast.
In both cases, they are not useful as features because they are not robust enough. (Their location in other scales may differ)
Removing low contrast features
If the magnitude of the intensity at the current pixel in the DoG image is less than a certain value, it is rejected.
If the magnitude of intensity value at subpixel locations is less than a certain value, we reject the keypoint. Eliminate edges and low contrast regions A few technical details Computability:
Usually, a non-maxima or non-minima position won’t have to go through all 26 checks. A few initial checks will usually sufficient to discard it.
Keypoints are not detected in the lowermost and topmost scales.
There simply aren’t enough neighbors to do the comparison. The stages of keypoint selection Removing edges
Calculate two gradients at a keypoint. Based on the image around the keypoint, three possibilities exist. The image around the keypoint can be:
A flat region - If this is the case, both gradients will be small.
An edge - One gradient will be big and the other will be small (along the edge)
A “corner” - Here, both gradients will be big.
Corners are great keypoints. So we want just corners. If both gradients are big enough, we let it pass as a key point. Otherwise, it is rejected.
Mathematically, this is achieved by the Hessian Matrix. Using this matrix, you can easily check if a point is a corner or not. Eliminate edges and low contrast regions Some of the detected SIFT frames. Input image. Finding Keypoints Using the available pixel data, subpixel values are generated. This is done by the Taylor expansion of the image around the approximate key point. Mathematically, it’s like this:
We can easily find the extreme points of this equation (differentiate and equate to zero). On solving, we’ll get subpixel key point locations. These subpixel values increase chances of matching and stability of the algorithm. The red crosses mark pixels in the image. But the actual extreme point is the green one Finding Keypoints Find subpixel maxima/minima
the marked points are the approximate maxima and minima. They are “approximate” because the maxima/minima almost never lies exactly on a pixel. It lies somewhere between the pixel. But because we simply cannot access data “between” pixels, we must mathematically locate the subpixel location. X is marked as a “key point” if its value is the greatest or lowest of all its 26 neighbors. Maxima and minima
Compare x with its 26
neighbors at 3 scales Finding key points is a two part process:
1. Locate maxima/minima in DoG images.
2. Find subpixel maxima/minima.
Locate maxima/minima in DoG images
The first step is to locate the maxima and minima by iterating through
each pixel and check all it’s neighbors. The check is done within the
current image, and also the one above and below it. Finding Keypoints Finding Keypoints 1. The pixel original image.
2. The initial 832 keypoints locations at maxima and minima of the difference of gaussian function.
Keypoints ate displayed as a vector indicating scale, orientation, and location.
3. 729 keypoints remain after applying a threshold on minimum contrast.
4. The final 536 keypoints that remain following an additional threshold on ratio of principal
curvatures. Orientation assignment The highest peak in the histogram is detected, and any peaks above 80% of the highest peak are converted into a new keypoint. This new keypoint has the same location and scale as the original. But it’s orientation is equal to the other peak. Therefore, for locations with multiple peaks of similar magnitude, there will be multiple keypoints created at the same location and scale but different orientations. Doing this for all 16 pixels, you would’ve “compiled” 16 totally random orientations into 8 predetermined bins. You do this for all sixteen 4×4 regions. So you end up with 4x4x8 = 128 numbers. Once you have all 128 numbers, you normalize them. These 128 numbers form the “feature vector”. This keypoint is uniquely identified by this feature vector. Descriptor representation Orientation assignment An orientation histogram is formed from the gradient orientations of sample points within the region around the keypoint. Gradient magnitudes and orientations are calculated using these formulae: The size of the “orientation collection region” around the keypoint depends on it’s scale. The bigger the scale, the bigger the collection region. We collect gradient directions and magnitudes around each keypoint. Then we figure out the most prominent orientations in that region, and we assign this orientations to the keypoint.
Any later calculations are done relative to this orientation. This ensures rotation invariance. Orientation assignment This feature vector introduces a few complications. We need to get rid of them before finalizing the fingerprint.
The feature vector uses gradient orientations. Clearly, if you rotate the image, everything changes. All gradient orientations also change.
To achieve rotation independence, the keypoint’s rotation is subtracted from each orientation. Thus each gradient orientation is relative to the keypoint’s orientation.
If we threshold numbers that are big, we can achieve illumination independence. So, any number (of the 128) greater than 0.2 is changed to 0.2. This resultant feature vector is normalized again. And now you have an illumination independent feature vector. A few technical details You do this for all sixteen 4×4 regions. So you end up with 4x4x8 = 128 numbers. Once you have all 128 numbers, you normalize them. These 128 numbers form the “feature vector”. This keypoint is uniquely identified by this feature vector. Descriptor representation Descriptor representation We want to generate a very unique fingerprint for the keypoint. We also want it to be relatively lenient when it is being compared against other keypoints.
First a set of orientation histograms are created on 4x4 pixel neighborhoods with 8 bins each.
To do this, we create a 16×16 window around the keypoint.
This 16×16 window is broken into sixteen 4×4 windows. A sample descriptor. Some of the detected SIFT descriptors. Any gradient orientation in the range 0-44 degrees add to the first bin. 45-89 add to the next bin and so on. The amount added to the bin depends on the magnitude of the gradient.
The amount added also depends on the distance from the keypoint, So gradients that are far away from the keypoint will add smaller values to the histogram.
This is done using a “gaussian weighting function”. This function simply generates a gradient , multiple it with the magnitude of orientations, and get a weighted values. The farther away, the lesser the magnitude. Descriptor representation Within each 4×4 window, gradient magnitudes and orientations are calculated. These orientations are put into an 8 bin histogram. Descriptor representation Orientation assignment Scale-space extrema detection Keypoint localization Descriptor representation K-means Algorithm Given a collection of vectors and K paramter
Find optimal distribution of K clusters.
1. “Guess” K centers.
2. Each vector belongs to the nearest center.
3. Re-Calculate centers by finding the cluster center.
4. Repeat steps 2-3 until there is no more updates. Analogy to documents
Could images be represented as Bag-of-Visual Words?
Leverage classic methods in IR for image applications What Is Visual Word? Bag of 'visual words' ? Object Bag-of-Word model Retrieve Text Words in Information Retrieval (IR)
Descriptiveness Application in Image Search Vector quantization
Clustering Algorithm Traditional Visual Word Generation Feature space Visual Word Use cases of SIFT Object recognition using SIFT features Robot localization and mapping Panorama stitching Analyzing the Human Brain in 3D Magnetic Resonance Images Object recognition using SIFT features Robot localization and mapping 3D scene modeling, recognition and tracking Analyzing the Human Brain in 3D Magnetic Resonance Images The Feature-based Morphometry (FBM) technique uses extrema in a difference of Gaussian scale-space to analyze and classify 3D magnetic resonance images (MRIs) of the human brain.
FBM models the image probabilistically as a collage of independent features, conditional on image geometry and group labels, e.g. healthy subjects and subjects with Alzheimer's disease (AD).
Features are first extracted in individual images from a 4D difference of Gaussian scale-space, then modeled in terms of their appearance, geometry and group co-occurrence statistics across a set of images. FBM was validated in the analysis of AD using a set of ~200 volumetric MRIs of the human brain, automatically identifying established indicators of AD in the brain and classifying mild AD in new images with a rate of 80%. This application uses SIFT features for 3D object recognition and 3D modeling in context of augmented reality, in which synthetic objects with accurate pose are superimposed on real images. SIFT matching is done for a number of 2D images of a scene or object taken from different angles. This is used with bundle adjustment to build a sparse 3D model of the viewed scene and to simultaneously recover camera poses and calibration parameters. Then the position, orientation and size of the virtual object are defined relative to the coordinate frame of the recovered model. For online match moving, SIFT features again are extracted from the current video frame and matched to the features already computed for the world mode, resulting in a set of 2D-to-3D correspondences. These correspondences are then used to compute the current camera pose for the virtual projection and final rendering. A regularization technique is used to reduce the jitter in the virtual projection. Researchers have compiled a high-resolution atlas of genetic activity in the adult human brain
First they scanned braines with functional magnetic resonance imaging (fMRI) to capture their precise anatomical details.
The researchers then cut up the brain into many tiny slices and chemically analyzed genetic activity within about 900 precise areas. In this application three cameras are used to determine 3D estimates for keypoint locations.
Keypoints are used only when they appear in all 3 images with consistent disparities, resulting in very few outliers.
As the robot moves, it localizes itself using feature matches to the existing 3D map, and then incrementally adds features to the map while updating their 3D positions using a Kalman filter.
This provides a robust and accurate solution to the problem of robot localization in unknown environments. SIFT feature matching can be used in image stitching for fully automated panorama reconstruction from non-panoramic images.
The SIFT features extracted from the input images are matched against each other to find k nearest-neighbors for each feature.
These correspondences are then used to find m candidate matching images for each image.
Because there is no restriction on the input images, graph search is applied to find connected components of image matches such that each connected component will correspond to a panorama.
Finally for each connected component Bundle adjustment is performed to solve for joint camera parameters, and the panorama is rendered using multi-band blending.
Because of the SIFT-inspired object recognition approach to panorama stitching, the resulting system is insensitive to the ordering, orientation, scale and illumination of the images.
The input images can contain multiple panoramas and noise images (some of which may not even be part of the composite image), and panoramic sequences are recognized and rendered as output. Extensions of the SIFT descriptor to 2+1-dimensional spatio-temporal data in context of human action recognition in video sequences have been studied.
The computation of local position-dependent histograms in the 2D SIFT algorithm are extended from two to three dimensions to describe SIFT features in a spatio-temporal domain.
For application to human action recognition in a video sequence, sampling of the training videos is carried out either at spatio-temporal interest points or at randomly determined locations, times and scales.
The spatio-temporal regions around these interest points are then described using the 3D SIFT descriptor. These descriptors are then clustered to form a spatio-temporal Bag of words model. 3D SIFT descriptors extracted from the test videos are then matched against these words for human action classification.
The authors report much better results with their 3D SIFT descriptor approach than with other approaches like simple 2D SIFT descriptors and Gradient Magnitude. Researchers have compiled a high-resolution atlas of genetic activity in the adult human brain based on the complete brains of two men as well as a hemisphere from a third man's brain, all of the tissue healthy when the men died. The researchers are making their data freely accessible online to aid in studies of normal and abnormal human brain function. The main challenge when it comes to understanding the human brain is the fact that it is the most powerful computer known.
It consists of approximately 100 billion neurons with roughly 1 quadrillion (1 million billion) connections wiring these cells together, and each connection or synapse typically fires about 10 times per second.
If our brain wasn't so difficult to understand we couldn't understood him. First, SIFT features are obtained from the input image using the algorithm described above.
Second these features are matched to the SIFT feature database obtained from the training images. This feature matching is done through nearest neighbor approach.
Third to increase robustness, matches are rejected for those keypoints for which the ratio of the nearest neighbor distance to the second nearest neighbor distance is greater than 0.8.
This discards many of the false matches arising from background clustter.
Finally, to avoid the expensive search required for finding the Euclidean-distance-based nearest neighbor, an approximate algorithm called the best-bin-first algorithm is used.
This is a fast method for returning the nearest neighbor with high probability, and can give speedup by factor of 1000 while finding nearest neighbor 95% of the time.
For each candidate cluster, a least-squares solution for the best estimated affine projection parameters relating the training image to the input image is obtained.
If the projection of a keypoint through these parameters lies within half the error range that was used for the parameters in the Hough transform bins, the keypoint match is kept. If fewer than 3 points remain after discarding outliers for a bin, then the object match is rejected. The least-squares fitting is repeated until no more rejections take place. Mathematically, “blurring” is referred to as the convolution of the gaussian operator and the image.