introduction to computer vision

Making Computers See StuffA Brief Example...

Steven Mitchell, Ph.D.Componica, LLC

About us.Componica, LLC (http://www.componica.com/)

Strong Background in Computer Vision

Copyright 2011 - Componica, LLC (http://www.componica.com/)

http://www.componica.com





Definition - Image Processing

Any sort of signal processing done to an image.

The Acquisition of an image.

Compression and storage of an image.

Enhancing and restoration

Registration of an image to another.

Measurement of data from the image (height of building, speed of car)

Interpretation and Recognition of objects.




A Menu of Tools:Image Enhancement - Computer...uncrop and enhance!

Image Segmentation - These pixels belong to this, those pixels belong to that.

Image Registration - Line this image up with that image.

Object Recognition - This is a image of a frog and that's an image of a cheeseburger.

Image Compression - Crush this image and make sure the process is undo-able. (Not Covered...there are plenty of free libraries to do this.)




Image EnhancementSimple Pixel Stuff:

Normalizing brightness and contrast.

Gamma correction (the non-linear effects of TV)

Histogram equalization - maximize the global contrast.

Color Correction...color temperature and tint. White balancing.




Geometric Stuff:

Interpolation. Make the image bigger or smaller. This is not as easy as it sounds if you want non-aliased results. Trips up web developers whenever they try to roll their own thumbnail generator.

Warping an image from one geometry to another.

Simple rotation, scale, and translation. You need two or three landmark points.

Perspective (aka Projection or Homogeneous) transforms. You need four landmark points.

Lens distortions. Images pinch or barrel out in camera lens. You can calibrate and correct for that with enough landmark points.

General warping. Typically used for image morphing in special effects.

Image Enhancement - Geometry




Image Enhancement - Interpolation

AMATEUR

PRO

UPSAMPLING DOWNSAMPLINGORIGINALS




Image Enhancement - Geometry

ORIGINAL IMAGE PERSPECTIVE CORRECTEDThe original image is warped to a perspective corrected version.

The four black dots indicated the landmark points used to normalize the image to this artificial view.

The black regions are areas that falls outside the original image. Unknown data.




Noise removing. Depend on the noise. Normally you have a set of tools (median filter, average filter, etc.) and you try a bunch of stuff until the image looks cleaner. Ad hoc.

De-blurring. You estimate the cause of the blurring and then undo it using deconvolution.

Motion blur - Estimate the camera movements and undo it.

Focus blur - Estimate the lens blur and undo it.

Very tricky problems and the results tend to suck because the image sucked to begin with. Don't expect magic.

Image Enhancement - Tricky Stuff

http://cg.postech.ac.kr/research/fast_motion_deblurring/






Divide an image into known parts:

It's not quite object recognition because images are typically interpreted on a pixel basis using stuff like edge detection or colors.

Sometimes it's good enough because you just care about the borders of stuff, not what the stuff is.

Border of tumor vs. healthy tissue.

The bright red ball in the color picture.

Is this pixel part of a letter ‘q’ or paper?

Often the first step to a bigger solution.

In OCR, step one is to separate the letters from the paper under annoying conditions.

Image Segmentation




Make a decision based on a single pixel:

Simple thresholding - is this pixel darker than 160?

Slightly better - is this pixel red?

Even more better - is this pixel statistically more likely to be paper or letter based on it's RGB value.

The works - I'm modeling the uneven-ness of the lighting on this paper and made a statistical model of RGB, is this damn pixel ink or paper?

The downfall of this is you're looking at single pixels without understand how it's neighbors relate to it, missing the whole picture. Sometimes it's good enough however. Really fast and easy to do.

Image Segmentation - Pixel Classes

ORIGINAL

SIMPLE THRESHOLD

ADAPTIVEBINARIZATION




Minimizing or maximizing a path which borders between stuff. This one is a common technique, with many variants:

Assign a cost to all the pixels. For example, edge detection - it will cost me a lot to cross an edge. Or color transition, it will cost me to cross to a different color.

Use an optimization technique from classic data structures (typically used in graph theory if you still remember) to compute the cheapest path from one location of the image to a different location.

Dynamic Programming - Strange name but all it means is compute the cheapest path from one side of an image to another. Works best for paths that tend to be linear overall.

Minimum Graph Cut - Find the cheapest way to split the image in two regions. This works well for circular paths and 3D stuff. Tends to be much more complicated than dynamic programming and slower.

Check out this: http://www.youtube.com/watch?v=6NcIJXTlugc

Image Segmentation - Least Cost Path



http://www.youtube.com/watch?v=6NcIJXTlugc

http://www.youtube.com/watch?v=6NcIJXTlugc


Random Decision Forests applied to pixels Apply twenty questions to pixel and it's surrounding neighbors to create decision trees to guess what this pixel is over.

Have a bunch of these decision trees (forest) and aggregate the results. Strangely it tends to works in many cases.

Image Segmentation - Bleeding Edge




I have a source image that I can transform (move, resize, rotate, or bend). Make it best fit a target image.

Translation Only - Shift the source image until it best fits the target.

Similarity Transform - Adjust the scale, rotation, and translation until the source image overlaps the target.

Perspective Transform - Move the source image’s four corners until it matches the target in a perspective manner.

Non-Rigid Warping - The source image is on a rubber sheet. Warp it onto the target image.

Obvious Applications:

Augmented Reality

Image Stitching

Object Detection

Image Registration





The most common way to register as image is the following:

Find the most interesting points on the two images (usually blobs and corners).

Scale-invariant feature transform - SIFT (Patented and slow)

Speeded Up Robust Features - SURF (Recently Patented and fast)

Features from Accelerated Segment Test - FAST (Very fast but noisy)

With SURF and SIFT you get a position, an angle, scale factor and a 64 element vector to compare. With FAST you get a position and maybe a rotation.

Compare all the interesting points from one image to the other forming matching pairs of points between images. Naive implementations are SLOW.

Use RANSAC to find a consensus (next page).

Image Registration - Interesting Points

SURF

FAST




FAST ExampleGiven a pixel, based on the 16 surrounding pixel, is this pixel interesting?

FAST uses a decision tree trained on real images and converted to nested if statements in C.

Doesn’t use math, averages about 3 comparisons per pixel...very very FAST.

http://mi.eng.cam.ac.uk/~er258/work/fast.html






FAST Example

The source code is computer generated, and free for anyone to use.

It is 6000 lines long and not comprehensible.

With an averaging of vectors and an arctangent, you can get a rotation vector cheaply.4 TAYLOR, DRUMMOND: MULTIPLE TARGET LOCALISATION AT OVER 100 FPS

Figure 2: Left: The 8�8 sample grid used for the HIPs and the 5 sample locations selectedfor indexing, relative to the FAST-9 interest point (shown by the grey circle). Right: Theorientation assignment scheme uses a sum of the gradients between opposite pixels in the16-pixel FAST ring.

2.1 Selecting Repeatable Feature Positions and OrientationsRuntime performance considerations led us to select FAST-9 [12] as the interest point de-tector. Typical approaches to assigning orientation require computationally expensive blur-ring [2] or histogramming [7] and would add significant computation to the runtime process-ing. Instead we simply sum gradients computed between opposite pixels in the 16-pixel ringused in FAST corner detection, as shown in Figure 2. The directions are fixed so the x and ycomponents of the orientation vector can be computed very quickly from weighted sums ofthe 8 pixel differences.

We run FAST-9 on each training image within a viewpoint bin and represent the 35highest-scoring corners from each 200�200 region with subfeatures. Proportionally fewersubfeatures are extracted from smaller regions at the edges of the viewpoint reference frame.For smaller scale viewpoint bins where the entire target is under 200�200 pixels a 35 cor-ner minimum is enforced which effectively increases the feature density for these smallertargets. The orientation measure of Figure 2 is also computed at each detected corner. Theposition (xr, yr) and orientation �r of the subfeature in the coordinate system of the view-point reference frame can be computed as the warp used to generate the training image isknown.

The appearance of the subfeature is represented by a sparsely-sampled quantised patch.We use a square 8�8 sampling grid centred on the interest point, with a 1-pixel gap betweensamples, as shown in Figure 2. Before sampling the pixel values the sampling grid is firstrotated so that it is aligned with the detected orientation of the subfeature. The 64 samplesare then extracted with bilinear interpolation, normalised for mean and standard deviation togive lighting invariance, and quantised into 5 intensity bins. The 5-bit index value explainedin Section 3.2 is also computed and stored in the subfeature.

The most repeatable feature positions and orientations for a viewpoint bin appear asdense clusters of subfeatures in the (xr, yr, �r) space when all the training images in the binhave been processed. Every subfeature is considered as the potential centre of a HIP feature,and the set of other subfeatures (from other training images) that lie within a certain distanceof the centre is found. We manually decide the allowable distance; in this paper we allow 2pixels of localisation error and 10 degrees of orientation error. Sets of subfeatures given theseallowed distances share enough similarity in appearance to be represented by a single HIPfeature in a target database. The largest set of subfeatures represents the most repeatably-detected feature, and will be the first feature we select to add to the database. Sets continue

DEMOhttp://mi.eng.cam.ac.uk/~er258/work/fast.html






FAST Example

So what’s the point? These points are stable regardless of angle, scale, or translation.

This reduces the data such that you can rapidly compare the image to a template for techniques like augmented reality, image stitching, and motion tracking.




As you can imagine, matching points locally doesn’t yield global matches.

RANdom SAmple Consensus - RANSAC - prunes away the mismatches and computes the transform that converts the source image to the target.

It works as followed:

Most transforms can be described by a minimal number of points. For similarity transforms it’s two points, for projection transforms it’s four points.

Pick two (or four) matched pair of points at random.

Compute a transform from those two (or four) sets of points.

Transform all the source points using this transform and see how many points are close to the target.

Repeat this and keep the best transform that matches the most points.

Image Registration - Interesting Points

Here the green lines indicate pairs of matched points that fit the transform (looks like a similarity transform) and the red lines are matched pairs that failed to fit this transform and therefore rejected.

RANSAC may seem ad hoc, but it works surprisingly well.




Augmented RealityFAST interest point detection 0.55msBuilding query bit masks 0.12msMatching into database 0.35msRobust pose estimation 0.1msTotal frame time 1.12ms

Table 1. Timings for the stages of our approach on a dataset withimages taken from within the range of trained viewpoints.

Figure 4. The bit error count provides a reasonable way to deter-mine good matches. Left: matches from viewpoints contained intraining set. Right: matches on viewpoints from outside trainingset.

robustness to different imaging devices.Matching on the first test sequence was very good, cor-

rectly localising the target in all 754 frames of the test se-quence. There was little blur in the sequence so the fullframe provided enough matches in all but 7 frames of thesequence, when the half-sampled image fallback was usedto obtain enough matches for a confident pose estimate. Theaverage total frame time on the sequence was 1.12ms on a2.4GHz processor. The time attributed to each stage of theprocess is shown in Table 1.

Somewhat surprisingly our method also performed rea-sonably well on the second sequence, even though it wasknown the frames were taken from views that were not cov-ered by our training set. On this sequence the target was lo-calised in 635 frames of the 675 in the sequence (94%). Asexpected the pose estimate using only the full-frame imagewas generally less confident so the fallbacks to sub-sampledimages were used more often: 377 frames used the half-image and 63 also used the quarter-scale image. Becauseof this additional workload the per-frame average time in-creased to 1.52ms.

The matching performance on these test sequences sug-

Figure 5. Increasing the range of viewpoint bins in the training setallows more viewpoint invariance to be added in a straightforwardmanner.

gests that the bit count dissimilarity score provides a reason-able way of scoring matches. To confirm this we computedthe average number of inlier and outlier matches over all ofthe frames in the two sequences, and plotted these againstthe dissimilarity score obtained for the match in Figure 4.For the sequence on the left where the viewpoints are in-cluded in the training set many good matches are found ineach frame, with on average 9.7 zero-error inliers obtained.The inlier percentage for matches with low dissimilarityscores is also good at over 82% in the zero error case. Theresult that both the number of inliers and the inlier fractiondrop off with increasing dissimilarity score demonstratesthat the simple bit error count is a reasonable measure ofthe quality of a match. The figure provides strong supportfor a PROSAC-like robust estimation procedure once thematches have been sorted by dissimilarity score as the lowerror matches are very likely to be correct.

Even when the viewpoint of the query image is outsidethe range for which features have been trained, as in the dataon the right of Figure 4, the dissimilarity score still providesa reasonable way to sort the matches, as the inlier fractioncan be seen to drop off with increasing dissimilarity. Theinlier rate of the first matches when sorted by dissimilarityscore is still sufficient in most frames to obtain a pose witha robust estimation stage such as PROSAC.

4.2. Controllable Viewpoint InvarianceAs our framework uses independent features for different

viewpoint bins it is possible to trade-off between robustnessto viewpoint variations and computation required for local-isation by simply adding or removing more bins.

For applications where viewpoints are restricted (for ex-ample if the camera has a roughly constant orientation) thenumber of database features can be drastically reduced lead-ing to even higher performance. Alternatively if more com-putational power is available it is possible to increase the

Once you have correspondence, you can compute 3D geometry. This is popular right now for some reason.


http://nghiaho.com





http://nghiaho.com/

http://nghiaho.com/


A very accurate way to line up a template onto an image is to use derivates and linear algebra.

Works very well only when the images are very close to each other in the first place.

It’s usually the polishing step after using an Interesting Points / RANSAC method.

Known as Lucas-Kanade Tracker -or- Baker-Matthews Tracker.

It’s the secret sauce to how the “Predator” algorithm works.

Image Registration - The Mathy Way

http://www.wedesoft.demon.co.uk/lucas-kanade-tracker.html






Making computers recognize objects in images

Many different algorithms, and each algorithm has appropriate uses depending on the objects being detected. For example:

AdaBoost - Awesome at detecting and locating faces, sucks at recognizing who’s face it is.

Deep Neural Networks - Works great for highly distorted text, but too slow for generalized OCR.

Active Appearance Models - Works great for both recognition and segmentation, but only on generally fixed shapes like faces and hearts. Sucks for livers and cancer.

Most recognizers depend on training the algorithm on known objects offline and then testing. Which brings up the topic of data...

Object Recognition




The data can be the most valuable part of a trainable system as many algorithms will generally function with somewhat similar hit rates.

Often ideas fail to take into account where the data comes from. It’s the killer of many ideas.

The basics of training something:

Data is normally split into a training and testing set. You train a thingy with the training set and test how well it works with the testing set.

Why? Most trainable thingies are prone to overfitting. Splitting the data into two sets prevents this problem because you use the test set to know when to stop training.

Disadvantage, you effectively need twice as much data. Sucks.

Object Recognition - It’s all about the Data

It’s obvious this data is best represented as a line, butif the model over-fits the data, it may compute arelationship that’s nonsensical.

As your algorithm learns from a training set (blue line),the error decreases for that set, but in the testing set,it will hit a point where overfitting is happening andwill increase the overall error in the real world. You stoptraining at the point the testing set starts getting worse.




Steve’s crappy breakdown of object recognition algorithms:

A ship of fools approach - Armies of stupid algorithms that together become smart. Kinda like democracy sort of...not really.

Let’s create a brain, Igor - So what happens if you simulate brain tissue? It’s grown quite a bit since the neural network hype in the late 80s / early 90s. Fun Fact: This will eventually kill us all in a bloody uprising.

If the shoe fits... - Well if this template StarBucks logo fits somewhere on my image using object registration as described above, then I’m guessing this StarBucks logo is present in the image. Duh.

Find Features to Filter and Fit in a Feed-forward Fashion - You see this pattern all the time and it lacks of creativity. Find interesting features in the image, and feed them into a trainable function like a neural network, a non-linear regression, or a support vector machine. Boring.

I’ve stared at a wall for 20 minutes now and I think everything out there is either one or a combination of these general ideas. Hmm...That’s it? Disappointed.

Object Recognition - The Menu




Face DetectionThis is NOT facial recognition.

Developed by Viola / Jones in 2000. I consider it break-thru in image recognition. It inspired many new methodologies.

How much does a cow weigh?

An army of simple face detectors.

"Robust Real-time Object Detection"Paul Viola and Michael Jones




BTW, It’s how the Kinect sees people.




BTW, It’s how the Kinect sees people.

"Real-Time Human Pose Recognition in Parts from Single Depth Images"Shotton, Fitzgibbon, Cook, Sharp, Finocchio, Moore, Kipman, Blake

Microsoft Research Cambridge & Xbox Incubation




Facial Recognition

Remove effects caused by lighting and perspective.

After you find a face, reduce it to numbers.

"Statistical Models of Appearance for Computer Vision"T.F. Cootes and C.J.Taylor




Facial Recognition

Let’s mix some paint...

Comparing numbers in hyperspace

k-Nearest Neighbor, Wikipedia




QR Codes

http://en.wikipedia.org/wiki/QR_Code

"Quick Response code" invented by Toyota subsidiary Denso Wave in 1994.

Specifications are in Japanese, good luck.

Why it’s awesome to use

Open License

Up to 2.5K of data

Error Correction

Easy to read and generate:

ZXing library






Lytro - How the @#! does it work?




Lytro - How the @#! does it work?

http://www.futurepicture.org/?p=47






Optical Character Recognition

iPhone 4th Gen

iPod Touch 4th Gen




Optical Character Recognition




CommentaryUbiquitous Surveillance...extreme dislike.

Birthday Paradox...The probability that, in a set of n randomly chosen people, some pair of them will have the same birthday.




CommentaryVideo Cameras may fit the criteria of legally blind.




But there are benefits that could help people.

Smartphones that ID diseases, plants, insects.

Robotic lawnmowers that don’t run over the neighbor’s cat.

Computers that judge emotions by reading your face.

Keyless entry based on face, iris.

Automated inspection of manufactured parts.

Search and Rescue.

Commentary



introduction to computer vision

Engineering